Recommended Posts

I'm continually surprised at how many folks have NO backups ... and apparently think they don't need them because of UnRAID's fault-tolerance.    Since I'm often writing the same thing over-and-over to point out why you should back up, I thought I'd just write a single post that I can refer to in future posts.   

 

So here's my nickel's worth on backups ...

 

(1)  ALL hard drives will fail ... it's not a question of IF, it's just a matter of WHEN.

 

(2)  A RAID system ... whether UnRAID or some other traditional RAID ... is NOT a backup.  It provides fault-tolerance, so a system can keep running when a disk fails; but the RAID can be corrupted; there's no protection from accidental deletions; power-spikes can easily damage multiple disks (resulting in significant data loss); etc.

 

(3)  ALL data that you don't want to lose should be stored in at LEAST two different places.  That's the whole idea of a backup.  Ideally one of those should be off-site, but for personal environments that's often not the case.  But at least have a backup!!  [My backups aren't off-site either; but they ARE stored in a waterproof, fireproof, data-rated safe]

 

(4)  A frequent excuse is that it's too costly to have backups.    I simply don't agree.  If your data is important enough to built a fault-tolerant server to hold it;  and you've spent the time and money to acquire it, it's almost certainly worth backing up.    Obviously the economics vary depending on just what you're collecting, but using DVDs as an example:  if you store all your DVDs compressed to a single-sided DVD size of 4.7GB, you can store 212 of them per TB.  At ~ $40/TB (current disk costs) that's roughly $0.18 to backup a DVD.  If you store them completely uncompressed, the average size is probably 50% more than that, so it might cost $0.27 to backup.  BluRays would of course be more ... perhaps as much as $1 each (still a pretty small amount compared to the cost of everything else).    Considering the time to rip the DVD; do any processing you may do [e.g. extract the movie; possibly recompress;  perhaps change the format; etc.]; and catalog it with whatever media cataloging system you use; I'd think $0.18 or so to never have to repeat that is well worth it.    "I'll just re-rip everything" is an easy excuse to avoid backups ... but when you lose a few hundred (or thousand) movies, that's suddenly not such a simple task.

 

(5)  The cost I noted above is just for the disks to store the backups on.  In fact, that's all you really need, but it's even more convenient if you have a 2nd backup server that you simply run a synchronization utility against periodically.  If you set it up with WOL, you can remotely turn it on;  run the sync; and then turn it off.  Doing this once/week (or at whatever interval makes sense for you) is a very simple way to ensure you'll never lose any data.

 

If you don't want the expense (and space) of a backup server; you can easily keep current backups by simply ensuring that every movie you copy to your server is also copied to a backup disk.  Just get in the habit of doing this, and it's simple to do.  I keep a "current" backup disk (labeled as "Backupsxx", where xx is the disk#) in an external disk caddy; and simply copy everything I'm copying to the server to that disk.  When it gets full; I put it in a plastic DriveBox and put a new disk in the caddy (storing the old one in my safe).  I also save a copy of the disk's directory in a PDF file, so I can easily locate anything on any of my backup disks without physically accessing them.

 

How much this costs depends on how quickly you're "growing" your array.    I add perhaps 200GB/month to my server, so I need about 2.4TB/year of additional backup space -- a cost of less than $100/year to maintain backups.  I consider that cheap insurance !!

 

(6)  In addition to backups, it's very convenient to have a way of confirming whether or not your data has been corrupted without having to do a full comparison of all your data against the backup disks [i've done this a few times, and it's a week-long process].    You can provide for this by keeping checksums of all your data on the UnRAID disks.    I started doing this a couple years ago, and it's VERY convenient.    I use and recommend the excellent Corz checksum utility, which lets you do this from Windows [http://corz.org/windows/software/checksum/ ], but there are Linux-based options as well if you prefer to do it on the UnRAID box itself.    This provides a very simple way to check for any corrupted files .. which you can then recover from your backups.  [i also have checksums on all my backup disks, so I can just as easily confirm that they're all okay as well].

 

(7)  HOW you backup isn't nearly as important as the simply fact that you DO it.  I've outlined how I do it with individual disks.  A backup server;  Crashplan; etc. are all alternatives that work fine.    If you have data that you truly don't care about, then that's fine to leave unprotected ... just be sure that you truly don't care about it.    A bunch of recorded TV shows clearly isn't as important to most folks as the last ten years of family photographs and home videos.    It's always better to err on the side of caution => I've seen MANY folks who were sick that they'd lost things; but have never found anyone who was upset that they'd backed up stuff they didn't really need.

 

(8)  If you don't have backups, and realize after a failure that you had important data on your failed disk(s), ONE recovery through a professional recovery service can easily cost more than a complete set of backups would have.  $500 - $1000 is a typical cost for professional recovery for ONE disk.  This can quickly change your mind about the importance of backups  :)

 

(9)  Nobody NEEDS backups.  Just as none of us NEED our movie/music/picture collections.    But if it would upset you to lose it, it should be backed up  :)    It may sound expensive to back everything up; but when you consider it relative to what you've spent to acquire the media, and the cost you've already got in your infrastructure to store it and play it, it's a very modest cost.    If you've ignored it while you grew a multi-TB collection, it can cost a bit to "catch up" ... but the ongoing cost to maintain good backups is very nominal.  And you can always use your older disks as backups as you replace them with newer, higher capacity drives ... thus further reducing the number of disks you have to buy just for backups.

 

Link to comment
  • Replies 117
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

A few thoughts on a somewhat related topic that also comes up fairly often ...

 

Parity check errors.

 

Correcting vs. Non-Correcting checks.

 

 

UnRAID has a built-in capability to check the entire array to confirm that the parity information is still correct, and to update it if not.  Note that BY FAR the most common cause of a parity error in the array is failure of a buffered write to the parity disk to be completed.  This can be caused by a power "glitch";  by an unsafe shutdown; by a software error (almost always in a plugin -- not the core UnRAID NAS); or by a memory error.

 

It is VERY rare that a sync error is due to an error on one of the data disks.  If a read error occurs on a data disk, UnRAID will re-write the data on that disk to correct the data ... and if the write fails, it will disable the disk (so no inappropriate change to the parity disk would occur).    So unless you have multiple drive failures at the same time, you can always recover a failed disk -- that's the whole concept of UnRAID's fault tolerance.    ... it's also why you want to periodically do a parity check to ensure your parity information is good.

 

Bottom line:  I can really think of NO reason to do a "non-correcting" parity check.    The whole idea of doing the check is to ensure your parity is good !!

 

Nevertheless, others disagree -- and it was the "clamor" by the UnRAID community that led Tom to add the option of non-correcting checks.  Unfortunately, that's all many folks do now, and there are frequent questions r.e. "what to do" when it finds sync errors.    The answer is simple:  run a correcting check !!

 

How is this issue related to backups?    Simple.  If you DO want to isolate whether or not a sync error is "real" (i.e. an error on the parity disk), or is due to corrupted data on one of the data disks, the only way to KNOW that for sure is if you have a complete set of backups or, as a minimum, a set of checksums for all your files.      Backups are best, as that guarantees you can do a bit-by-bit compare of your files, but checksums are certainly "good enough" to test whether or not a file has been corrupted (the odds of a corruption resulting in the same checksum are VERY low ... although it is possible).    One major difference, of course, is that if a checksum verification shows that a file has been corrupted, there's nothing you can do about it if you don't have a backup to replace it with.

 

A common question is "Which file was the sync error in?".    That's not an answerable question.  Remember that a sync error simply identifies the bit position within the longitudinal parity array where the error occurs.    There is one bit from EVERY disk that contributes to each bit of parity ... so the only thing that's computationally possible when there's a sync error would be to provide a "set of files" (one from every disk) that contributed to the parity sync error => but remember that the likelihood is that NONE of those files have errors, as it's far more likely that the error is an actual sync error on the parity disk.  IF you had a utility that would list that set of files;  and IF you then verified the checksum or compared the file to its backup copy for every one of those files, then you could indeed confirm exactly what had occurred.    But if you don't have either checksums or backups, then even if you have the list of "candidates", there's no way to confirm whether or not one of those was the reason for the sync error.

 

In the 6 years I've used UnRAID, I've had 3 or 4 sync errors during parity checks.  In EVERY case these have been errors on the parity disk -- which I've confirmed by doing a complete comparison of every file on the array against my backups after each parity check that resulted in corrections.    It's been ~ 3 years or so since I've had a sync error; but the next time I see one I'll do it a bit differently, as I now have checksums for all my files stored on the array in addition to my backups ... so I can check everything without the need to pull out all my backups (which would now only be needed if a checksum verification failed).

 

Bottom line:  I never run non-correcting checks, and really don't think they're necessary.    But if you DO want to use that alternative, be sure you have the ability to confirm whether or not you actually have any file corruption (by either verifying checksums or comparing with backups) and some means of dealing with it if you do (by either deleting the corrupted file or replacing it with your backup).

 

If it makes you feel better, you could always do this:

(a)  Run a non-correcting check

(b)  If any sync errors are found, run a checksum verification against your entire array, and note any corrupted files -- either deleting them or replacing them from backups

©  Now run a correcting check

 

Whatever you do, you do NOT want to leave your array in a state where you KNOW there are sync errors in your parity.    That defeats the whole idea of fault-tolerance.

 

Link to comment

Nice write-up gary.

 

If a read error occurs on a data disk, UnRAID will re-write the data on that disk to correct the data

 

If i remember correctly, you are storing your backups on single drives.

How often do you check your backup drives to ensure that you don't have errors on them?

 

Prior to using unRAID I also had my data stored all over the place on different drives.

After having the server ready I found that some files were not readable when I moved them to the array.

 

The monthly parity check ensures that the data on the array is readable but if you store the

backups on single drives and don't verify them regularly you may be lulled into a false sense of security.

Link to comment

My backups have checksums for all the files included on them.  When the backup drive gets full, I run a full verify on it; then put it in a DriveBox and store it in my safe (waterproof, fireproof, data-rated).    About once/year I pull all the backups out and run a checksum verification on them ... then put them back.

 

I don't mention it in this write-up, as I don't think the average person needs to do it ... as you know I'm a "backup fanatic" ==> but I also have a complete spare UnRAID server that's powered up once/month to sync backups from my other two servers to  :)    So all of my data is on, as a minimum, two fault-tolerant servers and a backup disk in my safe AND, for all my non-media data (pictures, documents, spreadsheets, etc.) is also on at least 3 other hard drives on my two main computers, ALL of which are UPS-protected.  [The non-media data is also backed up to the cloud.]

 

 

Link to comment

I don't like the idea of losing any of the data I've spent so many years collecting ... whether it's family pictures, financial records, or just all the music and movies I've collected over the years.    When you factor in the cost and time involved in that, a couple thousand bucks to have it VERY well backed up is really no big deal.

 

 

Link to comment
  • 2 weeks later...

Very good write-up, wiki-worthy!

 

The necessity of backups can't be stressed enough. I know a guy who lost all record and photographs of the first five years of his son's life due to a HDD failure - he didn't have backups, and they were gone. I can't even imagine how that made him feel. As garycase said, re-ripping hundreds of movies will take a long time. I ripped all my 300+ CDs 10+ years ago, and it took me weeks - and I woved I'm never going to do that again.

 

The checksum idea is very good. I'd verify them at least once a quarter to avoid creeping data corruption or drive failure due to unuse. I'm not aware of any research on drive failure rates when they go unused for long periods of time, but like all mechanical machines, I can see it as a potential risk.

 

garycase, how do you do your checksums? Do you run a massive checksum list for every file on each disk? How do you deal with changing data disk content, or is your data all static?

 

For Windows, SyncBack Pro does checksum checks for you. Here my basic tutorial for running MD5 checksums for Linux systems. I'll tweak it as I get my unRAID backup system up and running.

 

Below what I used to do on my non-unRAIDed Windows system that I wrote on another thread. I will tweak this now that I have unRAID, utilizing the checksum approach.

 

When I was on Windows, I used SyncBack Pro as well. Here overview of my backup system.

 

I had Incremental backups that ran every day of the week, full refresh once a week. Then I had a full backup once a week on another drive, full clear at every run. And another full backup which was refreshed once a month, culling deleted files every six months. Latter two drives were "offline", ie. not attached to my desktop for security purposes in case there's data corruption or electricity issues (I have an UPS but I'm paranoid :) ).

 

That way I would have daily backups of all critical data, and multiple backup levels as a fallback. If I accidentally delete a file or notice data corruption, I'd have it handy for up to six months.

 

Going forward, I will be rotating two HDDs on an offsite location (one at home, other at work), which protects me from theft, fire, water damage, etc. Crashplan used to fill that role - I will continue using it, but will also rotate HDDs.

Link to comment
  • 2 weeks later...

I have a question on how to properly use corz checksum.

 

Scenario:

You have created checksums for a folder which contains file "A". You add a new file "B" into that folder. You can now right click and select create checksum and click synchronize to add the newly created file "B" to the list. Now I want to verify everything is still the same so I right click and select verify on the folder and it tells me success no errors found.

 

Next I modify file "A" which in this example is a text file. Once file "A" has been modified the checksum changes. If I understand this correctly, synchronizing the folder will not change file "A" because the synchronize function only adds new items? Therefore, I then have to create new checksums and overwrite the original checksum file in order to make sure file "A" has been properly updated?

 

Assuming this is the case, I would want to run a verify on the folders and then check the error logs to see which files have changed and double check to make sure that the file should have been changed. Then once I am sure that the file was suppose to have changed I overwrite the checksum?

Link to comment

Nice write up - some sound advice...

 

I work in IT B2B sales and the number of clients who wince when I quote a solution for their business and then tell them that the backup solution + software and media is often as dear as the server I quoted... and often they reject the backup and rely on RAID5 to safeguard their business. To emphasise the importance of backups I often tell them to come in tomorrow and not turn their PC on and see how much work they can do - this makes them realise just how important their data is...

 

However as a home user it isn't practical to spend £1000s on backup drives, Backup Exec etc... so this is what I do;

 

I use UNRAID for my movies, documents, photographs, music, etc - so I pretty much have everything on it. From the data that's on there, I have some which I consider irreplaceable - my photographs and documents mainly... these total a little under 1Tb - even though my data on UNRAID is about 7Tb... yes, I would be upset to lose my video collection, or my music - but the way I look at is that I could replace this if I had to - my other data I would never be able to recreate :(

 

So I have a USB3 1.5Tb host powered hard disk, which is encrypted using Trucrypt, which is connected to my main PC. I backup my "critical" data regularly (every night at 2am) to this USB drive using Cobian Backup 11 Gravity - this has a "pre-backup" script to decrypt & mount the USB drive - backup and then dismount the disk when the backup is complete - so it is encrypted again.

 

Then... just when you thought it was all over - I have another PC sitting at my Dads house which also has a USB3 hard disk connected and encrypted as above - although this one is decrypted all the time the PC is on (always), so that I can get remote access to it, but it is still encrypted using Trucrypt when someone disconnects it. I use Syncrify Client to backup my "critical" data over a VPN connection to this USB disk on the remote PC - which actually takes very little time as it is bit level backup so a 2Gb PST file might only have a few bits changed and so the backup time is very short unlike other solutions which would upload the whole file again!

 

So... by doing this I reckon I am covered against accidentally deleting files or multiple disk failures in PCs and UNRAID occurring together. By encrypting the backup disks, if some toe-rag breaks in and steals the conveniently light USB drives then my data is secure. The first time I backed up the "remote" USB disk I copied the data onto it locally and then took the disk over to my Dads - otherwise, yes this would have taken months to copy the whole lot over the first time ;)

 

OK, it might seem an elaborate backup but I don't have any photos of my family that are not digital, etc... so I don't want to lose these  :D

 

I reckon that if in one day I have all the disks in UNRAID fail, a USB disk fail and the USB drive at my Dads house being stolen and both houses burning down - then I guess someone really doesn't want me to keep my data!!

 

I just thought I would mention my setup as it might give others an idea of how to built a resilient backup for very little as most of the software above is free for personal use... just Google it :)

 

To any normal person I would say - decide what is irreplaceable to you - buy a couple of USB disks which can hold it all and once a week backup onto one and store the other at a relative or friends house so that if, god forbid, your house burns down - you will still have the majority of your data somewhere else. Just remember to swap the disks over every couple of weeks.

 

There - I said it... sorry its so long - but might help someone :)

Link to comment

While I agree with the basic premise [i.e. only back up your important data [or, as I said it above, backup "... ALL data that you don't want to lose  ..."],  I also think that many folks under-estimate what that set of data is -- until they lose it !!  :)

 

As I said earlier, "... If your data is important enough to built a fault-tolerant server to hold it;  and you've spent the time and money to acquire it, it's almost certainly worth backing up. " ==> Yes, I know you can always "recreate it" by re-ripping all your media; redoing the catalogs;  re-compressing anything that needs to be in a different format; etc.    But if you think about all the time and effort involved in doing all that, I suspect you may find that backups aren't really all that expensive !!  8)

 

Link to comment

As I said earlier, "... If your data is important enough to built a fault-tolerant server to hold it;  and you've spent the time and money to acquire it, it's almost certainly worth backing up.

 

I couldn't agree more - but what I do works for me... with 7Tb of "data" I would struggle to routinely backup all of that as I would need another server to accommodate it and to also backup offsite would mean having to have a 3rd server... then when I add an an 8Gb ISO to my main data store, the offsite backup that night would run for a month to upload it :(

 

So, whilst I agree, for me its not really practical to backup everything - but I am sensible enough to realise that if I lose a film from my collection, to me, it wouldn't be the end of the world... but if I lose the only photo's of my daughters 1st birthday party I would be really upset!!

 

So I guess its a balance of weighing up cost, time and resources against the importance of your data - the purpose of my post was just to give others an idea of what can be achieved for very little money (most of the software I use is open source / free for non-commercial use) giving a better level of redundancy than a lot of business have ;)

 

Nonetheless some really good information in the thread.

 

Thanks.

Link to comment
  • 3 weeks later...

When the tape drive I was using for backups started to die back in about 2004 or 2005 I ended up writing my own backup utility, initially to store the backups to DVDs and then as the cost of hard drives dropped I switched to using external drives.  This utility is written in Python and I use it to backup my unRAID server to removable drives attached to my Windows desktop.  It is built on the notion of a single full backup followed by an unlimited number of incrementals, so while the first backup takes a lot of time the incrementals run pretty quick.  Typically I run an incremental pass on the weekend to grab all the new media files, a process that might take a half hour or so. 

 

The backups are written in user-configurable chunks, typically about 500MB (the system will automatically split large files across multiple chunks), to a drive in my Windows desktop machine.  From there they get copied to an external drive in one of my backup media sets.  I have two media sets, one is kept at a remote location (to further protect against fire, flood or theft - but not far enough away to protect against a meteor strike).  Periodically I will take the external drive I am currently saving backups to over to the external location, swap it for the last disk in that set and bring that disk back.  When I return with the swapped disk I then update it with the backup chunks that were kept on the workstation in its absence and then I can delete those from the work station and repeat the process. 

 

In this way I have quadruple redundancy for all the backed up data almost all the time:

 

1. the unraid disk where the data resides

2. the unraid parity protection (not truly a copy, but close)

3. the copy on the workstation internal cache drive

4. the copy on the local external drive

 

once the data is swapped to site the items 3 and 4 become the local external drive and the remote external drive.

 

About once every year or two I restart the whole process, because by then I'll have some higher capacity drives that I can use to remove the older (and smaller) back up drives from service.  The last time I did this I was able to retire a handful of 500GB drives, replacing them with 2TB units that I had removed from the unRAID box when I started moving to 4TB drives.

 

The data on the external drives is check summed both at the chunk level and at the individual file level.  And the database that manages this has a SHA1 hash of all the individual files as well, so in theory I could use it to check against the current contents of the unRAID server without having to access any of the external drives.  But I've not written that code yet.

 

The backup utility is called ArcvBack and is available on:

http://arcvback.com/arcvback.html

 

It currently uses Python 2.5, one of these days I'll have to update it to the Python 3.x series.

 

Regards,

 

Stephen

Link to comment
  • 2 weeks later...

A few thoughts on a somewhat related topic that also comes up fairly often ...

 

Parity check errors.

 

Correcting vs. Non-Correcting checks.

 

 

There were some issues under which the non-correcting parity check would have come in very handy. Imagine running a parity check that generates thousands and thousands of parity errors. Might indicate that a disk was corrupted, and perhaps should be rebuilt based on parity rather than vice versa. In such a case you would not want to blindly update parity.

 

I also remember a user that got a few parity errors every time he ran a check. Didn't know if it was parity or a disk that was causing the issue (memory tests were good). If it was  disk failing, updating parity was actually reducing the ability to recover. Being able to perform non-correcting checks would allow diagnostics to continue without creating more corruption.

 

I suggested that parity check should always be non-correcting, but should remember the location of each parity anomaly.  If everything looked good and there were a couple of parity errors that the user could rationalize, he/she could request that all be corrected (with the locations known, this would be near instant). But if the parity errors were extensive, the user would have some tools to know what file was on each disk at that parity error location to try to isolate the issue.

 

This was quite a while ago, but this is what I am remembering.

Link to comment
There were some issues under which the non-correcting parity check would have come in very handy. Imagine running a parity check that generates thousands and thousands of parity errors. Might indicate that a disk was corrupted, and perhaps should be rebuilt based on parity rather than vice versa. In such a case you would not want to blindly update parity.
I have personally lost data because of a correcting parity check. I have tried to stress the importance of NOT writing ANYTHING to the disks if you suspect there is a problem until you know the nature of the problem. I really wish there was a way to boot the array to a diagnostic mode that there would be no writes of any sort, read only for all data.

 

This topic has been debated to death in the past, and suffice it to say, in an ideal world where the disks and controllers behave as designed when an error is detected, a correcting parity check is the correct action. If something isn't behaving as it should, you can easily get into a situation where you really don't want to write to a drive without running further diagnostics.

Link to comment

As I've noted before, I've never run a non-correcting check ... and likely never will.    On the other hand, I certainly wouldn't start a parity check if the disk errors column wasn't all zeroes.  If a disk is having read errors, then I'd replace the disk, whether or not it's been red-balled (which UnRAID only does on failed writes ... not on failed reads).    In fact, I've done that once.

 

In ~ 6 years of using UnRAID  I've only had a few sync errors during parity checks -- and in EVERY case the errors were in fact on the parity drive, so it was just fine that it corrected them  :)

 

I might run a non-correcting check if there was a convenient tool that would show the possible list of corrupted files for each error ... but since there's not, I don't bother.  Barring a disk failure ... which would show either in the errors column or by the disk being red-balled ... there's a very high probability of sync errors simply being errors on the parity disk for one of the many reasons Tom has outlined in other posts on this topic.    He did, in v5, add the option of non-correcting checks for those who want them ... but I just don't see the need.  If I had any doubts about data integrity after a check (i.e. a lot of sync errors with corresponding read errors on one of the drives) ... I'd just do an integrity check on the drive with the errors and replace any corrupted files.

 

 

Link to comment

I certainly wouldn't start a parity check if the disk errors column wasn't all zeroes. 

I have my browser HOME Screen set to the Tower/Main GUI page. I automatically scan the drives 'Errors' Column several times a day as a result.

 

It *would* be very nice if there were some tool available that would identify the file on each physical drive that corresponds to the bad parity byte...but that's way way way over my head on how to do. (or is it at a low low low level of the OS?) Either way, I can't do help.  :(

Link to comment

As I've noted before, I've never run a non-correcting check ... and likely never will.    On the other hand, I certainly wouldn't start a parity check if the disk errors column wasn't all zeroes.  If a disk is having read errors, then I'd replace the disk, whether or not it's been red-balled (which UnRAID only does on failed writes ... not on failed reads).    In fact, I've done that once.

 

In ~ 6 years of using UnRAID  I've only had a few sync errors during parity checks -- and in EVERY case the errors were in fact on the parity drive, so it was just fine that it corrected them  :)

 

I might run a non-correcting check if there was a convenient tool that would show the possible list of corrupted files for each error ... but since there's not, I don't bother.  Barring a disk failure ... which would show either in the errors column or by the disk being red-balled ... there's a very high probability of sync errors simply being errors on the parity disk for one of the many reasons Tom has outlined in other posts on this topic.    He did, in v5, add the option of non-correcting checks for those who want them ... but I just don't see the need.  If I had any doubts about data integrity after a check (i.e. a lot of sync errors with corresponding read errors on one of the drives) ... I'd just do an integrity check on the drive with the errors and replace any corrupted files.

 

With the current state of hard drive technology, and all of the safeguards built into the disks themselves, chances are a user can go their whole life and never have a "real time failure event". And unRAID gives us two additional levels of protection.

 

1 - the obvious ability to recover if a disk fails, and

2 - it puts us (many of us anyway) in a certain uber diligent mindset of being proactive about disk issues and taking action swiftly to remedy when the tell-tale signs of a pending failure begin to show

 

So in a world where the chances of data loss is already low and we want to further reduce risk, we start to explore less and less likely situations. Situations that many users will not see. So the fact that you or I or 100 other unRAIDers have never seen something is not necessarily a good reason to dismiss it as a problem. Even in this thread we see a user HAS lost data as a result of the current parity check scheme, which is enough evidence that it is worth discussing.

 

So if something goes screwy, for example a disk starts spewing garbage in the middle of a parity check (RobJ is going to challenge me on this, I know it is coming), you instantly lose your ability to recover that disk. Instead you'd be recovering from backups while other users would be re-ripping disks for the next month, and others crying in their oatmeal. The scheme I outlined, which runs the parity check in non-correcting mode and then lets the user review the corruptions (if there are any) before allowing them to be applied, would add just one more "9" (in the six sigma sense) to the safety that unRAID provides.

 

The ability to see what files are impacted at a certain disk location would be hard to do. It would require a very detailed knowledge of the underlying file system structure. But even lacking this feature, remembering the parity check locations and putting the human in control of whether to let them be applied would still be worthwhile.

Link to comment
As I've noted before, I've never run a non-correcting check ... and likely never will.    On the other hand, I certainly wouldn't start a parity check if the disk errors column wasn't all zeroes.

 

And here is the issue. UNRAID AUTOMATICALLY STARTS A CORRECTING PARITY CHECK AFTER A CRASH. It doesn't give you the option to evaluate the situation before it starts writing data to the parity disk.

Link to comment

As I've noted before, I've never run a non-correcting check ... and likely never will.    On the other hand, I certainly wouldn't start a parity check if the disk errors column wasn't all zeroes.

 

And here is the issue. UNRAID AUTOMATICALLY STARTS A CORRECTING PARITY CHECK AFTER A CRASH. It doesn't give you the option to evaluate the situation before it starts writing data to the parity disk.

 

I agree!

If unRAID detects an error, it should not start the array and it should not automatically conduct a parity check!

Link to comment

I agree!

If unRAID detects an error, it should not start the array and it should not automatically conduct a parity check!

I think your statement is too broad!

 

I think then the issue is not that a parity check is being started, but that it is a correcting parity check which can result in writes to the parity disk.  I an unclean shutdown is detected (whatever the reason) and the parity check was a non-correcting one then this would mean that most users would not notice anything much happening if the shutdown was caused by something like a power failure but their array would still be checked for integrity. 

 

What I do agree with is that a correcting parity check should not be auto-started outside user control.  As has been mentioned this can lead to data loss under certain (albeit rare) circumstances.  You also do not want an automatic parity check if any disk has been red-balled due to a write failure for the same reasons.

 

Link to comment

You also do not want an automatic parity check if any disk has been red-balled due to a write failure for the same reasons.

 

UnRAID will not start a parity check -- automatic or manual -- if there's a red-balled disk.  So that's not an issue.    The reason it starts an automatic check if it detects an unclean shutdown is that an unclean shutdown is a very likely cause of the parity disk not completing any pending updates -- so it's likely that parity errors were induced by the system.  Note that since Reiser is a journaling file system it's VERY unlikely that any issues would be induced on the data disks themselves, but the parity disk updates are not journaled transactions.

 

I understand the conceptual reasons folks want non-correcting checks ... but the question is what are you able to do with the data?    A non-correcting check lets you know there are sync errors, but doesn't fix them.    So unless you know (a) what location every identified error was at;  (b) which set of files those locations could impact (one on every disk that had data at that location);  and © whether or not each of those files is currently good (requiring either checksum data or a backup to compare with) ... then the knowledge that there are errors isn't of much use.    Note also that if you DID replace a file that had been corrupted (and was therefore causing the sync error), you'd STILL have sync errors (actually a LOT more of them) after you replaced the corrupted file ... so you'd need to run a correcting check afterwards.

 

I haven't kept a record of it, so I don't know for sure how many times I've had sync errors ... but I'd guess it's in the range of 6-8 times over 6 years (most in the first couple years when my hardware wasn't as reliable) -- and as I noted earlier, EVERY sync error that was corrected was a legitimate sync error on the parity disk => NONE were on any of the data disks.  There are, of course, potential circumstances that could result in an error being on the data disk ... but the likelihood of this is VERY low.    For the typical UnRAID user, it's far better to just run correcting checks.    IMHO this forum has got too many folks running non-correcting checks and then wondering what to do about the errors (which are almost certainly on the parity disk) -- often to the point where they're paranoid about correcting them.

 

I agree, by the way, that if there were good tools that would allow bjp999's idea to be implemented ["... runs the parity check in non-correcting mode and then lets the user review the corruptions (if there are any) before allowing them to be applied "] it would allow a more-knowledgeable user to at least "click" on each correction before it was applied.    But unless the tool that allowed this was displaying the potentially involved files (and, again, there was a way to validate whether or not each file was good),  then accepting these changes would become about as automatic as accepting software licensing agreements when you install software  :)

... which is effectively just a correcting check with a bunch of extra "clicks" by the user  8) 8)

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.