Should we enable "Write corrections to parity"


squashem

Recommended Posts

I searched on the threads over here for how often to run the Parity check. The general consensus seems to be that running the parity check once every month is a good idea.

When the parity check is run is it a good idea to turn on the "Write corrections to parity" checkbox?

 

Please take a look at the attachment .

Link to comment

There has been a lot of discussion about this over the years. I guess the recent consensus is to let the scheduled parity check be non-correcting so it won't change parity, just in case you have some reason to suspect a problem with another disk.

 

If it does find any parity errors with the scheduled parity check, and you don't know of any particular suspect disk, then you would have to correct parity by running another parity check, a correcting check.

Link to comment

As noted, there are varying opinions.  Personally, I always run it with the box checked -- if there are any errors, I want them corrected.    Ask yourself a simple question:  What are you going to do if the check finds errors?    There ARE cases where the errors are due to a loose SATA cable or a miss-aligned disk in a drive cage ... and if you've recently "fiddled" with the server and think you might have one of these issues you may want to run a non-correcting check.    But in general detected errors are just that -- errors -- and they should be corrected.

 

There is one time I recommend doing a non-correcting check => immediately after you've done a disk rebuild (either because of a failed disk or simply because you're replacing a disk for some other reason ... e.g. with a larger disk).

 

Notwithstanding the arguments for and against, if you want to always run non-correcting checks, that's fine.    Just note that any time an error is detected you'll then need to run a correcting check to fix it  :)    [Or you can reseat all your disks/cables and try the check again -- but in the vast majority of cases an error is indeed an error.]

 

 

Link to comment

I think you're running a risk by scheduling correcting checks - suppose a disk fails, or simply drops off-line, during a correcting check.

 

It's far safer to schedule a non-correcting check. Most of the time the result will be no errors found. On the very odd occasion an error is found then yes, you need to run a correcting check - but on your own terms and after taking whatever remedial action you see fit first.

 

  • Upvote 3
Link to comment

As I noted, there are varying opinions  :)

 

Unless you've "fiddled" with the server (and thus suspect a loose cable); or have some other good reason to not do any corrections (i.e. after a disk rebuilt it's a good idea to do this to confirm the rebuild went well);  then running a non-correcting check will almost certainly result in running a correcting check right after it if any errors are found -- and the scenario of a disk "dropping off line" can easily happen during that check as easily as for any other check.

 

I've NEVER run a non-correcting check except for confirming disk rebuilds in the 8+ years I've been using UnRAID ... and the few sync errors I've had have ALWAYS been legitimate parity errors that the check properly corrected.

 

 

Link to comment

Thing is, Gary, you run the risk of a disk dropping off-line during a parity check every month. I risk it once in a blue moon. So on those rare occasions I do need to run a correcting check I have the opportunity to check first that the cables aren't loose. On one server I've never run a correcting check - just the initial parity build 18 months ago and the more recent conversion to dual parity, and monthly non-correcting checks that have always revealed zero errors. If one day it does reveal an error I'll certainly do a bit of investigating before I start a correcting check.

 

Anyway, what's the point in having a server and not fiddling with it?  :D

Link to comment

Simply a matter of choice -- as I noted earlier, different folks have different preferences.   It certainly doesn't "hurt" anything to do non-correcting checks ... my view is simply that if the non-correcting check identifies an error, the fix is almost certainly going to be a correcting check -- so I simply always do correcting checks.

As I noted, the one exception is if I've done a drive rebuild -- in which case I run a non-correcting check after the rebuild to ensure the rebuild went well (if it didn't, there will be a parity error => but since you haven't changed parity you can repeat the rebuild).

I've never had (and don't expect to have) a loose cable -- I use only high-quality locking SATA cables and am VERY careful when I make any physical changes to the server to confirm that everything is seated properly.   As I noted above, on the few occasions when I've had a parity sync correction, it has ALWAYS been a valid correction of parity, and NOT an error on a drive.

... and I do very little "fiddling" :D

Link to comment

IMO, and based on my experience, scheduled checks should always be non correct, mainly because of these reasons:

1) parity checks stress all disks, all disks may have perfect SMART reports before starting and one (or more) fail during the check, and if there's a failure parity may be corrupted and there's no way to correct it.

2) in normal usage there shouldn't be a single sync error during a parity check, I don't remember ever getting a single unexpected error in almost 10 years using unRAID, so the schedule non correcting parity check should always return 0 errors and there's no need to run a correcting one.

Only time I run a correcting check is after an unclean shutdown, in that case it's normal for there to be some sync errors and it's best to fixed them ASAP.

Edited by johnnie.black
  • Upvote 2
Link to comment
  • 8 months later...

I would add my voice to the 'Do not write corrections to the parity disk'  side of this discussion.

I recently had a power outage that extended beyond my UPS (Yes I need to configure auto shutdown).

I was off site and asked my wife to restart the server.  I connected into the system a day or so later and found a lot of files on one of the disks missing.  The parity check was about 90% complete and had written 100's of millions of corrections to the parity drive.  Once the parity check was complete, I corrected the drive and found 1000's of corrupted files in the Lost and Found.  I could not recover the drive because the correct parity had been overwritten.

In the future, I want to make the decision as to whether the parity or data drive is corrupted.

 

  • Upvote 1
Link to comment
1 hour ago, robertyoung127 said:

I would add my voice to the 'Do not write corrections to the parity disk'  side of this discussion.

I recently had a power outage that extended beyond my UPS (Yes I need to configure auto shutdown).

I was off site and asked my wife to restart the server.  I connected into the system a day or so later and found a lot of files on one of the disks missing.  The parity check was about 90% complete and had written 100's of millions of corrections to the parity drive.  Once the parity check was complete, I corrected the drive and found 1000's of corrupted files in the Lost and Found.  I could not recover the drive because the correct parity had been overwritten.

In the future, I want to make the decision as to whether the parity or data drive is corrupted.

 

Parity typically will not help with corruption.

  • Like 1
Link to comment
20 hours ago, robertyoung127 said:

I would add my voice to the 'Do not write corrections to the parity disk'  side of this discussion.

I recently had a power outage that extended beyond my UPS (Yes I need to configure auto shutdown).

I was off site and asked my wife to restart the server.  I connected into the system a day or so later and found a lot of files on one of the disks missing.  The parity check was about 90% complete and had written 100's of millions of corrections to the parity drive.  Once the parity check was complete, I corrected the drive and found 1000's of corrupted files in the Lost and Found.  I could not recover the drive because the correct parity had been overwritten.

In the future, I want to make the decision as to whether the parity or data drive is corrupted.

 

What filesystem were you using? Xfs, btrfs, or rfs?

Link to comment
  • 1 month later...

I find this "write corrections to the parity disk" idea very, very confusing.  The whole point of parity is to be able to fix errors in your data if any are found.  Why would you want to update the parity disk with what you are currently reading if there are errors?  You want to correct the errors, not write new parity so it passes with the current data.  The way this is worded in the settings makes me believe that this is indeed what is happening, that parity is being incorrectly updated based on current data, and is only affirmed by the discussion in this thread.  That being, a "correcting" parity check is actually just throwing parity out the window and completely ignoring it, and is simply rebuilding parity with whatever data is currently on the data drives.  Why is this something you would want to do?  This seems like an insanely bad idea.  Have I actually got this wrong?  Because the discussion here seems to say I've actually got it right.

 

Given the feature's description and the discussion in this thread, this is what I think would happen if I took a disk to another machine, made corrupting changes to tons of files on it, and put it back into the unRAID machine and ran a correcting parity check.  The corrupted files would cause a parity check failure, and the parity disk would be updated to contain new and current parity data based on the untouched disks and the intentionally corrupted disk.  Why would anyone ever want that to happen?  You should want the corrupted data on the intentionally corrupted disk to be returned to the pre-corrupted state.  You want to update the array based on the existing parity data.  You don't want to update the parity disk to reflect the current state.  Please, please tell me I have this wrong.  Because if I don't have this wrong, unRAID's parity check is completely useless.  I must have this wrong.  But it sure seems like I have it correct in my head, given what I've just read.

Edited by _Shorty
Link to comment
6 minutes ago, _Shorty said:

Because if I don't have this wrong, unRAID's parity check is completely useless

Write corrections to parity is optional, you don't need to, and IMO shouldn't, do a correcting parity check for the schedule parity checks, you should do one e.g., after an unclean shutdown, so both options are necessary and available.

Link to comment

Parity cannot tell you where the problem is. In your hypothetical scenario, since you know where the problem is, you would use parity to rebuild the disk you had corrupted on another system.

 

In a more typical scenario, where a disk has failed, parity will let you rebuild it.

 

If you have an unclean shutdown, you could have some parity that was not written and so would be out of sync. So parity needs to be corrected. It is also possible that you had writes to a data disk that were not flushed either, so the data disk could also be wrong. But parity can't help you with that.

 

In any case, you need parity to be in sync. If you let parity errors accumulate, you will only be making things worse. In the absence of any other information about which disk to correct, you correct parity.

Link to comment
3 hours ago, trurl said:

If you have an unclean shutdown, you could have some parity that was not written and so would be out of sync.

And there isn't really any way around this with a software-based solution. Only a RAID controller card with battery backup can accept multi-disk writes from the OS and then make sure all writes will reach the disks after reboot, in case of a power failure.

 

That's also why people likes to use UPS with their servers, to reduce the possibility of half-written multi-disk data.

 

There isn't really any difference compared to the traditional software-RAID in Linux - if it sees that one mirror disk has a higher event counter than the other on boot, it will start a background job to correct this issue by scanning the drives and copy corrections from the drive with the higher event counter value.

 

So it's just a question of what expectations that are possible to make, and understand the required limitations.

 

A corrective parity scan will make unRAID - and other RAID solutions - accept and permanent silent errors. A single-parity system will never be able to figure out which of the disks that ha the silent error. What they are good at is to recover broken disks - not handling silent errors. That's also why newer RAID systems tries to complement the parity drives with data checksums. Either on the file system level, or through secondary means.

Link to comment
16 hours ago, _Shorty said:

if I took a disk to another machine, made corrupting changes to tons of files on it, and put it back into the unRAID machine and ran a correcting parity check.

 

This scenario would, of course, cause parity to be incorrect and generate parity errors.    Yes, you're correct that a correcting check would then update parity so it reflected the current "... intentionally corrupted disk ..."  => as it would have no idea that writes had been done to that disk.   In a more realistic scenario, if the disk had encountered a lot of write errors while in the array, there would have been reported errors and the disk would have been "red-balled" ... in which case it would have been taken off line.    You could have then done a rebuild to the same (or a different) disk and the original data would have been restored.    You could do the same in your scenario by forcing a rebuild of the disk after you re-installed it, since YOU would know that it was corrupted.

 

Basically removing a disk from the array and writing to it on a different system is never a good idea (except in severe recovery scenarios), since this will guarantee parity is no longer valid.    If, instead of removing it and "intentionally corrupting" it, you had removed it and wrote a bunch of new data to it on another system, then the changes to parity with a correcting check would have been exactly what you wanted -- parity would now reflect the actual contents of all of the disks.

 

The bottom line is that parity is not a panacea -- it does not eliminate the need for backups 'nor is it able to correct all possible corruptions.   It simply provides insurance against a single (or dual) drive failure causing data loss as long as the system is working correctly.

Link to comment

As Johnny said, in normal use, either a correcting check or a non-correcting check will find the same zero errors. It is only after a hard shutdown that there is a legit reason for a parity sync issue, and in such a case, running a correcting check first world avoid a second parity check.

 

But if you ever suspect a disk may have been corrupted, you don't want to run a correcting check. Reasons you might suspect include falling drive attributes, log entries pointing to drive problems, failing memory chip (which has been corrected), running a drive outside the array (for example, to run a recovery process on a different platform). In other words, if you are doubting a disk and want to run a parity check, do not run a correcting one. An uncorrecting check would provide peace of mind if it comes back clean, or of not, at least give you a sense of the scope of the corruption as you consider what to do.

 

If you had corruption, you'd really want pre computed checksums (e.g. md5 or BTRFS scrub) to compare to the current disk data. And/or data backups. Without them, you might consider pulling the suspect disk from unRaid. UnRaid will simulate the removed disk using parity and the other drives. And you can mount the actual removed disk as a UD. You would then have both the real and emulated disks online at the same time and can compute / compare checksums to see if corruption (differences) exist, and if they do, try to figure out which is correct (which may not be easy or even possible without a source of truth). After the analysis you'd have more data to point to the next step.

 

I would like to point out that a parity check will never tell you what disk is the cause of a sync error. With dual parity, it was hoped that the two parities might be able to triangulate and tell you what disk changed, but that is not possible today (whether that is theoretically possible in the future, I am not sure). So if you ever have a random sync error, with no data disk to suspect you really have no unRaid help to know if it is parity or a data disk. We tend to assume parity, as filesystems are rugidized with heavy commercial use. As mentioned above, the best tool to really figure it out are checksums that you'd have to be maintaining before the issue and represent truth. And/or backups. Otherwise you'd be stuck doing the analysis above on each and every data disk (you'd have to know what you're doing unRaid wise, and keep the array strictly read only). That or just accepting the possible corruption by running a correcting check.

 

I'll point out that leaving a known parity sync error in place is not a good idea. If any drive were to fail and you did a rebuild, it could be corrupted (maybe subtly, maybe in unused parts of the disk, but it would not be the mirror image of the original). A correcting check would fix that, even if it means accepting whatever corruption has already occurred.

 

Summary ... UnRaid does an extremely good job of maintaining parity. But if hardware, user error, or unexplained random events cause sync errors, with no hard shutdown involved, your best line of defense against data corruption are checksums from a point you know the data was all valid. And then going to a backup to recover the bad file(s) (or re-obtaining the file in question in another way). Otherwise you are in for a very frustrating experience to try and figure it out. And in the end may have to settle for knowing you have (or may have) corruption somewhere but having no way to know for sure or figure out the affected disk or files. And deleting every possible file that could be corrupted means losing too much valuable data!

 

(#ssdindex - parity sync errors)

  • Like 1
  • Thanks 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.