(SOLVED) Errors when running a Parity check

PanteraGSTK · February 20, 2016

Hi all, I don't post much because unRAID usually just works for me and I don't really run any plugins.

For some reason after updating to v6.1.8 I'm having issues. My version 5.0.6 setup ran well with the exception of my cheap SATA controllers giving me fits. I upgraded to a AOC-SAS2LP-MV8 and haven't looked back.

While I was upgrading the HBA I beefed up the CPU to an old q6600 I had and put in 8gb of memory. Wasn't needed in 5 so I didn't use it, but i wanted to play around with the new VM and docker capabilities so I figured why not.

I'm using an nforce 750 board that never gave me issues when it was my main gaming rig board, but it is still pretty old.

I'm not sure if my aging hardware is the issue or if something else is going on, but I can't get a parity check to complete now. It had been a while since I had done one on my v5 setup so I figured it was time.

It starts off fast at around 80mbps, but then slowly crawls to a halt with errors in the syslog (attached).

I'm not sure what other information you guys need, but I'm worried that v6 may not like my hardware, or there was an issue with the v5 setup I wasn't aware of.

unraid-syslog-20160220-0338.zip

itimpi · February 20, 2016

It would be better if you posted the diagnostics file (Tools->Diagnostics) as that includes a lot more information than just the syslog.

PanteraGSTK · February 20, 2016

Done. Thanks.

unraid-diagnostics-20160220-0357.zip

itimpi · February 20, 2016

A quick look through the SMART reports for your drives show the following:

WD-WMAZA6962916: 3 Pending Sectors

Any pending sectors is a bad thing as it indicates sectors that cannot be read reliably and can thus affect building parity or recovering a failed drive. A small number can sometimes be cleared by writing the drive but if you cannot get the number back to 0 then the drive needs replacing.

WD-WCAYY0111280: 1341 Pending Sectors

With this number of pending sectors I would assume the drive could fail at any point. This is likely to be the drive that it affecting your attempts to check parity. As you have been running a correcting parity check as the syslog implies it will also have resulted in unreliable parity for this number of sectors so I would be reluctant to trust your parity. I would see if you could copy off as much data as possible to another drive and then replace it ASAP. As the pending sectors could result in file corruption it would be advisable if at all possible to use the copies of the files from your backups.

WD-WCAVY5673505: 1 Reallocated Sector

This is OK as long as the number stays stable. However it should be watched. With v6 as long as you have notifications switched on you should be notified if the value increases.

PanteraGSTK · February 20, 2016

Thank you for looking. Now I know what to look for. I'll get on replacing these drives as I've been slowly doing that, but it seems that I need to do it faster. Thanks again.

PanteraGSTK · February 21, 2016

One question I didn't think of earlier.

If I can't trust parity, how am I going to get back to a "ready" state if I can't trust the data being rebuilt once I replace the drive?

Are we talking moving all data off of unRAID and starting over? That won't be fun.

garycase · February 21, 2016

If I can't trust parity, how am I going to get back to a "ready" state if I can't trust the data being rebuilt once I replace the drive?

Well ... the point is that parity may reflect inaccurate data on the drive => but it WILL reflect what's actually on the drive; so if you replace it and let it rebuild it will have the same content that was on the old (unreliable) drive. Note that if you simply copied all the data off the drive and then re-copied it back to a new replacement drive you'd have exactly the same issue.

If you have checksums or backups of your data, it would be a good idea to then validate that your files are good ... if you don't have backups (or checksums) then you'll just have to assume all is okay.

itimpi · February 21, 2016

One question I didn't think of earlier.

If I can't trust parity, how am I going to get back to a "ready" state if I can't trust the data being rebuilt once I replace the drive?

The issue is that the rebuilt drive will have up to 1341 sectors (as that is the number of pending sectors that were present when parity was generated) whose contents cannot be relied on. That can affect a lot of files and/or directory information and you will not know which ones they are. Do you have backups of those files? Do you know which files they are?

Are we talking moving all data off of unRAID and starting over? That won't be fun.

Not as such. We are only talking about the files that were on the problem disk. If you go the simple rebuild route then some of them may be corrupted unless you either have checksums for them to detect the corruption, or can restore them from backup disks.

If you have full backup of the files that are on the problem disk then there may be a better way forward than simply rebuilding the disk and then trying to correct any errors in the data on the rebuilt disk.

JorgeB · February 21, 2016

According to Tom, any disk URE won’t update parity, so if your parity was good it should still be, I would try to rebuild disk10, the problem is that disk1 possibly has bad sectors also, rebuild can fail, or complete with read errors from disk1, so it would be good to have checksums of all your files to check the rebuilt disk.

itimpi · February 21, 2016

According to Tom, any disk URE won’t update parity, so if your parity was good it should still be, I would try to rebuild disk10, the problem is that disk1 possibly has bad sectors also, rebuild can fail, or complete with read errors from disk1, so it would be good to have checksums of all your files to check the rebuilt disk.

That might be the case, but the syslog said that many changes were being written to the parity disk so I would not be confident about that.

If there are not good backups in place then rebuilding even with that many sectors suspect is still worth doing as it is a relatively small proportion of the millions of sectors on a single disk. After all if they are mainly media files the odd corrupt bit will probably be ignored during playback.

This is certainly one of the cases where having checksums and a good backup policy can help! It is also something that would be less likely to happen in the first place on v6 as notifications would have been raised about the fact there were pending sectors so action would be taken before the number became significant.

PanteraGSTK · February 21, 2016

I've already made sure my non media file backups are good.

The media files on the failing disks are easily recoverable from the original copy so I'm not at a loss there.

I don't do checksums, but wish I would have known that was necessary (smart). I will going forward. With the data on those drives that are having issues, it won't be too hard to find out what data is good and what has problems.

I'll report my findings once I finish.

itimpi · February 21, 2016

I've already made sure my non media file backups are good.

The media files on the failing disks are easily recoverable from the original copy so I'm not at a loss there.

I don't do checksums, but wish I would have known that was necessary (smart). I will going forward. With the data on those drives that are having issues, it won't be too hard to find out what data is good and what has problems.

I'll report my findings once I finish.

Good to hear that you are in a position to recover any files that might end up corrupt.

It is worth pointing out (if you do not already know) that there is now the Dynamix File Integrity plugin for v6 that makes creating and checking checksums very easy.

PanteraGSTK · February 21, 2016

Great. I'll check the dynamix plugin use that going forward.

JorgeB · February 21, 2016

According to Tom, any disk URE won’t update parity, so if your parity was good it should still be, I would try to rebuild disk10, the problem is that disk1 possibly has bad sectors also, rebuild can fail, or complete with read errors from disk1, so it would be good to have checksums of all your files to check the rebuilt disk.

Well, I just tested this and it's not true, a correcting parity check can update parity in case of a read error, and cause corrupted files after rebuilding, this is why I've always did and will continue to do non-correcting checks.

PanteraGSTK · February 21, 2016

According to Tom, any disk URE won’t update parity, so if your parity was good it should still be, I would try to rebuild disk10, the problem is that disk1 possibly has bad sectors also, rebuild can fail, or complete with read errors from disk1, so it would be good to have checksums of all your files to check the rebuilt disk.

Well, I just tested this and it's not true, a correcting parity check can update parity in case of a read error, and cause corrupted files after rebuilding, this is why I've always did and will continue to do non-correcting checks.

By non correcting checks you are referring to a parity check, correct? What is the benefit of this practice?

JorgeB · February 21, 2016

By non correcting checks you are referring to a parity check, correct? What is the benefit of this practice?

I don't expect to find any errors during my monthly parity checks, in fact I don't remember the last time it found one, but if there are any I will then try to find the cause and decide how to proceed, because like I suspected a disk reading error can (and this may not always happen) incorrectly update parity, when you then rebuild that disk one or more files will be corrupt.

garycase · February 22, 2016

The problem with doing non-correcting checks is that the system has no way to identify WHERE the bit error has occurred when it finds an error. If no disk has any reported errors, the error is by far more likely to be a legitimate parity error than a data error on a drive ... which is why the default behavior is to update the parity drive with a correction.

It IS true, of course, that if you have the appropriate backups and/or checksums [ideally both] that you can identify whether or not there was an error on a data drive. Unless you have a utility that will identify WHICH specific files on each drive are impacted by the erroneous bit, you would need to run a full set of checksum validations (or comparisons with your backups) to see if any files have been corrupted ... and then the appropriate action would be to replace the corrupted file and then run a correcting parity check to update parity.

The simple fact is the actions would be virtually identical if you had run a correcting parity check in the first place => run checksums to see if any data file was corrupted by the error; and then replace any corrupted file from your backup.

If you do NOT have backups of your data, but DO have checksums, then the non-correcting checks may be useful ... since in that case if you discover a corrupted file with your checksum validations you could rebuild the disk containing that file and hopefully it would then be correct, assuming the parity error was in fact associated with the corrupted file.

trurl · February 22, 2016

If you have some very good reason to suspect a specific disk is wrong you can choose to rebuild it from the existing parity instead of correcting the parity. Or you can choose to follow up by correcting parity. If you let it always correct parity you don't have a chance to choose.

JorgeB · February 22, 2016

The problem with doing non-correcting checks is that the system has no way to identify WHERE the bit error has occurred when it finds an error. If no disk has any reported errors, the error is by far more likely to be a legitimate parity error than a data error on a drive ... which is why the default behavior is to update the parity drive with a correction.

It IS true, of course, that if you have the appropriate backups and/or checksums [ideally both] that you can identify whether or not there was an error on a data drive. Unless you have a utility that will identify WHICH specific files on each drive are impacted by the erroneous bit, you would need to run a full set of checksum validations (or comparisons with your backups) to see if any files have been corrupted ... and then the appropriate action would be to replace the corrupted file and then run a correcting parity check to update parity.

The simple fact is the actions would be virtually identical if you had run a correcting parity check in the first place => run checksums to see if any data file was corrupted by the error; and then replace any corrupted file from your backup.

If you do NOT have backups of your data, but DO have checksums, then the non-correcting checks may be useful ... since in that case if you discover a corrupted file with your checksum validations you could rebuild the disk containing that file and hopefully it would then be correct, assuming the parity error was in fact associated with the corrupted file.

I’m sorry but I disagree, a disk read error is pretty obvious, you get a system notification, or just look at the error column on main page, if you’re doing a correcting parity check and parity is wrongly updated, you have to rebuild the failed disk, run checksums on all those files, replace corrupted ones from backups (assuming you have checksums and backups).

On the other hand, if it was a non-correcting check, you just replace the bad disk and let it rebuild.

If there are any sync errors without disk read errors, and all disks look fine then I’d do a correcting parity check.

garycase · February 22, 2016

... I’m sorry but I disagree, a disk read error is pretty obvious, you get a system notification, or just look at the error column on main page ...

Agree ... HOWEVER, it is VERY likely that the error column will ALREADY be non-zero BEFORE you run a parity check if there have been errors on a disk. In that case I would agree that it's a good idea to rebuild the disk ... no parity check needed to know that. In fact, I implied that in what I said above ...

... If no disk has any reported errors, ...

If you have your parity checks automated, then that's perhaps a reasonable reason to use non-correcting checks, since you may not notice the disk errors before the parity check runs. [i don't automate my checks -- so I'd always notice this]

If, however, you look at the disk status before running the check, and they're all error free, then a correcting check eliminates the second step of running a non-correcting one and then a subsequent correcting check to fix any sync errors ... which is what you end up doing if your automated check is non-correcting ...

... If there are any sync errors without disk read errors, and all disks look fine then I’d do a correcting parity check.

I agree, however, that there's no harm in simply running a non-correcting check; then looking at the error column if there are any sync errors; and then either rebuilding a disk (if one shows errors in the error column) or running the parity check again as a correcting check otherwise.

JorgeB · February 22, 2016

These screens are from the test I made:

1 - having just run a parity check with 0 errors, made the disk fail during a correcting parity check, after it found some errors aborted check.

2 – rebuild to new disk

3 – checksum found 5 corrupted files (would probably find many more If I hadn’t stop the parity check).

Some log extracts:

Feb 21 16:07:30 Testv6 kernel: ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Feb 21 16:07:30 Testv6 kernel: ata6.00: irq_stat 0x40000001
Feb 21 16:07:30 Testv6 kernel: ata6.00: failed command: READ DMA EXT
Feb 21 16:07:30 Testv6 kernel: ata6.00: cmd 25/00:e0:00:56:4c/00:03:08:00:00/e0 tag 19 dma 507904 in
Feb 21 16:07:30 Testv6 kernel:         res 51/40:00:09:57:4c/00:00:08:00:00/e0 Emask 0x9 (media error)
Feb 21 16:07:30 Testv6 kernel: ata6.00: status: { DRDY ERR }
Feb 21 16:07:30 Testv6 kernel: ata6.00: error: { UNC }
Feb 21 16:07:30 Testv6 kernel: ata6.00: configured for UDMA/133
Feb 21 16:07:30 Testv6 kernel: sd 6:0:0:0: [sdd] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Feb 21 16:07:30 Testv6 kernel: sd 6:0:0:0: [sdd] tag#19 Sense Key : 0x3 [current] [descriptor] 
Feb 21 16:07:30 Testv6 kernel: sd 6:0:0:0: [sdd] tag#19 ASC=0x11 ASCQ=0x4 
Feb 21 16:07:30 Testv6 kernel: sd 6:0:0:0: [sdd] tag#19 CDB: opcode=0x28 28 00 08 4c 56 00 00 03 e0 00
Feb 21 16:07:30 Testv6 kernel: blk_update_request: I/O error, dev sdd, sector 139220745
Feb 21 16:07:30 Testv6 kernel: ata6: EH complete
Feb 21 16:07:30 Testv6 kernel: md: correcting parity, sector=139198600
Feb 21 16:07:30 Testv6 kernel: md: disk2 read error, sector=139198608
Feb 21 16:07:30 Testv6 kernel: md: disk2 read error, sector=139198616
Feb 21 16:07:30 Testv6 kernel: md: disk2 read error, sector=139198624

BobPhoenix · February 22, 2016

Which is exactly why I run non-correcting checks myself. If I get an error on a check and I can't determine which disk it occurred on I can then just run a correcting check to update parity. That isn't an option if I leave the default turned on. I loose the opportunity to try to find the error myself before corrupting parity. It is worth having to run a correcting check immediately after a non-correcting check to me - gives me more options then to always do the correcting check. Most of the time (99%) I never have to do the correcting check anyway.

PanteraGSTK · February 22, 2016

It seems that there are some different views on how to maintain data integrity. Checksum and parity checks go hand in hand, but not all of that seems to be documented as a best practice. Nor is backup verification.

What I mean by this is I'm not as seasoned as you guys when it comes to making sure my data stays intact and I think a "data integrity verification" guide of some sort would be a great addition to standard documentation. Plus it would be a good learning experience for me to learn how to maintain my system with as little oversight as possible.

If this is documented then please point it out, but it'd be great to find a way to automate this process.

Thoughts?

garycase · February 22, 2016

The best way to approach parity checks has been a debated topic for nearly the entire decade that UnRAID has been around [in fact, UnRAID's 10th anniversary is next week ]

Because of the way UnRAID computes parity, it is FAR more likely that any sync errors are on the parity disk than on any other disk -- but it's certainly true that it's not an absolute that this will be the case ... thus the debate over whether it the parity disk should be updated to correct detected sync errors. When UnRAID was first released, there wasn't any choice. Some time later (I wasn't an UnRAID user in those days, so don't know exactly when) JoeL wrote UnMenu, and included in UnMenu a way to do a non-correcting check ... although the check from the main GUI was still always correcting.

When v5 came out, there was then a box on the Web GUI that allowed you to choose whether or not to correct sync errors during a check ... and the debate intensified. Note that the automatic check that happens after an "unclean shutdown" was still always a correcting check. With v6, you can change that behavior as well in the system options.

So ... which is best ?? There are times when it's absolutely best to have a non-correcting check available ... immediately after a disk rebuild; when doing a New Config with the "trust parity" option; are two key cases that come to mind.

But the simple fact is there is mathematically no way to know WHERE a bit error has occurred when a sync error occurs; so unless you have some means of identifying what likely happened, simply correcting it on the parity disk is generally the most-likely-to-be-correct choice. If you DO have a way to identify where it happened -- i.e. a disk has indicated errors -- then it's a different story ... you can, for example, rebuild that disk. Of course if for some reason that wasn't where the error actually was, you've just corrupted the data on that disk ... but that's a risk you have to take in the absence of more certain validation means.

Personally, I prefer to simply always do correcting checks -- and on the RARE occasion when they actually find and correct a sync error (or errors), I simply run a full CRC validation of my array [This takes a few days, but it only takes about one minute of "my time" (as opposed to "computer time") ... i.e. I go to Windows Explorer, point to my Media share, right-click, and select "Verify checksums"]

Note that in the past 5 years or so (since I've had full checksums available) I've probably had 3 occasions when sync errors were detected and automatically corrected ... and NONE of them had any errors in the actual data, so correcting the parity bit was indeed the right thing to do.

In the event that a validation DID identify a corrupted file (or files), I'd simply replace them from my backups. Note that UnRAID -- or any other RAID -- is NOT a substitute for backing up your data. [And of course the backups should also have checksums]

JorgeB · February 22, 2016

It seems that there are some different views on how to maintain data integrity. Checksum and parity checks go hand in hand, but not all of that seems to be documented as a best practice. Nor is backup verification.

What I mean by this is I'm not as seasoned as you guys when it comes to making sure my data stays intact and I think a "data integrity verification" guide of some sort would be a great addition to standard documentation. Plus it would be a good learning experience for me to learn how to maintain my system with as little oversight as possible.

If this is documented then please point it out, but it'd be great to find a way to automate this process.

Thoughts?

A good start is having checksums of all your files, if something happens that makes you doubt the integrity of one or more of your disks you can run a check to confirm if there are any corrupted files, then you can restore just those files from backups or original media.

There’s a plugin for Unraid, but it can cause issues with reiserfs disks, you can convert them to xfs, not just because of this, but because reiser is on its way out and xfs has better performance among other things.

If you don't want to convert file systems you can also create and verify checksums from a windows PC using for example corz.

(SOLVED) Errors when running a Parity check

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation