4.6 (and 4.7) - Parity Errors after Upgrading Disk in otherwise stable server


Recommended Posts

Last week I upgraded a 1TB disk to a 2TB one, the operation went successfully and I didn't think anything of it. This is how it should be, right? I mean, unRAID is a pretty paranoid system, verifying writes to both data and parity disk during all writes to the protected array (and that's why we all love it).

 

Well, by coincidence, I ran a No-Correct parity check later last week in order to check if my SIL3132 card would work in my 16x slot (I wanted to get a clue as to whether I could hope a SASLP-MV8 card would work in that slot). That no-correct parity check found 2 errors at the exact same blocks in the very beginning of the disk (0.1%) several times in a row. After moving the card back to its original location (where it's been for several parity checks and preclear operations in the past), the parity errors were still present with a new No-Correct check, and at the exact same sectors still. I reported this in another thread but am moving the whole subject to the support forum, hoping to be able to write a more concise account of events and maybe hear back from a broader audience.

 

I understand from Rajahal's comments in the other thread that the most likely source of the parity errors is the disk rebuild, and I concluded from that, that most likely all the other datadisks incl parity are correct and the faulty data on the recently-rebuild disk

http://lime-technology.com/forum/index.php?topic=12621.0

 

Rajahal also disclosed that UnRaid does not verify writes to a disk in progress of being rebuild, making this theory plausible. However, the smart reports for all the disks in the array showed no problems. I have 0 relocated and 0 pending sectors on all the disks.

 

So, I did it over again. Yanked the recently-rebuild datadisk, assigned another precleared 2TB drive and rebuild again. Ran a no-correct parity check immediatly after, without shutting down the server, and this time I have 5 parity errors. Like last time, they are persistent, are in the first 0.1%, and in the exact same sectors with each parity check. I am attaching a syslog covering the disk rebuild, one incomplete parity check, one complete, and one more incomplete (25% complete as of now), all with the errors in the same locations.

 

I'm sure stranger things has happened, but I have to say, I do find this pretty strange. I divide the strangeness into two distinct chunks:

A) It's strange that my server (hardware?) can rebuild the data incorrectly on to the new datadisk with all the testing I have done on this hardware.

B) It's strange that unRaid would not verify the writes to the new data disk.

 

On B), I realize this might be in order to complete the disk rebuild faster and get back to a protected state, but what good is it if garbage is being written to the new disk and not being found out until the next parity check? Note that if the next parity check is a regular Parity Sync, then the garbage data would have been made permanent.

 

Does unRaid need an enhancement to verify writes during data rebuild, or am I way off base here? And does anyone have a theory on why the errors can occur in the first place?

 

----

 

Addendum - Further info on my server and my practices:

 

I've had my unRaid server running in good form for about 9 or 10 months now. The server uses the 6 ICH9R SATA ports, 2 from an onboard Jmicron JMP36x chip and one from a PCIE 1x SIL3132. Currently at 9 disks, leaving one of the SIL3132 ports available (the disk being rebuild is on the SIL3132 card).

 

I have done a fair amount of parity checks including monthly scheduled ones, and have never had as much as a single sync error. Not one. The disk on the SIL3132 card has been in the array for more than 4 months and has therefore been checked at least 3 times (and with no errors).

 

I run on UPS, and there has not been a powerfailure, crash or any form of sudden shutdown. When powering down I kill all my apps that run on the protected array (Sabnzbd, Transmission and Twonkyserver), stop the array and power down cleanly. Every single time since I brought the server into 'production' 9 or 10 months ago.

 

Additionally, I have been running a fair amount of preclear cycles, two on every disk I have added to the array since day one, plus some more preclearing simultaneously on both ports of the SIL3132 card before trusting it and starting to add disks to that card's ports. I'm pretty paranoid about integrity of this server, I do run on UPS, The server was memtested for 24 hours before being brought online, I have all my power and data cables separated and tied up nicely, latches on the datacables, all measures that I have been able to think off or stumble on to in this forum. I also keep a spare twice-precleared 2TB drive ready to drop in if a disk should fail.

syslog-2011-05-09_Rebuild_Again_and_With_5_Errors_After_Rebuild.txt

Link to comment

I am wondering if you might be seeing something like I reported in:

 

http://lime-technology.com/forum/index.php?topic=11515.msg109840#msg109840

 

where one of my drives was returning different data for some reads.  In the end the only way to track down which drive was causing the problem was to do MD5SUMs on each drive, multiple times over the ranges of blocks where parity errors were being reported.

 

In this post I documented the "dd | md5sum" commands I used to do this:

 

http://lime-technology.com/forum/index.php?topic=10364.msg98580#msg98580

 

in your case since the errors are close to the start of the disk you probably can run without the "skip" and with "count" set to something like 100000.  And for such a short range your tests will not take very long.

 

 

Regards,

 

Stephen

 

Link to comment

It could be a memory issue. Run memtest overnight. Unfortunately, does not detect all possible memory problems. Recently, a user with the same problem tested his memory for 9+ hours and found no problems. But when he used only one memory stick instead of two the problem went away.

Link to comment

Status update - Yesterday I upgraded to 4.7 and stripped my go script down to the bare bones, rebuild the drive again, and voila - no parity errors. That's right, NONE  8)

 

I have decided I do need to get to the bottom of this, so I'm going to try and identify potential sources of the rebuild errors I have been experiencing. Not having parity errors is fundamental to my very perception of data storage, and I need to know that I can rebuild the array when a disk fails. Also, I'd rather go through this now that all my disks look healthy from the smart reports than at a later time if/when two drives might be close to failure at the same time.

 

This is the list of sources that have been proposed as well as my own considerations (listed in likelihood from low to high):

 

1) Various unraid addons and dependencies packages for these - specifically (and in no particular order) unmenu, apcupsd, email-notify, smarthistory, transmission, sabnzbd and twonkyserver. I definitely would hope none of these could be causing this, but have included here on the list because it is part of the delta from the working and not-working configurations I have tested so far.

 

2) Drive not returning the same content each time it is read, suggested by VCA. It's a valid theory, but since I haven't had any parity errors before this rebuild error, I consider it unlikely. I have constructed a script similar to VCA's to test with next time I have the system in a state where the rebuild has failed. With the 4.7 'bare bones' configuration that did successfully rebuild, every drive did return consistent MD5 for the first 0.25% 10 times in a row. For others that will be building such a script, remember the double >> redirects in order to keep the md5 from all the runs in stead of the latest one only (vca's code in the linked thread had only one >). So as an example

dd if=/dev/sda skip=1953125 count=3906250 | md5sum -b >> sda.log

and repeat as many times as desired (a block is 512bytes, so this example will look from 1.0GB and 2.0GB forwards).

 

3) Memory errors, as suggested by dgaschk. It's a valid theory as well, although again, since I haven't had any parity errors before the rebuild, and all the ones after the rebuild has been in same locations consistently, it doesn't sound likely. The fact that I can rebuild successfully with 4.7 bare bones also does not point in this direction. But do note that I'm not going to cross it off the list  :)

 

4) Write corruptions with my SIL3132 controller. Not really considered likely as I have been able to preclear successfully several times, and also because the rebuild works with the 4.7 bare bones configuration. It's really more on here as in one of the other candidates in combination with this particular hardware.

 

5) unraid version 4.6 - this version was not out for very long before being replaced by 4.7, so most likely the people who did upgrade to 4.6 also upgraded to 4.7 and as a consequence probably only very few people have been rebuilding disks with this release (and a significant portion of those who have, may not have performed a parity check afterward). I do realize (and appreciate) it was in beta for a fair amount of time, but how many people actually rebuild their disks and performed a parity check as part of the beta testing.. Again, this is unlikely when looking at the changelog for 4.7, but since it is part of my delta from not-working to working rebuild configurations, I'll put it up here.

 

6) Various performance tweaks in my go script - most of which I've copied from other forum members and since forgotten what do. I'm attaching my go script for anyone interested to review (See the section under "# Filesystem Tuning" and "# Set Readahead buffer to 2048". That's my top suspect, all based on my (still limited) insights into the unraid world.

 

I'm testing today with 4.7 and the performance tweaks mentioned above commented out. If it works now I'll assume the error was caused by 5) or 6) and take one more step to pinpoint which one is the culprit. I'll want to wrap this up and know I have a stable system and what caused the errors before starting to test the AOC-SASLP-MV8 card that just arrived in the mail yesterday  :)

go.txt

Link to comment

Small update - unRaid 4.7 with the performance tweaks commented out was also successful. So, it gives more confidence that I haven't simply been lucky with the 4.7 bare bones config, and also rules out all the addons and dependencies packages. Testing now with 4.7 and all the performance tweaks back in.

Link to comment

4.7 and my full go script worked! Zero parity errors after rebuilding the disk again.

 

I'm not passing judgment yet, but it does look like this was caused by the 4.6 version (or at least that version in combination with my particular hardware). I'm testing 4.6 bare bones now to make sure I found the rotten apple.

Link to comment
B) It's strange that unRaid would not verify the writes to the new data disk.

 

On B), I realize this might be in order to complete the disk rebuild faster and get back to a protected state, but what good is it if garbage is being written to the new disk and not being found out until the next parity check? Note that if the next parity check is a regular Parity Sync, then the garbage data would have been made permanent.

 

Does unRaid need an enhancement to verify writes during data rebuild, or am I way off base here? And does anyone have a theory on why the errors can occur in the first place?

 

This is done by running a parity check after rebuilding. I agree that unRAID should prompt you to perform a parity check after a rebuild.

Link to comment
  • 2 months later...

Boys and Girls - If I'm reading this right

http://lime-technology.com/forum/index.php?topic=13866.0

then Tom has encountered this bug as well and plans to fix it in a forthcoming version 4.7.1.

 

In my analysis, the rebuild errors I found and which were reproducible on 4.6 were caused by the bug Tom is describing above. Running without GO script would eliminate the offending disk activity during the rebuild, and be the reason the errors went away when stripping down the go script.

The fact that I didn't get the error on 4.7 with full GO script, I will write off as coincidence (could be sabnzbd queue was empty or paused and/or transmission paused because of schedule-time-of-day settings at the time of the 4.7 rebuild test).

 

A nasty bug. I look forward to the forthcoming v4.7.1 and will definitely hold off on any data rebuilding until then. Good times.

 

(EDIT: I added [solved] to the subject title)

Link to comment
  • 4 months later...

Changed topic back. It's been 4½ months and 4.7.1 still hasn't surfaced. So not solved.

 

My server has been up and running stable for 3½ months straight now. All monthly monthly parity checks since the above issue found zero issues. Everything still points to the bug Tom disclosed 4½ months ago being the cause of the rebuild errors I encountered (see link in last post).

 

I do think it's fair to say I've been a patient man, but at this point my patience is wearing thin, and now I need to rebuild a drive that is starting to reallocate sectors (no errors yet). This should have been fixed by now (or 4½ months ago!) if Tom is at all serious about data integrity in the supposedly 'stable' release branch of unRAID.

Link to comment

Changed topic back. It's been 4½ months and 4.7.1 still hasn't surfaced. So not solved.

 

My server has been up and running stable for 3½ months straight now. All monthly monthly parity checks since the above issue found zero issues. Everything still points to the bug Tom disclosed 4½ months ago being the cause of the rebuild errors I encountered (see link in last post).

 

I do think it's fair to say I've been a patient man, but at this point my patience is wearing thin, and now I need to rebuild a drive that is starting to reallocate sectors (no errors yet). This should have been fixed by now (or 4½ months ago!) if Tom is at all serious about data integrity in the supposedly 'stable' release branch of unRAID.

 

+1

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.