Jump to content

First Time I've Had Errors After Completing Parity-Check


Recommended Posts

Hi all, after running a parity-check overnight, I've encountered my first errors.  This parity-check was ran after I've completed the first phase for a large upgrade I'm making to my server.  More details about my upgrade https://lime-technology.com/forum/index.php?topic=44097.msg422253#msg422253

 

I ran a parity-check a few weeks back and found 0 errors.  In fact, I've never encountered an error before (system is ~2 years old).  I'm trying to figure out the cause and determine if I've had any data loss.  I'm hoping someone with more experience can share their thoughts as to the errors.  Let me write a short timeline:

 

2013_09

Original system build

 

2015_11_21

I found the system had entered "bad res-counter state".  All commands resulted in "Segmentation fault. (Core dumped)".  I did some reading and figured this was a memory issue.  After running MemTest, I found a couple failing addresses.

 

2015_11_24

Replaced RAM.  Ran MemTest for ~36 hours.  0 failures.

 

2015_11_25

Installed SUPERMICRO AOC-SAS2LP-MV8 and 2 ICY DOCK Cages.  Started parity check.

 

2015_11_26

Results from running a CORRECT Parity-Check (this is the first parity-check I've ran since the unclean shutdown that resulted from the failing memory).

Nov 26 01:46:20 unRAID kernel: md: correcting parity, sector=3519069768

Nov 26 01:46:20 unRAID kernel: md: correcting parity, sector=3519069776

Nov 26 01:46:20 unRAID kernel: md: correcting parity, sector=3519069784

Nov 26 01:46:20 unRAID kernel: md: correcting parity, sector=3519069792

Nov 26 01:46:20 unRAID kernel: md: correcting parity, sector=3519069800

 

Ran md5 script for all drives:

#!/bin/bash

LOG_DIR=/var/log/hashes

cd $LOG_DIR

 

for i in {1..5}

  do

    echo "Begin sdc for the $i time."

    dd if=/dev/sdc skip=3510000000 count=10000000 | md5sum -b >> sdc.log

  done

exit

 

All drive outputs have the same md5 checksum for their drive.

 

Now I'm currently running a NON-CORRECT parity-check.  If it comes back with 0 errors, is it safe to assume that the errors were just some bad data on the parity-drive?  Are there any additional tests I should run?  Thanks!

Link to comment
  • 2 weeks later...

Yesterday I finished running a long pre-clear routine on my new 6TB hard drives.  Everything was OK with them, so I decided to run another NON-CORRECT parity check on my setup without the new drives added.  I'm only 50% through the parity check, but I've unfortunately found 2 errors:

 

Dec  6 23:39:53 unRAID kernel: md: parity incorrect, sector=3519069768

Dec  6 23:39:53 unRAID kernel: md: parity incorrect, sector=3519069800

 

These 2 sectors are errors that were found before.

 

Does the community have any advice for trying to troubleshoot what is wrong with my build?  I'm worried that the SUPERMICRO AOC-SAS2LP-MV8 or new breakout cables might be causing these errors, but isn't that unlikely since the same sectors are affected?

Link to comment

Only at 55%, but I just noticed something interesting:

 

Device Identification Temp. Size Free Reads Writes Errors

parity ST4000VN000-1H4168_Z300MCV2 (sdg) 3907018532 36°C 4 TB - 12206615 42 0

disk1 Browse /mnt/disk1 (sdf) 3907018532 40°C 4 TB 45.95 GB 12216606 10 0

disk2 Browse /mnt/disk2 (sde) 3907018532 39°C 4 TB 1.23 GB 12220766 10 0

disk3 Browse /mnt/disk3 (sdd) 3907018532 34°C 4 TB 5.37 GB 7062978 10 0

disk4 Browse /mnt/disk4 (sdk) 3907018532 37°C 4 TB 1.6 GB 12202240 10 0

disk5 Browse /mnt/disk5 (sdj) 3907018532 42°C 4 TB 4.05 GB 12207924 10 0

disk6 Browse /mnt/disk6 (sdi) 3907018532 39°C 4 TB 10.77 GB 12197476 10 0

flash Browse /boot (flash device) DT_100_G2 - 8.01 GB 7.91 GB 444 15

 

Sorry if this is hard to parse...  But disk3 (sdd) has only 70XXXXX reads vs the other disks that have 122XXXXX reads.  Considering that all my disks are the same capacity and all have a small amount of free capacity, why would one disk have a smaller read amount than the others?

Link to comment

Don't worry about the read counts ... it's normal to have different counts on different drives -- sometimes very significantly different.  It's an I/O count, not a byte count ... and depending on just how much is requested on each read there can be notable differences.

 

As for the parity errors ... this could be an issue with the SAS2LP, as others have seen similar results.  It's unlikely that they're data errors, however => especially since you've confirmed the data with an MD5 check.

 

Do you have ECC memory?    If not, this could indeed be due to a memory issue that Memtest didn't detect.  Also, what version of UnRAID are you running?    Most (but not all) of the known SAS2LP issues were apparently resolved as of v6.1.4, so if you're running a version earlier than that you should update to the latest 6.1.6

 

Link to comment

Cool, I won't worry about the read counts.

 

It seems very odd to me that the SAS2LP would have errors with the same sectors.  Now granted my hardware knowledge isn't that great to really explain this.  It just seems that if the controller was having errors, then it would be random rather than consistent sectors.

 

Unfortunately, I do not have ECC memory (the processor/socket I went with doesn't support it - mea culpa).  I did retire my single RAM stick a couple weeks back though and replaced it with a brand new one.  I ran MemTest for ~36 hours without any errors.

 

I am currently running unRAID Server Pro Version: 5.0.6  I was hoping to get a "stable" build before I upgraded to 6.  Are there known issues in 5 that were resolved in 6?

 

Thanks!

Link to comment

It's been a long time (year or more), but I recall a couple threads where folks were having this same issue .... i.e. recurring parity errors at the same sectors.    Unfortunately I can't find the thread, and I don't recall the exact setting that ultimately resolved it.  As I recall it was NOT a real data error ... but I just don't remember how it was actually resolved (I think there was some setting that had to be changed).

 

However, it's likely that it was resolved in the evolution to v6, so I'd go ahead and upgrade to the v6.1.6 and see if the issue goes away.  If not, then I'll do a more extensive search to see if I can find that specific thread.

 

Link to comment

Alright, I upgraded to 6.1.6 last night and ran another NO-CORRECT Parity-Check.  It found the same two errors as last time:

Dec  8 01:56:27 unRAID kernel: md: parity incorrect, sector=3519069768

Dec  8 01:56:27 unRAID kernel: md: parity incorrect, sector=3519069800

 

I did originally run a CORRECT Parity-Check after I installed the SAS2LP, so maybe I need to correct them again to get them back to what they should be?

Link to comment

Alright, I upgraded to 6.1.6 last night and ran another NO-CORRECT Parity-Check.  It found the same two errors as last time:

Dec  8 01:56:27 unRAID kernel: md: parity incorrect, sector=3519069768

Dec  8 01:56:27 unRAID kernel: md: parity incorrect, sector=3519069800

 

I did originally run a CORRECT Parity-Check after I installed the SAS2LP, so maybe I need to correct them again to get them back to what they should be?

That's probably the only way to know whether they really need correcting. Then after run a NO-CORRECT to see if they are fixed.
Link to comment

So I ran the CORRECT parity-check on 12/8 and the results:

Dec  8 15:30:11 unRAID kernel: md: correcting parity, sector=3519069768

Dec  8 15:30:11 unRAID kernel: md: correcting parity, sector=3519069800

 

After that I started another NO-CORRECT check and it found 0 errors.  It also found 0 errors last time I ran a check after the correcting parity check.  I'm thinking I might shut down the server for 1 day and then run a couple more parity-checks to make sure everything is OK.

Link to comment

Well, I started another parity-check last night.  This was a NO-CORRECT check:

Dec 10 01:34:01 unRAID kernel: md: parity incorrect, sector=3519069768

Dec 10 01:34:01 unRAID kernel: md: parity incorrect, sector=3519069800

 

I've decided that I want to rule out my HDs, memory, and other hardware.  So I unplugged my HDs from the SAS2LP and back into the motherboard.  I'm going to run a CORRECT parity-check and then maybe a few more checks just to make sure everything is good.  If no errors, then we can safely assume something is wrong with either the SAS2LP or the breakout cables.

Link to comment

Here is my latest status.  I unplugged all HDs from the SAS2LP and back into the motherboard.  I ran 3 parity-checks with this configuration.  Here are the results:

 

2015_12_10

Results from running a CORRECT Parity-Check

Dec 10 11:30:25 unRAID kernel: md: correcting parity, sector=3519069768

Dec 10 11:30:25 unRAID kernel: md: correcting parity, sector=3519069800

 

2015_12_11

Results from running a CORRECT Parity-Check - 0 errors

Allowed HDs to spin down and then started another parity-check

Results from running a CORRECT Parity-Check - 0 errors

 

The only two hardware pieces that are different with my parity-check error setup is the SAS2LP and breakout cables.  I doubt the cables are the issue since I've seen consistent sectors fail the parity-check.  So my best guess right now is that there is some bug with the SAS2LP.

 

I would like to get the SAS2LP working if possible.  Has anyone been using this card successfully without error?  Maybe there is some settings I can play with?

Link to comment

As I noted earlier, I've seen this exact issue [repeating parity check errors at the same sectors] discussed on the forum, but it's been a good while (perhaps as long as a year ago ... and I simply don't recall exactly how it was resolved (or even what version of UnRAID was involved).    It WAS something they resolved, but I simply don't remember what that resolution was.    Hopefully someone who was involved with that issue will remember the thread and post a link to it, but I've searched a good bit and haven't found it.

 

Link to comment

This is one of the threads I had been thinking of:

http://lime-technology.com/forum/index.php?topic=38359.0

 

Note that changing the spin-down setting for the parity drive to "Never" seems to have fixed the issue for at least some of the folks ... you might want to give that a try and see if it also helps your situation.    You probably don't want to leave it that way, but it'd be nice to know if it has an impact.    Others have indicated that disabling vt-d in the BIOS has helped with the same general issue.

 

Link to comment

Thanks garycase.  I am running an AMD board, so I don't think it supports vt-d.  I believe AMD has their own version though, so maybe I should try disabling that.

 

I can spend some time to see if changing the spin-down setting is a valid workaround for my situation or not.  You're right though, I wouldn't feel confident leaving it this way for an extended period of time.  Knowing that if my array ever had to be shutdown I would lose confidence in my parity would make me uncomfortable.

 

What do you think the long-term solution is here?  Does it seem likely that unRAID could resolve this issue with a future update?  Or should I start looking around for a replacement for the SAS2LP?

Link to comment

Agree, the Dell H310's seem to be a good choice [i don't have one, but they're well thought of on this board ... you simply have to reflash the firmware, which is a simple process].

 

First I'd see if changing the parity spindown to "Never" resolves the issue ... if so, you could just leave it like that for a few months until the next couple versions are released.  Hopefully this issue will (finally) be resolved ... it's hard to believe that such a good card has caused so many issues in this application.

 

Link to comment
  • 5 months later...

Sorry to stir up an old thread.  Just wanted to give an update.

This weekend I upgraded my unRAID server from 6.1.6 to 6.1.9.  I also flashed my AOC-SAS2LP-MV8 to 4.0.0.1812.  I was hoping that one or both of these changes would resolve my parity check issues.  Unfortunately, I am still seeing parity check errors when using the SAS2LP card.  I believe I am going to give up on the SAS2LP card and try buying another card.

Link to comment

It's a shame, as the SAS2LP is a very nice card, but for whatever reason it seems to have recurring problems on many UnRAID setups ever since v6  (worked fine with v5).    I assume it's some strange Linux driver issue, but for whatever reason it is what it is.

 

Agree the simplest "fix" is to just use another card  :)

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...