Jump to content

New build 2-4 million parity errors


Recommended Posts

Hi everyone,

 

I want to start off with saying that I am a first time unraid user and that I am unexperienced with linux, That said I have many years of exp with OSX and I am Windows admin. I mention my history because my specific knowledge of linux/unraid is lacking but I am well versed in computers in general so I should be able to act on any suggestions given. I have done a lot of searching and reading and I have attempted many things but I feel like I have hit a wall and need some assistance with looking at the logs and/or hardware trouble shooting advice.

 

Problem: after successful setup of my unraid server and the parity disk has been built I run a parity check which results in 2-4 million parity sync errors. If I run the parity check again after running parity check with correction, I still get a similar number of errors.

 

Things I have tried to resolve this issue:

 

Updated bios of mother board to the latest version

Performed memtest overnight (no errors)

Isolated each drive in the array and rebuilt parity and then performed parity check. (3x drives total built 2 drive array and rotated the drives so that only 2 drives were ever in the array which allowed me to test each drive as a parity and as a data drive effectively isolating each drive)

Replaced all sata cables

Installed a sata controller (later removed as it made no difference)

Replaced powersupply (Corsair RM750x)

Downloaded and setup a new copy of Unraid

 

What I’ve learned is that either all of my new drives are bad or they are all good (I do not believe they are bad)

Either my MB is not supported or it is going out?

 

 

Systems specs:

SOFTWARE

Unraid version 6.1.9

INSTALLED PLUGINS: Community Applications, Fix Common Problems, Dynamix WebGUI, Unraid Server OS

 

HARDWARE

Model: Custom

M/B: ASUSTeK COMPUTER INC. - P8P67-M PRO (B3 revision)

CPU: Intel® Core™ i7-2600K CPU @ 3.40GHz

HVM: Enabled

IOMMU: Disabled

Cache: 256 kB, 1024 kB, 8192 kB

Memory: 16384 MB (max. installable capacity 32 GB)

Network: eth0: 1000Mb/s - Full Duplex

Kernel: Linux 4.1.18-unRAID x86_64

OpenSSL: 1.0.1s

Uptime:2 days, 12:56:25

1x wd red 4tb (parity)

1x wd red 3tb

1x Samsung 850 evo 250 gb ssd (cache)

 

Any guidance is greatly appreciated, Diagnostics are attached.

 

I appreciate any and all help and please let me know if there is any additional information i can provide.

 

-Dustin

unraid-diagnostics-20161003-0901.zip

Link to comment

It certainly sounds like you have done your homework and did a lot of troubleshooting.

 

The syslog doesn't actually contain all that troubleshooting though since it starts over when you reboot.

 

Consequentally, I don't actually see any evidence that an initial parity sync was ever completed.

 

Do a New Config and let parity build. Then do a non-correcting parity check. Then post another diagnostic.

Link to comment

It certainly sounds like you have done your homework and did a lot of troubleshooting.

 

The syslog doesn't actually contain all that troubleshooting though since it starts over when you reboot.

 

Consequentally, I don't actually see any evidence that an initial parity sync was ever completed.

 

Do a New Config and let parity build. Then do a non-correcting parity check. Then post another diagnostic.

 

Will do!

 

Thanks,

 

-Dustin

Link to comment

Here is exactly what I did before running "diagnostics" to gather these reports.

Ran new configuration

selected drives and started the array at which point the parity disk was build process was initiated

after completion of parity build I ran parity check with out corrections

upon completion of parity check "diagnostics" was ran.

 

I have attached the diagnostics information.

 

Thank you for taking the time to look at this information, I am open to any and all suggestions thanks again for looking.

 

-Dustin

unraid-diagnostics-20161004-0956.zip

Link to comment

This is puzzling. I suspect a hardware problem of some kind that is preventing the data from being read or written correctly, possibly a stuck bit somewhere along the way, but I don't know where it might be since you seem to have already checked memory and tried different ports.

 

You don't actually mention clearing the drives. That wouldn't cause what we are seeing, but I wonder if you did a complete preclear cycle of a single disk whether that would also fail due to some read/write problem.

 

Link to comment

In my testing I believe to have eliminated (to a varying degree of certainty) nearly every component, with the exception of the Motherboard and CPU.

 

I'll go ahead and preclear the drives to see what happens. I have not done this before but I'll do a search and see how to do this unless you have a link handy that can walk me through it? I've got preclear running on all of my drives at the moment. I'll report back with results.

 

A part of me wonders if it has something to do with the p67 express chipset on my motherboard as I have not been able to find a build using this chipset. Or if it has something to do with a missed bios setting? For bios I loaded optimized defaults and then enabled AHCI and vt-d.

 

Thank you for looking at the information provided and getting back to me.

 

-Dustin

Link to comment

a single pass preclear was performed on my SSD attached is the message displayed at the end of the process. I am not sure exactly what I'm looking for but i presume that the returning value of any number other than 00000 is a bad thing.

 

================================================================== 1.15

=                unRAID server Pre-Clear disk /dev/sdb

=              cycle 1 of 1, partition start on sector 64

= Disk Pre-Clear-Read completed                                DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes            DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.          DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4      DONE

= Step 5 of 10 - Clearing MBR code area                        DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning  DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries            DONE

= Step 10 of 10 - Verifying if the MBR is cleared.              DONE

= Disk Post-Clear-Read completed                                DONE

Disk Temperature: 38C, Elapsed Time:  1:12:03

========================================================================1.15

== SamsungSSD850EVO250GB  S2R5NXBH314753P

== Disk /dev/sdb has NOT been precleared successfully

== skip=50800 count=200 bs=1000448 returned 00002 instead of 00000 skip=122600 count=200 bs=1000448 returned 00002 instead of 00000 skip=227000 count=200 bs=1000448 returned 00256 instead of 00000 skip=249200 count=200 bs=1000448 returned 00256 instead of 00000

============================================================================

** Changed attributes in files: /tmp/smart_start_sdb  /tmp/smart_finish_sdb

                ATTRIBUTE  NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE

  Airflow_Temperature_Cel =    62      72            0        ok          38

No SMART attributes are FAILING_NOW

 

 

    the number of sectors pending re-allocation did not change.

0 sectors had been re-allocated before the start of the preclear.

0 sectors are re-allocated at the end of the preclear,

    the number of sectors re-allocated did not change.

 

Thanks for having a look

 

-Dustin

250gb_SSD_preclear_results.txt

Link to comment

a single pass preclear was performed on my SSD attached is the message displayed at the end of the process. I am not sure exactly what I'm looking for but i presume that the returning value of any number other than 00000 is a bad thing.

 

================================================================== 1.15

=                unRAID server Pre-Clear disk /dev/sdb

=              cycle 1 of 1, partition start on sector 64

= Disk Pre-Clear-Read completed                                DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes            DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.          DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4      DONE

= Step 5 of 10 - Clearing MBR code area                        DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning  DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries            DONE

= Step 10 of 10 - Verifying if the MBR is cleared.              DONE

= Disk Post-Clear-Read completed                                DONE

Disk Temperature: 38C, Elapsed Time:  1:12:03

========================================================================1.15

== SamsungSSD850EVO250GB  S2R5NXBH314753P

== Disk /dev/sdb has NOT been precleared successfully

== skip=50800 count=200 bs=1000448 returned 00002 instead of 00000 skip=122600 count=200 bs=1000448 returned 00002 instead of 00000 skip=227000 count=200 bs=1000448 returned 00256 instead of 00000 skip=249200 count=200 bs=1000448 returned 00256 instead of 00000

============================================================================

** Changed attributes in files: /tmp/smart_start_sdb  /tmp/smart_finish_sdb

                ATTRIBUTE  NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE

  Airflow_Temperature_Cel =    62      72            0        ok          38

No SMART attributes are FAILING_NOW

 

 

    the number of sectors pending re-allocation did not change.

0 sectors had been re-allocated before the start of the preclear.

0 sectors are re-allocated at the end of the preclear,

    the number of sectors re-allocated did not change.

 

Thanks for having a look

 

-Dustin

Looks like it tried to write zeros then later read back non-zeros so that seems consistent with the problems you are having with parity. If the SSD wasn't part of your parity testing then that would seem to eliminate drive issues and point toward something else. What I don't know.
Link to comment

The SSD is one constant that has always been in the array. But it has always been setup as a cache drive so I assume that it does not effect a parity check, am I correct in my assumption?

 

I figure at this point I'll allow the other drives to preclear to see what happens, after they have completed I'll go ahead and run memtest again just to make 100% sure.

 

I'll report back with results as they come.

 

Thank you very much trurl for the assistance.

 

-dustin

Link to comment

You should not run pre-clear on an SSD.  If you insist maybe use 'blkdiscard' command to tell the SSD to discard all the data you wrote, and then let unRAID format it and add to cache pool.

 

Good to know!

 

I'll be sure to not run preclear on an SSD in the future.

 

I'll look into how to use "blkdiscard" to discard all written data to the SSD.

 

Thank you for the reply.

 

-Dustin

Link to comment

Drive results from the 3tb Red and the 4tb Red

 

3tb:

================================================================== 1.15

=                unRAID server Pre-Clear disk /dev/sdc

=              cycle 1 of 1, partition start on sector 1

= Disk Pre-Clear-Read completed                                DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes            DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.          DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4      DONE

= Step 5 of 10 - Clearing MBR code area                        DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning  DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries            DONE

= Step 10 of 10 - Verifying if the MBR is cleared.              DONE

= Disk Post-Clear-Read completed                                DONE

Disk Temperature: 36C, Elapsed Time:  29:47:10

========================================================================1.15

== WDCWD30EFRX-68EUZN0  WD-WCC4N4ALL65Y

== Disk /dev/sdc has NOT been precleared successfully

== skip=200 count=200 bs=1000448 returned 01024 instead of 00000 skip=600 count=200 bs=1000448 returned 04096 instead of 00000 skip=800 count=200 bs=1000448 returned 02816 instead of 00000 skip=1200 count=200 bs=1000448 returned 00002 instead of 00000 skip=1600 count=200 bs=1000448 returned 02048 instead of 00000 skip=1800 count=200 bs=1000448 returned 04096 instead of 00000 skip=2200 count=200 bs=1000448 returned 00256 instead of 00000 skip=3000 count=200 bs=1000448 returned 01024 instead of 00000 skip=3800 count=200 bs=1000448 returned 00576 instead of 00000 skip=4800 count=200 bs=1000448 returned 00514 instead of 00000

============================================================================

** Changed attributes in files: /tmp/smart_start_sdc  /tmp/smart_finish_sdc

                ATTRIBUTE  NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE

      Temperature_Celsius =  114    118            0        ok          36

No SMART attributes are FAILING_NOW

 

0 sectors were pending re-allocation before the start of the preclear.

0 sectors were pending re-allocation after pre-read in cycle 1 of 1.

0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

0 sectors are pending re-allocation at the end of the preclear,

    the number of sectors pending re-allocation did not change.

0 sectors had been re-allocated before the start of the preclear.

0 sectors are re-allocated at the end of the preclear,

    the number of sectors re-allocated did not change.

root@unraid:/boot# cd /boot

 

4tb:

================================================================== 1.15

=                unRAID server Pre-Clear disk /dev/sdd

=              cycle 1 of 1, partition start on sector 1

= Disk Pre-Clear-Read completed                                DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes            DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.          DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4      DONE

= Step 5 of 10 - Clearing MBR code area                        DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning  DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries            DONE

= Step 10 of 10 - Verifying if the MBR is cleared.              DONE

= Disk Post-Clear-Read completed                                DONE

Disk Temperature: 33C, Elapsed Time:  41:10:39

========================================================================1.15

== WDCWD40EFRX-68WT0N0  WD-WCC4E7JD0XCT

== Disk /dev/sdd has NOT been precleared successfully

== skip=1000 count=200 bs=1000448 returned 00512 instead of 00000 skip=1200 count=200 bs=1000448 returned 00512 instead of 00000 skip=1600 count=200 bs=1000448 returned 01024 instead of 00000 skip=2000 count=200 bs=1000448 returned 00512 instead of 00000 skip=3400 count=200 bs=1000448 returned 00004 instead of 00000 skip=3800 count=200 bs=1000448 returned 00002 instead of 00000 skip=4000 count=200 bs=1000448 returned 00002 instead of 00000 skip=4200 count=200 bs=1000448 returned

01024 instead of 00000 skip=4400 count=200 bs=1000448 returned 04096 instead of 00000 skip=5200 count=200 bs=1000448 returned 00002 instead of 00000

============================================================================

No SMART attributes are FAILING_NOW

 

0 sectors were pending re-allocation before the start of the preclear.

0 sectors were pending re-allocation after pre-read in cycle 1 of 1.

0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

0 sectors are pending re-allocation at the end of the preclear,

    the number of sectors pending re-allocation did not change.

root@unraid:/boot#

 

So the preclears are done and they all look to have the same thing happening. This leads me to believe that the drives are more likely fine as the odds of all 4 drives having this issue are very small. *1 4tb drive is not in this system during this troubleshooting phase, hence only 3 reports total.

 

Currently i am running the memtest from the unraid boot menu. I'll report back with those results.

 

Thanks for looking

 

-Dustin

Link to comment

Ran the memtest x86 from the unraid boot flash drive for 8 passes (20 hours) and zero errors were found.

 

At this point the problem is either my motherboard, my cpu or some inherent compatibility issue with my hardware and unraid.

 

I found an intel CPU diagnostic and stress test utility that I'll run on the cpu to see if I eliminate that as the cause. If the CPU passes I'll order a z77 or z68 based MB to replace this p67 board and try again. I'm pretty set on using unraid so I'll keep at til this thing is up and running.

 

Any suggestions on how to proceed?

 

Thanks for all the help

-Dustin

 

 

Link to comment

If memtest passes then it seems to me like there is nothing between memory and CPU that is affecting the data so I wouldn't be surprised if CPU checks out OK too.

 

But since something is affecting the data when reading and/or writing the disks with both preclear and parity check I am thinking something with the motherboard. Do you have a separate SATA controller card you could plug in and test with?

Link to comment

I do have a SATA card (Marvell 88SE9705 chipset based)  that I attempted to use but received the same errors when doing a parity check. That said I am not sure if I had done a "new config" and then configured the array, rebuilt the parity drive and then did a parity check or just swapped the drives on to the card and attempted a parity check?

 

I'll reinstall the SATA card and do a preclear on one of the drives to see what I get out of it. If it comes out ok it looks like a motherboard is in my future.

 

Then to find a good socket 1155 motherboard with an x77 or x68 chipset for a good price. unless there is a server grade board i haven't been able to find that will work with my 2600k?

 

Thanks again for the help,

 

-Dustin

 

Link to comment

This is an inexperienced guy speaking, but I've been building some computers for years and from what I could gather, you've tested most of the fail points, and they all came out clear. RAM is OK, connections are OK, just that hard drives are reading out bad errors.

 

But you mentioned that three drives were affected. This is starting to concern me to the point where I don't recommend putting these drives in the same build as the motherboard. I didn't hear you mention about any RAID cards, SATA controllers so I'm assuming you're direct connecting them to the motherboard like me? Even if you did test them with different SATA cards, it's still possible the MB is writing garbage data to the drives.

 

If you can, pull out the drives, connect them to another build and try running HDDScan (http://hddscan.com/) If you don't have a build, then a laptop with a 3.5 enclosure will be fine.

 

I am highly suspecting a corruption in the SATA controller of the motherboard.

 

 

 

*Fine print: This is a very very small chance (practically not possible) but it may be that your last computer or something fried these drives. I told you this is the remotest chance. That's why I'm asking you to test them in another computer.

Link to comment

Hi and thank you both for the responses!

 

trurl thank you for the suggestion about updating the bios but that was one of my first trouble shooting measures. I'll be running a full array and parity rebuild over the weekend/ or re-preclearing  drive after I stress test the cpu.

 

ideaman924 ill pop one of the drives (the 4th one) into my other build and run a hddscan on it as you suggested. What tests should i run? Also the drives have only ever been in the unraid box up until this point. As far as the mobo being the culprit I think all of us that have posted are on or turning to that page now.

 

More on the sata controller: I received this card from amazon and in initial setup i saw that one of my drives was not being detected. i ended up just switching ports which caused all drives to be seen by my bios. As far as my testing using that card I'm not sure if i rebuilt the array and parity disk or not before testing parity. The results were seemingly faster read times but about 4 million parity errors.

Link to comment

Hi and thank you both for the responses!

 

trurl thank you for the suggestion about updating the bios but that was one of my first trouble shooting measures. I'll be running a full array and parity rebuild over the weekend/ or re-preclearing  drive after I stress test the cpu.

 

ideaman924 ill pop one of the drives (the 4th one) into my other build and run a hddscan on it as you suggested. What tests should i run? Also the drives have only ever been in the unraid box up until this point. As far as the mobo being the culprit I think all of us that have posted are on or turning to that page now.

 

More on the sata controller: I received this card from amazon and in initial setup i saw that one of my drives was not being detected. i ended up just switching ports which caused all drives to be seen by my bios. As far as my testing using that card I'm not sure if i rebuilt the array and parity disk or not before testing parity. The results were seemingly faster read times but about 4 million parity errors.

 

4 million parity errors is just bogus. I'm not saying I don't believe you. If you get 4 million parity errors then you get 4 million parity errors. But this is interesting, since the check fails even if there's no data on the drives.

 

You want to start up HDDScan (in administrator mode just in case) and add a Surface Test, and then use the Verify option. Add the task and it'll begin executing automatically. Grab a cup of coffee and sit down. This process might take some time. If the tests finish, double click on the task to show the results, take a  screenshot of all the tabs (Graph, Map, Report) and post it here so we can check what's wrong with the drive.

 

After that, add the S.M.A.R.T task and let it finish. Double click on the task and screenshot again. Show us the S.M.A.R.T screen. I know you already did this on UnRAID, but we're checking here just to make sure your server isn't fudging up somewhere.

Link to comment

Hey guys here are my results in the latest round of testing.

 

test 1: unraid box + SATA controller + preclear = Fail, Drive was unable to be precleared. So thi sis the same result as before where the drive is reading data where 000's are expected.

 

test 2: Main build + WD Red 3tb from NAS + HDDscan (surface scan + SMART): please see the attached PDF's. As far as i can tell it passed.

 

test 3: CPU diagnostics in windows: I couldn't even get this test up and running! I attempted to install win 10 on the unraid box with a known good installer USB. I tried 4x times each time i tried a different USB port (JIC) and got a different error message as to why windows setup had failed but basically every error message stated there was a file missing.

 

Thanks for looking.

wd_red_3tb.pdf

Link to comment
  • 2 weeks later...

Mark this one solved!

 

I ended up purchasing a second hand hp z420 with an 8-core cpu, and 16gb ECC memory. I added a 5 in 3 icydock and have had it up and running for about 5 days now (docker, plugins, ftp, VM's etc...)

 

Thanks for everyone's help! About the old hardware I'll wait for a crazy deal on a new mobo and see if i can make something out of that hardware.

 

-Dustin

Link to comment

Mark this one solved!

 

I ended up purchasing a second hand hp z420 with an 8-core cpu, and 16gb ECC memory. I added a 5 in 3 icydock and have had it up and running for about 5 days now (docker, plugins, ftp, VM's etc...)

 

Thanks for everyone's help! About the old hardware I'll wait for a crazy deal on a new mobo and see if i can make something out of that hardware.

 

-Dustin

 

Great, you were able to resolve the problem! Regarding the drives you posted with HDDScan, they don't have any immediately noticeable problems, so you should be fine. Run a preclear on them just to make sure your old configuration didn't corrupt/write garbage data on the platters.

 

Just a question, what was wrong with your last hardware? Was it CPU/RAM corruption or a SATA controller failing?

Link to comment

As far as the new  box goes I did end up preclearing all drives and they passed with no issues. :-)

 

Whenit comes to t the old stuff I wont know for sure what the issue really is until I get another mobo to test with. But based on all the testing that has been done my best guess is that the problem is with the motherboard. What exactly the problem is I couldnt say.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...