Beta10a - Disk Errors


Recommended Posts

So with the latest unraid beta i have been getting errors on all my data disks in my array excluding the parity drive, today after a fresh reboot this morning across three data disks my unraid is reporting that i have a total of 1,648,277,160 errors.

 

I am not sure where to go to post more detail however I can confirm all my hardware is fine as i recently downgraded to the latest stable release of v5 and ran for 2 weeks and no errors appearing on the disk.

 

Could someone assist me in how i report useful information about this issue.

 

My drives as all WD and all running reiserfs

 

Thanks in advance.

 

TLDR - Root cause;

Faulty PSU causing errors to be reported on drives. Suspected drive cage or PSU issue from Tom, swapped both out got system working and re introducted the drive cage back into the mix with success.

Link to comment

What always helps others with troubleshooting is to get an understanding of your hardware/plugins/addons arrangement. Also it's nice to capture and post a log file from startup through several hours of operation. At first blush, it seems like it could be memory or driver issues. The differences between v5 and v6 would be kernel (controller/driver) and 32 versus 64 bit operation. Memory can be checked using the memtest option at boot time; most people here recommend running it for 24 hours.

 

Good Luck!

Link to comment

I resolved an issue(s) with errors after updating from Beta8 to Beta10/Beta10a. I don't believe the errors were caused by Beta10 or Beta10a but with the previously flawed Beta8 ( it was beta8 i believe that had some serious issues).

 

Anyway I removed my unraid server from service for a couple of days and ran reiserfsck on all my drives (reiserfsck --check /dev/mdX, and if the drive required it I ran reiserfsck --fix-fixable /dev/mdX...

 

One drive required reiserfsck --rebuild-tree /dev/mdX

 

Once the drives completed the reiserfsck process I did a Parity-Check, and all was well.

 

FYI - I opened 4 shells and ran 4 drives at the same time. No problems.

Link to comment

I resolved an issue(s) with errors after updating from Beta8 to Beta10/Beta10a. I don't believe the errors were caused by Beta10 or Beta10a but with the previously flawed Beta8 ( it was beta8 i believe that had some serious issues).

 

Anyway I removed my unraid server from service for a couple of days and ran reiserfsck on all my drives (reiserfsck --check /dev/mdX, and if the drive required it I ran reiserfsck --fix-fixable /dev/mdX...

 

One drive required reiserfsck --rebuild-tree /dev/mdX

 

Once the drives completed the reiserfsck process I did a Parity-Check, and all was well.

 

FYI - I opened 4 shells and ran 4 drives at the same time. No problems.

 

 

Were you able to validate if the data itself did not have corruptions?

Link to comment

Sorry my appoligies i didn't know what would be relevant or not so that is what i was asking needed to be posted, so here is my hardware i am running

 

Motherboard: Supermicro - X10SLH-F

CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz

Memory: Kingston KVR16E11/8i - 1600  DDR3 ECC Registered CL11 8192 Module

 

I haven't downgraded yet but what seems to be occurring is two issues.

 

The first issue is randomly unraid will select any one of my data disks and make them as failed. This actually caused me to go and replace two of my three data drives until the fault reoccurred. To validate this issue i ran full smart tests on all drives involved which all came back fine, what required then was to remove the drive from the array, start it and then add it back and at that point it would identify it as a new drive and undertake a full rebuild of that disk. Then about a day or so later i would switch to another drive. I eventually got the shits with doing this and downgraded to version 5.0.5 ran full drive smart tests and successfully ran for about 2 - 3 weeks with no drive issues at all.

 

The second issue is, and I have now point pointed to invoking the parity check, gives me millions of drive errors across only my data drives. I can restart unraid and it will run fine for days but the second a parity check is invoked drive errors go crazy, drives stay constantly spun up and issue #1 usually pops up.

 

I am running a parity check now and will post syslog when the errors start occuring.

Link to comment

The second issue is, and I have now point pointed to invoking the parity check, gives me millions of drive errors across only my data drives. I can restart unraid and it will run fine for days but the second a parity check is invoked drive errors go crazy, drives stay constantly spun up and issue #1 usually pops up.

 

I am running a parity check now and will post syslog when the errors start occuring.

 

 

Is this with unRAID 5.0.5 or unRAID 6 Beta 10 ?

Link to comment

Motherboard: Supermicro - X10SLH-F

CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz

Memory: Kingston KVR16E11/8i - 1600  DDR3 ECC Registered CL11 8192 Module

It looks like good stuff! Do I read the memory correctly, you're using 1 8GB stick?

 

I can't see much that could be wrong and about all I can think of would be a cabling problem (not too likely) or perhaps a dodgy power supply?

 

You didn't mention any plugins/add-ons so are you just running stock?

 

Thanks!

Link to comment

Motherboard: Supermicro - X10SLH-F

CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz

Memory: Kingston KVR16E11/8i - 1600  DDR3 ECC Registered CL11 8192 Module

It looks like good stuff! Do I read the memory correctly, you're using 1 8GB stick?

 

I can't see much that could be wrong and about all I can think of would be a cabling problem (not too likely) or perhaps a dodgy power supply?

 

You didn't mention any plugins/add-ons so are you just running stock?

 

Thanks!

 

Running stock, I am almost certain that there is nothing wrong with hardware as i ran this configuration without issue for about 2-3 when i downgraded last time to test my hardware.

 

Yes i am running 1 8gb stick is this an issue ?

Link to comment

So i kicked off parity check today and then went to work, upon coming home the parity check worked fine. I thought to myself weird, so i tried to kick off the parity check again, this time i noticed the drives were spun down before i clicked it bingo I immediately got drive errors, but when i tired to go to syslog on the tools page there was this error. Which i think was caused by my crazy big syslog file filled with disk errors.

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 132124678 bytes) in /usr/local/emhttp/plugins/webGui/include/myPage_content.php(36) : eval()'d code on line 2

 

I went and grabbed syslog file and have attached it, i had to cut it down because it was 132mb of disk errors.

syslog.zip

Link to comment

Yes i am running 1 8gb stick is this an issue ?

 

Since the motherboard is booting with the single module then it must be OK; I've just always loaded them in pairs as some m.b's won't boot (old school). Your manual states:

Populating these DIMM modules with a pair of memory modules of the same type and same size will result in interleaved memory, which will improve memory performance

and I guess you can take it for what's it's worth. I'm just not familiar with that level or generation of motherboard.

Link to comment

Looks like ATA errors are being logged. Even in the first test. Can you try these in non XEN mode? Is there an option in the syslinux for that?

 

 

There could be some kind of timeout issue with the spinups or with the ATA/libata code.

Either related to spinups and/or XEN drivers.

 

I'll boot into non Xen mode and try it. It looks like i can invoke the fault by initiating the parity check when the drives are spun down. So ill be able to get an answer quite quickly.

Link to comment

Yes i am running 1 8gb stick is this an issue ?

 

Since the motherboard is booting with the single module then it must be OK; I've just always loaded them in pairs as some m.b's won't boot (old school). Your manual states:

Populating these DIMM modules with a pair of memory modules of the same type and same size will result in interleaved memory, which will improve memory performance

and I guess you can take it for what's it's worth. I'm just not familiar with that level or generation of motherboard.

 

I'd always intended to get another sick i just CBF to date and when i was buying the setup first i'd spent more than i wanted to already. Thanks for the headsup tho.

Link to comment

Take a peek over here too.

http://lime-technology.com/forum/index.php?topic=35689.msg332378#msg332378

 

 

Perhaps providing a fuller picture of your hardware.

PSU, drives. etc etc

 

Thanks i will do check that out

 

So the full setup is

 

SYSTEM:

MB: Supermicro - X10SLH-F

RAM: Kingston KVR16E11/8i - 1600  DDR3 ECC Registered CL11 8192 Module

CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz

PSU: SilverStone SFX 450w Model ST455F-G

FAN CONTROL: None

HOT SWAP CAGES: IstarUSA BPU-340SATA

RAID CARD: None

STORAGE: 3 X WD Red drives 3TB and 1 X WD Green drives 750gb

VM  DRIVE: Samsung 250gb SSD 840 Series

CACHE DRIVE: TOSHIBA 512GB

 

Link to comment

PSU seems to be adequate for the configuration.

 

So with the help of Tom i think i am close to a resolution on this. At this premature stage it looks that i may have either a faulty drive cage or a faulty PSU. I have swapped both out spun down my drives and kicked off a parity check which appears to be humming along nicely. I will introduce the drive cage back into the mix after a week running without errors and do the same to isolate the fault.

 

At this stage it does appear to be faulty hardware and not related to unraid. Unraid is just doing it's job.

 

I will keep reporting back as i continue to investigate.

Link to comment

Vibrations in one of my older rigs use to knock out one of the drives with heavy activity.

The selected PSU is sized well enough from what I saw. 36A Single rail.

 

 

I didn't totally doubt the hardware since you said you downgraded to unRAID 5 and ran clean for two weeks.

Keep us updated.

 

Honestly that has me baffled. That was the whole purpose of downgrading to validate the H/W. When I was building the system I focused on quality power and quality m/b I believe I achieved it but things fail. I'm leaning towards the drive cage tho.

 

Either way ill let people know to close this issue out. Next PSU i will get the 600w version of the same thing for abit more grunt.

 

Thank you for your help :-)

Link to comment

Im not going to call this solved. I swapped back in my drive cage and left the new PSU in and ran overnight parity checks, spun the array up and down multiple times and undertook mover. No errors on any of the disks.

 

This is good enought for me. Thank you everyone for your suggestions and help

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.