bennymundz Posted October 21, 2014 Share Posted October 21, 2014 So with the latest unraid beta i have been getting errors on all my data disks in my array excluding the parity drive, today after a fresh reboot this morning across three data disks my unraid is reporting that i have a total of 1,648,277,160 errors. I am not sure where to go to post more detail however I can confirm all my hardware is fine as i recently downgraded to the latest stable release of v5 and ran for 2 weeks and no errors appearing on the disk. Could someone assist me in how i report useful information about this issue. My drives as all WD and all running reiserfs Thanks in advance. TLDR - Root cause; Faulty PSU causing errors to be reported on drives. Suspected drive cage or PSU issue from Tom, swapped both out got system working and re introducted the drive cage back into the mix with success. Quote Link to comment
doorunrun Posted October 23, 2014 Share Posted October 23, 2014 What always helps others with troubleshooting is to get an understanding of your hardware/plugins/addons arrangement. Also it's nice to capture and post a log file from startup through several hours of operation. At first blush, it seems like it could be memory or driver issues. The differences between v5 and v6 would be kernel (controller/driver) and 32 versus 64 bit operation. Memory can be checked using the memtest option at boot time; most people here recommend running it for 24 hours. Good Luck! Quote Link to comment
mygoogoo Posted October 23, 2014 Share Posted October 23, 2014 I resolved an issue(s) with errors after updating from Beta8 to Beta10/Beta10a. I don't believe the errors were caused by Beta10 or Beta10a but with the previously flawed Beta8 ( it was beta8 i believe that had some serious issues). Anyway I removed my unraid server from service for a couple of days and ran reiserfsck on all my drives (reiserfsck --check /dev/mdX, and if the drive required it I ran reiserfsck --fix-fixable /dev/mdX... One drive required reiserfsck --rebuild-tree /dev/mdX Once the drives completed the reiserfsck process I did a Parity-Check, and all was well. FYI - I opened 4 shells and ran 4 drives at the same time. No problems. Quote Link to comment
WeeboTech Posted October 23, 2014 Share Posted October 23, 2014 I resolved an issue(s) with errors after updating from Beta8 to Beta10/Beta10a. I don't believe the errors were caused by Beta10 or Beta10a but with the previously flawed Beta8 ( it was beta8 i believe that had some serious issues). Anyway I removed my unraid server from service for a couple of days and ran reiserfsck on all my drives (reiserfsck --check /dev/mdX, and if the drive required it I ran reiserfsck --fix-fixable /dev/mdX... One drive required reiserfsck --rebuild-tree /dev/mdX Once the drives completed the reiserfsck process I did a Parity-Check, and all was well. FYI - I opened 4 shells and ran 4 drives at the same time. No problems. Were you able to validate if the data itself did not have corruptions? Quote Link to comment
bennymundz Posted October 23, 2014 Author Share Posted October 23, 2014 Sorry my appoligies i didn't know what would be relevant or not so that is what i was asking needed to be posted, so here is my hardware i am running Motherboard: Supermicro - X10SLH-F CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz Memory: Kingston KVR16E11/8i - 1600 DDR3 ECC Registered CL11 8192 Module I haven't downgraded yet but what seems to be occurring is two issues. The first issue is randomly unraid will select any one of my data disks and make them as failed. This actually caused me to go and replace two of my three data drives until the fault reoccurred. To validate this issue i ran full smart tests on all drives involved which all came back fine, what required then was to remove the drive from the array, start it and then add it back and at that point it would identify it as a new drive and undertake a full rebuild of that disk. Then about a day or so later i would switch to another drive. I eventually got the shits with doing this and downgraded to version 5.0.5 ran full drive smart tests and successfully ran for about 2 - 3 weeks with no drive issues at all. The second issue is, and I have now point pointed to invoking the parity check, gives me millions of drive errors across only my data drives. I can restart unraid and it will run fine for days but the second a parity check is invoked drive errors go crazy, drives stay constantly spun up and issue #1 usually pops up. I am running a parity check now and will post syslog when the errors start occuring. Quote Link to comment
WeeboTech Posted October 23, 2014 Share Posted October 23, 2014 The second issue is, and I have now point pointed to invoking the parity check, gives me millions of drive errors across only my data drives. I can restart unraid and it will run fine for days but the second a parity check is invoked drive errors go crazy, drives stay constantly spun up and issue #1 usually pops up. I am running a parity check now and will post syslog when the errors start occuring. Is this with unRAID 5.0.5 or unRAID 6 Beta 10 ? Quote Link to comment
doorunrun Posted October 23, 2014 Share Posted October 23, 2014 Motherboard: Supermicro - X10SLH-F CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz Memory: Kingston KVR16E11/8i - 1600 DDR3 ECC Registered CL11 8192 Module It looks like good stuff! Do I read the memory correctly, you're using 1 8GB stick? I can't see much that could be wrong and about all I can think of would be a cabling problem (not too likely) or perhaps a dodgy power supply? You didn't mention any plugins/add-ons so are you just running stock? Thanks! Quote Link to comment
WeeboTech Posted October 24, 2014 Share Posted October 24, 2014 There have been some odd reports of people having varying degrees of parity errors on the late betas in XEN mode vs non XEN mode. However without posting enough details regarding a potential defect report, It's hard to determine. See here. http://lime-technology.com/forum/index.php?topic=34456.0 Quote Link to comment
bennymundz Posted October 24, 2014 Author Share Posted October 24, 2014 Motherboard: Supermicro - X10SLH-F CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz Memory: Kingston KVR16E11/8i - 1600 DDR3 ECC Registered CL11 8192 Module It looks like good stuff! Do I read the memory correctly, you're using 1 8GB stick? I can't see much that could be wrong and about all I can think of would be a cabling problem (not too likely) or perhaps a dodgy power supply? You didn't mention any plugins/add-ons so are you just running stock? Thanks! Running stock, I am almost certain that there is nothing wrong with hardware as i ran this configuration without issue for about 2-3 when i downgraded last time to test my hardware. Yes i am running 1 8gb stick is this an issue ? Quote Link to comment
bennymundz Posted October 24, 2014 Author Share Posted October 24, 2014 So i kicked off parity check today and then went to work, upon coming home the parity check worked fine. I thought to myself weird, so i tried to kick off the parity check again, this time i noticed the drives were spun down before i clicked it bingo I immediately got drive errors, but when i tired to go to syslog on the tools page there was this error. Which i think was caused by my crazy big syslog file filled with disk errors. Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 132124678 bytes) in /usr/local/emhttp/plugins/webGui/include/myPage_content.php(36) : eval()'d code on line 2 I went and grabbed syslog file and have attached it, i had to cut it down because it was 132mb of disk errors. syslog.zip Quote Link to comment
WeeboTech Posted October 24, 2014 Share Posted October 24, 2014 Looks like ATA errors are being logged. Even in the first test. Can you try these in non XEN mode? Is there an option in the syslinux for that? There could be some kind of timeout issue with the spinups or with the ATA/libata code. Either related to spinups and/or XEN drivers. Quote Link to comment
doorunrun Posted October 24, 2014 Share Posted October 24, 2014 Yes i am running 1 8gb stick is this an issue ? Since the motherboard is booting with the single module then it must be OK; I've just always loaded them in pairs as some m.b's won't boot (old school). Your manual states: Populating these DIMM modules with a pair of memory modules of the same type and same size will result in interleaved memory, which will improve memory performance and I guess you can take it for what's it's worth. I'm just not familiar with that level or generation of motherboard. Quote Link to comment
bennymundz Posted October 24, 2014 Author Share Posted October 24, 2014 Looks like ATA errors are being logged. Even in the first test. Can you try these in non XEN mode? Is there an option in the syslinux for that? There could be some kind of timeout issue with the spinups or with the ATA/libata code. Either related to spinups and/or XEN drivers. I'll boot into non Xen mode and try it. It looks like i can invoke the fault by initiating the parity check when the drives are spun down. So ill be able to get an answer quite quickly. Quote Link to comment
bennymundz Posted October 24, 2014 Author Share Posted October 24, 2014 Yes i am running 1 8gb stick is this an issue ? Since the motherboard is booting with the single module then it must be OK; I've just always loaded them in pairs as some m.b's won't boot (old school). Your manual states: Populating these DIMM modules with a pair of memory modules of the same type and same size will result in interleaved memory, which will improve memory performance and I guess you can take it for what's it's worth. I'm just not familiar with that level or generation of motherboard. I'd always intended to get another sick i just CBF to date and when i was buying the setup first i'd spent more than i wanted to already. Thanks for the headsup tho. Quote Link to comment
WeeboTech Posted October 24, 2014 Share Posted October 24, 2014 Take a peek over here too. http://lime-technology.com/forum/index.php?topic=35689.msg332378#msg332378 Perhaps providing a fuller picture of your hardware. PSU, drives. etc etc Quote Link to comment
bennymundz Posted October 24, 2014 Author Share Posted October 24, 2014 Take a peek over here too. http://lime-technology.com/forum/index.php?topic=35689.msg332378#msg332378 Perhaps providing a fuller picture of your hardware. PSU, drives. etc etc Thanks i will do check that out So the full setup is SYSTEM: MB: Supermicro - X10SLH-F RAM: Kingston KVR16E11/8i - 1600 DDR3 ECC Registered CL11 8192 Module CPU: Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz PSU: SilverStone SFX 450w Model ST455F-G FAN CONTROL: None HOT SWAP CAGES: IstarUSA BPU-340SATA RAID CARD: None STORAGE: 3 X WD Red drives 3TB and 1 X WD Green drives 750gb VM DRIVE: Samsung 250gb SSD 840 Series CACHE DRIVE: TOSHIBA 512GB Quote Link to comment
WeeboTech Posted October 24, 2014 Share Posted October 24, 2014 PSU seems to be adequate for the configuration. Quote Link to comment
bennymundz Posted October 25, 2014 Author Share Posted October 25, 2014 PSU seems to be adequate for the configuration. So with the help of Tom i think i am close to a resolution on this. At this premature stage it looks that i may have either a faulty drive cage or a faulty PSU. I have swapped both out spun down my drives and kicked off a parity check which appears to be humming along nicely. I will introduce the drive cage back into the mix after a week running without errors and do the same to isolate the fault. At this stage it does appear to be faulty hardware and not related to unraid. Unraid is just doing it's job. I will keep reporting back as i continue to investigate. Quote Link to comment
WeeboTech Posted October 25, 2014 Share Posted October 25, 2014 Vibrations in one of my older rigs use to knock out one of the drives with heavy activity. The selected PSU is sized well enough from what I saw. 36A Single rail. I didn't totally doubt the hardware since you said you downgraded to unRAID 5 and ran clean for two weeks. Keep us updated. Quote Link to comment
bennymundz Posted October 25, 2014 Author Share Posted October 25, 2014 Vibrations in one of my older rigs use to knock out one of the drives with heavy activity. The selected PSU is sized well enough from what I saw. 36A Single rail. I didn't totally doubt the hardware since you said you downgraded to unRAID 5 and ran clean for two weeks. Keep us updated. Honestly that has me baffled. That was the whole purpose of downgrading to validate the H/W. When I was building the system I focused on quality power and quality m/b I believe I achieved it but things fail. I'm leaning towards the drive cage tho. Either way ill let people know to close this issue out. Next PSU i will get the 600w version of the same thing for abit more grunt. Thank you for your help :-) Quote Link to comment
dgaschk Posted October 26, 2014 Share Posted October 26, 2014 See here: http://lime-technology.com/forum/index.php?topic=9880.msg94514#msg94514 Quote Link to comment
bennymundz Posted October 26, 2014 Author Share Posted October 26, 2014 Im not going to call this solved. I swapped back in my drive cage and left the new PSU in and ran overnight parity checks, spun the array up and down multiple times and undertook mover. No errors on any of the disks. This is good enought for me. Thank you everyone for your suggestions and help Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.