Jump to content

After Parity Sync, Parity Check crashes w/Disk 1 faulty and no Syslog


Recommended Posts

About 2 weeks ago, after upgrading to 6.0.1 I had my parity and one of my data drives (Disk 5) die at the same time.  Both were connected using the same Molex to SATA Y power adapter so I assumed it was just a bad wire, even though the drives (6TB WD Reds) had been in use and connected that way since I built the server in March.  RMA'd the drives, ran the replacements through 3 preclear cycles each and then added them to the array as the new parity and Disk 5 and ran the initial parity sync overnight.  Everything appeared to go fine but then running a parity check immediately after, the check stopped less than 1 hour in with Disk 1 showing over 800k errors and faulty/disabled. 

 

Thankfully the contents of Disk 1 are emulated so I am backing them up to another server now.  But what should I do after that?  I'm concerned that on the Dashboard tab, under Parity Status it says "Data is invalid" (even though on the Main tab there's a green dot by the parity drive showing normal operation).  On top of that, I can't access a log.  Clicking on System Log from Tools just shows a blank page, and downloading it comes up blank also.  At this point I'm afraid to take the array offline, power down, or do anything else until I've backed up all the other data drives too. 

 

I know there's not a ton of info to go on here without a log, but does anyone have any advice?  I've had nothing but problems with these 6TB Reds since I got them but I had thought it was only when using them with Molex to SATA power adapters (in addition to the 2 I just RMA'd, I had another I RMA'd about a month before when I tried to move it to another server and it died without warning also).  But this time I only used the PSU's SATA power connectors and double checked there were no loose connections.  So now I don't know what's going on.  Disk 1 is connected via a SATA controller card (as are Disks 2,3, and 4).  I haven't had any problems with that card and all the drives that failed earlier were running of the motherboard.  The PSU is a Corsair RM650 which I was previously using in a different server and have never had any problems with it either. 

 

Anyway, would really appreciate knowing what steps I should take from here, and let me know if there's any other info I can provide.  Thanks.

Link to comment

Thanks Jonp for moving this to General Support.  I've updated my signature to reflect my current hardware in case that helps.  The first drive failure happened when I was attempting to move all 6TB drives from Server 2 to Server 1, with Server 1 failing to boot after.  The 3 failures since, including the current Disk 1, all happened in Server 2.  That's 4 of the 6 6TB Reds I bought in March all dead within 4 months -- and worse, dropping with no warning whatsoever (all precleared 3 cycles with clean smart reports, no reallocated sectors, etc. and w/temps in the 27-36 C range).  Unless someone has any other theories, I'm going to assume I just lost the tech lotto and bought from a bad batch of drives? 

 

I'm still at a loss for what to do going forward though.  Right now I'm not doing anything else with Server 2 until I make sure all the other data drives are backed up, since I now have zero confidence that any or all of the remaining drives aren't going to fail at any moment.  After that I'll reboot and hopefully be able to post a log.

 

I've heard it said that if drives are going to fail they usually fail early and if they don't they last a long time.  I have 4 x 4TB WD Reds now in a Windows FlexRAID server in addition to the 6 x 3TB Reds in Server 1 and I've yet to have any problems at all with those drives.  These 6TBs have been a nightmare though.  Isn't this highly unusual for multiple drives to just die like this with no previous errors or hint of impending failure?  This server sits on a table right next to my desk and these drives aren't even making any odd noises before dying.  They just work normally and then seemingly for no reason there's a red X by them on the GUI and they're gone.

Link to comment

Tools->Diagnostics. Post the zip file.

 

Thanks dgashk.  Unfortunately the zip is too big to attach and extracting just the syslog that's too big also (5.22 MB zipped).  Is there something else from the diagnostic I can post or some portion of the syslog to key in on?  The server's been up since July 3 when installed the RMA replacement parity and Disk 5 drives and started preclearing those.

Link to comment

Hi dgaschk.  Could you glean anything from the syslog I posted?  The portion that jumps out to me is from Jul 8 23:08 and through the next morning.  This is when I was running the parity sync after adding the 2 precleared replacement drives (Disk 5, then parity).  On the web GUI at that time there was no indication that parity sync wasn't just running normally.  I reloaded to check progress a couple times in the first few minutes before I went to bed and when I woke up progress was around 75%.  Then when I got back after it completed later around noon on July 9 there were no errors, etc.  But on the log there is error handling and read errors all over the place.  Which doesn't make sense to me because if parity didn't actually sync then how am I able to back up all the emulated contents of Disk 1 after the parity check crashed and the drive shut down as faulty?

 

I don't know what happened during this parity sync but it looks to me like this where the problem might have started, during the parity sync before the parity check.  I'm just trying to understand what exactly happened.  Thanks again. 

Link to comment

Server 1 does not has enough amperage. The PSU is insufficient.

 

Check for BIOS and firmware updates for server 2. Run MEMtest overnight.

 

Thanks.  I'll check for BIOS and firmware updates for server 2 and run MEMtest (after I've finished backing up the other data drives, since I don't want to risk taking the array offline or power cycling until then).  Are you basing this advise on something you're seeing in the log btw? 

 

As for Server 1, the C-60 is a 9w TDP processor.  Powering that + a single stick of RAM and 3 case fans only in addition to the 6 hard drives, it idles at 30w or less per my kill-a-watt and I've never seen it spike even to 100w at boot.  So how would a 300w PSU be insufficient?  Not challenging you on that since I trust you probably know more on these things than I do.  Just trying to understand what you're basing that on.  Thanks again.

Link to comment

Check for BIOS and firmware updates for server 2. Run MEMtest overnight.

 

Alright, I finished backing up everything on Server 2 and updated BIOS, then I ran MEMtest for 12 passes with no errors.  Disk 1 still shows as faulty but I thought I'd be able to start the array unprotected and still be able to access the emulated contents of Disk 1 as I'd been able to previously.  Good thing I backup it up because once the array started, Disk 1 came up as unmountable with the option to format it and I can no longer access its contents.

 

So now what happens when I get the RMA replacement drive for Disk 1?  Should I still be able to write the failed drive's contents to it from parity or will I have to copy it from backup and then run another parity sync?  A quick step by step would be helpful if you could point me in the right direction. Thanks.

 

I'm also still open to theories on why Disk 1 might have failed in the first place, and why the previous parity sync appeared from the web GUI to run normally and complete just fine while on the syslog it was throwing errors all over the place.  Updating BIOS is nice but I can't be confident I've addressed the root cause of the problem until I understand what it actually was.

Link to comment

Server 1 does not has enough amperage. The PSU is insufficient.

 

Check for BIOS and firmware updates for server 2. Run MEMtest overnight.

 

Thanks.  I'll check for BIOS and firmware updates for server 2 and run MEMtest (after I've finished backing up the other data drives, since I don't want to risk taking the array offline or power cycling until then).  Are you basing this advise on something you're seeing in the log btw? 

 

As for Server 1, the C-60 is a 9w TDP processor.  Powering that + a single stick of RAM and 3 case fans only in addition to the 6 hard drives, it idles at 30w or less per my kill-a-watt and I've never seen it spike even to 100w at boot.  So how would a 300w PSU be insufficient?  Not challenging you on that since I trust you probably know more on these things than I do.  Just trying to understand what you're basing that on.  Thanks again.

 

See here: http://lime-technology.com/forum/index.php?topic=12219.0

 

The wattage is not important. The PSU does not supply enough amperage for the disk drives. The PSU allots 17 AMPs for disk and MB and 17 AMPs for a graphics card. This is a fixed allocation meaning that you only have access to less than 50% of the total power.

Link to comment

Check for BIOS and firmware updates for server 2. Run MEMtest overnight.

 

Alright, I finished backing up everything on Server 2 and updated BIOS, then I ran MEMtest for 12 passes with no errors.  Disk 1 still shows as faulty but I thought I'd be able to start the array unprotected and still be able to access the emulated contents of Disk 1 as I'd been able to previously.  Good thing I backup it up because once the array started, Disk 1 came up as unmountable with the option to format it and I can no longer access its contents.

 

So now what happens when I get the RMA replacement drive for Disk 1?  Should I still be able to write the failed drive's contents to it from parity or will I have to copy it from backup and then run another parity sync?  A quick step by step would be helpful if you could point me in the right direction. Thanks.

 

I'm also still open to theories on why Disk 1 might have failed in the first place, and why the previous parity sync appeared from the web GUI to run normally and complete just fine while on the syslog it was throwing errors all over the place.  Updating BIOS is nice but I can't be confident I've addressed the root cause of the problem until I understand what it actually was.

 

Further advise requires diagnostics file.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...