Lost 1 drive, havoc ensued


Recommended Posts

Long story, I'm running 4.4.2 on a 6 disk array.  After an evening of using just disk5, I went to power down for the night.  disk1-4 were spun down while disk5 and parity were still up.  I hit the spin up button and everything seemed fine.  But after I clicked the stop button, the page wouldn't refresh.  The box isn't pingable nor can I telnet to it.  After a reboot, I still can't get to any of the shares, and the terminal showed some skewed text without really any error. At this point I think it's a disk that is semi failing.

 

So I rebooted it and went into the prompt this time.  All disks show they were mounted.  So I started doing touch /mnt/disk1/1, touch /mnt/disk2/2.  When I got to disk4, the box froze.  I thought I found the culprit and shutdown the box to unplug disk4's power.  After power up, I can get to the webpage now, and I see disk4 is missing.  Good.  I start the array from the page, but it froze yet again.  I go to the console, and df showed only /mnt/disk3 being there.  So I thought maybe it's disk3 that's bad.  Another power down, plug in disk4 and unplug disk3. Oops, should have powered up with all 6 disks plugged in, now when it came up, the page said: "Too many wrong and/or missing disks!"

 

What should I do now?  Are my data still protected as long as I find the bad disk and replace it?  How can I find out which disk is bad?  How do I get unRAID to recognize my drives again and start rebuilding?  Currently disk3 shows missing since I unplugged it, and disk4 has a blue dot next to it after I unplugged it and plugged it back in.

 

BTW, I couldn't post this in the 4.4 forum, the post button doesn't do anything.

 

Link to comment

Your diagnostic methods remind me of trying to locate a bad spark plug!  I'm going to have to kid you here, by saying we're in a more modern age now.  If we're wondering about a hard drive, we don't have to treat it like a dumb spark plug, we just ask it what is wrong!  We have SMART in our drives, and we have syslogs, both chock full of info that they want to tell us!  On a more serious note, that really is the wrong way, a very bad way, to diagnose any RAID array.  If this was not an unRAID array, you would have lost ALL of the data on all of the drives.

 

And besides, I don't think there is anything seriously wrong with any of the drives.  It is extremely rare for drive issues to crash the system, they just cause a lot of errors, and sometimes slow the system way down, but don't actually freeze or crash it.  It is possible (but rare) to have corruption in the Reiser file system, after a bad shutdown or serious power issue, that could cause a characteristic crash, with very characteristic evidence of a kernel panic on the physical console.  But that would only require a single reboot, and the use of reiserfsck (like scandisk) to clean it up.

 

What would have been the most helpful was a syslog immediately after the apparent crashes, if it was available from the console.  And even if you can't get a syslog, a digital camera pic of the console can help.  I do recommend the Troubleshooting page, below in my sig.  For now, we need to first recover your array, so please do re-hook up all drives, and see if a reboot will keep the system up long enough to do the Trust My Array procedure, which will restore the array and all drives back to normal.  Your syslog show no issues at all with the system, besides the obvious status of the 2 drives, which is completely normal, just the way I would have expected them to be, based on your story of actions taken.  Now we need a syslog captured right after the system fails again, or a good idea of what are the last messages on the screen, in order to figure out what really is the problem.

 

Two other things you can try, the first is run a 2 to 4 hour memory test, from the unRAID boot screen, to eliminate a memory problem.  Memory problems are a big source of crashes.  And the next thing you can do is obtain SMART reports for ALL of the drives, including Disk 3 once re-connected.  Please see the Troubleshooting page, Obtaining a SMART report section.

Link to comment

Thanks for all the suggestions, I will bring everything back and try the trust my array method.  The reason I just unplugged it is because everytime I tried to access the share via XP, it locks up the system or at least appears to be. (Share timing out, gui timing out, can't ping or telnet and command prompt not responding to any key strokes)  In my experience with bad drives in XP is that it freezes up for a looong time until it finally times out.  That and the touching of file freezing up the box.

 

I did not realize the syslog isn't being appended to until I copied over the latest to the flash drive. The syslog right after the crash was actually on the flash drive before this.  I'll get another once I get the array restored tonight, hopefully it's still in tact.

Link to comment

After I plugged in all the drives, booted up and followed the trust my array steps, I got to the part where it says click start to bring all the drives to green.  The gui froze after that, I went to the console and tried to copy the syslog.  Before I finished typing the "g" for log, the keyboard went unresponsive.  The screen went "skewed" again which I am attaching to this post.  After a reboot, the gui showed all drives being blue balls.  And the Start button is labeled:

 

Start will record all disk information, bring the array on-line, and start Parity-Sync (if parity is present). The array is immediately available, but is unprotected until Parity-Sync completes.

 

I was able to run a smarts report on all drives, and they all said PASSED.  I don't know what to do next.  Since I have the case open now, I can hear some periodic sound from one of the drives.  Its sounds like the motor shutsdown then starts up right away in a split second.  But it's so random, I haven't been able to find out which drive is doing that.  While I wait, I will run a mem test.

Link to comment

more smart reports and syslog, but I doubt the syslog is helpful since it didn't crash.

 

Quick overview of the smart reports.

 

Reallocated_Sector_Ct is 0 for all drives

 

Current_Pending_Sector is 10 for the parity drive

 

highest Temperature_Celsius was 30

Link to comment

The parity disk has some problems ...

 

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      10

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      10

 

But this does not seem lke the kind of thing that would crash the server.

 

Look here for a description of these (and other) SMART attributes.

Link to comment

The parity disk has some problems ...

 

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       10

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       10

 

But this does not seem lke the kind of thing that would crash the server.

 

Look here for a description of these (and other) SMART attributes.

 

Thanks for the link.  It appears I may have some sector errors for the parity drives, but like you said, it shouldn't be crashing the server.

Link to comment

Sorry, life got in the way.  Volunteer help can be very unreliable, especially from me!

 

This looks like a tough one.  There is a fair amount of evidence, but nothing conclusive.

* Clear corruption, in 2 different sessions, of the screen buffer of the current terminal session.  Appears to include an 8 byte slide of the various lines from the buffered console, plus an overwrite of the 8 bytes between each line buffer

* No apparent corruption elsewhere.  Both syslogs and all of the SMART reports are correctly structured, and contain no evidence whatsoever of any buggy modules or memory corruption.  So far, only the console display shows anything wrong.

* System misbehavior: apparent freezes, keyboard unresponsive, network offline.  So far, cannot tell if these freezes are crashes or panics, or just endless loops (100% CPU) or stuck in a race condition

* Possible power interruption, or motor control issue.  A motor that is dipping down then returning to full revs seems very significant.  Power supply issues?  Getting too hot?  Which motor, fan or power supply or drive?  Drive SMART reports show no evidence of this.

* Memory test, 1 pass only, shows no problems, but flaky memory bits need more passes, perhaps over night, to be identified.  Symptoms so far don't 'feel' like memory issues, not random enough, but that is not conclusive.

* No evidence at all of drive or cabling issues.  That leaves motherboard, addon controller, power supply, over heating chipsets, flaky memory under load or when heated up, kernel incompatibility with some hardware component, some other module bug?

 

Some things to try, may or may not help, but may provide more info:

* Repeat the Trust My Array procedure a couple more times, to determine if it freezes in the same way at the same point.  Repeat other tasks also, that had previously 'crashed', to tell if they crash randomly or are completely repeatable.  It's really helpful to know what is repeatable, and what is not.

* Run memory test over night or when not doing anything else, we want to force any weak memory bits to fail

* Try booting different unRAID versions, both forward and backward, v4.3.3 and v4.5-beta6.  We want to 'try' to eliminate software compatibility issues in a particular kernel release.  Get syslogs when ever possible, never know which one will have the important clue.

* Try unhooking power to ALL drives, and booting, then listening for the dipping motor.  Especially check to see if it is a fan or the power supply.  You said that it was a 'periodic sound', but also that it was 'random'?

* Carefully, feel for extremely hot components on the motherboard, but be careful to avoid any static discharge, don't touch any exposed electrical traces.

* If you can find a substitute power supply, try it.

 

That should keep you busy for a bit!  :)

Link to comment

Thank you for the volunteer help. :)  I was hoping someone from lime would take this too.

 

For the trust my array method, can I still do that right now when all my drives are blue and the start button say it'll start a sync?  I'm guessing after running the trust my array method, it'll go back to "normal"?

 

After I posted, the memory test finished 3 passes all without error, so I leaning against memory error.  Since the motor noise is a bigger sign.  When I say periodic, I just mean I hear it randomly but repeatedly.  No pattern so far.  I didn't want to leave the drives on for too long to test the memory, so I'll test the memory tonight with all the drive's power unhooked.

 

If it's a drive issue, I suppose if I hook it up to an external power supply and power up the drive individually, I'd hear the bad drive's motor that way.  Maybe this is something I can try too.  It does not appear to be the power supply fan etc, but I'll try unhooking power to all the drives.  I'll create a script too to copy the syslog with minimal key strokes so I can get the syslog before it crashes, report back what I find.  Thanks again for the help.

 

Link to comment

The 'Trust My Array' procedure starts with the Restore button, so that resets the config back to zero anyway, whatever the current state is.  I can't think of any reason at all, why you couldn't repeat it.  Once you Start the array, everything should return to normal, as normal as your system can be currently, with a parity check in progress, which you can abort.  Once everything is fixed, you will want to run a full parity check.

 

Memory sounds good.

Link to comment

Finally, I got something useful in the syslog.  I noticed that it would crash soon after I do a df.  So I just kept doing df and copying the syslog over and over.  Here's a snippet:

 

May 20 20:34:19 unRAID kernel: hda: dma_timer_expiry: dma status == 0x61

May 20 20:34:29 unRAID login[1311]: ROOT LOGIN  on `tty1'

May 20 20:34:29 unRAID kernel: hda: DMA timeout error

May 20 20:34:29 unRAID kernel: hda: dma timeout error: status=0x50 { DriveReady SeekComplete }

May 20 20:34:29 unRAID kernel: ide: failed opcode was: unknown

May 20 20:34:29 unRAID kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error }

May 20 20:34:29 unRAID kernel: hda: task_in_intr: error=0x04 { DriveStatusError }

May 20 20:34:29 unRAID kernel: ide: failed opcode was: unknown

May 20 20:34:29 unRAID kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error }

May 20 20:34:29 unRAID kernel: hda: task_in_intr: error=0x04 { DriveStatusError }

May 20 20:34:29 unRAID kernel: ide: failed opcode was: unknown

May 20 20:34:29 unRAID kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error }

May 20 20:34:29 unRAID kernel: hda: task_in_intr: error=0x04 { DriveStatusError }

May 20 20:34:29 unRAID kernel: ide: failed opcode was: unknown

May 20 20:34:29 unRAID kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error }

May 20 20:34:29 unRAID kernel: hda: task_in_intr: error=0x04 { DriveStatusError }

May 20 20:34:29 unRAID kernel: ide: failed opcode was: unknown

May 20 20:34:30 unRAID kernel: ide0: reset: success

May 20 20:34:50 unRAID kernel: hda: dma_timer_expiry: dma status == 0x61

 

Sounds like my hda is the one having issue.  I read a mention of DMA errors being the cable itself?

Link to comment

Yes, DMA errors are often related to communications issues, which are very often cable related, but these are different.  They are 'DMA timeout errors', implying a non-responsive client on one end of a DMA transfer.  Plus, these are errors, being reported and handled properly, they are NOT crashes.  They certainly are unusual, and demand explanation, because they may very well be related to what ever is the true issue, but disk errors themselves almost never cause crashes.

 

What IS interesting is, what has caused the DMA failures, and unfortunately, we may be back to a failing motherboard or controller or over heated north bridge chipset or power issue, as the real problem.  Can you provide some detail about your hardware, especially the motherboard and PSU?

 

By the way, that syslog happens to be the same one you included earlier with the SMART reports.

Link to comment

Oops, I must have attached the wrong one.  I renamed this one to differentiate in case it's the browser caching it.  Motherboard is an Asus M2NPV-VM running on a Enermax Power supply, 365watts if I remember.  32A on the 5V rail and 17A on the 12V rail.  Plenty of power if you ask me.  (multimeter showed 12V+ and 5V+ while all drives are operating) It has 3 IDE drives at 200GB and 2 using the onboard IDE controller and another on a ulltraATA 100 PCI controller I think.  3 500GB SATA drives using the onboard controller.  All 6 drives have a fan blowing directly at them from the front and I have a case fan and opened all slots for the expansion ports.  The air flow should be front to back "parallel"  I checked the base of the northbridge heatsink using an IR temp probe and it showed a max of 110 degree F, which isn't too hot and it isn't dusty.

 

After finding out it's hda (disk1), I again yanked the power off that drive and booted up. (I know, caveman spark plug method)  The system came up without incident, no crash for some time.  Though I did not start the array with the missing disk, so I can't say for certain if I did what would happen.  Just for the heck of it, I replaced the IDE cable with a brand (80 conductors) new one, and booted it up again.  All drives report green with the following message:

 

 

Check will start a Parity-Check.

(Last checked on 5/20/2009 8:54:03 PM, finding 2 errors.)

 

That was from the previous trust my drive session that kicked off an automatic check that did not finish.  So I started another check, and this is the output:

 

Total size: 488,386,552 KB

Current position: 85,956 (0.0%)

Estimated speed: 85,956 KB/sec

Estimated finish: 94.6 minutes

Sync errors: 2

 

The parity check is still running so far.  I'll let it finish.

Link to comment

Parity check finished:

 

Check will start a Parity-Check.

(Last checked on 5/20/2009 11:28:51 PM, finding 3396 errors.)

 

Temperature Size Free Reads Writes Errors

26°C 488,386,552 -    1,224,509 100 951

 

 

Should I be worried about the errors?  I guess another parity check would tell me if there are sector errors on the drives.  But how can I be sure that these changes to parity aren't due to my data disk having corrupted data?  Perhaps during the many hard reboots.

Link to comment

Parity check finished:

 

Check will start a Parity-Check.

(Last checked on 5/20/2009 11:28:51 PM, finding 3396 errors.)

 

Temperature Size Free Reads Writes Errors

26°C 488,386,552 -    1,224,509 100 951

 

 

Should I be worried about the errors?  I guess another parity check would tell me if there are sector errors on the drives.  But how can I be sure that these changes to parity aren't due to my data disk having corrupted data?  Perhaps during the many hard reboots.

 

There are no guarantees.  However, the RFS file system used by unRAID on data disks is robust and handles unexpected power failures quite well.  Experience on these forums has been overwhelmingly positive.  I cannot remember a case of a user reporting data corruption after a hard reboot.  But PARITY has no file system, and is much more likely to have a problem when not properly shutdown.  This is normally the reason for sync errors like you are seeing.  I have had anywhere from a couple parity errors up to several hundred after one hard reset.

 

Given the fact that RFS is well protected from power failures and parity is not, it is not surprising that unRAID will update parity to make it match the data disks when sync errors are found.  99.9% of the time that is what you want to happen.  A second parity check should result in 0 parity errors  (If not you have other problems).

 

More dangerous is a situation under which you have a data disk fail which causes a lockup and hence hard reboot.  In that case you have to use the parity disk, that you know isn't perfect, to perform a data disk rebuild.  But even in these situations, experience has been very positive.

 

Despite the overwhelmingly positive anecdotal evidence, however, some of us would like a check to PROVE there was no corruption.  unRAID does not help here.  I use par2 to calculate recovery blocks on my disks that are full of static content.  Even a very small amount of space in par2 blocks can detect ANY AND ALL errors at the individual file level.  It can also correct errors, but correction is limited by the number and size of par2 blocks you build and save.  My analysis is that 1G would be more than enough to recover from something subtle after a rebuild or hard reboot.

Link to comment

Please forgive me for not reading your post more closely.

 

My prior post is 100% true, but is discussing parity SYNC errors (of which you have 3396) and not disk errors (of which you have 951 on your parity disk).

 

Disk errors can mean different things.  They can be caused by something subtle like a bad or lose cable/backplane, or they can mean more serious problems with the disk itself.

 

I suggest that you re-run the smart report on your parity disk.

 

If the number of reallocated sectors / pending sectors is increasing, the disk is going bad and needs to be replaced.

 

If not, cabling is called into question.  I'd recommend changing the data cable to the parity disk and bypassing backplanes if possible.

 

The good news is that these problems are affecting just the parity disk, and not your data disks.  Once you get the parity disk fixed / replaced you should be able to rebuild parity with little difficultly.  But if you were to have a data disk failure, I would have real concerns about your ability to rebuild it right now.

 

Link to comment

All very helpful information.  Thank you both for all your help.  Another parity check is planned for tonight to see if I get a 0 on the sync error.  I was wondering why there were disk errors, I thought it was just a different way of tallying the sync error.  From the looks of it, it's the parity disk on it's way out.  As suggested, another smart report will hopefully tell me if that's true.  I've got no spare cables at the moment, so I'll have to settle for a reseat of the cable for now.

 

It's good to hear my data disk is probably ok, but it's surprising to hear the parity disk might be bad, since it's still relatively new if I remember correctly.  I guess that's why we have unRAID, and my main reason to get the unRAID was the lack of reliable storage.

Link to comment

All very helpful information.  Thank you both for all your help.  Another parity check is planned for tonight to see if I get a 0 on the sync error.  I was wondering why there were disk errors, I thought it was just a different way of tallying the sync error.  From the looks of it, it's the parity disk on it's way out.  As suggested, another smart report will hopefully tell me if that's true.  I've got no spare cables at the moment, so I'll have to settle for a reseat of the cable for now.

 

It's good to hear my data disk is probably ok, but it's surprising to hear the parity disk might be bad, since it's still relatively new if I remember correctly.  I guess that's why we have unRAID, and my main reason to get the unRAID was the lack of reliable storage.

 

Run the smartctl report and post the results.  If the disk is unraveling, there is no value in re-running the parity check.  The smartctl report will tell us whether the cable or the drive is the likely problem.

Link to comment

Doh, I did not copy the syslog when I did the parity check.  That was dumb.  But here's the smart report for the parity drive.  Bold entries are the interesting ones.  Reported_Uncorrect is up to 56 from 0 while Current_Pending_Sector and Offline_Uncorrectable went from 10 to 0.

 

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  114  100  006    Pre-fail  Always      -      76607144

  3 Spin_Up_Time            0x0003  094  091  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      380

  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  068  060  030    Pre-fail  Always      -      6830037

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      558

10 Spin_Retry_Count        0x0013  100  099  097    Pre-fail  Always      -      26

12 Power_Cycle_Count      0x0032  100  037  020    Old_age  Always      -      242

184 Unknown_Attribute      0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  044  044  000    Old_age  Always      -      56

188 Unknown_Attribute      0x0032  100  100  000    Old_age  Always      -      0

189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0

190 Airflow_Temperature_Cel 0x0022  074  051  045    Old_age  Always      -      26 (Lifetime Min/Max 24/26)

194 Temperature_Celsius    0x0022  026  049  000    Old_age  Always      -      26 (0 15 0 0)

195 Hardware_ECC_Recovered  0x001a  051  032  000    Old_age  Always      -      76607144

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.