Advice on 2 failing disks


Recommended Posts

I have 2 disks failing.  One is full of data, the other is empty.  See screenshot.  I would like to remove the empty bad drive from the array without causing parity to need rebuilt - is that possible?  The goal is to be able to rebuild the drive that has data without the empty drive causing an issue that makes the rebuild fail.  Any suggestions?

I guess I should mention I had a power failure, and when unraid booted back up and started a parity check, these two drives are racking up the errors like mad.  Parity check is only at 30% and says it will take 72 days to complete... so I feel that at least one drive failing out is imminent.

Screen_Shot_2017-02-07_at_11_10.50_PM.png.6f205fdf479a6936536eb0bc5806b793.png

Link to comment

I have 4 failing disks - the disks are old junk disks and a few read failures doesn't concern me, unraid handles that nicely.  But obviously 500K failures on a disk that's empty is no bueno - I want to eject that from the array, and replace the one that has data and has 30K errors.  The log leads me to believe it's a media failure and not a SATA cable.  I had to break up the diagnostics into 2 files because it was over the 320KB forum attachment limit.

Link to comment

Of the disks showing read errors 3 appear to be clearly bad, disk18 maybe all readable.

 

Disk12:

 

Device Model:     WDC WD20EADS-00S2B0
Serial Number:    WD-WCAVY0135415
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       314
197 Current_Pending_Sector  0x0032   198   196   000    Old_age   Always       -       729
198 Offline_Uncorrectable   0x0030   198   198   000    Old_age   Offline      -       737

 

Disk15:

 

Model Family:     Western Digital Caviar Green
Device Model:     WDC WD20EADS-55R6B0
Serial Number:    WD-WCAVY1357136
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       3

 

Disk18:

 

Device Model:     WDC WD20EADS-00R6B0
Serial Number:    WD-WCAVY0324118
  5 Reallocated_Sector_Ct   0x0033   195   195   140    Pre-fail  Always       -       33
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

 

Disk21

 

Device Model:     WDC WD20EADS-00R6B0
Serial Number:    WD-WCAVY0772324
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       770
197 Current_Pending_Sector  0x0032   199   196   000    Old_age   Always       -       397
198 Offline_Uncorrectable   0x0030   200   196   000    Old_age   Offline      -       155

 

A good rebuild it's impossible, best bet would be to try and copy all possible data from those disks.

 

PS: I didn't check SMART for the other disks yet, it's possible there are more.

Link to comment

I have 4 failing disks - the disks are old junk disks and a few read failures doesn't concern me, unraid handles that nicely.

I'm not sure what you mean by read failures being handled, I guess you are ok with losing the data and have good backups?

 

After a bad experience early on, I no longer leave any questionable disks in the array. A known bad disk jeopardizes the ability of unraid recovering the data on any other failed disk.

 

 

Link to comment

I have 4 failing disks - the disks are old junk disks and a few read failures doesn't concern me, unraid handles that nicely.

I'm not sure what you mean by read failures being handled, I guess you are ok with losing the data and have good backups?

 

After a bad experience early on, I no longer leave any questionable disks in the array. A known bad disk jeopardizes the ability of unraid recovering the data on any other failed disk.

 

I mean: From http://lime-technology.com/wiki/index.php/Troubleshooting

If your array has been running fine for days/weeks/months/years and suddenly you notice a non-zero value in the error column of the web interface, what does that mean? Should I be worried?

Occasionally unRAID will encounter a READ error (not a WRITE error) on a disk. When this happens, unRAID will read the corresponding sector contents of all the other disks + parity to compute the data it was unable to read from the source. It will then WRITE that data back to the source drive. Without going into the technical details, this allows the source drive to fix the bad sector so next time, a read of that sector will be fine. Although this will be reported as an "error", the error has actually been corrected already. This is one of the best and least understood features of unRAID!

 

Yes - I have offline offsite backups and I'm not terribly concerned with the data.  I use old disks until they are marked dead.  My critical data lives elsewhere.

But with the intent of kicking this array down the road another day and not resorting to backups - I can conclude the best course of action with highest probability of success is to rsync the data on the failing full drive to one of the existing empty disks, then drop both the failing full and failing empty drives out of the array. 

Link to comment

Occasionally unRAID will encounter a READ error (not a WRITE error) on a disk. When this happens, unRAID will read the corresponding sector contents of all the other disks + parity to compute the data it was unable to read from the source.

This passage assumes ALL the other disks are OK at that sector. With multiple bad disks, there is a good chance the computed data will be wrong, corrupting the written data.

 

But with the intent of kicking this array down the road another day and not resorting to backups - I can conclude the best course of action with highest probability of success is to rsync the data on the failing full drive to one of the existing empty disks, then drop both the failing full and failing empty drives out of the array. 

Sounds reasonable. Removing all drives with known issues will ensure the best chance of a successful recovery should one of your "good" drives die a sudden death.
Link to comment

This passage assumes ALL the other disks are OK at that sector. With multiple bad disks, there is a good chance the computed data will be wrong, corrupting the written data.

 

This is true, there is a good chance - but I have a cronjob that runs on the first of the month that does

find /mnt/user -type f -print0 | xargs -0 md5sum > "/mnt/user/scripts/md5sums.$(date +%F_%R)"

and I keep a file of hashes from my offsite backups - both as an index, and as a reference to compare to if I encounter a file I suspect may have become corrupt. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.