Advice on 2 failing disks

davehahn · February 8, 2017

I have 2 disks failing. One is full of data, the other is empty. See screenshot. I would like to remove the empty bad drive from the array without causing parity to need rebuilt - is that possible? The goal is to be able to rebuild the drive that has data without the empty drive causing an issue that makes the rebuild fail. Any suggestions?

I guess I should mention I had a power failure, and when unraid booted back up and started a parity check, these two drives are racking up the errors like mad. Parity check is only at 30% and says it will take 72 days to complete... so I feel that at least one drive failing out is imminent.

JorgeB · February 8, 2017

You have 3 4 disks with errors, post your diagnostics

davehahn · February 8, 2017

I have 4 failing disks - the disks are old junk disks and a few read failures doesn't concern me, unraid handles that nicely. But obviously 500K failures on a disk that's empty is no bueno - I want to eject that from the array, and replace the one that has data and has 30K errors. The log leads me to believe it's a media failure and not a SATA cable. I had to break up the diagnostics into 2 files because it was over the 320KB forum attachment limit.

JorgeB · February 8, 2017

Of the disks showing read errors 3 appear to be clearly bad, disk18 maybe all readable.

Disk12:

Device Model:     WDC WD20EADS-00S2B0
Serial Number:    WD-WCAVY0135415
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       314
197 Current_Pending_Sector  0x0032   198   196   000    Old_age   Always       -       729
198 Offline_Uncorrectable   0x0030   198   198   000    Old_age   Offline      -       737

Disk15:

Model Family:     Western Digital Caviar Green
Device Model:     WDC WD20EADS-55R6B0
Serial Number:    WD-WCAVY1357136
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       3

Disk18:

Device Model:     WDC WD20EADS-00R6B0
Serial Number:    WD-WCAVY0324118
  5 Reallocated_Sector_Ct   0x0033   195   195   140    Pre-fail  Always       -       33
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

Disk21

Device Model:     WDC WD20EADS-00R6B0
Serial Number:    WD-WCAVY0772324
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       770
197 Current_Pending_Sector  0x0032   199   196   000    Old_age   Always       -       397
198 Offline_Uncorrectable   0x0030   200   196   000    Old_age   Offline      -       155

A good rebuild it's impossible, best bet would be to try and copy all possible data from those disks.

PS: I didn't check SMART for the other disks yet, it's possible there are more.

trurl · February 8, 2017

You clearly should have setup Notifications before you let yourself get into this mess.

JonathanM · February 8, 2017

I have 4 failing disks - the disks are old junk disks and a few read failures doesn't concern me, unraid handles that nicely.

I'm not sure what you mean by read failures being handled, I guess you are ok with losing the data and have good backups?

After a bad experience early on, I no longer leave any questionable disks in the array. A known bad disk jeopardizes the ability of unraid recovering the data on any other failed disk.

davehahn · February 8, 2017

I have 4 failing disks - the disks are old junk disks and a few read failures doesn't concern me, unraid handles that nicely.

I'm not sure what you mean by read failures being handled, I guess you are ok with losing the data and have good backups?

After a bad experience early on, I no longer leave any questionable disks in the array. A known bad disk jeopardizes the ability of unraid recovering the data on any other failed disk.

I mean: From http://lime-technology.com/wiki/index.php/Troubleshooting

If your array has been running fine for days/weeks/months/years and suddenly you notice a non-zero value in the error column of the web interface, what does that mean? Should I be worried?

Occasionally unRAID will encounter a READ error (not a WRITE error) on a disk. When this happens, unRAID will read the corresponding sector contents of all the other disks + parity to compute the data it was unable to read from the source. It will then WRITE that data back to the source drive. Without going into the technical details, this allows the source drive to fix the bad sector so next time, a read of that sector will be fine. Although this will be reported as an "error", the error has actually been corrected already. This is one of the best and least understood features of unRAID!

Yes - I have offline offsite backups and I'm not terribly concerned with the data. I use old disks until they are marked dead. My critical data lives elsewhere.

But with the intent of kicking this array down the road another day and not resorting to backups - I can conclude the best course of action with highest probability of success is to rsync the data on the failing full drive to one of the existing empty disks, then drop both the failing full and failing empty drives out of the array.

JonathanM · February 8, 2017

Occasionally unRAID will encounter a READ error (not a WRITE error) on a disk. When this happens, unRAID will read the corresponding sector contents of all the other disks + parity to compute the data it was unable to read from the source.

This passage assumes ALL the other disks are OK at that sector. With multiple bad disks, there is a good chance the computed data will be wrong, corrupting the written data.

But with the intent of kicking this array down the road another day and not resorting to backups - I can conclude the best course of action with highest probability of success is to rsync the data on the failing full drive to one of the existing empty disks, then drop both the failing full and failing empty drives out of the array.

Sounds reasonable. Removing all drives with known issues will ensure the best chance of a successful recovery should one of your "good" drives die a sudden death.

davehahn · February 8, 2017

This passage assumes ALL the other disks are OK at that sector. With multiple bad disks, there is a good chance the computed data will be wrong, corrupting the written data.

This is true, there is a good chance - but I have a cronjob that runs on the first of the month that does

find /mnt/user -type f -print0 | xargs -0 md5sum > "/mnt/user/scripts/md5sums.$(date +%F_%R)"

and I keep a file of hashes from my offsite backups - both as an index, and as a reference to compare to if I encounter a file I suspect may have become corrupt.

Advice on 2 failing disks

Recommended Posts

davehahn

Link to comment

JorgeB

Link to comment

davehahn

Link to comment

JorgeB

Link to comment

trurl

Link to comment

JonathanM

Link to comment

davehahn

Link to comment

JonathanM

Link to comment

davehahn

Link to comment

Join the conversation