Disk disabled / failed?

steve1977 · September 11, 2015

I just upgraded and shortly thereafter one disk disabled and now shows "red". The disk is relatively new. It is of course possible that this is just a hardware failure / faulty disk, but I have seen several topics on this forum about wrongly declared disk failures with the more recent Unraid releases.

Any chance someone can look into my diagnostic files whether anything indicates whether the disk is really broken? Thanks in advance!!!

tower-diagnostics-20150911-2312.zip

steve1977 · September 12, 2015

One more related question: I am now copying the content from the "faulty disk" to another disk in the array. Is this a smart thing or am I calling for trouble? If I understand Unraid correctly, I am technically actually not copying from the faulty disk (though it looks to me), but copying from the parity disk (that emulates the faulty disk) to the array. If my understanding is correct, this should be ok.

Thoughts?

trurl · September 12, 2015

One more related question: I am now copying the content from the "faulty disk" to another disk in the array. Is this a smart thing or am I calling for trouble? If I understand Unraid correctly, I am technically actually not copying from the faulty disk (though it looks to me), but copying from the parity disk (that emulates the faulty disk) to the array. If my understanding is correct, this should be ok.

Thoughts?

unRAID emulates the disk by reading ALL the other disks including parity and from that it is able to calculate the data for the missing disk. If you think about it, it is obvious that the parity disk cannot possibly contain all the data for the failed disk. How would it know which disk was going to fail? See the wiki for a better understanding of how parity actually works. It's pretty simple, and if you get that then a lot of things about unRAID will make sense.

So, you are not getting the data from parity, you are actually making unRAID read all of the drives at once. This is what it does when it rebuilds a disk. And that it the usual way to deal with your situation, rebuilding the disk. In fact, even if you manage to copy all the data from the emulated disk onto other disks in the array, you are still going to have to rebuild that disk, or else set a New Config without it and rebuild parity.

steve1977 · September 12, 2015

Got it, so copying from the "faulty" disk is doable and not worse than rebuilding it? Or is there any destructive as the parity is changing by the copy-activity?

Also, any indication from my diagnostic files whether the disk is really faulty or "just" some other issue (as mentioned in some of the other threads)?

RobJ · September 12, 2015

Disk is fine. The SAS card or SAS card driver is not, but I don't know what is wrong. This is rather coincidental, as this is the second time today I've dealt with this same problem! However yours is on v6.1.0 and the other is on v5.0.5, hard to see a connection, apart from the mpt2sas driver and the card its managing. His thread is here, and you should read through it, especially my analysis of what actually happened. Working from the syslog only at first, I had the wrong idea, but once I saw his screen pic, I had to come up with a different explanation. Yours appears very similar, and if you look at your Disk 7, you should see that it too has changed drive symbols. It started as sdd, but now is sds, attached (according to the syslog!) to the 9th SATA port (sd 1:0:8:0) on the card!

It didn't lose the drive quite the same way as Marcus. His appeared very innocent, just a 'synchronization', but almost immediately it said 'removing handle' (which appears to be the way the SAS error handler indicates the drive being dropped). Yours took much longer before that occurred, but it did occur, and then later it was re-discovered and hooked up to the 9th port and assigned sds as drive symbol. I have never seen this kind of behavior before, so I have to classify it for now as a bug in the card or mpt2sas driver.

steve1977 · September 12, 2015

Thanks, we are indeed using the same card (M1015 flashed into IT mode). Do you suspect that the card is faulty and requires replacement? Or the cable to the drive?

In the Unraid UI, it still appears to be mounted as "sdd", so not sure about your reference to "sds".

I had some issues with the same drive two weeks ago. It shows "I/O errors" when accessing it through the VM and then "disappeared". I didn't really change anything, but this was no longer an issue the next day.

Let me do the following. Copy all files to a new drive within the array. Then rebuild the whole array and upgarde to 6.1.2. Then resend diagnostic files.

Does this make sense?

trurl · September 12, 2015

Got it, so copying from the "faulty" disk is doable and not worse than rebuilding it?

Well, I would say that the more time you spend not rebuilding the disk the more at risk you are of another failure. Until the disk is rebuilt, your array doesn't have parity protection.

Or is there any destructive as the parity is changing by the copy-activity?

Parity is changed by the copy activity, but that is not destructive. If it didn't keep updating parity, that would be destructive, because parity would be invalid and it would be impossible to rebuild the disk.

Copying all the data from an emulated disk to other disks in the array is not the usual course of action, rebuilding the failed disk is. I have seen people backup the data from an emulated disk to another system because they have some reason to not rebuild the disk, but continuing to write to the array with a failed disk is usually not recommended.

steve1977 · September 12, 2015

Thanks for your quick reply. You mentioned in your earlier post that the disk is actually not faulty, but just wrongly seen this way due to a card or driver bug. So, I would not need to change / rebuild the disk. So, I could just do a "new config", but this requires quite a lot of faith that the disk is really 100% ok (as "new config" would wipe the parity). So, I was thinking that copying the data and only thereafter doing the "new config" would be "safer". And I cannot rebuild anyways to an existing disk, can I?

RobJ · September 12, 2015

Thanks, we are indeed using the same card (M1015 flashed into IT mode). Do you suspect that the card is faulty and requires replacement? Or the cable to the drive?

Nothing there tells me who's at fault, so could be the firmware on the card (might check for an update) or could be the mpt2sas driver module, or another lower level driver for the card. It's not the cable, as it's not a communication problem.

In the Unraid UI, it still appears to be mounted as "sdd", so not sure about your reference to "sds".

Have you stopped the array yet? Try that, and see what it says in the drop down for Disk 7.

Let me do the following. Copy all files to a new drive within the array. Then rebuild the whole array and upgarde to 6.1.2. Then resend diagnostic files.

You can do that I suppose, it gives you another backup of the files on Disk 7, but the important thing is to rebuild Disk 7 in place. Normal procedure would be to unassign Disk 7, start and stop the array, then re-assign Disk 7 and start the array, which will start the rebuild of Disk 7.

steve1977 · September 12, 2015

Ok, I still cannot see any reference to "sds" (in the pull-down, also not after restarting). Also updated to 6.1.2. Attached new diagnostic files.

What is the mpt2sas drive module. HW or SW?

The copy process is taking forever (probably days), so maybe I just go ahead rebuilding from parity. "Unassign Disk 7, start and stop the array, then re-assign Disk 7" also works with the same disk, right? Even if the rebuild fails, I could still rebuild to another disk, isn't it? How long would you expect the rebuild to take? Shall I shut down the VM during this process?

tower-diagnostics-20150912-1100.zip

steve1977 · September 12, 2015

Doing the "rebuild" on the old drive now. Speed of 100k/sec gives me an expected time of 600 days ;-) Shall I buy a new disk to try or play around with FW settings of my M1015 card? Any thoughts appreciated!

steve1977 · September 12, 2015

And here we go with yet another diagnostic file...

tower-diagnostics-20150912-1247.zip

steve1977 · September 13, 2015

Any thoughts?

RobJ · September 14, 2015

Just checked, 30 seconds into the rebuild it aborted, just like the previous aborts. Perhaps it's a bad port? Try reconnecting it to a different port. It the problem re-occurs, but with a different drive, at least you'll have confirmed it's not the drive.

steve1977 · October 11, 2015

got a new drive and then the same thing happened. yet again disk 7 (but a new disk).

the disk worked for a day or even a few days. then it turned "red" and now I'm again in the need to rebuild.

not impossible, but now very unlikely to be another fault disk. maybe driver? maybe my data card?

I'm ok to replace the data card, but I thought the M1015 is already the best supported one. any advice?

steve1977 · October 11, 2015

Please also find updated diagnostic files. Thanks for looking into it.

tower-diagnostics-20151011-2148.zip

steve1977 · October 23, 2015

Any thoughts. Now going through the same problem the third time. Event is always the same. I am creating an array. All works well for 3 days or so. Then, one disk goes bad (shows "red") and the content of this disk is emulated.

I changed the "faulty" disk and recreated a new array. Unfortuntely, the issue is always the same. Working for 3 days and then one disk is faulty.

I am sending a diagnostic of the 3rd time later today, but I am sure that it is the same issue as the 2nd time (see my previous email).

This issue is really annoying as it basically prevents me from properly using Unraid...

Thanks in advance for any help or ideas you may have!!!

trurl · October 23, 2015

Just reviewed the thread, and I don't see any mention of the possibility of power issues. What is the exact model of your power supply?

steve1977 · October 23, 2015

I doubt it is the powersupply, but of course you never know. It is a rather new Corsair 650 or so. Shall I look up the details?

steve1977 · October 23, 2015

And here we go with the latest diagostic file (after yet again a disk failed). Any thoughts?

tower-diagnostics-20151023-2128.zip

trurl · October 23, 2015

I doubt it is the powersupply, but of course you never know. It is a rather new Corsair 650 or so. Shall I look up the details?

650 should be fine, as long as it is SINGLE 12V RAIL.

steve1977 · October 24, 2015

It's actualy an AX860 (http://www.amazon.com/Corsair-Professional-860-Watt-Modular-Platinum/dp/B00A0HZMKG).

Any other thoughts from the diagnostic files that I have just posted? Thanks!

steve1977 · October 31, 2015

I mentioned before that everything works well for a few days after setting up the array and then one disk fails (while the disk is actually functional).

I now know why it takes a few days. It actually works until the first "parity check" kicks in.

How often is the parity check required and any idea why the check leads to one disk turn "red"?

drumstyk1 · October 31, 2015

I promise not to hijack this thread but i think I may be experiencing the same, or related issue. I have been running smoothly v5b11 since it was bleeding edge. Did a clean install to 6.1 and my disk 3 failed. I bought a new drive, precleared 3 cycles, and popped it in my disk 3 slot. Server rebuilt the drive as expected and then all of my other drives failed simultaneously... rebooted everything and was able to get all the drives green for a while but i cannot access my shares. I will probably create my own thread but if you guys want any logs or hardware info from me for crossreferencing, just lmk!

trurl · October 31, 2015

I promise not to hijack this thread but i think I may be experiencing the same, or related issue. I have been running smoothly v5b11 since it was bleeding edge. Did a clean install to 6.1 and my disk 3 failed. I bought a new drive, precleared 3 cycles, and popped it in my disk 3 slot. Server rebuilt the drive as expected and then all of my other drives failed simultaneously... rebooted everything and was able to get all the drives green for a while but i cannot access my shares. I will probably create my own thread but if you guys want any logs or hardware info from me for crossreferencing, just lmk!

You are experiencing an issue which it seems to you is similar to some issue another user has started a thread about. If that issue is not clearly related to a possible defect in unRAID itself, but might instead be something related to your particular hardware or configuration, then you should definitely start your own thread. It can only confuse things if we start trying to get more information from you in this other users thread.

Disk disabled / failed?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation