Problem with Parity Check automatically correcting!

PeterB · December 27, 2012

I'm running V5-rc8a.

I just noticed that one of my drives had been disabled with 24 errors reported on the unRAID interface.

I looked at the SMART report for that drive and there was nothing untoward.

I then tried to stop the array, intending to reboot and rebuild the data on Disk1. At this point the standard emhttp interface became unresponsive.

I then pushed the on/off button, which would normally invoke a graceful shutdown. This caused some disk activity, but didn't power down the system.

I then went to the telnet interface and invoked reboot.

When the machine came back up it automatically went in to a parity check - by the time I gained control and cancelled the parity check, it had already found and, presumably, 'corrected' 3152 errors.

My understanding is that once the disk was disabled, the drive was in simulation mode, and that no further writes occurred on that drive, simply updating the parity. Now, the parity check will have destroyed the parity data which, I believe, would have been healthy, and I can no longer recover my disk1 data.

What, and how, can I recover from this situation?

I believe that I have to rebuild the parity, which will abandon any changes on disk1 since unRAID disabled it. It just so happens, that I have a new drive, ready pre-cleared, but not yet added to the array, so it would be possible to do a rebuild without destroying my disk1 contents - however, I believe that I have nothing to gain from doing that ... I'm certain that the parity data is now wrong, and I can no longer trust it.

Can I tell what data is lost?

What did I do wrong - should I have followed an alternative course of action?

And, finally, a plea to Tom - Please return to the functionality of one/some of the V5 betas, where the 'Correct' action was not the default for the parity check.

It cannot be assumed that the parity data is always at fault when the parity is wrong. Clearly, in this case, the parity data stood a much better chance of being correct than the disk1 data. I, as the intelligent user, want to be able to choose which data I trust.

Edit:

Just to add a little bit of information:

This is the status of the parity check, as reported by unMenu, at the point that I stopped it:

Total Size 1,953,514,552 KB

Current 1,150,052 (0.1%)

Speed 21,905 KB/sec

Finish 1480 minutes

Sync Errors 3,152 (corrected)

Johnm · December 27, 2012

And, finally, a plea to Tom - Please return to the functionality of one/some of the V5 betas, where the 'Correct' action was not the default for the parity check.

+1

I have mentioned this before.

I want unraid to TELL ME there is a problem. then ask what I want to do.

not to just start fixing what it thinks is wrong..

PS. you are the second or third person to have this issue in the last month.

PeterB · December 27, 2012

I want unraid to TELL ME there is a problem. then ask what I want to do.

not to just start fixing what it thinks is wrong.

The problem is that there is no basis for unRAID to 'think' - it just blithely assumes that it is the parity which is wrong.

I guess that when/if we get a second parity drive, it will be possible for unRAID to make a reasoned guess at which drive is incorrect, but only in the case of a single drive error.

PS. you are the second or third person to have this issue in the last month.

There's no real comfort in knowing that!

Joe L. · December 27, 2012

And, finally, a plea to Tom - Please return to the functionality of one/some of the V5 betas, where the 'Correct' action was not the default for the parity check.

+1

I have mentioned this before.

I want unraid to TELL ME there is a problem. then ask what I want to do.

not to just start fixing what it thinks is wrong..

PS. you are the second or third person to have this issue in the last month.

+1

JonathanM · December 27, 2012

And, finally, a plea to Tom - Please return to the functionality of one/some of the V5 betas, where the 'Correct' action was not the default for the parity check.

+1

I have mentioned this before.

I want unraid to TELL ME there is a problem. then ask what I want to do.

not to just start fixing what it thinks is wrong..

PS. you are the second or third person to have this issue in the last month.

+1

+1, Been bitten by it a couple years ago.

dgaschk · December 27, 2012

I agree this issue needs to be fixed.

As a work-around would it work to disconnect the parity drive after a hard shutdown during a disk rebuild?

PeterB · January 3, 2013

Okay, apart from supporting my plea for a 'non-correcting' default, no one has actually commented on checking/recovering/securing my data.

I restarted the array, and it has been running perfectly well since December 27. There have been no further errors reported on disk1 and I have not become aware of any missing data, apart from one torrent download which I checked with Transmission/verify, before completing the download. Luckily, that drive is one of the fuller ones on my server, so data written to user shares tends to go to other drives.

Analysing the events, my understanding is this:

From the point that the drive errors caused unRAID to stop using disk1 all writes to that drive only went to the parity drive.

The drive may have been disabled part way through writing a set of linked blocks, so the structure is suspect (unless ReiserFS can actually protect against this).

The (probably good) parity data, which should have allowed me to fully recover, was destroyed by the parity check, and cannot be used.

So, any data written to disk1 between it being disabled and the system restart is lost.

The disk structure may be compromised and should be checked/repaired with Reiserfsck.

Parity should be rebuilt.

Any comments/precautions?

limetech · January 3, 2013

PeterB:

What made you think the disk was disabled? If it was truly disabled then it would have appeared 'disabled' when you rebooted the server.

The reasoning behind "correcting parity check upon restart after unclean shutdown" is this. This state of detecting a restart following an unclean shutdown only occurs when there's been, well, an unclean shutdown - that is, a crash, or a power failure, or some other case where the sever is rebooted while the array is started and all the disks are mounted and there's possibly outstanding disk i/o.

Let's consider this case of outstanding writes at the time of a hard reset. In this case the data on at least one of the disks is "incomplete" - either file data or metadata or both, as well as possibly the parity data. So this particular disk will have some corruption, where 99% of the time a subsequent reboot will fix due to reiserfs replaying journaled transactions. But this does not fix the parity. So we want to start up a parity check, and write corrected parity, as soon as possible because if some other disk fails, we will not be able to completely rebuild it.

So, following an unclean shutdown, provided server comes up with no missing/disabled disks, I would say you always want to do a correcting parity check, and this is what the code does.

Also: as you know, the only time a disk can get 'disabled' if when a write fails to that disk. In this case, as part of pretty low level error recovery, the 'super.dat' file is updated to reflect the disabled disk right in-line with error recovery. This minimizes the amount of subsequent I/O to that disk. To implement this, the unRaid driver opens and writes the file on the flash. So if your flash is bad, I guess it's possible that the update might not happen, but there are other checks in place (like a crc check) to guard against this.

PeterB · January 5, 2013

Hi Tom,

PeterB:

What made you think the disk was disabled?

The drive was 'red-balled' in the emhttp interface, and showed 24 errors. In unMENU, it had the red text DISK_DSBL (I do not remember the exact text) against it. There were errors recorded against that drive in the log.

Ooops, I assumed that the failure to shutdown cleanly meant that my syslog had not been preserved. However, I've just looked on the flash drive and it is there! I'll attach it to this post. The drive errors occurred on Dec 25 at around 05:58, and seem to be associated with me moving the new drive (which had finished preclearing overnight) between hot-swap bays.

At shutdown, the log shows that disk1 is disabled. The log also records the smart report and it can be seen that there is nothing untoward on that drive.

If it was truly disabled then it would have appeared 'disabled' when you rebooted the server.

Clearly, when the system restarted, the disk came up enabled, so either:

The disk was not disabled, despite the evidence on the web interfaces.

or:

The fact that the system locked up on stopping the array prevented the 'disable' status being recorded.

and/or:

There is a bug in rc8a.

I also attach the syslog for the subsequent session, which shows disk1 being mounted when the array comes up.

....

So, following an unclean shutdown, provided server comes up with no missing/disabled disks, I would say you always want to do a correcting parity check, and this is what the code does.

....

Okay, thanks for the explanation. I guess that is logical but it would seem that, on this occasion. it did not work as expected.

It is clear, from the logs, that the disk was disabled when the system went down. It is also clear that on the restart, the disk was mounted as though nothing was wrong. We can also deduce, from the fact that the syslog was written as the system shutdown, that the flash drive was still accessible and writable.

Edit to add:

Further up this thread, Johnm suggests that other people have had similar problems. Perhaps this does, indeed, point to some weakness in the current v5.0 code, which allows this to happen.

Syslogs.zip

limetech · January 10, 2013

PeterB:

Due to your fantastic explanation and logs, yes there is a bug which was introduced in -rc5 where recording a disk disabled event to the superblock could get missed if there's an unclean shutdown. Fixed in -rc9b.

Problem with Parity Check automatically correcting!

Recommended Posts

PeterB

Link to comment

Johnm

Link to comment

PeterB

Link to comment

Joe L.

Link to comment

JonathanM

Link to comment

dgaschk

Link to comment

PeterB

Link to comment

limetech

Link to comment

PeterB

Link to comment

limetech

Link to comment

Join the conversation