Errors (SOLVED)

joeshmoe1 · June 3, 2012

Can anyone tell me what all these errors mean? I'm running 5.0 b14 with sabnzbd and sickbeard.

Jun  3 11:25:36 SERVER logger: #  * Extensive error-handling mechanism, mirroring OpenSSL's error codes (Errors)
Jun  3 11:37:35 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:37:35 SERVER kernel:          res 51/40:c7:84:c9:68/40:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:37:35 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:39:31 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:39:31 SERVER kernel:          res 51/40:df:2d:87:8e/40:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:39:31 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:41:15 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:41:15 SERVER kernel:          res 51/40:af:8c:e6:af/40:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:41:15 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:46:30 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:46:30 SERVER kernel:          res 51/40:9f:9b:b8:09/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:46:30 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:47:29 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:47:29 SERVER kernel:          res 51/40:4f:22:fd:0c/40:00:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:47:29 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:48:56 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:48:56 SERVER kernel:          res 51/40:b7:0d:54:20/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:48:56 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:49:43 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:49:43 SERVER kernel:          res 51/40:cf:9f:06:31/40:02:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:49:43 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:50:25 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:50:25 SERVER kernel:          res 51/40:00:83:2a:34/40:00:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:50:25 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:51:29 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:51:29 SERVER kernel:          res 51/40:6f:af:db:47/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:51:29 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:51:58 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:51:58 SERVER kernel:          res 51/40:4f:6a:c0:57/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:51:58 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:54:02 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:54:02 SERVER kernel:          res 51/40:4f:36:9f:6d/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:54:02 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:55:17 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:55:17 SERVER kernel:          res 51/40:00:03:b5:80/40:00:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:55:17 SERVER kernel: ata5.00: error: { UNC } (Errors)

kenoka · June 3, 2012

I believe UNC denotes an uncorrectable media error. I would run a SMART report and post the results.

joeshmoe1 · June 3, 2012

Which disk does ata5 refer to? I have four data, one parity, and one cache. I've also attached the full syslog if that helps.

syslog.zip

kenoka · June 4, 2012

ata5 is a WD 1 TB EADS:

Jun 3 11:25:19 SERVER kernel: ata5.00: ATA-8: WDC WD10EADS-00M2B0, 01.00A01, max UDMA/133

It looks like you have two in your system. They were designated as sdf and sdg in this boot-up:

Jun 3 11:25:19 SERVER kernel: scsi 4:0:0:0: Direct-Access ATA WDC WD10EADS-00M 01.0 PQ: 0 ANSI: 5
Jun 3 11:25:19 SERVER kernel: sd 4:0:0:0: [sdf] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)

Jun 3 11:25:19 SERVER kernel: scsi 5:0:0:0: Direct-Access ATA WDC WD10EADS-00M 01.0 PQ: 0 ANSI: 5
Jun 3 11:25:19 SERVER kernel: sd 5:0:0:0: [sdg] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)

I'm not sure if ata5 = scsi 5. If it does, then it's /dev/sdg.

dgaschk · June 4, 2012

Post reports for both drives.

joeshmoe1 · June 4, 2012

Smart tests for both drives.

smart.txt

smart2.txt

mbryanr · June 4, 2012

Device Model: WDC WD10EADS-00M2B0

Serial Number: WD-WCAV50083273

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

197 Current_Pending_Sector 0x0032 193 193 000 Old_age Always - 1219

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 15673 24018239

-----------------------------------------------

I'd cut my loses and replace the drive.

joeshmoe1 · June 4, 2012

Device Model: WDC WD10EADS-00M2B0
Serial Number: WD-WCAV50083273

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

197 Current_Pending_Sector 0x0032 193 193 000 Old_age Always - 1219

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed: read failure 90% 15673 24018239

-----------------------------------------------

I'd cut my loses and replace the drive.

How bad is it? I'd like to transition to 3TB drives but my current parity is 2TB. Would this drive last long enough to rebuild parity on a new drive?

mbryanr · June 4, 2012

Likely not. If you ran a correcting parity check first <with the existing 2TB parity>...A correcting parity check would move those "pending" sectors to reallocated...just with that number of pending - I wouldn't trust it.

If you attempted to replace your current parity drive with a 3TB parity drive - every sector that could not be read would be incorrect and you would lose the data.

In other words, if you attempted to rebuild parity off of that drive...it would fail.

joeshmoe1 · June 4, 2012

Thanks mbryanr. I started looking for issues because a parity check after a power outage was really slow. I've got a new 2TB disk ordered. Should I do a parity check before swapping the bad disk out?

mbryanr · June 4, 2012

I would just wait for the replacement. But you could try..

I never had patience for a parity check that runs at 4 kb/s.

While waiting for the new drive is when you want to be careful...ie I wouldn't "write" to the array. If you get a failure/redball on another disk...ugh.

dgaschk · June 4, 2012

The failing drive is disk 1. A parity update will write incorrect information to parity making impossible to recover disk 1. You can try to rebuild disk 1 but it likely needs to be replaced.

joeshmoe1 · June 5, 2012

The failing drive is disk 1. A parity update will write incorrect information to parity making impossible to recover disk 1. You can try to rebuild disk 1 but it likely needs to be replaced.

That's not good news as I had already started a parity check although it has not found any errors. I guess I'll cancel it and wait for the new drive. I'm a bit surprised that things could be so bad without any warning in the GUI. All the drives still show green. The only obvious indication of trouble was a slow parity check and slow write speeds. Should I try and rebuild disk 1 and if so how would I do that?

Thanks for the help.

mbryanr · June 5, 2012

1. Stop array

2. "unassign" the drive on the Main page

3. Start array - server should come up with that drive showing as missing (but array will still start).

4. Stop array

5. "assign" the drive back

6. Start array - server should start a Reconstruct to that drive - let it run and see if any errors happen again

http://lime-technology.com/wiki/index.php/Troubleshooting#What_if_I_get_an_error.3F

What if I get an error?

If your array has been running fine for days/weeks/months/years and suddenly you notice a non-zero value in the error column of the web interface, what does that mean? Should I be worried?

Occasionally unRAID will encounter a READ error (not a WRITE error) on a disk. When this happens, unRAID will read the corresponding sector contents of all the other disks + parity to compute the data it was unable to read from the source. It will then WRITE that data back to the source drive. Without going into the technical details, this allows the source drive to fix the bad sector so next time, a read of that sector will be fine. Although this will be reported as an "error", the error has actually been corrected already. This is one of the best and least understood features of unRAID!

There may be OTHER types of errors than this one, so it is certainly worth your while to capture a syslog after an error is detected, but this is likely what has happened. Also, if you notice this happening more than once in a very great while, you might want to consider testing and replacing the disk in question. Remapped sectors have been linked with higher than normal drive failure.

After getting an error, run a parity check soon after, to make sure that all is well.

mbryanr · June 5, 2012

btw, read this thread on parity checks etc. From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity.

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

joeshmoe1 · June 5, 2012

btw, read this thread on parity checks etc. From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity.

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

That's the behavior I would expect. It would be odd if a parity check on a seemingly healthy array resulted in the loss of data. If it does try to reallocate I would think a parity check before swapping out the disk might be good.

mbryanr · June 5, 2012

Exactly...which is why (I believe) I've seen Joe L recommend a parity check on disks that only have 1-..<100 pending reallocated sectors.

limetech · June 6, 2012

What the drive firmware is supposed to do, and what it does, are two different things...

When a sector gets an unrecoverable read error, it can mark the sector "reallocation pending". This means that the disk could not read the data, and if the host eventually writes to the sector, the disk will reallocate the sector instead of trying to write over the same media location.

However, the disk firmware could also just let the re-write of the sector take place. Maybe it depends on the nature of the error?

Well what unRaid does, since it can rebuild data of a failed sector, is, after having rebuilt said data using parity reconstruction, in addition to returning data to host, it issues a write to the drive of the "corrected" data back to the same logical block address. What the drive does with that is really unknown.

I know that these kinds of algorithms tend to be treated as "trade secrets" (probably more to provide some arm-waving in the case of bugs, than actually doing anything clever).

Joe L. · June 6, 2012

btw, read this thread on parity checks etc. From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity.

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

I have no idea if the re-construction comes first, or the parity calc... Perhaps Tom @ lime-tech can respond to what happens if there is a read-failure of a data disk during a "correcting" parity check? does parity think there are zeros on the data disk, or does parity use the re-constructed data???

Joe L.

limetech · June 6, 2012

btw, read this thread on parity checks etc. From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity.

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

I have no idea if the re-construction comes first, or the parity calc... Perhaps Tom @ lime-tech can respond to what happens if there is a read-failure of a data disk during a "correcting" parity check? does parity think there are zeros on the data disk, or does parity use the re-constructed data???

Joe L.

In a "correcting" parity check, for a given block address, data is read from all data drives and the parity drive. That is, a read I/O operation is sent to each drive. When the read I/O's complete we look at the completion statuses.

Normally, each I/O completion will say "success". In this case the xor engine xor's all the data blocks together and compares it with the content of the parity block. If they are equal, then everything's fine, move on to the next block. If they are not equal, then this is logged in the system log as "incorrect parity", and the block we generated by xor'ing all the data blocks is written to the parity drive. (This is the same thing that happens in a normal parity sync except the parity drive is not read and there's no comparison - it just is written with the calculated parity.)

If one of the read I/O's returns "failure" however, then what we do is xor together all the data from each drive except the one that failed. The resultant block is the data that we should have read from the disk that failed, if the I/O to that disk didn't fail. After creating this block, we then write it to the same Logical Block Address to the disk that previously returned an I/O error reading that block.

If that "write" of calculated data to the failed disk itself fails, well then we disable the data drive and terminate the parity check.

What the drive firmware should do is this: "Hey I just got sent data for a sector that I know I couldn't read before, so what the heck, let's write it back to the same sector and immediately re-read to see if it now works, because I know that sometimes re-writing the data fixes the magnetic structure on the disk and I won't have to use one of my limited reallocation sectors. Then if my verify fails, I'll write the data to a reallocated sector." Or the firmware could say, "Screw it, let's just write the data to the sector again and forget about verifying." Or the firmware could say, "Forget about writing to the same sector and doing a verify because that's slow, let's just reallocate the sector." What a particular drive does, you will have to find out who the firmware designer is for the company and ask them.

If 2 or more of the read I/O operations sent to the data and parity disk for the same LBA fails, the parity check is immediately terminated.

joeshmoe1 · June 7, 2012

Thanks for all the help. I've got the new drive preclearing now. I think my data will be fine but I'm still a bit unclear on Tom's explanation. Is a parity check in this situation (disk with read errors):

[*] beneficial

[*]harmful

[*]neither

[*]or depends on how the disk behaves

Thanks

JonathanM · June 7, 2012

Thanks for all the help. I've got the new drive preclearing now. I think my data will be fine but I'm still a bit unclear on Tom's explanation. Is a parity check in this situation (disk with read errors):

[*] beneficial

[*]harmful

[*]neither

[*]or depends on how the disk behaves

Thanks

4

Under ideal circumstances, a parity check should result in corrected data being written back to the failing drive. However, you still have a failing drive, which needs to be replaced sooner rather than later. Unraid only tolerates one disk failure at a time, so prudence says don't play games with a bad drive if you can at all help it. The longer you play around, the better the chances you will end up with a second drive failure (murphy's law).

mbryanr · June 7, 2012

Agreed. Some will attempt to "heal" a drive that has a few pending sectors, and during the corresponding parity check/write those pending sectors don't increase...will have a drive they feel comfortable with.

Here was Tom's reply to me:

"If you see media errors get corrected it’s time to shop for another drive."

So, while a correcting parity check can reallocate sectors, it can not save a failing drive. The # of pending sectors you had...was too great a risk to try to save.

Errors (SOLVED)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation