Errors (SOLVED)


Recommended Posts

Can anyone tell me what all these errors mean? I'm running 5.0 b14 with sabnzbd and sickbeard.

 

Jun  3 11:25:36 SERVER logger: #  * Extensive error-handling mechanism, mirroring OpenSSL's error codes (Errors)
Jun  3 11:37:35 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:37:35 SERVER kernel:          res 51/40:c7:84:c9:68/40:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:37:35 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:39:31 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:39:31 SERVER kernel:          res 51/40:df:2d:87:8e/40:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:39:31 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:41:15 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:41:15 SERVER kernel:          res 51/40:af:8c:e6:af/40:00:00:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:41:15 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:46:30 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:46:30 SERVER kernel:          res 51/40:9f:9b:b8:09/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:46:30 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:47:29 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:47:29 SERVER kernel:          res 51/40:4f:22:fd:0c/40:00:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:47:29 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:48:56 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:48:56 SERVER kernel:          res 51/40:b7:0d:54:20/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:48:56 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:49:43 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:49:43 SERVER kernel:          res 51/40:cf:9f:06:31/40:02:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:49:43 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:50:25 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:50:25 SERVER kernel:          res 51/40:00:83:2a:34/40:00:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:50:25 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:51:29 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:51:29 SERVER kernel:          res 51/40:6f:af:db:47/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:51:29 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:51:58 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:51:58 SERVER kernel:          res 51/40:4f:6a:c0:57/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:51:58 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:54:02 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:54:02 SERVER kernel:          res 51/40:4f:36:9f:6d/40:01:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:54:02 SERVER kernel: ata5.00: error: { UNC } (Errors)
Jun  3 11:55:17 SERVER kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Jun  3 11:55:17 SERVER kernel:          res 51/40:00:03:b5:80/40:00:01:00:00/e0 Emask 0x9 (media error) (Errors)
Jun  3 11:55:17 SERVER kernel: ata5.00: error: { UNC } (Errors)

Link to comment

ata5 is a WD 1 TB EADS:

Jun  3 11:25:19 SERVER kernel: ata5.00: ATA-8: WDC WD10EADS-00M2B0, 01.00A01, max UDMA/133

 

It looks like you have two in your system. They were designated as sdf and sdg in this boot-up:

Jun  3 11:25:19 SERVER kernel: scsi 4:0:0:0: Direct-Access    ATA      WDC WD10EADS-00M 01.0 PQ: 0 ANSI: 5

Jun  3 11:25:19 SERVER kernel: sd 4:0:0:0: [sdf] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)

Jun  3 11:25:19 SERVER kernel: scsi 5:0:0:0: Direct-Access    ATA      WDC WD10EADS-00M 01.0 PQ: 0 ANSI: 5

Jun  3 11:25:19 SERVER kernel: sd 5:0:0:0: [sdg] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)

 

I'm not sure if ata5 = scsi 5. If it does, then it's /dev/sdg.

 

 

 

 

Link to comment

Device Model:    WDC WD10EADS-00M2B0

Serial Number:    WD-WCAV50083273

 

5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

 

197 Current_Pending_Sector  0x0032  193  193  000    Old_age  Always      -      1219

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      90%    15673        24018239

 

 

-----------------------------------------------

I'd cut my loses and replace the drive. 

Link to comment

Device Model:    WDC WD10EADS-00M2B0

Serial Number:    WD-WCAV50083273

 

5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0

 

197 Current_Pending_Sector  0x0032  193  193  000    Old_age  Always      -      1219

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed: read failure      90%    15673        24018239

 

 

-----------------------------------------------

I'd cut my loses and replace the drive. 

 

How bad is it? I'd like to transition to 3TB drives but my current parity is 2TB. Would this drive last long enough to rebuild parity on a new drive?

Link to comment

Likely not. If you ran a correcting parity check first <with the existing 2TB parity>...A correcting parity check would move those "pending" sectors to reallocated...just with that number of pending - I wouldn't trust it.

 

If you attempted to replace your current parity drive with a 3TB parity drive - every sector that could not be read would be incorrect and you would lose the data.

 

In other words, if you attempted to rebuild parity off of that drive...it would fail.

Link to comment

I would just wait for the replacement. But you could try..

 

I never had patience for a parity check that runs at 4 kb/s.

 

While waiting for the new drive is when you want to be careful...ie I wouldn't "write" to the array.  If you get a failure/redball on another disk...ugh.

 

Link to comment

The failing drive is disk 1. A parity update will write incorrect information to parity making impossible to recover disk 1. You can try to rebuild disk 1 but it likely needs to be replaced.

 

That's not good news as I had already started a parity check although it has not found any errors. I guess I'll cancel it and wait for the new drive. I'm a bit surprised that things could be so bad without any warning in the GUI. All the drives still show green. The only obvious indication of trouble was a slow parity check and slow write speeds. Should I try and rebuild disk 1 and if so how would I do that?

 

Thanks for the help.

Link to comment

 

 

1. Stop array

2. "unassign" the drive on the Main page

3. Start array - server should come up with that drive showing as missing (but array will still start).

4. Stop array

5. "assign" the drive back

6. Start array - server should start a Reconstruct to that drive - let it run and see if any errors happen again

 

http://lime-technology.com/wiki/index.php/Troubleshooting#What_if_I_get_an_error.3F

What if I get an error?

 

If your array has been running fine for days/weeks/months/years and suddenly you notice a non-zero value in the error column of the web interface, what does that mean? Should I be worried?

Occasionally unRAID will encounter a READ error (not a WRITE error) on a disk. When this happens, unRAID will read the corresponding sector contents of all the other disks + parity to compute the data it was unable to read from the source. It will then WRITE that data back to the source drive. Without going into the technical details, this allows the source drive to fix the bad sector so next time, a read of that sector will be fine. Although this will be reported as an "error", the error has actually been corrected already. This is one of the best and least understood features of unRAID!

There may be OTHER types of errors than this one, so it is certainly worth your while to capture a syslog after an error is detected, but this is likely what has happened. Also, if you notice this happening more than once in a very great while, you might want to consider testing and replacing the disk in question. Remapped sectors have been linked with higher than normal drive failure.

After getting an error, run a parity check soon after, to make sure that all is well.

Link to comment

btw, read this thread on parity checks etc.  From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity. 

 

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

 

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

Link to comment

btw, read this thread on parity checks etc.  From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity. 

 

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

 

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

That's the behavior I would expect. It would be odd if a parity check on a seemingly healthy array resulted in the loss of data. If it does try to reallocate I would think a parity check before swapping out the disk might be good.

 

 

Link to comment

What the drive firmware is supposed to do, and what it does, are two different things...

 

When a sector gets an unrecoverable read error, it can mark the sector "reallocation pending". This means that the disk could not read the data, and if the host eventually writes to the sector, the disk will reallocate the sector instead of trying to write over the same media location.

 

However, the disk firmware could also just let the re-write of the sector take place.  Maybe it depends on the nature of the error?

 

Well what unRaid does, since it can rebuild data of a failed sector, is, after having rebuilt said data using parity reconstruction, in addition to returning data to host, it issues a write to the drive of the "corrected" data back to the same logical block address.  What the drive does with that is really unknown.

 

I know that these kinds of algorithms tend to be treated as "trade secrets" (probably more to provide some arm-waving in the case of bugs, than actually doing anything clever).

Link to comment

btw, read this thread on parity checks etc.  From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity. 

 

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

 

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

I have no idea if the re-construction comes first, or the parity calc...  Perhaps Tom @ lime-tech can respond to what happens if there is a read-failure of a data disk during a "correcting" parity check?  does parity think there are zeros on the data disk, or does parity use the re-constructed data???

 

Joe L.

Link to comment

btw, read this thread on parity checks etc.  From my <limited> understanding - a correcting parity check with an corresponding read failure on a data disk would first attempt to reallocate the sector that it could not be read from before updating the parity. 

 

See Joe L's post #14

http://lime-technology.com/forum/index.php?topic=20006.msg178551#msg178551

 

Maybe someone with a better understanding than me could explain what happens when there is a read failure during a correcting parity check.

I have no idea if the re-construction comes first, or the parity calc...  Perhaps Tom @ lime-tech can respond to what happens if there is a read-failure of a data disk during a "correcting" parity check?  does parity think there are zeros on the data disk, or does parity use the re-constructed data???

 

Joe L.

 

In a "correcting" parity check, for a given block address, data is read from all data drives and the parity drive.  That is, a read I/O operation is sent to each drive.  When the read I/O's complete we look at the completion statuses.

 

Normally, each I/O completion will say "success".  In this case the xor engine xor's all the data blocks together and compares it with the content of the parity block.  If they are equal, then everything's fine, move on to the next block.  If they are not equal, then this is logged in the system log as "incorrect parity", and the block we generated by xor'ing all the data blocks is written to the parity drive. (This is the same thing that happens in a normal parity sync except the parity drive is not read and there's no comparison - it just is written with the calculated parity.)

 

If one of the read I/O's returns "failure" however, then what we do is xor together all the data from each drive except the one that failed.  The resultant block is the data that we should have read from the disk that failed, if the I/O to that disk didn't fail.  After creating this block, we then write it to the same Logical Block Address to the disk that previously returned an I/O error reading that block.

 

If that "write" of calculated data to the failed disk itself fails, well then we disable the data drive and terminate the parity check.

 

What the drive firmware should do is this: "Hey I just got sent data for a sector that I know I couldn't read before, so what the heck, let's write it back to the same sector and immediately re-read to see if it now works, because I know that sometimes re-writing the data fixes the magnetic structure on the disk and I won't have to use one of my limited reallocation sectors. Then if my verify fails, I'll write the data to a reallocated sector."  Or the firmware could say, "Screw it, let's just write the data to the sector again and forget about verifying."  Or the firmware could say, "Forget about writing to the same sector and doing a verify because that's slow, let's just reallocate the sector."  What a particular drive does, you will have to find out who the firmware designer is for the company and ask them.

 

If 2 or more of the read I/O operations sent to the data and parity disk for the same LBA fails, the parity check is immediately terminated.

Link to comment

Thanks for all the help. I've got the new drive preclearing now. I think my data will be fine but I'm still a bit unclear on Tom's explanation. Is a parity check in this situation (disk with read errors):

[*] beneficial

[*]harmful

[*]neither

[*]or depends on how the disk behaves

 

Thanks

Link to comment

Thanks for all the help. I've got the new drive preclearing now. I think my data will be fine but I'm still a bit unclear on Tom's explanation. Is a parity check in this situation (disk with read errors):

[*] beneficial

[*]harmful

[*]neither

[*]or depends on how the disk behaves

 

Thanks

 

4

 

Under ideal circumstances, a parity check should result in corrected data being written back to the failing drive. However, you still have a failing drive, which needs to be replaced sooner rather than later. Unraid only tolerates one disk failure at a time, so prudence says don't play games with a bad drive if you can at all help it. The longer you play around, the better the chances you will end up with a second drive failure (murphy's law).

Link to comment

Agreed.  Some will attempt to "heal" a drive that has a few pending sectors, and during the corresponding parity check/write those pending sectors don't increase...will have a drive they feel comfortable with.

 

Here was Tom's reply to me:

"If you see media errors get corrected it’s time to shop for another drive."

 

So, while a correcting parity check can reallocate sectors, it can not save a failing drive.  The # of pending sectors you had...was too great a risk to try to save. :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.