Jump to content

errors in syslog


Recommended Posts

Ive started getting these errors two days ago, the server itself seems to be slower than usual since its happening.

May 31 18:48:47 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed
May 31 18:48:47 Tower kernel: hdparm: sending ioctl 2285 to a partition!
May 31 18:48:49 Tower last message repeated 5 times
May 31 18:48:49 Tower kernel: smartctl: sending ioctl 2285 to a partition!
May 31 18:48:49 Tower last message repeated 3 times
May 31 18:49:49 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed
May 31 18:49:49 Tower kernel: hdparm: sending ioctl 2285 to a partition!
May 31 18:49:51 Tower last message repeated 5 times
May 31 18:49:51 Tower kernel: smartctl: sending ioctl 2285 to a partition!
May 31 18:49:51 Tower last message repeated 3 times

I have put it into maintenance mode and did reiser checks and md1-5 were good.  Anyone have any ideas?

Link to comment

Ive started getting these errors two days ago, the server itself seems to be slower than usual since its happening.

May 31 18:48:47 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed
May 31 18:48:47 Tower kernel: hdparm: sending ioctl 2285 to a partition!
May 31 18:48:49 Tower last message repeated 5 times
May 31 18:48:49 Tower kernel: smartctl: sending ioctl 2285 to a partition!
May 31 18:48:49 Tower last message repeated 3 times
May 31 18:49:49 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed
May 31 18:49:49 Tower kernel: hdparm: sending ioctl 2285 to a partition!
May 31 18:49:51 Tower last message repeated 5 times
May 31 18:49:51 Tower kernel: smartctl: sending ioctl 2285 to a partition!
May 31 18:49:51 Tower last message repeated 3 times

I have put it into maintenance mode and did reiser checks and md1-5 were good.  Anyone have any ideas?

Yes, your SimpleFeatures add-on email-notifications feature is written improperly.  Check with its author.

 

The hdparm command should only be used on a disk, not on a partitions on a disk.  Other than that, the message is harmless.

Link to comment

wow i am being really dumb today cause thats not even the errors i wanted to ask about, this is:

May 31 22:55:51 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
May 31 22:55:51 Tower kernel: ata3.00: irq_stat 0x40000001
May 31 22:55:51 Tower kernel: ata3.00: failed command: READ DMA EXT
May 31 22:55:51 Tower kernel: ata3.00: cmd 25/00:08:17:39:d7/00:00:25:00:00/e0 tag 0 dma 4096 in
May 31 22:55:51 Tower kernel: res 51/40:00:17:39:d7/00:00:25:00:00/e0 Emask 0x9 (media error)
May 31 22:55:51 Tower kernel: ata3.00: status: { DRDY ERR }
May 31 22:55:51 Tower kernel: ata3.00: error: { UNC }
May 31 22:55:51 Tower kernel: ata3.00: configured for UDMA/133
May 31 22:55:51 Tower kernel: ata3: EH complete

Link to comment

I am running v5.0-rc3 but it was happening on an earlier beta build as well

and have attached the syslog from before my last reboot

 

Here is the system overview from simplefeatures

System Overview

unRAID Version: unRAID Server Plus, Version 5.0-rc3

Motherboard: - 132-BL-E758

Processor: Intel® CoreTM i7 - 2.666 GHz

Cache: L1 = 32 kB  L2 = 32 kB  L3 = 1024 kB 

Memory: 6 GB - DIMM0 = 1033 MHz  DIMM2 = 1033 MHz  DIMM4 = 1033 MHz 

Network: 1000Mb/s - Full Duplex

Uptime: 0 days, 0 hrs, 21 mins, 57 secs

syslog.zip

Link to comment

wow i am being really dumb today cause thats not even the errors i wanted to ask about, this is:

May 31 22:55:51 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
May 31 22:55:51 Tower kernel: ata3.00: irq_stat 0x40000001
May 31 22:55:51 Tower kernel: ata3.00: failed command: READ DMA EXT
May 31 22:55:51 Tower kernel: ata3.00: cmd 25/00:08:17:39:d7/00:00:25:00:00/e0 tag 0 dma 4096 in
May 31 22:55:51 Tower kernel: res 51/40:00:17:39:d7/00:00:25:00:00/e0 Emask 0x9 (media error)
May 31 22:55:51 Tower kernel: ata3.00: status: { DRDY ERR }
May 31 22:55:51 Tower kernel: ata3.00: error: { UNC }
May 31 22:55:51 Tower kernel: ata3.00: configured for UDMA/133
May 31 22:55:51 Tower kernel: ata3: EH complete

Un-Correctable Media errors are ALWAYS un-readable sectors on disk drives.  It is not unRAID release specific, but disk specific.

 

Joe L.

Link to comment

What should I do to correct it, besides ordering an new disk drive.  Also which disk is it?

based on your syslog, it is disk4.

 

May 31 23:34:17 Tower kernel: md: import disk4: [8,48] (sdd) WDC_WD20EARS-00S8B1_WD-WCAVY5667002 size: 1953514552

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574224/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574232/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574240/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574248/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574256/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574264/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574272/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574280/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574288/1, count: 1

Jun  1 16:21:54 Tower kernel: md: disk4 read error

Jun  1 16:21:54 Tower kernel: handle_stripe read error: 3443574296/1, count: 1

 

Please get a SMART report and post the result.  It will let you know how to proceed.  Very likely you will need to replace the drive.

to get a SMART report type:

smartctl -a /dev/sdd

 

Joe L.

 

Link to comment

That disk shows:

 

  5 Reallocated_Sector_Ct  0x0033  191  191  140    Pre-fail  Always      -      71

 

and

 

197 Current_Pending_Sector  0x0032  197  196  000    Old_age  Always      -      1252

 

There are 71 sectors that have been re-allocated, and another 1252 that will be the next time you write to them.

 

Basically, your disk needs to be replaced.  (ASAP) 

DO NOT PERFORM A CORRECTING PARITY CHECK!!! DO NOT!!!    If you do, parity will be changed to reflect the zeros returned when a sector cannot be read.    (and you will lose the ability to re-construct that sector)

 

Replace the disk as soon as possible.  Consider it as failed.

 

Joe L.

Link to comment

DO NOT PERFORM A CORRECTING PARITY CHECK!!! DO NOT!!!    If you do, parity will be changed to reflect the zeros returned when a sector cannot be read.    (and you will lose the ability to re-construct that sector)

I wish I remembered who it was on these forums that was advocating parity checks should always be correcting. This is a live example of when doing something intuitive based on the unraid current interface (checking and correcting parity) could end up with permanently lost data. It's all too easy to think that checking parity to verify it's ability to protect from failure would be a good thing to do right now. I would love to get whoever it was to weigh in here.

 

I'm pretty sure his argument would be that the read should fail, immediately triggering a reconstruction of the failed read followed by the attempt to write the correct data, presumably the drive will either fail the write and get taken offline, or reallocate the sector and succeed the write.

 

I could be all wet, but trusting a failing drive to behave predictably makes me nervous.

Link to comment

Ofcorce I started a parity check before you responce came... stopped that as soon as I saw it.  Hope I didnt loose anything.  Will try to get a hd to replace it tomorrow.  Now i know sectors on a drive do fail, how can you tell that the drive itself is dying though.  And how many sectors on a 2tb drive going bad are too many?

Link to comment
Now i know sectors on a drive do fail, how can you tell that the drive itself is dying though.  And how many sectors on a 2tb drive going bad are too many?

It's more the rate of failure than the absolute number. A perfectly stable high number, while not ideal, is much better than a steadily climbing lower number. The problem with rising bad sector counts is you don't know when or if it will level off.

Link to comment
  • 3 weeks later...

DO NOT PERFORM A CORRECTING PARITY CHECK!!! DO NOT!!!    If you do, parity will be changed to reflect the zeros returned when a sector cannot be read.    (and you will lose the ability to re-construct that sector)

I wish I remembered who it was on these forums that was advocating parity checks should always be correcting. This is a live example of when doing something intuitive based on the unraid current interface (checking and correcting parity) could end up with permanently lost data. It's all too easy to think that checking parity to verify it's ability to protect from failure would be a good thing to do right now. I would love to get whoever it was to weigh in here.

 

I'm pretty sure his argument would be that the read should fail, immediately triggering a reconstruction of the failed read followed by the attempt to write the correct data, presumably the drive will either fail the write and get taken offline, or reallocate the sector and succeed the write.

 

I could be all wet, but trusting a failing drive to behave predictably makes me nervous.

 

Just saw this, and pretty sure you are referring to me.

 

Joe, I fully agree with your analysis of that drive and its need for replacement, but I really, really hope you did not mean what you said about "parity will be changed to reflect the zeros returned when a sector cannot be read".  That is seriously wrong behavior, and I'm really hoping you were rushed, not thinking about what you said, etc?  Perhaps enjoying a good beverage?  ;)  Because if you have determined that that is what actually happens during a parity calculation when encountering a read failure, then I (we) strongly implore Tom to quickly correct that.  If parity cannot be calculated from the physical data from every disk, then it cannot be calculated and it cannot be changed.  The return from a read failure cannot result in a pseudo value, it must be a true failure, that either stops the process completely with appropriate error messages, or starts a remedial process.

 

Jonathan, your suggestion of reconstruction and rewrite sounds exactly right.  Force the drive to deal with it.  If the data is not rewritten successfully, how can you trust the drive?  Or your parity protection?

 

I never got back to that thread, except to catch some of the responses, some of which sounded valid and needed further thought (and possible mind changing!).  I'm afraid I tend to walk away from confrontation, not good at dealing with...

Link to comment

Joe, I fully agree with your analysis of that drive and its need for replacement, but I really, really hope you did not mean what you said about "parity will be changed to reflect the zeros returned when a sector cannot be read".  That is seriously wrong behavior, and I'm really hoping you were rushed, not thinking about what you said, etc?  Perhaps enjoying a good beverage?  ;)  Because if you have determined that that is what actually happens during a parity calculation when encountering a read failure, then I (we) strongly implore Tom to quickly correct that.  If parity cannot be calculated from the physical data from every disk, then it cannot be calculated and it cannot be changed.  The return from a read failure cannot result in a pseudo value, it must be a true failure, that either stops the process completely with appropriate error messages, or starts a remedial process.

According to Tom,  if during a parity check/sync a read error occurs, the sector that could not be read will be re-constructed (from the other disks) and supplied to the program requesting it, AND written to the disk that failed the read.  This "should" cause the SMART firmware to re-allocate the sector.  Then the re-constructed data is used to verify parity.  (and it better match at that point!)    If the data cannot be re-constructed, the parity check/sync is aborted.

 

This would indicate to me that it is safe to perform a correcting parity check, even with the possibility of a read-error from any  disk.

 

HOWEVER..... I am guessing that if a disk sector was determined to be un-readable by some INTERNAL test on a disk (long or short test request perhaps, or read-ahead possibly of the cylinder to read into disk buffer cache) that unRAID might not get a read error, since it had not issued any read request.    That results in a sector pending re-allocation.

 

In those conditions, what do you think we'll get when we attempt to read a sector that has already been marked for re-allocation when next written?  My guess is we may certainly not get the original contents, but instead something else... (all zeros perhaps?  who knows)

 

In that case, it may clobber parity.

 

It is easily possible corrective actions cannot occur when a disk is initially read, since parity does not yet exist. (not yet assigned) as I've seen plenty of disks reports where hundreds of sectors are marked as pending re-allocation.  If Tom's description is accurate, that simply cannot happen.  (or the SMART firmware is brain-damaged)  Any sector pending re-allocation should be nearly instantly very quickly be re-allocated (or rather, as soon as all the other disks can  be spun up) as it is re-written.

 

Something tells me that using the NOCORRECT option is prudent unless you specifically want to fix parity.  I feel more in control of my clobbering of my data that way.

 

PS.  I don't frequently drink alcoholic beverages. I prefer Pepsi. (Think I had a glass of wine with friends on Easter...  Did my share of drinking as a teenager...  Drinking age was 18 back then.  That was a looooooooooooooooooooooooooong time ago.)

 

Joe L.

Link to comment

Very reassuring Joe, and thanks for your comments.  What Tom says is exactly what I would have expected, and the way it should be.

 

As to your concern about previously detected bad or suspicious sectors, I don't believe they are handled any differently, from our point of view.  As far as I know, and I cannot conceive of any other valid way of handling them, there is NO difference between a drive just discovering a bad sector, or the drive already knowing it was bad.  In both cases, it would return a read failure.  I strongly suspect that both cases are handled similarly internally, with additional retries and additional calibration of head alignment before retrying again.  I don't think there is any possibility of getting zeroes.  Part of my understanding and reasoning about bad sectors comes from the helpful smartmontools FAQ, especially these 2 questions.

 

As to what happens when there is no parity info available, it should be a read failure too, but without any way to recover!  I think this illustrates tremendously the importance of UnRAID's parity protection, in that the parity drive is not just about drive protection (the ability to reconstruct complete drives), but also data protection at the sector level (the ability to reconstruct and restore any lost sector).  We probably should emphasize that aspect more, because it is equally important.

 

Just to clarify for others, and explain why writes have a special importance in dealing with bad or suspicious sectors, and why there is a Current Pending Sector list:

 

    Drives try almost heroically to preserve the current data contents of any sector.  If a sector read fails, the drive never gives up on that sector until a write to that sector is performed.  Why?  Because there is one very important distinction between a read and a write.  With a read, the data contents are considered important, desired.  With a write, you have just indicated to the drive you are going to overwrite the sector, and therefore the current contents HAVE NO VALUE!  So we can clobber it, write test patterns to it, fully evaluate the quality of the media under that sector, and make the definitive decision about whether it can be trusted with new data, or should be discarded, remapped.  If it was a Pending sector, it is removed from that list, because we dealt with it.  The write to that sector releases the drive from having to preserve the current contents, which allows it to test and reallocate if necessary.  The Current Pending Sector list is a list of sectors that failed a read but we don't want to give up on yet, and we cannot fully test either.  If a read fails, then the drive adds it to the Current Pending Sector list, to await future testing.  Just because a read fails one time, it is not doomed, because it is known that under different conditions, perhaps a slightly cleaner electrical environment or different temperature conditions or a different head alignment, that sector may finally be successfully read.  So the Current Pending Sector list is a list of sectors known to be suspect, possibly bad, but cannot be fully dealt with yet, and the drive does not want to take any chances damaging the current data contents.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...