ChronoStriker1 Posted May 31, 2012 Share Posted May 31, 2012 Ive started getting these errors two days ago, the server itself seems to be slower than usual since its happening. May 31 18:48:47 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed May 31 18:48:47 Tower kernel: hdparm: sending ioctl 2285 to a partition! May 31 18:48:49 Tower last message repeated 5 times May 31 18:48:49 Tower kernel: smartctl: sending ioctl 2285 to a partition! May 31 18:48:49 Tower last message repeated 3 times May 31 18:49:49 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed May 31 18:49:49 Tower kernel: hdparm: sending ioctl 2285 to a partition! May 31 18:49:51 Tower last message repeated 5 times May 31 18:49:51 Tower kernel: smartctl: sending ioctl 2285 to a partition! May 31 18:49:51 Tower last message repeated 3 times I have put it into maintenance mode and did reiser checks and md1-5 were good. Anyone have any ideas? Link to comment
Joe L. Posted June 1, 2012 Share Posted June 1, 2012 Ive started getting these errors two days ago, the server itself seems to be slower than usual since its happening. May 31 18:48:47 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed May 31 18:48:47 Tower kernel: hdparm: sending ioctl 2285 to a partition! May 31 18:48:49 Tower last message repeated 5 times May 31 18:48:49 Tower kernel: smartctl: sending ioctl 2285 to a partition! May 31 18:48:49 Tower last message repeated 3 times May 31 18:49:49 Tower kernel: scsi_verify_blk_ioctl: 36 callbacks suppressed May 31 18:49:49 Tower kernel: hdparm: sending ioctl 2285 to a partition! May 31 18:49:51 Tower last message repeated 5 times May 31 18:49:51 Tower kernel: smartctl: sending ioctl 2285 to a partition! May 31 18:49:51 Tower last message repeated 3 times I have put it into maintenance mode and did reiser checks and md1-5 were good. Anyone have any ideas? Yes, your SimpleFeatures add-on email-notifications feature is written improperly. Check with its author. The hdparm command should only be used on a disk, not on a partitions on a disk. Other than that, the message is harmless. Link to comment
ChronoStriker1 Posted June 1, 2012 Author Share Posted June 1, 2012 is there anyway i can find whats calling hdparm? Link to comment
Joe L. Posted June 1, 2012 Share Posted June 1, 2012 is there anyway i can find whats calling hdparm? It is the notify email feature that is causing the errors. Do a search on the error message you are getting in these forums... you'll see the responses. Link to comment
ChronoStriker1 Posted June 1, 2012 Author Share Posted June 1, 2012 sorry should have done that first, unfortunately im sick and didnt think that far ahead. Link to comment
ChronoStriker1 Posted June 1, 2012 Author Share Posted June 1, 2012 wow i am being really dumb today cause thats not even the errors i wanted to ask about, this is: May 31 22:55:51 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 May 31 22:55:51 Tower kernel: ata3.00: irq_stat 0x40000001 May 31 22:55:51 Tower kernel: ata3.00: failed command: READ DMA EXT May 31 22:55:51 Tower kernel: ata3.00: cmd 25/00:08:17:39:d7/00:00:25:00:00/e0 tag 0 dma 4096 in May 31 22:55:51 Tower kernel: res 51/40:00:17:39:d7/00:00:25:00:00/e0 Emask 0x9 (media error) May 31 22:55:51 Tower kernel: ata3.00: status: { DRDY ERR } May 31 22:55:51 Tower kernel: ata3.00: error: { UNC } May 31 22:55:51 Tower kernel: ata3.00: configured for UDMA/133 May 31 22:55:51 Tower kernel: ata3: EH complete Link to comment
dgaschk Posted June 1, 2012 Share Posted June 1, 2012 See here: http://lime-technology.com/forum/index.php?topic=9880.0 Link to comment
ChronoStriker1 Posted June 1, 2012 Author Share Posted June 1, 2012 I am running v5.0-rc3 but it was happening on an earlier beta build as well and have attached the syslog from before my last reboot Here is the system overview from simplefeatures System Overview unRAID Version: unRAID Server Plus, Version 5.0-rc3 Motherboard: - 132-BL-E758 Processor: Intel® CoreTM i7 - 2.666 GHz Cache: L1 = 32 kB L2 = 32 kB L3 = 1024 kB Memory: 6 GB - DIMM0 = 1033 MHz DIMM2 = 1033 MHz DIMM4 = 1033 MHz Network: 1000Mb/s - Full Duplex Uptime: 0 days, 0 hrs, 21 mins, 57 secs syslog.zip Link to comment
Joe L. Posted June 1, 2012 Share Posted June 1, 2012 wow i am being really dumb today cause thats not even the errors i wanted to ask about, this is: May 31 22:55:51 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 May 31 22:55:51 Tower kernel: ata3.00: irq_stat 0x40000001 May 31 22:55:51 Tower kernel: ata3.00: failed command: READ DMA EXT May 31 22:55:51 Tower kernel: ata3.00: cmd 25/00:08:17:39:d7/00:00:25:00:00/e0 tag 0 dma 4096 in May 31 22:55:51 Tower kernel: res 51/40:00:17:39:d7/00:00:25:00:00/e0 Emask 0x9 (media error) May 31 22:55:51 Tower kernel: ata3.00: status: { DRDY ERR } May 31 22:55:51 Tower kernel: ata3.00: error: { UNC } May 31 22:55:51 Tower kernel: ata3.00: configured for UDMA/133 May 31 22:55:51 Tower kernel: ata3: EH complete Un-Correctable Media errors are ALWAYS un-readable sectors on disk drives. It is not unRAID release specific, but disk specific. Joe L. Link to comment
ChronoStriker1 Posted June 1, 2012 Author Share Posted June 1, 2012 What should I do to correct it, besides ordering an new disk drive. Also which disk is it? Link to comment
Joe L. Posted June 1, 2012 Share Posted June 1, 2012 What should I do to correct it, besides ordering an new disk drive. Also which disk is it? based on your syslog, it is disk4. May 31 23:34:17 Tower kernel: md: import disk4: [8,48] (sdd) WDC_WD20EARS-00S8B1_WD-WCAVY5667002 size: 1953514552 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574224/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574232/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574240/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574248/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574256/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574264/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574272/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574280/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574288/1, count: 1 Jun 1 16:21:54 Tower kernel: md: disk4 read error Jun 1 16:21:54 Tower kernel: handle_stripe read error: 3443574296/1, count: 1 Please get a SMART report and post the result. It will let you know how to proceed. Very likely you will need to replace the drive. to get a SMART report type: smartctl -a /dev/sdd Joe L. Link to comment
ChronoStriker1 Posted June 1, 2012 Author Share Posted June 1, 2012 Attached smartctl.txt Link to comment
Joe L. Posted June 1, 2012 Share Posted June 1, 2012 That disk shows: 5 Reallocated_Sector_Ct 0x0033 191 191 140 Pre-fail Always - 71 and 197 Current_Pending_Sector 0x0032 197 196 000 Old_age Always - 1252 There are 71 sectors that have been re-allocated, and another 1252 that will be the next time you write to them. Basically, your disk needs to be replaced. (ASAP) DO NOT PERFORM A CORRECTING PARITY CHECK!!! DO NOT!!! If you do, parity will be changed to reflect the zeros returned when a sector cannot be read. (and you will lose the ability to re-construct that sector) Replace the disk as soon as possible. Consider it as failed. Joe L. Link to comment
JonathanM Posted June 2, 2012 Share Posted June 2, 2012 DO NOT PERFORM A CORRECTING PARITY CHECK!!! DO NOT!!! If you do, parity will be changed to reflect the zeros returned when a sector cannot be read. (and you will lose the ability to re-construct that sector) I wish I remembered who it was on these forums that was advocating parity checks should always be correcting. This is a live example of when doing something intuitive based on the unraid current interface (checking and correcting parity) could end up with permanently lost data. It's all too easy to think that checking parity to verify it's ability to protect from failure would be a good thing to do right now. I would love to get whoever it was to weigh in here. I'm pretty sure his argument would be that the read should fail, immediately triggering a reconstruction of the failed read followed by the attempt to write the correct data, presumably the drive will either fail the write and get taken offline, or reallocate the sector and succeed the write. I could be all wet, but trusting a failing drive to behave predictably makes me nervous. Link to comment
ChronoStriker1 Posted June 2, 2012 Author Share Posted June 2, 2012 Ofcorce I started a parity check before you responce came... stopped that as soon as I saw it. Hope I didnt loose anything. Will try to get a hd to replace it tomorrow. Now i know sectors on a drive do fail, how can you tell that the drive itself is dying though. And how many sectors on a 2tb drive going bad are too many? Link to comment
JonathanM Posted June 2, 2012 Share Posted June 2, 2012 Now i know sectors on a drive do fail, how can you tell that the drive itself is dying though. And how many sectors on a 2tb drive going bad are too many? It's more the rate of failure than the absolute number. A perfectly stable high number, while not ideal, is much better than a steadily climbing lower number. The problem with rising bad sector counts is you don't know when or if it will level off. Link to comment
RobJ Posted June 23, 2012 Share Posted June 23, 2012 DO NOT PERFORM A CORRECTING PARITY CHECK!!! DO NOT!!! If you do, parity will be changed to reflect the zeros returned when a sector cannot be read. (and you will lose the ability to re-construct that sector) I wish I remembered who it was on these forums that was advocating parity checks should always be correcting. This is a live example of when doing something intuitive based on the unraid current interface (checking and correcting parity) could end up with permanently lost data. It's all too easy to think that checking parity to verify it's ability to protect from failure would be a good thing to do right now. I would love to get whoever it was to weigh in here. I'm pretty sure his argument would be that the read should fail, immediately triggering a reconstruction of the failed read followed by the attempt to write the correct data, presumably the drive will either fail the write and get taken offline, or reallocate the sector and succeed the write. I could be all wet, but trusting a failing drive to behave predictably makes me nervous. Just saw this, and pretty sure you are referring to me. Joe, I fully agree with your analysis of that drive and its need for replacement, but I really, really hope you did not mean what you said about "parity will be changed to reflect the zeros returned when a sector cannot be read". That is seriously wrong behavior, and I'm really hoping you were rushed, not thinking about what you said, etc? Perhaps enjoying a good beverage? Because if you have determined that that is what actually happens during a parity calculation when encountering a read failure, then I (we) strongly implore Tom to quickly correct that. If parity cannot be calculated from the physical data from every disk, then it cannot be calculated and it cannot be changed. The return from a read failure cannot result in a pseudo value, it must be a true failure, that either stops the process completely with appropriate error messages, or starts a remedial process. Jonathan, your suggestion of reconstruction and rewrite sounds exactly right. Force the drive to deal with it. If the data is not rewritten successfully, how can you trust the drive? Or your parity protection? I never got back to that thread, except to catch some of the responses, some of which sounded valid and needed further thought (and possible mind changing!). I'm afraid I tend to walk away from confrontation, not good at dealing with... Link to comment
Joe L. Posted June 23, 2012 Share Posted June 23, 2012 Joe, I fully agree with your analysis of that drive and its need for replacement, but I really, really hope you did not mean what you said about "parity will be changed to reflect the zeros returned when a sector cannot be read". That is seriously wrong behavior, and I'm really hoping you were rushed, not thinking about what you said, etc? Perhaps enjoying a good beverage? Because if you have determined that that is what actually happens during a parity calculation when encountering a read failure, then I (we) strongly implore Tom to quickly correct that. If parity cannot be calculated from the physical data from every disk, then it cannot be calculated and it cannot be changed. The return from a read failure cannot result in a pseudo value, it must be a true failure, that either stops the process completely with appropriate error messages, or starts a remedial process. According to Tom, if during a parity check/sync a read error occurs, the sector that could not be read will be re-constructed (from the other disks) and supplied to the program requesting it, AND written to the disk that failed the read. This "should" cause the SMART firmware to re-allocate the sector. Then the re-constructed data is used to verify parity. (and it better match at that point!) If the data cannot be re-constructed, the parity check/sync is aborted. This would indicate to me that it is safe to perform a correcting parity check, even with the possibility of a read-error from any disk. HOWEVER..... I am guessing that if a disk sector was determined to be un-readable by some INTERNAL test on a disk (long or short test request perhaps, or read-ahead possibly of the cylinder to read into disk buffer cache) that unRAID might not get a read error, since it had not issued any read request. That results in a sector pending re-allocation. In those conditions, what do you think we'll get when we attempt to read a sector that has already been marked for re-allocation when next written? My guess is we may certainly not get the original contents, but instead something else... (all zeros perhaps? who knows) In that case, it may clobber parity. It is easily possible corrective actions cannot occur when a disk is initially read, since parity does not yet exist. (not yet assigned) as I've seen plenty of disks reports where hundreds of sectors are marked as pending re-allocation. If Tom's description is accurate, that simply cannot happen. (or the SMART firmware is brain-damaged) Any sector pending re-allocation should be nearly instantly very quickly be re-allocated (or rather, as soon as all the other disks can be spun up) as it is re-written. Something tells me that using the NOCORRECT option is prudent unless you specifically want to fix parity. I feel more in control of my clobbering of my data that way. PS. I don't frequently drink alcoholic beverages. I prefer Pepsi. (Think I had a glass of wine with friends on Easter... Did my share of drinking as a teenager... Drinking age was 18 back then. That was a looooooooooooooooooooooooooong time ago.) Joe L. Link to comment
RobJ Posted June 24, 2012 Share Posted June 24, 2012 Very reassuring Joe, and thanks for your comments. What Tom says is exactly what I would have expected, and the way it should be. As to your concern about previously detected bad or suspicious sectors, I don't believe they are handled any differently, from our point of view. As far as I know, and I cannot conceive of any other valid way of handling them, there is NO difference between a drive just discovering a bad sector, or the drive already knowing it was bad. In both cases, it would return a read failure. I strongly suspect that both cases are handled similarly internally, with additional retries and additional calibration of head alignment before retrying again. I don't think there is any possibility of getting zeroes. Part of my understanding and reasoning about bad sectors comes from the helpful smartmontools FAQ, especially these 2 questions. As to what happens when there is no parity info available, it should be a read failure too, but without any way to recover! I think this illustrates tremendously the importance of UnRAID's parity protection, in that the parity drive is not just about drive protection (the ability to reconstruct complete drives), but also data protection at the sector level (the ability to reconstruct and restore any lost sector). We probably should emphasize that aspect more, because it is equally important. Just to clarify for others, and explain why writes have a special importance in dealing with bad or suspicious sectors, and why there is a Current Pending Sector list: Drives try almost heroically to preserve the current data contents of any sector. If a sector read fails, the drive never gives up on that sector until a write to that sector is performed. Why? Because there is one very important distinction between a read and a write. With a read, the data contents are considered important, desired. With a write, you have just indicated to the drive you are going to overwrite the sector, and therefore the current contents HAVE NO VALUE! So we can clobber it, write test patterns to it, fully evaluate the quality of the media under that sector, and make the definitive decision about whether it can be trusted with new data, or should be discarded, remapped. If it was a Pending sector, it is removed from that list, because we dealt with it. The write to that sector releases the drive from having to preserve the current contents, which allows it to test and reallocate if necessary. The Current Pending Sector list is a list of sectors that failed a read but we don't want to give up on yet, and we cannot fully test either. If a read fails, then the drive adds it to the Current Pending Sector list, to await future testing. Just because a read fails one time, it is not doomed, because it is known that under different conditions, perhaps a slightly cleaner electrical environment or different temperature conditions or a different head alignment, that sector may finally be successfully read. So the Current Pending Sector list is a list of sectors known to be suspect, possibly bad, but cannot be fully dealt with yet, and the drive does not want to take any chances damaging the current data contents. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.