Haenchensd Posted February 1, 2012 Share Posted February 1, 2012 Normal monthly parity check started last night and after 12hours it was still at 0.0% done and had found about 35,000 errors. The array seems to be running perfectly fine other than this and I've had no problems, however after looking at my syslog it seems the error may possibly be only with the parity drive. Syslog: Feb 1 12:42:19 Media kernel: ata5.00: failed command: READ DMA EXT (Minor Issues) Feb 1 12:42:19 Media kernel: ata5.00: cmd 25/00:00:e7:5e:34/00:04:6f:00:00/e0 tag 0 dma 524288 in (Drive related) Feb 1 12:42:19 Media kernel: res 51/40:00:2c:60:34/40:00:6f:00:00/00 Emask 0x9 (media error) (Errors) Feb 1 12:42:19 Media kernel: ata5.00: status: { DRDY ERR } (Drive related) Feb 1 12:42:19 Media kernel: ata5.00: error: { UNC } (Errors) Feb 1 12:42:19 Media kernel: ata5.00: configured for UDMA/33 (Drive related) Feb 1 12:42:19 Media kernel: ata5.01: configured for UDMA/133 (Drive related) Feb 1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] Unhandled sense code (Drive related) Feb 1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08 (System) Feb 1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] Sense Key : 0x3 [current] [descriptor] (Drive related) Feb 1 12:42:19 Media kernel: Descriptor sense data with sense descriptors (in hex): Feb 1 12:42:19 Media kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Feb 1 12:42:19 Media kernel: 6f 34 60 2c Feb 1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] ASC=0x11 ASCQ=0x4 (Drive related) Feb 1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] CDB: cdb[0]=0x28: 28 00 6f 34 5e e7 00 04 00 00 (Drive related) Feb 1 12:42:19 Media kernel: end_request: I/O error, dev sdf, sector 1865703468 (Errors) Feb 1 12:42:19 Media kernel: ata5: EH complete (Drive related) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) Feb 1 12:42:19 Media kernel: handle_stripe read error: 1865703400/0, count: 1 (Errors) Feb 1 12:42:19 Media kernel: mdcmd (53): spindown 1 (Routine) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) Feb 1 12:42:19 Media kernel: handle_stripe read error: 1865703408/0, count: 1 (Errors) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) Feb 1 12:42:19 Media kernel: handle_stripe read error: 1865703416/0, count: 1 (Errors) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) Feb 1 12:42:19 Media kernel: handle_stripe read error: 1865703424/0, count: 1 (Errors) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) Feb 1 12:42:19 Media kernel: handle_stripe read error: 1865703432/0, count: 1 (Errors) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) Feb 1 12:42:19 Media kernel: handle_stripe read error: 1865703440/0, count: 1 (Errors) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) Feb 1 12:42:19 Media kernel: handle_stripe read error: 1865703448/0, count: 1 (Errors) Feb 1 12:42:19 Media kernel: md: disk0 read error (Errors) The first few lines repeat occasionally, but the others are repeating thousands of times over. SMART report for parity: smartctl -a -d ata /dev/sdf smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST32000542AS Serial Number: 5XW1PS0X Firmware Version: CC35 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Wed Feb 1 15:24:56 2012 MST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 633) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 093 093 006 Pre-fail Always - 158171401 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 095 095 020 Old_age Always - 5347 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 6 7 Seek_Error_Rate 0x000f 075 060 030 Pre-fail Always - 35584178 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 6750 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 097 097 020 Old_age Always - 3341 183 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1009 188 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 072 044 045 Old_age Always In_the_past 28 (0 1 34 24) 194 Temperature_Celsius 0x0022 028 056 000 Old_age Always - 28 (0 17 0 0) 195 Hardware_ECC_Recovered 0x001a 044 035 000 Old_age Always - 158171401 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 27 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 27 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 259184096451348 241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 1859306417 242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 98807890 SMART Error Log Version: 1 ATA Error Count: 1009 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1009 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 7a 54 00 00 Error: UNC at LBA = 0x0000547a = 21626 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 40 1f 52 00 e0 00 00:01:10.499 READ DMA EXT 27 00 00 00 00 00 e0 00 00:01:10.478 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:01:10.438 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:01:10.420 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:01:10.388 READ NATIVE MAX ADDRESS EXT Error 1008 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 7a 54 00 00 Error: UNC at LBA = 0x0000547a = 21626 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 40 1f 52 00 e0 00 00:01:06.558 READ DMA EXT 27 00 00 00 00 00 e0 00 00:01:06.537 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:01:06.497 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:01:06.483 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:01:06.457 READ NATIVE MAX ADDRESS EXT Error 1007 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 7a 54 00 00 Error: UNC at LBA = 0x0000547a = 21626 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 40 1f 52 00 e0 00 00:01:02.617 READ DMA EXT 27 00 00 00 00 00 e0 00 00:01:02.597 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:01:02.576 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:01:02.485 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:01:02.477 READ NATIVE MAX ADDRESS EXT Error 1006 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 7a 54 00 00 Error: UNC at LBA = 0x0000547a = 21626 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 40 1f 52 00 e0 00 00:00:58.747 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:58.726 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:58.686 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:58.671 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:58.646 READ NATIVE MAX ADDRESS EXT Error 1005 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 7a 54 00 00 Error: UNC at LBA = 0x0000547a = 21626 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 40 1f 52 00 e0 00 00:00:54.846 READ DMA EXT 27 00 00 00 00 00 e0 00 00:00:54.825 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 a0 00 00:00:54.785 IDENTIFY DEVICE ef 03 42 00 00 00 a0 00 00:00:54.774 SET FEATURES [set transfer mode] 27 00 00 00 00 00 e0 00 00:00:54.745 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. All the other drives appear to be in fine working order. Is this a fixable problem or do I just need to replace the drive? Thanks Quote Link to comment
Rajahal Posted February 2, 2012 Share Posted February 2, 2012 Those errors could indicate a bad drive, or they could be caused by a loose or bad cable. Reseat the drive's power and data connections and run another parity check. If you still see errors, then replace the drive. Quote Link to comment
Joe L. Posted February 2, 2012 Share Posted February 2, 2012 Those errors could indicate a bad drive, or they could be caused by a loose or bad cable. Reseat the drive's power and data connections and run another parity check. If you still see errors, then replace the drive. UNC Media errors are unreadable sectors on a physical disk. This is confirmed with the supplied SMART report which shows several re-allocated sectors, and several more pending re-allocation. The errors are not caused by a bad cable. 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 6 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 27 Now, there are several thousand spare sectors on the disk, so it is not yet "dead" but simply acting exactly as the manufacturer designed it to act. Unfortunately, the disk will return zeros for the 27 sectors it has not yet re-allocated, so rebuilding a failed data disk will end up with some corruption. Best bet is to stop the arraay, un-assign the parity disk, start the array with it un-assigned, stop the array once more, re-assign the parity disk, and let unRAID re-construct it. If it successfully re-allocates the sectors pending re-allocation, then just keep an eye on it. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.