Is my parity drive dying?


Recommended Posts

Normal monthly parity check started last night and after 12hours it was still at 0.0% done and had found about 35,000 errors.  The array seems to be running perfectly fine other than this and I've had no problems, however after looking at my syslog it seems the error may possibly be only with the parity drive.

 

Syslog:

Feb  1 12:42:19 Media kernel: ata5.00: failed command: READ DMA EXT (Minor Issues)
Feb  1 12:42:19 Media kernel: ata5.00: cmd 25/00:00:e7:5e:34/00:04:6f:00:00/e0 tag 0 dma 524288 in (Drive related)
Feb  1 12:42:19 Media kernel:          res 51/40:00:2c:60:34/40:00:6f:00:00/00 Emask 0x9 (media error) (Errors)
Feb  1 12:42:19 Media kernel: ata5.00: status: { DRDY ERR } (Drive related)
Feb  1 12:42:19 Media kernel: ata5.00: error: { UNC } (Errors)
Feb  1 12:42:19 Media kernel: ata5.00: configured for UDMA/33 (Drive related)
Feb  1 12:42:19 Media kernel: ata5.01: configured for UDMA/133 (Drive related)
Feb  1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] Unhandled sense code (Drive related)
Feb  1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] Result: hostbyte=0x00 driverbyte=0x08 (System)
Feb  1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] Sense Key : 0x3 [current] [descriptor] (Drive related)
Feb  1 12:42:19 Media kernel: Descriptor sense data with sense descriptors (in hex):
Feb  1 12:42:19 Media kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb  1 12:42:19 Media kernel:         6f 34 60 2c 
Feb  1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] ASC=0x11 ASCQ=0x4 (Drive related)
Feb  1 12:42:19 Media kernel: sd 2:0:0:0: [sdf] CDB: cdb[0]=0x28: 28 00 6f 34 5e e7 00 04 00 00 (Drive related)
Feb  1 12:42:19 Media kernel: end_request: I/O error, dev sdf, sector 1865703468 (Errors)
Feb  1 12:42:19 Media kernel: ata5: EH complete (Drive related)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)
Feb  1 12:42:19 Media kernel: handle_stripe read error: 1865703400/0, count: 1 (Errors)
Feb  1 12:42:19 Media kernel: mdcmd (53): spindown 1 (Routine)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)
Feb  1 12:42:19 Media kernel: handle_stripe read error: 1865703408/0, count: 1 (Errors)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)
Feb  1 12:42:19 Media kernel: handle_stripe read error: 1865703416/0, count: 1 (Errors)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)
Feb  1 12:42:19 Media kernel: handle_stripe read error: 1865703424/0, count: 1 (Errors)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)
Feb  1 12:42:19 Media kernel: handle_stripe read error: 1865703432/0, count: 1 (Errors)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)
Feb  1 12:42:19 Media kernel: handle_stripe read error: 1865703440/0, count: 1 (Errors)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)
Feb  1 12:42:19 Media kernel: handle_stripe read error: 1865703448/0, count: 1 (Errors)
Feb  1 12:42:19 Media kernel: md: disk0 read error (Errors)

 

The first few lines repeat occasionally, but the others are repeating thousands of times over.

 

SMART report for parity:

smartctl -a -d ata /dev/sdf
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     ST32000542AS
Serial Number:    5XW1PS0X
Firmware Version: CC35
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Feb  1 15:24:56 2012 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 ( 633) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				No Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   093   093   006    Pre-fail  Always       -       158171401
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   020    Old_age   Always       -       5347
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       6
  7 Seek_Error_Rate         0x000f   075   060   030    Pre-fail  Always       -       35584178
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6750
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   097   097   020    Old_age   Always       -       3341
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       1009
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   044   045    Old_age   Always   In_the_past 28 (0 1 34 24)
194 Temperature_Celsius     0x0022   028   056   000    Old_age   Always       -       28 (0 17 0 0)
195 Hardware_ECC_Recovered  0x001a   044   035   000    Old_age   Always       -       158171401
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       27
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       27
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       259184096451348
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       1859306417
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       98807890

SMART Error Log Version: 1
ATA Error Count: 1009 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1009 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7a 54 00 00  Error: UNC at LBA = 0x0000547a = 21626

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 1f 52 00 e0 00      00:01:10.499  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:01:10.478  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:10.438  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      00:01:10.420  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:01:10.388  READ NATIVE MAX ADDRESS EXT

Error 1008 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7a 54 00 00  Error: UNC at LBA = 0x0000547a = 21626

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 1f 52 00 e0 00      00:01:06.558  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:01:06.537  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:06.497  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      00:01:06.483  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:01:06.457  READ NATIVE MAX ADDRESS EXT

Error 1007 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7a 54 00 00  Error: UNC at LBA = 0x0000547a = 21626

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 1f 52 00 e0 00      00:01:02.617  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:01:02.597  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:02.576  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      00:01:02.485  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:01:02.477  READ NATIVE MAX ADDRESS EXT

Error 1006 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7a 54 00 00  Error: UNC at LBA = 0x0000547a = 21626

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 1f 52 00 e0 00      00:00:58.747  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:00:58.726  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:00:58.686  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      00:00:58.671  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:58.646  READ NATIVE MAX ADDRESS EXT

Error 1005 occurred at disk power-on lifetime: 6748 hours (281 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7a 54 00 00  Error: UNC at LBA = 0x0000547a = 21626

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 40 1f 52 00 e0 00      00:00:54.846  READ DMA EXT
  27 00 00 00 00 00 e0 00      00:00:54.825  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:00:54.785  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      00:00:54.774  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 e0 00      00:00:54.745  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

All the other drives appear to be in fine working order.  Is this a fixable problem or do I just need to replace the drive?

 

Thanks

Link to comment

Those errors could indicate a bad drive, or they could be caused by a loose or bad cable.  Reseat the drive's power and data connections and run another parity check.  If you still see errors, then replace the drive.

UNC Media errors are unreadable sectors on a physical disk.  This is confirmed with the supplied SMART report which shows several re-allocated sectors, and several more pending re-allocation.  The errors are not caused by a bad cable.

 

5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      6

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      27

 

Now, there are several thousand spare sectors on the disk, so it is not yet "dead" but simply acting exactly as the manufacturer designed it to act.

 

Unfortunately, the disk will return zeros for the 27 sectors it has not yet re-allocated, so rebuilding a failed data disk will end up with some corruption.

 

Best bet is to stop the arraay, un-assign the parity disk, start the array with it un-assigned,

stop the array once more, re-assign the parity disk, and let unRAID re-construct it.  If it successfully re-allocates the sectors pending re-allocation, then just keep an eye on it.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.