Jump to content

Advice on drive failure?


Recommended Posts

I'm running 5.0beta10.

 

Last night I had my server apart to install 5in3 cages.  When I went to bed (at 3am!) I left a non-correcting parity check running, which I had been observing for some 30 minutes, monitoring drive temperatures - there was nothing untoward at this time.

 

During the night we had a powercut.  Now, foolishly, I had restarted the server with the UPS usb unplugged, but plugged it in subsequently.  I suspect that this means that acpupsd was not active.  Certainly, there is no logfile on my flash drive, pertaining to the session when the power cut occurred.

 

When the power was restored, the machine appeared to start up normally, but a few minutes later I discovered that disk1 had red-balled.

 

The main array status (in unMENU) was announcing "Parity updated 67 times to address sync errors".  This worries me, because the last thing I want is for parity to update in response to a drive failure!

 

I am currently rebuilding onto a new drive (isn't it wonderfully easy to install/swap drives with a trayless hot-swap drive cage?!), but I'm a little worried that 67 errors will have been introduced into my data.

 

The smart report for the failed drive is showing nothing untoward (and the fact that I could obtain the smart report suggests that there was no connection problem):

smartctl -a /dev/sdc (--)

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format) family
Device Model:     WDC WD10EARS-00Y5B1
Serial Number:    WD-WCAV56040419
Firmware Version: 80.00A80
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Aug 11 08:12:09 2011 SGT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (20100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 231) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3031)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   132   128   021    Pre-fail  Always       -       6366
 4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2327
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9727
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       271
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       95
193 Load_Cycle_Count        0x0032   187   187   000    Old_age   Always       -       39992
194 Temperature_Celsius     0x0022   111   095   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

I would be grateful if someone more experienced could peruse my log file and suggest why the drive red-balled and confirm whether the 67 parity updates may have compromised my data.

syslog-2011-08-11.txt.zip

Link to comment

"Parity updated 67 times to address sync errors" is most likely due to the raiser fs journal being played out when the disk came back after an unclean shutdown.

 

I doubt you would have lost data under the conditions you described.

 

Does the log contain messages about the journal?

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...