PeterB Posted August 11, 2011 Share Posted August 11, 2011 I'm running 5.0beta10. Last night I had my server apart to install 5in3 cages. When I went to bed (at 3am!) I left a non-correcting parity check running, which I had been observing for some 30 minutes, monitoring drive temperatures - there was nothing untoward at this time. During the night we had a powercut. Now, foolishly, I had restarted the server with the UPS usb unplugged, but plugged it in subsequently. I suspect that this means that acpupsd was not active. Certainly, there is no logfile on my flash drive, pertaining to the session when the power cut occurred. When the power was restored, the machine appeared to start up normally, but a few minutes later I discovered that disk1 had red-balled. The main array status (in unMENU) was announcing "Parity updated 67 times to address sync errors". This worries me, because the last thing I want is for parity to update in response to a drive failure! I am currently rebuilding onto a new drive (isn't it wonderfully easy to install/swap drives with a trayless hot-swap drive cage?!), but I'm a little worried that 67 errors will have been introduced into my data. The smart report for the failed drive is showing nothing untoward (and the fact that I could obtain the smart report suggests that there was no connection problem): smartctl -a /dev/sdc (--) smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD10EARS-00Y5B1 Serial Number: WD-WCAV56040419 Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Thu Aug 11 08:12:09 2011 SGT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (20100) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 231) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3031) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 132 128 021 Pre-fail Always - 6366 4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2327 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9727 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 271 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 95 193 Load_Cycle_Count 0x0032 187 187 000 Old_age Always - 39992 194 Temperature_Celsius 0x0022 111 095 000 Old_age Always - 36 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I would be grateful if someone more experienced could peruse my log file and suggest why the drive red-balled and confirm whether the 67 parity updates may have compromised my data. syslog-2011-08-11.txt.zip Link to comment
dgaschk Posted August 15, 2011 Share Posted August 15, 2011 Data may be compromised. There is no way to know for sure from the syslog. If the content is video or music you'll probably never notice if it is. The data may also be fine... Link to comment
mcs Posted August 16, 2011 Share Posted August 16, 2011 "Parity updated 67 times to address sync errors" is most likely due to the raiser fs journal being played out when the disk came back after an unclean shutdown. I doubt you would have lost data under the conditions you described. Does the log contain messages about the journal? Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.