Multiple 3TB pre-clear failures

PFT · October 13, 2011

This is a new server with a fresh 5.0b12a install. It has a seven disk array (incl parity and cache) - all 3TB Hitachi 5K3000. The existing discs were all successfully pre-cleared in an older chassis with 4.7 and the 1.12beta pre-clear script.

The present failures involve a further five disks of the same type and size that I attempted to pre-clear using the 1.13 script in the new (5.0b12a) chassis. All five failed to pre-clear successfully. Two of the five gave up relatively quickly at 10h36, one was 97% through the pre-read, the other at around the same time, was into the zeroing phase at that point. Unfortunately this happened late in the evening so I was not around to see it happen, and the log subsequently filled with thousands of lines of 'garbage' (more on that below).

The other three went on to complete the cycle in the expected 42 hours or so but all reported an unsuccessful pre-clear, due to "Post-read detected un-expected non-zero bytes on disk". The other two produced identical reports. The SMART reports do not (as far as I can see) indicate anything untoward, no reallocated sectors or such. This is a screen capture of the final status for one of the disks:

================================================================== 1.13

= unRAID server Pre-Clear disk /dev/sdm

= cycle 1 of 1, partition start on sector 1

= Disk Pre-Clear-Read completed DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward. DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4 DONE

= Step 5 of 10 - Clearing MBR code area DONE

= Step 6 of 10 - Setting MBR signature bytes DONE

= Step 7 of 10 - Setting partition 1 to precleared state DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries DONE

= Step 10 of 10 - Verifying if the MBR is cleared. DONE

= Disk Post-Clear-Read completed DONE

Disk Temperature: 35C, Elapsed Time: 40:59:04

========================================================================1.13

== Hitachi HDS5C3030ALA630 MJ1311YNG1AVEA

== Disk /dev/sdm has NOT been precleared successfully

== skip=332600 count=200 bs=8225280 returned instead of 00000

============================================================================

** Changed attributes in files: /tmp/smart_start_sdm /tmp/smart_finish_sdm

ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VA

LUE

Temperature_Celsius = 171 157 0 ok 35

No SMART attributes are FAILING_NOW

0 sectors were pending re-allocation before the start of the preclear.

0 sectors were pending re-allocation after pre-read in cycle 1 of 1.

0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

0 sectors are pending re-allocation at the end of the preclear,

the number of sectors pending re-allocation did not change.

0 sectors had been re-allocated before the start of the preclear.

0 sectors are re-allocated at the end of the preclear,

the number of sectors re-allocated did not change.

root@Tower2:/boot#

=============================================================================

Here is the preclear report for the same disk, as posted in /boot/preclear_reports:

========================================================================1.13

== invoked as: ./preclear_disk.sh /dev/sdm

==

== Disk /dev/sdm has NOT been successfully precleared

== Postread detected un-expected non-zero bytes on disk==

== Ran 1 cycle

==

== Using :Read block size = 8225280 Bytes

== Last Cycle's Pre Read Time : 10:06:48 (82 MB/s)

== Last Cycle's Zeroing time : 10:43:47 (77 MB/s)

== Last Cycle's Post Read Time : 20:07:15 (41 MB/s)

== Last Cycle's Total Time : 40:59:04

==

== Total Elapsed Time 40:59:04

==

== Disk Start Temperature: 38C

==

== Current Disk Temperature: 35C,

==

============================================================================

** Changed attributes in files: /tmp/smart_start_sdm /tmp/smart_finish_sdm

ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE

Temperature_Celsius = 171 157 0 ok 35

No SMART attributes are FAILING_NOW

0 sectors were pending re-allocation before the start of the preclear.

0 sectors were pending re-allocation after pre-read in cycle 1 of 1.

0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

0 sectors are pending re-allocation at the end of the preclear,

the number of sectors pending re-allocation did not change.

0 sectors had been re-allocated before the start of the preclear.

0 sectors are re-allocated at the end of the preclear,

the number of sectors re-allocated did not change.

============================================================================

And here is the SMART for the same disk

===========================================================================

Disk: /dev/sdm

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===

Device Model: Hitachi HDS5C3030ALA630

Serial Number: MJ1311YNG1AVEA

Firmware Version: MEAOA580

User Capacity: 3,000,592,982,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Wed Oct 12 15:22:07 2011 PDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x84) Offline data collection activity

was suspended by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (36667) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0

2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0

3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0

4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 4

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0

7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0

8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0

9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 44

10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5

193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 5

194 Temperature_Celsius 0x0002 171 171 000 Old_age Always - 35 (Min/Max 25/41)

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0

197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

============================================================================

When I finally came to look at the log, after realising that two of the discs had prematurely stopped their pre-clear activity, it was filled with thousands of repetitions of the following pair of messages, repeating every thirty seconds or so. The log is still filling now .

Oct 12 17:59:02 Tower2 kernel: sas: command 0xef34c240, task 0xc0800280, not at

initiator: BLK_EH_RESET_TIMER

Oct 12 17:59:02 Tower2 kernel: sas: command 0xefa2ab40, task 0xebbef000, not at

initiator: BLK_EH_RESET_TIMER

Oct 12 17:59:33 Tower2 kernel: sas: command 0xefa2ab40, task 0xebbef000, not at

initiator: BLK_EH_RESET_TIMER

Oct 12 17:59:33 Tower2 kernel: sas: command 0xef34c240, task 0xc0800280, not at

initiator: BLK_EH_RESET_TIMER

It may or not be relevant, but all five of these disks are connected to the same AOC-SASLP-MV8 controller.

Any suggestions and thoughts about where to go from here will be muchappreciated.

PFT · October 13, 2011

I was incorrect in assuming that the syslog did not include the events leading up to the failures. They are still in the very front of the log here:

syslog.2011.10.11.zip

dgaschk · October 13, 2011

There are reports of that controller not working with the beta. It could be the problem. Try to pre=clear again.

bcbgboy13 · October 13, 2011

Preclear wont succeed on that card.

Connect the 5 new 3TB HDs to the motherboard SATA ports and move the existing from there onto the controller. And for time being remove the second unnecessary card - you should be able to have 12 drives with only using the MB ports and a single SASLP.

PFT · October 13, 2011

Preclear wont succeed on that card.

Connect the 5 new 3TB HDs to the motherboard SATA ports and move the existing from there onto the controller. And for time being remove the second unnecessary card - you should be able to have 12 drives with only using the MB ports and a single SASLP.

That's the first time I've heard that, I wonder if Joe L might comment.

I've already started a new pre-clear with two different disks I was holding in reserve. So far it's been running about 9 hours and just about to finish the pre-read. So far, so good.

The difference this time is that the array has not been started.

Would be good if an experienced member could look at the log and see if he can determine what actually went wrong at around the 08:55 mark, which was when the first two original disks seem to given up the ghost.

dgaschk · October 13, 2011

Preclear wont succeed on that card.

Connect the 5 new 3TB HDs to the motherboard SATA ports and move the existing from there onto the controller. And for time being remove the second unnecessary card - you should be able to have 12 drives with only using the MB ports and a single SASLP.

I have concurrently pre-cleared 4 3TB Hitachi drives on a AOC-SASLP-MV8 successfully. It was just for testing. The machine went into production running 4.7. This was using a Foxconn A74ML-K AM2+/AM3 AMD 740G Micro ATX AMD MB. The card has the .21 firmware.

PFT · October 14, 2011

Update:

After 19-20 hours both disks have finished zeroing and are now in the pre-read phase. No errors reported in the log.

Once again, the array is not started.

PFT · October 14, 2011

Update:

Well that's curious. Disk 1 of 2 has successfully pre-cleared after 37 hours, Disk 2 is still going strong about an hour behind the first and it is looking good too.

So, it is not true that 3TB disks can't be pre-cleared successfully when attached to an AOC-SASLP-MV8.

I'm still wondering though what caused the first batch to fail and am coming round to the view that the system was simply overtaxed with five concurrent pre-clears and an active array. But then the other (4.7) server has 22 disks and will happily pre-clear a single 3TB with the array running.

Both servers are identical except for the chassis (Supermicro 846 running 5.0b12a and Norco 4224 running 4.7. Both have the X8SILI-F-O motherboard, AOC-SASLP-MV8, Intel i3, 4 GB RAM and 850W Corsair psu.

I'm now going to attempt again to pre-clear a pair of the original failed disks, again with the array offline.

bcbgboy13 · October 14, 2011

Preclear wont succeed on that card.

Connect the 5 new 3TB HDs to the motherboard SATA ports and move the existing from there onto the controller. And for time being remove the second unnecessary card - you should be able to have 12 drives with only using the MB ports and a single SASLP.

Update:

Well that's curious. Disk 1 of 2 has successfully pre-cleared after 37 hours, Disk 2 is still going strong about an hour behind the first and it is looking good too.

So, it is not true that 3TB disks can't be pre-cleared successfully when attached to an AOC-SASLP-MV8.

I'm still wondering though what caused the first batch to fail and am coming round to the view that the system was simply overtaxed with five concurrent pre-clears and an active array.

When I claimed that that you wont succeed I meant in the same configuration - you running 5 preclear sessions shared on a controller from live system with "cache directories" and a mover kicking out at midnight. There are a lot of documented cases in the 5b12a where the SASLP will crash and one of the theories is due to intensive disk activity (some people claim when there is no activity but after 2 days). As your system crashed I suspect that it will crash again as preclear on 3TB will take over 40 hours and this is a plenty of time to happen.

Have you moved the new disks to the motherboard ports you could repeat the procedure but even if Unraid crashed at some point (and this is not sure as the heavy IO activity is on the motherboard ports) your preclear sessions will run and after 40-45 hours you could have had 5 precleared disks and be done.

Now you ran two session only on a stopped server (I did not know you have a second one) so in theory you could not enjoy your media files for over 40 hours.

And now you are going to run 2 sessions again for 40 hours on a stopped server and perhaps the last one for another forty hours on a stopped server.

At least if you haven't started run the three sessions at once.

And BTW SASLP is an inferior card when fully loaded with 8 HDs. (it has limited bandwidth per port)

Johnm · October 14, 2011

the saslp-mv8 can run and preclear 3TB hitachi drives just fine.

it is true there is a bug in beta12a if you have two saslp-mv8's in your server. it looks like a driver glitch.

yes, you fully saturate the card after with the 6th drive. you will only feel this issue during parity checks/rebuilds. in my experience, the hitachi drives slow down to about 85MBs from 112MBs with 8 drives during parity check. IMO that's not enough to panic over..

PFT · October 15, 2011

the saslp-mv8 can run and preclear 3TB hitachi drives just fine.

it is true there is a bug in beta12a if you have two saslp-mv8's in your server. it looks like a driver glitch.

yes, you fully saturate the card after with the 6th drive. you will only feel this issue during parity checks/rebuilds. in my experience, the hitachi drives slow down to about 85MBs from 112MBs with 8 drives during parity check. IMO that's not enough to panic over..

Hmm. Now that's interesting. I did have a second (empty) SASLP in the chassis while the first (failed) batch of pre-clears were being processed, but not for the second (successful) batch. I thought I'd been paying fairly close attention the dedicated beta thread but I must have missed the part where the 'two SASLP' problem was discussed. Can you point me to it?

The second SASLP is on its way back to Supermicro to be re-flashed with the 'Non-RAID' firmware version.

When I do the third batch of three I'll try it with the array running.

BTW It's not a major issue for me that the array is 'parked' while this is going on since all the files on Server 2 are still available on Server 1, or somewhere else on my network.

Johnm · October 15, 2011

i had interpreted it as two..

the last time i looked it was in multiple mv8 systems. reading the last batch of issues, it is with single controllers. so that blew my theory out of the water. No one is still sure of the cause.

I have SM mobos and 3 saslp-mv8s spread across 2 unraid beta12a boxes, not a single error related to the mv8... till now

since I posted above, I blew up my server while doing a preclear and copying 3tb of data. first hard failure i have had since day one on this box. I am pretty sure it is totally unrelated. but interesting that the only thing that was still working was the preclear.. my error looks like like some sort of memory error or kernel panic though.

I see no reason for your array to off while doing a preclear though. I have in the past precleared 4x 3TB drives at once while i had 4 more 3TB drives on the same MV8 on a running array.

that system was an I3-2100, 8 gigs of ram and probably beta9? I was intentionally stress testing the server and saslp-mv8 with 3TB drives for this thread. I had zero issues.

It still sounds like you just taxed your system.

EDIT:

I think I found my crash. I had removed my cache drive earlier in the day. I also had not assigned the new drives to my shares yet and unraid just sort of went into some sort of error loop trying to figure out what to do with the data instead of reporting I was out of drive space. a reboot looks to have fixed the issue.

Multiple 3TB pre-clear failures

Recommended Posts

PFT

Link to comment

PFT

Link to comment

dgaschk

Link to comment

bcbgboy13

Link to comment

PFT

Link to comment

dgaschk

Link to comment

PFT

Link to comment

PFT

Link to comment

bcbgboy13

Link to comment

Johnm

Link to comment

PFT

Link to comment

Johnm

Link to comment

Join the conversation