Multiple 3TB pre-clear failures


Recommended Posts

This is a new server with a fresh 5.0b12a install. It has a seven disk array (incl parity and cache) - all 3TB Hitachi 5K3000. The existing discs were all successfully pre-cleared in an older chassis with 4.7 and the 1.12beta pre-clear script.

 

The present failures involve a further five disks of the same type and size that I attempted to pre-clear using the 1.13 script in the new (5.0b12a) chassis. All five failed to pre-clear successfully. Two of the five gave up relatively quickly at 10h36, one was 97% through the pre-read, the other at around the same time, was into the zeroing phase at that point. Unfortunately this happened late in the evening so I was not around to see it happen, and the log subsequently filled with thousands of lines of 'garbage' (more on that below).

 

The other three went on to complete the cycle in the expected 42 hours or so but all reported an unsuccessful pre-clear, due to "Post-read detected un-expected non-zero bytes on disk". The other two produced identical reports. The SMART reports do not (as far as I can see) indicate anything untoward, no reallocated sectors or such. This is a screen capture of the final status for one of the disks:

 

================================================================== 1.13

=                unRAID server Pre-Clear disk /dev/sdm

=              cycle 1 of 1, partition start on sector 1

= Disk Pre-Clear-Read completed                                DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes            DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.          DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4      DONE

= Step 5 of 10 - Clearing MBR code area                        DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning  DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries            DONE

= Step 10 of 10 - Verifying if the MBR is cleared.              DONE

= Disk Post-Clear-Read completed                                DONE

Disk Temperature: 35C, Elapsed Time:  40:59:04

========================================================================1.13

==  Hitachi HDS5C3030ALA630    MJ1311YNG1AVEA

== Disk /dev/sdm has NOT been precleared successfully

== skip=332600 count=200 bs=8225280 returned instead of 00000

============================================================================

** Changed attributes in files: /tmp/smart_start_sdm  /tmp/smart_finish_sdm

                ATTRIBUTE  NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VA

LUE

      Temperature_Celsius =  171    157            0        ok          35

No SMART attributes are FAILING_NOW

 

0 sectors were pending re-allocation before the start of the preclear.

0 sectors were pending re-allocation after pre-read in cycle 1 of 1.

0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

0 sectors are pending re-allocation at the end of the preclear,

    the number of sectors pending re-allocation did not change.

0 sectors had been re-allocated before the start of the preclear.

0 sectors are re-allocated at the end of the preclear,

    the number of sectors re-allocated did not change.

root@Tower2:/boot#

 

=============================================================================

 

Here is the preclear report for the same disk, as posted in /boot/preclear_reports:

 

 

========================================================================1.13

== invoked as: ./preclear_disk.sh /dev/sdm

==

== Disk /dev/sdm has NOT been successfully precleared

== Postread detected un-expected non-zero bytes on disk==

== Ran 1 cycle

==

== Using :Read block size = 8225280 Bytes

== Last Cycle's Pre Read Time  : 10:06:48 (82 MB/s)

== Last Cycle's Zeroing time  : 10:43:47 (77 MB/s)

== Last Cycle's Post Read Time : 20:07:15 (41 MB/s)

== Last Cycle's Total Time    : 40:59:04

==

== Total Elapsed Time 40:59:04

==

== Disk Start Temperature: 38C

==

== Current Disk Temperature: 35C,

==

============================================================================

** Changed attributes in files: /tmp/smart_start_sdm  /tmp/smart_finish_sdm

                ATTRIBUTE  NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE

      Temperature_Celsius =  171    157            0        ok          35

No SMART attributes are FAILING_NOW

 

0 sectors were pending re-allocation before the start of the preclear.

0 sectors were pending re-allocation after pre-read in cycle 1 of 1.

0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.

0 sectors are pending re-allocation at the end of the preclear,

    the number of sectors pending re-allocation did not change.

0 sectors had been re-allocated before the start of the preclear.

0 sectors are re-allocated at the end of the preclear,

    the number of sectors re-allocated did not change.

============================================================================

 

 

And here is the SMART for the same disk

 

 

===========================================================================

Disk: /dev/sdm

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    Hitachi HDS5C3030ALA630

Serial Number:    MJ1311YNG1AVEA

Firmware Version: MEAOA580

User Capacity:    3,000,592,982,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Wed Oct 12 15:22:07 2011 PDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x84) Offline data collection activity

was suspended by an interrupting command from host.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (36667) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

SCT capabilities:       (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0

  2 Throughput_Performance  0x0005  100  100  054    Pre-fail  Offline      -      0

  3 Spin_Up_Time            0x0007  100  100  024    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      4

  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  100  100  020    Pre-fail  Offline      -      0

  9 Power_On_Hours          0x0012  100  100  000    Old_age  Always      -      44

10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      4

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      5

193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      5

194 Temperature_Celsius    0x0002  171  171  000    Old_age  Always      -      35 (Min/Max 25/41)

196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

============================================================================

 

 

When I finally came to look at the log, after realising that two of the discs had prematurely stopped their pre-clear activity, it was filled with thousands of repetitions of the following pair of messages, repeating every thirty seconds or so. The log is still filling now .

 

Oct 12 17:59:02 Tower2 kernel: sas: command 0xef34c240, task 0xc0800280, not at

initiator: BLK_EH_RESET_TIMER

Oct 12 17:59:02 Tower2 kernel: sas: command 0xefa2ab40, task 0xebbef000, not at

initiator: BLK_EH_RESET_TIMER

Oct 12 17:59:33 Tower2 kernel: sas: command 0xefa2ab40, task 0xebbef000, not at

initiator: BLK_EH_RESET_TIMER

Oct 12 17:59:33 Tower2 kernel: sas: command 0xef34c240, task 0xc0800280, not at

initiator: BLK_EH_RESET_TIMER

 

It may or not be relevant, but all five of these disks are connected to the same AOC-SASLP-MV8 controller.

 

Any suggestions and thoughts about where to go from here will be muchappreciated.

 

 

 

Link to comment

Preclear wont succeed on that card.

Connect the 5 new 3TB HDs to the motherboard  SATA ports and move the existing from there onto the controller. And for time being remove the second unnecessary card - you should be able to have 12 drives with only using the MB ports and a single SASLP.

 

That's the first time I've heard that, I wonder if Joe L might comment.

 

I've already started a new pre-clear with two different disks I was holding in reserve. So far it's been running about 9 hours and just about to finish the pre-read. So far, so good.

 

The difference this time is that the array has not been started.

 

Would be good if an experienced member could look at the log and see if he can determine what actually went wrong at around the 08:55 mark, which was when the first two original disks seem to given up the ghost.

Link to comment

Preclear wont succeed on that card.

Connect the 5 new 3TB HDs to the motherboard  SATA ports and move the existing from there onto the controller. And for time being remove the second unnecessary card - you should be able to have 12 drives with only using the MB ports and a single SASLP.

 

I have concurrently pre-cleared 4 3TB Hitachi drives on a AOC-SASLP-MV8 successfully. It was just for testing. The machine went into production running 4.7. This was using a Foxconn A74ML-K AM2+/AM3 AMD 740G Micro ATX AMD MB. The card has the .21 firmware.

Link to comment

Update:

 

Well that's curious. Disk 1 of 2 has successfully pre-cleared after 37 hours, Disk 2 is still going strong about an hour behind the first and it is looking good too.

 

So, it is not true that 3TB disks can't be pre-cleared successfully when attached to an AOC-SASLP-MV8.

 

I'm still wondering though what caused the first batch to fail and am coming round to the view that the system was simply overtaxed with five concurrent pre-clears and an active array. But then the other (4.7) server has 22 disks and will happily pre-clear a single 3TB with the array running.

 

Both servers are identical except for the chassis (Supermicro 846 running 5.0b12a and Norco 4224 running 4.7. Both have the X8SILI-F-O motherboard, AOC-SASLP-MV8, Intel i3, 4 GB RAM and 850W Corsair psu.

 

I'm now going to attempt again to pre-clear a pair of the original failed disks, again with the array offline.

Link to comment

Preclear wont succeed on that card.

Connect the 5 new 3TB HDs to the motherboard  SATA ports and move the existing from there onto the controller. And for time being remove the second unnecessary card - you should be able to have 12 drives with only using the MB ports and a single SASLP.

 

Update:

 

Well that's curious. Disk 1 of 2 has successfully pre-cleared after 37 hours, Disk 2 is still going strong about an hour behind the first and it is looking good too.

 

So, it is not true that 3TB disks can't be pre-cleared successfully when attached to an AOC-SASLP-MV8.

 

I'm still wondering though what caused the first batch to fail and am coming round to the view that the system was simply overtaxed with five concurrent pre-clears and an active array.

 

When I claimed that that you wont succeed I meant in the same configuration - you running 5 preclear sessions shared on a controller from live system with "cache directories" and a mover kicking out at midnight. There are a lot of documented cases in the 5b12a where the SASLP will crash and one of the theories is due to intensive disk activity (some people claim when there is no activity but after 2 days). As your system crashed I suspect that it will crash again as preclear on 3TB will take over 40 hours and this is a plenty of time to happen.

 

Have you moved the new disks to the motherboard ports you could repeat the procedure but even if Unraid crashed at some point (and this is not sure as the heavy IO activity is on the motherboard ports) your preclear sessions will run and after 40-45 hours you could have had 5 precleared disks and be done.

 

Now you ran two session only on a stopped server (I did not know you have a second one) so in theory you could not enjoy your media files for over 40 hours.

And now you are going to run 2 sessions again for 40 hours on a stopped server and perhaps the last one for another forty hours on a stopped server.

 

At least if you haven't started run the three sessions at once.

 

And BTW SASLP is an inferior card when fully loaded with 8 HDs. (it has limited bandwidth per port)

Link to comment

the saslp-mv8 can run and preclear 3TB hitachi drives just fine.

 

it is true there is a bug in beta12a if you have two saslp-mv8's in your server. it looks like a driver glitch.

 

yes, you fully saturate the card after with the 6th drive. you will only feel this issue during parity checks/rebuilds. in my experience, the hitachi drives slow down to about 85MBs from 112MBs with 8 drives during parity check. IMO that's not enough to panic over..

Link to comment

the saslp-mv8 can run and preclear 3TB hitachi drives just fine.

 

it is true there is a bug in beta12a if you have two saslp-mv8's in your server. it looks like a driver glitch.

 

yes, you fully saturate the card after with the 6th drive. you will only feel this issue during parity checks/rebuilds. in my experience, the hitachi drives slow down to about 85MBs from 112MBs with 8 drives during parity check. IMO that's not enough to panic over..

 

Hmm. Now that's interesting. I did have a second (empty) SASLP in the chassis while the first (failed) batch of pre-clears were being processed, but not for the second (successful) batch. I thought I'd been paying fairly close attention the dedicated beta thread but I must have missed the part where the 'two SASLP' problem was discussed. Can you point me to it?

 

The second SASLP is on its way back to Supermicro to be re-flashed with the 'Non-RAID' firmware version.

 

When I do the third batch of three I'll try it with the array running.

 

BTW It's not a major issue for me that the array is 'parked' while this is going on since all the files on Server 2 are still available on Server 1, or somewhere else on my network.

Link to comment

i had interpreted it as two..

the last time i looked it was in multiple mv8 systems. reading the last batch of issues, it is with single controllers. so that blew my theory out of the water. No one is still sure of the cause.

I have SM mobos and 3 saslp-mv8s spread across 2 unraid beta12a boxes, not a single error related to the mv8... till now

 

since I posted above, I blew up my server while doing a preclear and copying 3tb of data. first hard failure i have had since day one on this box. I am pretty sure it is totally unrelated. but interesting that the only thing that was still working was the preclear..  my error looks like like some sort of memory error or kernel panic though.

 

 

I see no reason for your array to off while doing a preclear though. I have in the past precleared 4x 3TB drives at once while i had 4 more 3TB drives on the same MV8 on a running array.

that system was an I3-2100, 8 gigs of ram and probably beta9? I was intentionally stress testing the server and saslp-mv8 with 3TB drives for this thread. I had zero issues.

 

It still sounds like you just taxed your system.

 

 

EDIT:

I think I found my crash. I had removed my cache drive earlier in the day. I also had not assigned the new drives to my shares yet and unraid just sort of went into some sort of error loop trying to figure out what to do with the data instead of reporting I was out of drive space. a reboot looks to have fixed the issue.

 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.