Re: preclear_disk.sh - a new utility to burn-in and pre-clear disks for quick add


Recommended Posts

Tom had to deal with this quite awhile back, it's somewhere in the Release Notes.  We used to have occasional posts about temps not showing for certain drives, and we would try to work through why SMART was not enabled for that drive.  Tom decided he might as well always enable it first, and I don't think we have seen problems like that since.

 

It baffles me why there is even an option to have SMART disabled, in any drive.  You don't have to use the SMART data.  What advantage could there possibly be to having SMART disabled?  (I'm not referring to unRAID, just drives in general.)  And I really find it incomprehensible that a tool like Drobo (possibly the closest and most similar competitor to unRAID) would have SMART turned off on the drives that GoChris pulled!  Does that make any sense at all?  Can you imagine choosing to run a tool like unRAID without SMART data?

Link to comment
  • 2 weeks later...

Hi,

 

I am running unRAID pro version 4.4 and ran into problem tonight. I just purchased two 1TG hard disk. One is Western Digital WD10EADS and the other one is Samsung Spinpoint F1 HD103UJ. I use the preclear_disk.sh script to clear WD hard drive yesterday and it was running smoothly and finished today for two rounds. Then today before I ran the preclear_disk.sh script on Samsung, I have the problem of two disks showing unformatted. I searched the forum and installed the powerdown scripts. After rebooting the server, I started to run the preclear_disk.sh script on Samsung and encounter the following problems:

 

1. The preclear_disk.sh script complain that some libraries from smartctl can not be found, so I re-install the cxxlibs-6.0.8-i486-4.tgz package again. It seems to me that I need to re-install the package if I reboot the server. Is it right?

 

2. There are tons of error showing in the syslog and actually make the unRaid system not functional. I am no longer able to copy or delete files in the system. I think that is due to the continuous error on the system. Here are the messages which repeat like crazy in the syslog.

 

Mar 26 03:20:56 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00

Mar 26 03:20:56 Tower kernel: end_request: I/O error, dev sdg, sector 1953520064

Mar 26 03:20:56 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00

Mar 26 03:20:56 Tower kernel: end_request: I/O error, dev sdg, sector 1953520064

 

I put the new disk in an enclosure and run the disk to the external SATA port on the machine. I want to clear the disk first before actually installing in the system. I thought it might be a problem on the physical drive, so I connect the hard drive to my Window XP laptop via the USB port. Then I partition and format the drive. It runs oK and I also copied a few files in the disk.

 

I deleted a lot of error message on the syslog to make it smaller size. The syslog is attached in this post. Please take a look at the log and let me know how I can fix the problem. Your help is very much appreciated.

 

--Tom

Link to comment

Hi,

 

I am running unRAID pro version 4.4 and ran into problem tonight. I just purchased two 1TG hard disk. One is Western Digital WD10EADS and the other one is Samsung Spinpoint F1 HD103UJ. I use the preclear_disk.sh script to clear WD hard drive yesterday and it was running smoothly and finished today for two rounds. Then today before I ran the preclear_disk.sh script on Samsung, I have the problem of two disks showing unformatted. I searched the forum and installed the powerdown scripts. After rebooting the server, I started to run the preclear_disk.sh script on Samsung and encounter the following problems:

 

1. The preclear_disk.sh script complain that some libraries from smartctl can not be found, so I re-install the cxxlibs-6.0.8-i486-4.tgz package again. It seems to me that I need to re-install the package if I reboot the server. Is it right?

Correct...you need to re-install it each time you reboot.  This is fixed in the 4.5-beta3 release (the missing library is no longer missing)

2. There are tons of error showing in the syslog and actually make the unRaid system not functional. I am no longer able to copy or delete files in the system. I think that is due to the continuous error on the system. Here are the messages which repeat like crazy in the syslog.

 

Mar 26 03:20:56 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00

Mar 26 03:20:56 Tower kernel: end_request: I/O error, dev sdg, sector 1953520064

Mar 26 03:20:56 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00

Mar 26 03:20:56 Tower kernel: end_request: I/O error, dev sdg, sector 1953520064

Looks like communications to the drive stopped at some point, as it seemed to complain about each sector in turn it tried to access.

I put the new disk in an enclosure and run the disk to the external SATA port on the machine. I want to clear the disk first before actually installing in the system. I thought it might be a problem on the physical drive, so I connect the hard drive to my Window XP laptop via the USB port. Then I partition and format the drive. It runs oK and I also copied a few files in the disk.

 

I deleted a lot of error message on the syslog to make it smaller size. The syslog is attached in this post. Please take a look at the log and let me know how I can fix the problem. Your help is very much appreciated.

 

--Tom

Partitioning writes to the first sector only... It tells you very little about the true health of the drive. (other than it can read and write the first 512 bytes)

Formatting a disk only write to a small handful of the sectors on a disk.  It is very possible for it to be successful and still have problems reading and writing to other sectors on the disk not involved in formatting. 

 

It sounds a lot like you had a bad connection to the drive when it was attached to the unRAID array... either a bad cable, of a loose connection, or a bad drive tray connection.  Odds are the drive is OK.  Yes, when you have 1TB of bytes on a disk,  trying to log a failure writing/reading every sector will quickly fill the syslog and use up all memory. 

 

You should run a smartctl report on the drive (or run it through another preclear_disk cycle, as it does a pre and post smartctl report on the drives.)

 

Joe L.

Link to comment

Hi Joe,

 

Thanks for your reply. I have re-installed the drive in the enclosure, but it seems to be behave the same. Since I have only one SATA enclosure and cable, I might just installed the drive in the system and run the preclear_disk.sh again.

 

I ran a smartctl -H against it and failed:

--------------------

root@Tower:/boot/packages# smartctl -H /dev/sdg

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

Short INQUIRY response, skip product id

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

------------------

 

I ran another "smartctl -al"l command and here is the result:

----------------

root@Tower:/boot/packages# smartctl --all /dev/sdg

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen

Home page is http://smartmontools.sourceforge.net/

 

=== START OF INFORMATION SECTION ===

Device Model:    SAMSUNG HD103UJ

Serial Number:    S13PJ9DS302065

Firmware Version: 1AA01113

User Capacity:    1,000,204,886,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 3b

Local Time is:    Thu Mar 26 12:26:15 2009 Local time zone must be set--see zic m

 

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

 

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

                                        was never started.

                                        Auto Offline Data Collection: Disabled.

Self-test execution status:      (  0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (11788) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (  2) minutes.

Extended self-test routine

recommended polling time:        ( 197) minutes.

Conveyance self-test routine

recommended polling time:        (  21) minutes.

SCT capabilities:              (0x003f) SCT Status supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  253  253  051    Pre-fail  Always      -      0

  3 Spin_Up_Time            0x0007  078  078  011    Pre-fail  Always      -      7590

  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      11

  5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  253  253  051    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0025  100  100  015    Pre-fail  Offline      -      0

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      0

10 Spin_Retry_Count        0x0033  100  100  051    Pre-fail  Always      -      0

11 Calibration_Retry_Count 0x0012  100  100  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      11

13 Read_Soft_Error_Rate    0x000e  253  253  000    Old_age  Always      -      0

183 Unknown_Attribute      0x0032  100  100  000    Old_age  Always      -      0

184 Unknown_Attribute      0x0033  100  100  000    Pre-fail  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Unknown_Attribute      0x0032  100  100  000    Old_age  Always      -      0

190 Airflow_Temperature_Cel 0x0022  078  077  000    Old_age  Always      -      22 (Lifetime Min/Max 22/22)

194 Temperature_Celsius    0x0022  078  077  000    Old_age  Always      -      22 (Lifetime Min/Max 22/22)

195 Hardware_ECC_Recovered  0x001a  100  100  000    Old_age  Always      -      405

196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      0

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  099  099  000    Old_age  Always      -      3

200 Multi_Zone_Error_Rate  0x000a  100  100  000    Old_age  Always      -      0

201 Soft_Read_Error_Rate    0x000a  253  253  000    Old_age  Always      -      0

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

root@Tower:/boot/packages#

-------------------------------

 

Thanks,

--Tom

Link to comment
It sounds a lot like you had a bad connection to the drive when it was attached to the unRAID array... either a bad cable, of a loose connection, or a bad drive tray connection.

 

Completely agree.  Plus, the UDMA_CRC_Error_Count increased from 1 to 3, which is also indicative of cable or other interface issue.  Most of those syslog errors occurred after the drive was disabled at 03:20:40, which is like 'pulling the plug'.  It's generally fatal, and you can ignore all errors that subsequently occur.

 

I would not bother with any further testing until you can replace that SATA cable, or discover something loose in the power cabling or connectors.  The drive itself looks fine.

Link to comment

Any ideas why the preclear script wouldn't run?  I got a new WD 'Green' drive and tried to run preclear and after typing 'Yes', I got the preclear screen, but it was just frozen and nothing happened.  So, I decided to just go ahead and add the drive to the array, to see what would happen.  Unraid added the drive, but I had to wait about 4 hours while unraid cleared the disk.  No problems there.  So, after that happened, I removed it from the array and tried to run preclear again.  I ran just fine, until it got to the last step of reading the disk for the final time and froze 88% of the way through.  So, I stopped it and tried to run preclear again.  This time, it froze and wouldn't run.  So, I added it back to the array and I waited for another 4 hours while unraid cleared it again.  This surprised me as I figured unraid should see the disk as cleared.  So, after waiting the 4 hours, I tried to run preclear again.  Again, no luck.

 

So, I decided to do a 'smartctl --test=long /dev/sdb' and so now I have to wait 255 minutes.  Any ideas what is going on?

Link to comment

Any ideas why the preclear script wouldn't run?  I got a new WD 'Green' drive and tried to run preclear and after typing 'Yes', I got the preclear screen, but it was just frozen and nothing happened.  So, I decided to just go ahead and add the drive to the array, to see what would happen.  Unraid added the drive, but I had to wait about 4 hours while unraid cleared the disk.  No problems there.  So, after that happened, I removed it from the array and tried to run preclear again.  I ran just fine, until it got to the last step of reading the disk for the final time and froze 88% of the way through.  So, I stopped it and tried to run preclear again.  This time, it froze and wouldn't run.  So, I added it back to the array and I waited for another 4 hours while unraid cleared it again.  This surprised me as I figured unraid should see the disk as cleared.  So, after waiting the 4 hours, I tried to run preclear again.  Again, no luck.

 

So, I decided to do a 'smartctl --test=long /dev/sdb' and so now I have to wait 255 minutes.  Any ideas what is going on?

If the preclear script is failing to complete it indicates some issue with reading or writing the drive.  I would first suspect the SATA cable.  I'd replace it.  You might see errors in the syslog corresponding to the times the freezes occur.

 

Another thing to check... Make sure you have properly set the voltage on the system memory in your BIOS.  Many motherboards do not set it properly, and often memory needs very specific timing check your memory and BIOS settigs for it too.  All kinds of strange errors will occur when system memory is unable to store the correct values.

 

Joe L.

Link to comment

Ok.  Thanks for the tip.  It is a new drive, so I went into the guts of the thing, unplugged and replugged the SATA cable back into the drive and made sure I pushed extra hard to push it both into the drive and the mobo.  Then, for grins, I decided to try again and preclear is now running.  I'm 15GB into reading a 1TB drive.  So, I should know more by morning.

 

If I'm reading this right, either the cable is bad, or I didn't have it plugged in 'well' the first time.  So, for my own edification, I would appreciate if anyone could answer a question or two.  If the either of these two is correct (bad cable or badly connected), why did everything other than preclear seem to work.  The drive got exported and I could add it to the array.  unraid would preclear it (something that took several hours and resulted in many many writes to the drive).  I was also able to complete most of a preclear cycle on the disk the first time around.  Shouldn't that preclear cycle have failed at the same place (at the beginning, instead of several hours in)?

 

Is this a function of quantity, not quality?  Could I have read 1 byte from the drive every day forever, but as soon as I tried to read a 3 GB video file, would I have crashed and burned?

 

Thanks again for your help and the cool utility.

 

Chris

Link to comment
  • 2 weeks later...

So, I bought a new SATA cable, but just for grins decided to give the old one one last try by plugging it into a different plug on the mobo.  I set preclear up to run for 4 cycles and it worked.  Go figure.  However, I get a message when the last cycle ran and I don't know how to interpret it.  Does this make sense to anyone?

 

===========================================================================

=                unRAID server Pre-Clear disk /dev/sdc

=                      cycle 4 of 4

= Disk Pre-Clear-Read completed                                DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes            DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.          DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4      DONE

= Step 5 of 10 - Clearing MBR code area                        DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning  DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries            DONE

= Step 10 of 10 - Testing if the clear has been successful.    DONE

= Disk Post-Clear-Read completed                                DONE

Elapsed Time:  44:59:41

============================================================================

==

== Disk /dev/sdc has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

19,20c19,20

< Offline data collection status:  (0x82)      Offline data collection activity

<                                      was completed without error.

---

> Offline data collection status:  (0x84)      Offline data collection activity

>                                      was suspended by an interrupting command from host.

============================================================================

 

Thanks,

Chris

Link to comment

So, I bought a new SATA cable, but just for grins decided to give the old one one last try by plugging it into a different plug on the mobo.  I set preclear up to run for 4 cycles and it worked.  Go figure.  However, I get a message when the last cycle ran and I don't know how to interpret it.  Does this make sense to anyone?

 

===========================================================================

=                unRAID server Pre-Clear disk /dev/sdc

=                       cycle 4 of 4

= Disk Pre-Clear-Read completed                                 DONE

= Step 1 of 10 - Copying zeros to first 2048k bytes             DONE

= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE

= Step 3 of 10 - Disk is now cleared from MBR onward.           DONE

= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4       DONE

= Step 5 of 10 - Clearing MBR code area                         DONE

= Step 6 of 10 - Setting MBR signature bytes                    DONE

= Step 7 of 10 - Setting partition 1 to precleared state        DONE

= Step 8 of 10 - Notifying kernel we changed the partitioning   DONE

= Step 9 of 10 - Creating the /dev/disk/by* entries             DONE

= Step 10 of 10 - Testing if the clear has been successful.     DONE

= Disk Post-Clear-Read completed                                DONE

Elapsed Time:  44:59:41

============================================================================

==

== Disk /dev/sdc has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

19,20c19,20

< Offline data collection status:  (0x82)       Offline data collection activity

<                                       was completed without error.

---

> Offline data collection status:  (0x84)       Offline data collection activity

>                                       was suspended by an interrupting command from host.

============================================================================

 

Thanks,

Chris

Basically...

The process took 45 hours for 4 pre-read/clear/post-read cycles, during which you kept the disk *very* busy...

The process did not increase any error count between the SMART report done prior to the first cycle and the one done after the last.

 

If you were to look in your syslog you will find both of the SMART reports in their entirety.  You will find that the power on time and temperature changed between the two reports, but otherwise the reports are nearly identical.

 

Did you by chance have a "Long" or "Short" status test queued up when you started the preclear process?  (As far as I know, the "Offline data collection activity" refers to those two activities)

 

In any case, it looks like a nicely working disk.  Your new SATA cable is working well.

 

Joe L.

Link to comment
Did you by chance have a "Long" or "Short" status test queued up when you started the preclear process?  (As far as I know, the "Offline data collection activity" refers to those two activities)

 

In any case, it looks like a nicely working disk.  Your new SATA cable is working well.

 

That is nice to hear.  I did set up at least on smart status report.  I don't remember when exactly, but I'm guessing that that could be what you are referring to.  Also, just to make sure (I'm anal and paranoid, a combination that bothers my wife to no end), I'm actually still using the old cable.  I just pulled the old cable out of one of the plugs on the mobo and put it into a different one.  I'm wondering if the plug on the mobo could be bad?

 

Thanks again for your help.  How is it that free support on a board like this is better than paid support for so many products?

 

Chris

Link to comment
  • 3 weeks later...

Maybe it's just because I'm not a linux expert, but is there a way to easily set up a batch preclear.sh to perform this on multiple drives?

 

I know this is not normally needed, but as the script is useful for pre-screening drives for failure, it would be nice to do this on all the new drives I just received in order to ensure a smooth server setup when I get back to work from the weekend.

 

Just a thought.

 

 

Link to comment

Maybe it's just because I'm not a linux expert, but is there a way to easily set up a batch preclear.sh to perform this on multiple drives?

 

I know this is not normally needed, but as the script is useful for pre-screening drives for failure, it would be nice to do this on all the new drives I just received in order to ensure a smooth server setup when I get back to work from the weekend.

 

Just a thought.

 

 

 

There are several ways to do this:

 

1. use multiple "telnet" sessions to log onto unRAID.  Run one preclear_disk.sh script in each session.  (This is what I usually do)

 

2. Log into the system console using mutiple "consoles"  (Control-Alt-F1 through Control-Alt-F6 will switch between the six availale system consoles)

Run one preclear_disk.sh per console.  (Switch between them as needed to review their progress)

 

3. Install and run "screen" a program designed to allow you to have as many virtual "screens" as desired and switch between them with a hot-key-sequence.  It is described in this post: http://lime-technology.com/forum/index.php?topic=2817.msg24825#msg24825

 

Once you invoke it with "screen" you can start up a preclear_disk.sh, then type "Control-A c" to get a new console, start another preclear_disk.sh, type "Control-A c" to again get a new virtual screen, start a third preclear_disk.sh, etc.

 

You can at any time type "Control-A n" or "Control-A p" to switch to the next or previous virtual screen to track their progress.  You can type "Control-A ?" to get a list of possible commands to manage the screen consoles.

 

A brief tutorial on how to use screen is here: http://www.rackaid.com/resources/linux-tutorials/general-tutorials/using-screen/

 

You can even detach from screen, allowing you to close the telnet session and re-attach later.  To detach type "Control-A d"  Then, as a later time, type

screen -r to re-attach.

 

Another good article on "screen" can be found here:

http://www.linuxjournal.com/article/6340

 

It can do a lot more. You can "name" the screen sessions, list the sessions

Control-A "

(Control-A followed by a "quote")

 

Joe L.

Link to comment

 

 

There are several ways to do this:

 

1. use multiple "telnet" sessions to log onto unRAID.  Run one preclear_disk.sh script in each session.   (This is what I usually do)

 

 

 

Ahh, why didn't I think of that?  Still a little overwhelmed, I guess  ;)

 

Thanks for the tips.

 

Byron

Link to comment

Maybe it's just because I'm not a linux expert, but is there a way to easily set up a batch preclear.sh to perform this on multiple drives?

 

I know this is not normally needed, but as the script is useful for pre-screening drives for failure, it would be nice to do this on all the new drives I just received in order to ensure a smooth server setup when I get back to work from the weekend.

 

Just a thought.

 

 

 

There are several ways to do this:

 

1. use multiple "telnet" sessions to log onto unRAID.  Run one preclear_disk.sh script in each session.   (This is what I usually do)

 

2. Log into the system console using mutiple "consoles"  (Control-Alt-F1 through Control-Alt-F6 will switch between the six availale system consoles)

Run one preclear_disk.sh per console.  (Switch between them as needed to review their progress)

 

3. Install and run "screen" a program designed to allow you to have as many virtual "screens" as desired and switch between them with a hot-key-sequence.   It is described in this post: http://lime-technology.com/forum/index.php?topic=2817.msg24825#msg24825

 

Once you invoke it with "screen" you can start up a preclear_disk.sh, then type "Control-A c" to get a new console, start another preclear_disk.sh, type "Control-A c" to again get a new virtual screen, start a third preclear_disk.sh, etc.

 

You can at any time type "Control-A n" or "Control-A p" to switch to the next or previous virtual screen to track their progress.  You can type "Control-A ?" to get a list of possible commands to manage the screen consoles.

 

A brief tutorial on how to use screen is here: http://www.rackaid.com/resources/linux-tutorials/general-tutorials/using-screen/

 

You can even detach from screen, allowing you to close the telnet session and re-attach later.  To detach type "Control-A d"   Then, as a later time, type

screen -r to re-attach.

 

Another good article on "screen" can be found here:

http://www.linuxjournal.com/article/6340

 

It can do a lot more. You can "name" the screen sessions, list the sessions

Control-A "

(Control-A followed by a "quote")

 

Joe L.

 

 

This script is great. I just received two 1.5TB Seagate drives. Ran 2 cycles on one drive in about 24 hours and running one more for a clear mind. I didn't know about multiple consoles until readin this lol so now I have the second drive running 3 cycles. You guys have been great. This thread really cleared some things up that I didn't understand about the smart report.

Link to comment

How do you stop the pre-clear? When I rebooted my system I forgot to re-install smart tools so it won't have the beginning and end comparisions.

Type

Control-C

Hold the control key down and press the letter "C"

 

Thanks. One day I think I'll mess with my Go script......... :)

Link to comment

Well I ran the preclear_disk.sh script on my new 1.5T Seagate. It took 12:26.47 to complete successfully. Awesome program. None of the Smart changes were of the important variety so I feel good about that. Thats Joe L. (and all the other people who do so much here), this forum is 2nd to none.

Link to comment
  • 2 weeks later...

I am running this on my new 1.5TB Maxtor Green.  It did one full pass that seemed to work.  On the second pass, it didn't finish.

I looked in /tmp for the smart logs, but it appears to have been deleted.  where should I look to see what happened?  I remember seeing something about not being able to do something with the MBR.  My putty session got killed when I rebooted.  I'm going to try again

and see if it was just some sort of fluke.  My syslog file is 500Meg!  With a whole ton of these:

 

May 14 04:48:23 Tower kernel: end_request: I/O error, dev sdd, sector 2930137216
May 14 04:48:23 Tower kernel: sd 6:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00

 

And I see this

May 14 04:48:22 Tower kernel: sd 6:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00
May 14 04:48:22 Tower kernel: end_request: I/O error, dev sdd, sector 2930043648
May 14 04:48:22 Tower kernel: __ratelimit: 78016 callbacks suppressed
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255456
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255457
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255458
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255459
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255460
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255461
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255462
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255463
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255464
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255465
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: sd 6:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00
May 14 04:48:22 Tower kernel: end_request: I/O error, dev sdd, sector 2930044672

  As well

 

Is there anything I should look for?

 

Jim

Link to comment

These are just followup to the original real error.  Locate the first error sequences involving sdd or sd 6:0:0:0.  Also determine which drive sdd is, whether it is your new Maxtor Green, or a different drive that has decided to fail now.

Link to comment

I am running this on my new 1.5TB Maxtor Green.  It did one full pass that seemed to work.  On the second pass, it didn't finish.

I looked in /tmp for the smart logs, but it appears to have been deleted.  where should I look to see what happened?  I remember seeing something about not being able to do something with the MBR.  My putty session got killed when I rebooted.  I'm going to try again

and see if it was just some sort of fluke.  My syslog file is 500Meg!  With a whole ton of these:

 

May 14 04:48:23 Tower kernel: end_request: I/O error, dev sdd, sector 2930137216
May 14 04:48:23 Tower kernel: sd 6:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00

 

And I see this

May 14 04:48:22 Tower kernel: sd 6:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00
May 14 04:48:22 Tower kernel: end_request: I/O error, dev sdd, sector 2930043648
May 14 04:48:22 Tower kernel: __ratelimit: 78016 callbacks suppressed
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255456
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255457
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255458
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255459
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255460
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255461
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255462
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255463
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255464
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: Buffer I/O error on device sdd, logical block 366255465
May 14 04:48:22 Tower kernel: lost page write due to I/O error on sdd
May 14 04:48:22 Tower kernel: sd 6:0:0:0: [sdd] Result: hostbyte=0x04 driverbyte=0x00
May 14 04:48:22 Tower kernel: end_request: I/O error, dev sdd, sector 2930044672

  As well

 

Is there anything I should look for?

 

Jim

Assuming that /dev/sdd is your new disk... it looks like it stopped responding.  Might be a loose cable, either power or data...  It is very easy for some sata cables to come loose.

If not loose, then odds are the disk died an early death.

 

Can you do a

hdparm -I /dev/sdd

or

smartctl -a -d ata /dev/sdd

and get anything back at all?

 

If the disk did die an early death... sorry, but the script did exactly as designed... it helped identify an early failure.

Be happy it failed before you added it to your array... It takes a lot more time to replace it once it has data on it.

 

Joe L.

Link to comment

Assuming that /dev/sdd is your new disk... it looks like it stopped responding.  Might be a loose cable, either power or data...  It is very easy for some sata cables to come loose.

If not loose, then odds are the disk died an early death.

 

Can you do a

hdparm -I /dev/sdd

or

smartctl -a -d ata /dev/sdd

and get anything back at all?

 

If the disk did die an early death... sorry, but the script did exactly as designed... it helped identify an early failure.

Be happy it failed before you added it to your array... It takes a lot more time to replace it once it has data on it.

After the reboot it seems to be happily be running  I'm at 98% of the pre-read...   Ok..  Change that...  I guess it isn't happy...  I'm getting more of those errors on the zeroing..

 

Here is a snippet of the log.  The snippet starts at close to the end of the pre read and captures the start of the zeroing..  I'm in the middle of a power cycle (remotly so I may not get it back).   I'll have to look to see if the very first pass of this test behaved well...  I'll post the smart results when the computer reboots..

 

 

 

 

Link to comment

Here is my hdparm info and smart test info.  Interestingly..  After the power cycle my disk 2 was "missing"  I power cycled again and it came back?

 

Now I just have to see if disk 2 is on the same controller as my new disk..

 

/dev/sdd:

ATA device, with non-removable media
        Model Number:       WDC WD15EADS-00H7B0
        Serial Number:      WD-WCAUP0018631
        Firmware Revision:  05.00K05
        Transport:          Serial, SATA 1.0a, SATA II Extensions, SATA Rev 2.5
Standards:
        Supported: 8 7 6 5
        Likely used: 8
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors: 2930277168
        device size with M = 1024*1024:     1430799 MBytes
        device size with M = 1000*1000:     1500301 MBytes (1500 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Recommended acoustic management value: 128, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    SMART feature set
                Security Mode feature set
           *    Power Management feature set
           *    Write cache
           *    Look-ahead
           *    Host Protected Area feature set
           *    WRITE_BUFFER command
           *    READ_BUFFER command
           *    NOP cmd
           *    DOWNLOAD_MICROCODE
                Power-Up In Standby feature set
           *    SET_FEATURES required to spinup after power up
                SET_MAX security extension
                Automatic Acoustic Management feature set
           *    48-bit Address feature set
           *    Device Configuration Overlay feature set
           *    Mandatory FLUSH_CACHE
           *    FLUSH_CACHE_EXT
           *    SMART error logging
           *    SMART self-test
           *    General Purpose Logging feature set
           *    64-bit World wide name
           *    {READ,WRITE}_DMA_EXT_GPL commands
           *    Segmented DOWNLOAD_MICROCODE
           *    SATA-I signaling speed (1.5Gb/s)
           *    SATA-II signaling speed (3.0Gb/s)
           *    Native Command Queueing (NCQ)
           *    Host-initiated interface power management
           *    Phy event counters
                DMA Setup Auto-Activate optimization
           *    Software settings preservation
           *    SMART Command Transport (SCT) feature set
           *    SCT Long Sector Access (AC1)
           *    SCT LBA Segment Access (AC2)
           *    SCT Error Recovery Control (AC3)
           *    SCT Features Control (AC4)
           *    SCT Data Tables (AC5)
                unknown 206[12] (vendor specific)
                unknown 206[13] (vendor specific)
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
                supported: enhanced erase
        412min for SECURITY ERASE UNIT. 412min for ENHANCED SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee2ad0035dd
        NAA             : 5
        IEEE OUI        : 14ee
        Unique ID       : 2ad0035dd
Checksum: correct
root@Tower:~#
root@Tower:~#
root@Tower:~# smartctl -a -d ata /dev/sdd
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD15EADS-00H7B0
Serial Number:    WD-WCAUP0018631
Firmware Version: 05.00K05
User Capacity:    1,500,301,910,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu May 14 13:33:22 2009 GMT+5
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (40500) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   139   139   051    Pre-fail  Always       -       14844
  3 Spin_Up_Time            0x0027   100   253   021    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       0
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       9
194 Temperature_Celsius     0x0022   127   121   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   195   195   000    Old_age   Always       -       1311
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Now do I have to be concerned about the "Current_Pending_Sector "  Number?  Seems like that should  be 0 for a new good drive..

 

Could a bad controller have any effect on that number?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.