Re: preclear_disk.sh - a new utility to burn-in and pre-clear disks for quick add


Recommended Posts

I started another preclear on my old parity drive (a 1TB Seagate) and it seems (or at least the screen is) frozen.  The server is still up and running just fine, nothing looks to be over heated or anything like that, but the updating seems to have stopped.  I set it on a a cycle of 3 and it looks to have stopped at cycle one on the 10 step at 88%.  I can telnet into the tower and type ps -ef and I can see this:

root     28309  2657  0 07:42 tty1     00:00:57 /bin/bash ./preclear_disk.sh -c

and this

root      5740 28309 99 16:15 tty1     07:34:56 /bin/bash ./preclear_disk.sh -c

 

I am not sure if i have two preclears going or not.  I don't think i started two but i guess i could have accidentally.

 

The seagate drive appears to still be spinning so I don't think it has crashed and burned.  But it would appear that preclear has stopped on the disk.  Using unMenu and myMain and can see that there are not longer any reads or writes going to the disk.  This is kinda disturbing.

 

Any input is appreciated.

Link to comment

I started another preclear on my old parity drive (a 1TB Seagate) and it seems (or at least the screen is) frozen.  The server is still up and running just fine, nothing looks to be over heated or anything like that, but the updating seems to have stopped.  I set it on a a cycle of 3 and it looks to have stopped at cycle one on the 10 step at 88%.  I can telnet into the tower and type ps -ef and I can see this:

root     28309  2657  0 07:42 tty1     00:00:57 /bin/bash ./preclear_disk.sh -c

and this

root      5740 28309 99 16:15 tty1     07:34:56 /bin/bash ./preclear_disk.sh -c

 

I am not sure if i have two preclears going or not.  I don't think i started two but i guess i could have accidentally.

 

The seagate drive appears to still be spinning so I don't think it has crashed and burned.  But it would appear that preclear has stopped on the disk.  Using unMenu and myMain and can see that there are not longer any reads or writes going to the disk.  This is kinda disturbing.

 

Any input is appreciated.

Since the one process is a child of the other, I don't think you started two... it is normal to see two processes while it is clearing the drive as the clear is done in a background process while a foreground process updates the display.

 

On the other hand, it sure looks as if the process has stopped (assuming no read or write activity is actually occurring)

 

You might just abort it by typing "Control-C" in the window where it was started and try once more after running a smartctl test on the drive. 

 

I have seen drives stop and look like they locked up like this when other activity occurred concurrently.  It is a matter of a "deadlock" where two process both wait for the same resource to be freed, but each is really waiting for the other.    If you have time, and anything else is going on on the server, I'd just try waiting (overnight)

 

Then, I'd let it know who's boss.  ;)

 

Joe L.

Link to comment

I started another preclear on my old parity drive (a 1TB Seagate) and it seems (or at least the screen is) frozen.  The server is still up and running just fine, nothing looks to be over heated or anything like that, but the updating seems to have stopped.  I set it on a a cycle of 3 and it looks to have stopped at cycle one on the 10 step at 88%.  I can telnet into the tower and type ps -ef and I can see this:

root     28309  2657  0 07:42 tty1     00:00:57 /bin/bash ./preclear_disk.sh -c

and this

root      5740 28309 99 16:15 tty1     07:34:56 /bin/bash ./preclear_disk.sh -c

 

I am not sure if i have two preclears going or not.  I don't think i started two but i guess i could have accidentally.

 

The seagate drive appears to still be spinning so I don't think it has crashed and burned.  But it would appear that preclear has stopped on the disk.  Using unMenu and myMain and can see that there are not longer any reads or writes going to the disk.  This is kinda disturbing.

 

Any input is appreciated.

Since the one process is a child of the other, I don't think you started two... it is normal to see two processes while it is clearing the drive as the clear is done in a background process while a foreground process updates the display.

 

On the other hand, it sure looks as if the process has stopped (assuming no read or write activity is actually occurring)

 

You might just abort it by typing "Control-C" in the window where it was started and try once more after running a smartctl test on the drive. 

 

I have seen drives stop and look like they locked up like this when other activity occurred concurrently.  It is a matter of a "deadlock" where two process both wait for the same resource to be freed, but each is really waiting for the other.    If you have time, and anything else is going on on the server, I'd just try waiting (overnight)

 

Then, I'd let it know who's boss.  ;)

 

Joe L.

 

Thanks for the input.  I figured i would have to quit it and start it over.

 

I ended up killing the preclear and am now running a smartctl -t long on the drive.  When that finishes i will see what the output it and start a one cycle preclear on the drive.

 

Thanks JoeL.

Link to comment

============================================================================
==
== Disk /dev/sdc has been successfully precleared
==
============================================================================
S.M.A.R.T. error count differences detected after pre-clear
note, some 'raw' values may change, but not be an indication of a problem
20,21c20,21
< Offline data collection status:  (0x82)       Offline data collection activity
<                                       was completed without error.
---
> Offline data collection status:  (0x84)       Offline data collection activity
>                                       was suspended by an interrupting command from host.
============================================================================

 

I've used this script a few times now with no problems... the last time I got the output from SMART copied above.  Is this an issue/error? Should I re-run a smart test?

 

Cheers,

Matt

Link to comment

OK, the smartctl -t long test finished on my 1TB Seagate.

 

this is the output of smartctl -l selftest /dev/sdc

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       993         -

 

this is the output of smartctl -A /dev/sdc

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       38731720
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       32
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       8
  7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       9300322
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1001
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       26
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -       381
190 Airflow_Temperature_Cel 0x0022   066   053   045    Old_age   Always       -       34 (Lifetime Min/Max 26/36)
194 Temperature_Celsius     0x0022   034   047   000    Old_age   Always       -       34 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   036   025   000    Old_age   Always       -       38731720
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

 

Some of the above worries me but i would like some input from the experts in here that are better at reading this then i am.

 

Thanks

Link to comment
Some of the above worries me but i would like some input from the experts in here that are better at reading this then i am.

 

I personally don't think there are any experts yet, as to interpreting SMART reports.  It is too new a data analysis tool, plus they keep changing and adding attributes, *and* they are different for each drive vendor.  All I can give is my impressions from what I have seen so far, and learn from each new one I see.

 

The Raw_Read_Error_Rate and the corresponding Hardware_ECC_Recovered are quite normal for a Seagate, would be very high for any other brand.  The one attribute I would keep an eye on is the Seek_Error_Rate, which seems higher to me than it should be, and has taken a hit in its VALUE.  The High_Fly_Writes attributes is rather new, and I don't think anyone really understands it yet.  What troubles me about it, is the VALUE and WORST have already bottomed out, but perhaps Seagate themselves did not know how to appropriately scale it.  Overall, it looks fine, but I would monitor it once a month for awhile.  I think you will learn from watching it, what is OK and what bears continued monitoring.

Link to comment
> Offline data collection status:  (0x84)      Offline data collection activity

>                                      was suspended by an interrupting command from host.

 

I have seen that in 1 or 2 of the other pre_clear reports earlier.  It is completely harmless, just means a test was aborted, so no result from it.  I suspect that something in the pre_clear script is initiating an offline test that is aborted by a later command to the drive.  Possibly just a timing issue.

Link to comment

Thanks for the input RobJ.

 

The stats i was concerned about were the ones that you basically outlines there.  The High_Fly_Writes one is one that I am not sure about.  I have been searching around trying to understand what exactly it does and i am not quite sure.  All i know is that the 1.5TB drive I just ran a 3 cycle preclear on has about 15 High_Fly_Write "errors" already.

 

I will monitor the drive(s) like you suggested and see if anything changes.

Link to comment

> Offline data collection status:  (0x84)       Offline data collection activity

>                                       was suspended by an interrupting command from host.

 

I have seen that in 1 or 2 of the other pre_clear reports earlier.  It is completely harmless, just means a test was aborted, so no result from it.  I suspect that something in the pre_clear script is initiating an offline test that is aborted by a later command to the drive.  Possibly just a timing issue.

 

Thanks, I figured it was probably a timing issue but wanted to make 100% sure.  I had not seen this behaviour before.

 

Matt

Link to comment

Thanks for the input RobJ.

 

The stats i was concerned about were the ones that you basically outlines there.  The High_Fly_Writes one is one that I am not sure about.  I have been searching around trying to understand what exactly it does and i am not quite sure.  All i know is that the 1.5TB drive I just ran a 3 cycle preclear on has about 15 High_Fly_Write "errors" already.

 

I will monitor the drive(s) like you suggested and see if anything changes.

 

According to the smart report you have 381 high fly writes.  From reading about it, the high_fly_writes have to do with a condition that the drive detects where the heads get into a location too far about the disk location during a write.  When this is detected, the drive cancels / retries the write.  If you had 381 of these out of the millions or billions of write operations, it is likely not too serious.  That being said, that seems a pretty high value for this attribute.  Nevertheless, I wouldn't be too concerned about it unless you start to see other attribute problems or errors indicating that the drive is malfunctioning.  I have never seen a drive go bad because of high_fly_writes.

Link to comment

Thanks for the input RobJ.

 

The stats i was concerned about were the ones that you basically outlines there.  The High_Fly_Writes one is one that I am not sure about.  I have been searching around trying to understand what exactly it does and i am not quite sure.  All i know is that the 1.5TB drive I just ran a 3 cycle preclear on has about 15 High_Fly_Write "errors" already.

 

I will monitor the drive(s) like you suggested and see if anything changes.

 

According to the smart report you have 381 high fly writes.  From reading about it, the high_fly_writes have to do with a condition that the drive detects where the heads get into a location too far about the disk location during a write.  When this is detected, the drive cancels / retries the write.  If you had 381 of these out of the millions or billions of write operations, it is likely not too serious.  That being said, that seems a pretty high value for this attribute.  Nevertheless, I wouldn't be too concerned about it unless you start to see other attribute problems or errors indicating that the drive is malfunctioning.  I have never seen a drive go bad because of high_fly_writes.

 

Thanks for explaining that a little more.  Now i know what to pay attention to when looking at some of this smart information.

 

I just finished running preclear on a another 1.5 TB Seagate and all went well.  I ran 2 cycles on it and it took about 25 hours all told.

Link to comment

After running the script it told me the error count differences detected after pre-clear. Is this anything to be worried about?

Results look similar to a previous person's post but I want to double check.

 

 

The drive is a new Seagate 1.5tb St315005n1a1as

Took 12:12:24 to complete the preclear.

 

Edit: Seems like I can't get the image to show here. Here is a link

 

http://img10.imageshack.us/my.php?image=results.jpg

Link to comment

Looks good.  Actually, it's claiming to be 'better than good'.  If you check the Raw_Read_Error_Rate, the VALUE starts at an initialized value of 100, and rises to 118!  It's like asking for maximum effort from someone, and they responding they will give 110%, which is not possible but you know what they mean.  In this case, their scientists have measured and calculated appropriate scales for these error rates, and your drive must have such a low error rate, relative to their statistical norms, that their algorithm determined a higher than 100 value.  The Seek_Error_Rate stayed at 100.  The idea with these seems to be that as the rate of errors increases with wear and tear, the *_Error_Rate will drop from 100 down to the threshold value, at which point they have decided that the error rate is too high to trust the drive, and it will return a failing SMART grade.

 

With Error_Rate's, you should probably ignore the RAW values.  They may or may not be actual error counts, but what is being monitored with these attributes is not how many, but at what rate the errors are occurring, and how these numbers fit within the expected norms for that drive model.

Link to comment

Looks good.  Actually, it's claiming to be 'better than good'.  If you check the Raw_Read_Error_Rate, the VALUE starts at an initialized value of 100, and rises to 118!  It's like asking for maximum effort from someone, and they responding they will give 110%, which is not possible but you know what they mean.  In this case, their scientists have measured and calculated appropriate scales for these error rates, and your drive must have such a low error rate, relative to their statistical norms, that their algorithm determined a higher than 100 value.  The Seek_Error_Rate stayed at 100.  The idea with these seems to be that as the rate of errors increases with wear and tear, the *_Error_Rate will drop from 100 down to the threshold value, at which point they have decided that the error rate is too high to trust the drive, and it will return a failing SMART grade.

 

With Error_Rate's, you should probably ignore the RAW values.  They may or may not be actual error counts, but what is being monitored with these attributes is not how many, but at what rate the errors are occurring, and how these numbers fit within the expected norms for that drive model.

 

Awesome! The parity rebuild as started. Thanks RobJ

Link to comment

Hi,  I've run preclear_disk.sh on a drive.  I'm most of the way through.  All steps say done but I'm on the "Post-Read in progress: 88% complete"  step.  It appears as if the console has locked.  I cannot see any updates and it's been a little while (over 2 hours).  I can see in TOP that the "./preclear_disk.sh /dev/sdi" process is spiking one of my processors with 99% cpu utilization.

 

Should I wait it out and hope for the best -- or try to kill the process and start over?  Attached is my syslog for your review. 

Thank you.

Link to comment

Hi,  I've run preclear_disk.sh on a drive.  I'm most of the way through.  All steps say done but I'm on the "Post-Read in progress: 88% complete"  step.  It appears as if the console has locked.  I cannot see any updates and it's been a little while (over 2 hours).  I can see in TOP that the "./preclear_disk.sh /dev/sdi" process is spiking one of my processors with 99% cpu utilization.

 

Should I wait it out and hope for the best -- or try to kill the process and start over?  Attached is my syslog for your review. 

Thank you.

One of your disks (/dev/sdi) is having lots of errors as seen in this excerpt below from your syslog. 

 

Either the drive died, or a cable came loose.  In either case, I doubt the pre-clear will finish on its own.

 

Joe L.

Feb 26 16:51:40 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 26 16:51:40 Tower kernel: ata8.00: irq_stat 0x00020002, device error via SDB FIS
Feb 26 16:51:40 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in
Feb 26 16:51:40 Tower kernel:          res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Feb 26 16:51:40 Tower kernel: ata8.00: status: { DRDY ERR }
Feb 26 16:51:40 Tower kernel: ata8.00: error: { UNC }
Feb 26 16:51:40 Tower kernel: ata8.00: configured for UDMA/100
Feb 26 16:51:40 Tower kernel: ata8: EH complete
Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Feb 26 16:51:40 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 26 16:51:44 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 26 16:51:44 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS
Feb 26 16:51:44 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in
Feb 26 16:51:44 Tower kernel:          res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Feb 26 16:51:44 Tower kernel: ata8.00: status: { DRDY ERR }
Feb 26 16:51:44 Tower kernel: ata8.00: error: { UNC }
Feb 26 16:51:44 Tower kernel: ata8.00: configured for UDMA/100
Feb 26 16:51:44 Tower kernel: ata8: EH complete
Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Feb 26 16:51:44 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 26 16:51:48 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 26 16:51:48 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS
Feb 26 16:51:48 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in
Feb 26 16:51:48 Tower kernel:          res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Feb 26 16:51:48 Tower kernel: ata8.00: status: { DRDY ERR }
Feb 26 16:51:48 Tower kernel: ata8.00: error: { UNC }
Feb 26 16:51:48 Tower kernel: ata8.00: configured for UDMA/100
Feb 26 16:51:48 Tower kernel: ata8: EH complete
Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Feb 26 16:51:48 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 26 16:51:53 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 26 16:51:53 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS
Feb 26 16:51:53 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in
Feb 26 16:51:53 Tower kernel:          res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Feb 26 16:51:53 Tower kernel: ata8.00: status: { DRDY ERR }
Feb 26 16:51:53 Tower kernel: ata8.00: error: { UNC }
Feb 26 16:51:53 Tower kernel: ata8.00: configured for UDMA/100
Feb 26 16:51:53 Tower kernel: ata8: EH complete
Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Feb 26 16:51:53 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 26 16:51:57 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 26 16:51:57 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS
Feb 26 16:51:57 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in
Feb 26 16:51:57 Tower kernel:          res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Feb 26 16:51:57 Tower kernel: ata8.00: status: { DRDY ERR }
Feb 26 16:51:57 Tower kernel: ata8.00: error: { UNC }
Feb 26 16:51:57 Tower kernel: ata8.00: configured for UDMA/100
Feb 26 16:51:57 Tower kernel: ata8: EH complete
Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Feb 26 16:51:57 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 26 16:52:01 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 26 16:52:01 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS
Feb 26 16:52:01 Tower kernel: ata8.00: cmd 60/00:00:e0:7e:37/02:00:05:00:00/40 tag 0 ncq 262144 in
Feb 26 16:52:01 Tower kernel:          res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Feb 26 16:52:01 Tower kernel: ata8.00: status: { DRDY ERR }
Feb 26 16:52:01 Tower kernel: ata8.00: error: { UNC }
Feb 26 16:52:01 Tower kernel: ata8.00: configured for UDMA/100
Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Result: hostbyte=0x00 driverbyte=0x08
Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Sense Key : 0x3 [current] [descriptor]
Feb 26 16:52:01 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Feb 26 16:52:01 Tower kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb 26 16:52:01 Tower kernel:         05 37 80 05 
Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] ASC=0x11 ASCQ=0x4
Feb 26 16:52:01 Tower kernel: end_request: I/O error, dev sdi, sector 87523333
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940416
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940417
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940418
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940419
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940420
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940421
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940422
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940423
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940424
Feb 26 16:52:01 Tower kernel: Buffer I/O error on device sdi, logical block 10940425
Feb 26 16:52:01 Tower kernel: ata8: EH complete
Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB)
Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Write Protect is off
Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Mode Sense: 00 3a 00 00
Feb 26 16:52:01 Tower kernel: sd 8:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 26 16:52:06 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6
Feb 26 16:52:06 Tower kernel: ata8.00: irq_stat 0x00060002, device error via SDB FIS
Feb 26 16:52:06 Tower kernel: ata8.00: cmd 60/00:00:e0:80:37/01:00:05:00:00/40 tag 0 ncq 131072 in
Feb 26 16:52:06 Tower kernel:          res 68/02:00:00:00:00/00:00:00:00:68/00 Emask 0x2 (HSM violation)
Feb 26 16:52:06 Tower kernel: ata8.00: status: { DRDY DF DRQ }
Feb 26 16:52:06 Tower kernel: ata8.00: cmd 60/08:08:00:80:37/00:00:05:00:00/40 tag 1 ncq 4096 in
Feb 26 16:52:06 Tower kernel:          res 41/40:00:05:80:37/00:00:05:00:00/40 Emask 0x409 (media error) <F>
Feb 26 16:52:06 Tower kernel: ata8.00: status: { DRDY ERR }
Feb 26 16:52:06 Tower kernel: ata8.00: error: { UNC }
Feb 26 16:52:06 Tower kernel: ata8: hard resetting link

Link to comment
  • 2 weeks later...

I've used the preclear script on several of my drives to great results.  I just used it on two new drives a Seagate 1.5TB and a WD 1.5TB (wanted to see if the WD works well).  Anyway both ran through the single cycle including pre and post reads.  Everything finished on both fine as usual (though the WD took over 18 hours while the Seagate took the usual 12:12).  When I shutdown, moved them to their new drive locations and started up, both showed as unformatted and I had to format them for them to be usuable.  In the past I seem to recall they simply become active...  Anything I did wrong or is it that I brought both drives up at the same time, I think in the past I did one at a time...

 

G

Link to comment

I've used the preclear script on several of my drives to great results.  I just used it on two new drives a Seagate 1.5TB and a WD 1.5TB (wanted to see if the WD works well).  Anyway both ran through the single cycle including pre and post reads.  Everything finished on both fine as usual (though the WD took over 18 hours while the Seagate took the usual 12:12).  When I shutdown, moved them to their new drive locations and started up, both showed as unformatted and I had to format them for them to be usuable.  In the past I seem to recall they simply become active...  Anything I did wrong or is it that I brought both drives up at the same time, I think in the past I did one at a time...

 

G

They will always need to be formatted...  They are partitioned in a special way, but not formatted. The pre-clearing erases almost everything on the entire drive... There is no file-system, just an empty partition defined, in a special way that can be recognized.

 

The lengthy clearing step is skipped when adding them to the array.  That step takes your array off-line for 4 to 5 hours for a 1.5 TB drive.  You did nothing wrong. Instead of it being unavailable for many hours, your array came on-line and the formatting step takes a minute or two (for that size drive)  Instead of your family being unhappy because they wanted to view a movie, they can still get access to the server... even while the new drive is being formatted.

 

Glad the script worked for you...  If nothing else, you know that the drive is a bit less likely to suffer a mechanical problem in its first few hours of life in the server.

 

The only time a drive does not need to be formatted explicitly is when it is being used to replace an existing drive.  In that situation the bytes from the original drive are written to the replacement and those bytes represent a formatted drive, so no additional formatting is necessary.  In effect you copied the formatting from the old drive to the replacement.

 

It is possible you just did not notice the formatting before, or forgot you had to press the "Format" button... but you probably did.

 

Joe L.

Link to comment

Joe,

 

Thanks for the quick reply.  I'm sure I did format them before, I haven't added drives in a while so it seemed new to me.  But you point out the very reasons this script is so awesome, exercising the drives (especially to see if there are any oddities, especially important with these 1.5TB drives( and letting me keep the array live while it does it's thing.  I didn't realize you could run this on multiple drives at once until this time when I did 2 simultaneously in different terminal sessions, way cool.  Thanks for taking the time to make such a great tool for all of us!

 

G

Link to comment

Joe,

 

Thanks for the quick reply.  I'm sure I did format them before, I haven't added drives in a while so it seemed new to me.  But you point out the very reasons this script is so awesome, exercising the drives (especially to see if there are any oddities, especially important with these 1.5TB drives( and letting me keep the array live while it does it's thing.  I didn't realize you could run this on multiple drives at once until this time when I did 2 simultaneously in different terminal sessions, way cool.  Thanks for taking the time to make such a great tool for all of us!

 

G

Make sure you check the firmware version on the Seagate 1.5TB drive.  There is a whole series of Seagate drives with firmware versions that have a bug that will kill the drive to where the BIOS does not even see it and it takes a special jig to get it back alive once more...  You can prevent the failure by upgrading the firmware before the drive dies.

 

Joe L.

Link to comment

Yeah, all of my drives are either CC1H or CC1J.  I bought the server direct from Lime-Tech and it came with 4 of the CC1H and they work fine (except one of developed an unrecoverable sector or three and I had to RMA it with Seagate).  I also always seem to get CC1Hs from NewEgg, I did get two of the CC1J from Amazon, and apparently there is no firmware to upgrade them to CC1H, you just have to buy em that way for now.  I did run the latest firmware upgrade util on all the drives and it always said the latest was installed... They've been running for quite a while now with no problems, that's why I appreciate your script.

 

Thanks again!

Link to comment

I've noticed that drives I pull from my Drobo have SMART turned off. Silly me I keep forgetting to turn it on before running this, so usually I turn it on then run it again.

 

Could you update your script to make sure SMART is turned on before it starts? Otherwise, working great and love that it says so much downtime adding drives to the array.

Link to comment

I've noticed that drives I pull from my Drobo have SMART turned off. Silly me I keep forgetting to turn it on before running this, so usually I turn it on then run it again.

 

Could you update your script to make sure SMART is turned on before it starts? Otherwise, working great and love that it says so much downtime adding drives to the array.

it is apparently enabled on all my drives... or at least on anything remotely recent. I have a few 8Gig drives where it is not available at all.. they pre-date SMART.

 

If you want to add the enable command to your version of preclear_disk.sh, add the line in blue below in the get_smart_start function

 

get_start_smart() {

  smartctl -s on $1 >/dev/null 2>&1

  smartctl -d ata -a $1 2>&1 | egrep -v "Power_On_Minutes|Temperature_Celsius" >/tmp/smart_start$$

  cat /tmp/smart_start$$ | logger -tpreclear_disk-start -plocal7.info -is

}

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.