Jump to content

Using SMART to predict hard drive failure


Recommended Posts

Below is a link to an article to using SMART data to predict hard drive failure:

 

    http://www.extremetech.com/computing/194059-using-smart-to-accurately-predict-when-a-hard-drive-is-about-to-die

 

I think, as many of us have already figured out, what SMART reports back is mostly useless...  However, the Author does point out the five most useful parameters that will help to determine if one should be replacing a drive.

Link to comment

I think, as many of us have already figured out, what SMART reports back is mostly useless... 

 

I've never thought this. In fact over the course of hundreds of hard drives I've owned, it's been useful to predict a problem or a potential problem in recovering. While I'm no blackblaze, I've had hard drives with SMART since it's inception and it's predicted failures and saved my butt a few times.

 

The 5 parameters mentioned in the article is what I look for. Plus any FAILING NOW attributes.

Mostly increasing or high amount of reallocated sectors and pending sectors > 0.

and uncorrectable sectors.

 

I also do SMART long tests periodically.

If I get an LBA error, you know there's a problem reading the hard drive.

Any logged problem reading any sector could equate to the inability to rebuild a failed drive on the array.

 

Each of these attributes should be reviewed periodically with a periodic smart long test to insure your array.

Link to comment

I think, as many of us have already figured out, what SMART reports back is mostly useless... 

 

I've never thought this. In fact over the course of hundreds of hard drives I've owned, it's been useful to predict a problem or a potential problem in recovering. While I'm no blackblaze, I've had hard drives with SMART since it's inception and it's predicted failures and saved my butt a few times.

 

The 5 parameters mentioned in the article is what I look for. Plus any FAILING NOW attributes.

Mostly increasing or high amount of reallocated sectors and pending sectors > 0.

and uncorrectable sectors.

 

I also do SMART long tests periodically.

If I get an LBA error, you know there's a problem reading the hard drive.

Any logged problem reading any sector could equate to the inability to rebuild a failed drive on the array.

 

Each of these attributes should be reviewed periodically with a periodic smart long test to insure your array.

 

How often do you do the long tests?

 

I use Joe's script to save smart reports regularly (I like how it sorts them into their own folders by drive and includes the date in the title), and try to run short tests before the monthly parity check, but long tests I tend to slack on. I just ran them on all of my drives because some are getting on 5+years old and 50k+ hours, and I'm adding new drives so I've been watching them a bit more closely. I wish smart monitoring/logging/notifications were more integrated with Unraid though, similar to MyMain, it should be.

Link to comment

I think, as many of us have already figured out, what SMART reports back is mostly useless... 

 

I've never thought this. In fact over the course of hundreds of hard drives I've owned, it's been useful to predict a problem or a potential problem in recovering. While I'm no blackblaze, I've had hard drives with SMART since it's inception and it's predicted failures and saved my butt a few times.

 

The 5 parameters mentioned in the article is what I look for. Plus any FAILING NOW attributes.

Mostly increasing or high amount of reallocated sectors and pending sectors > 0.

and uncorrectable sectors.

 

 

What is even more interesting is that I decided after reading your post is to see exactly what is reported back in the SMART test.  I have three different brands of drives in my two servers.  The number of parameters returned are 17, 17 and 24 parameters.  There are only 14 of these parameters are common across all three of these different drives.  And even more interesting is that two of theses drives don't even return two of the parameters (#187 and #188) that the Author says are important to monitor! 

 

Now, I have deduced over the past couple of years of following this forum that the most important ones from an unRAID standpoint are  #5 (Relocated Sectors Count), #197 (Current Pending Sector Count) and #198 (Uncorrectable Sector Count).  Some of the others like #9 (Power On Hours) provide a sense how much 'life' might be expected from a drive but it is hardly conclusive.  As far as I know, how to apply the information from these three can only gained by following the posts on this forum and learning from what has been posted here.  (In general, numbers 197 and 198 should be zero and number 5 should remain constant.)

 

How one addresses the issue when one of these parameters is out of bounds apparently depends on the temperament of the individual.  A cautious person would quickly replace the drive and the bold one would wait to see if the drive is actually going to continue on its path to failure...

Link to comment

I had spoken to Tom once about saving the smart logs he reads into /var/log so it could be archived.

He did not want to put much more code in the mainline.

I wrote a wrapper to do it, but with all the hoopla and unRAID6 I put it aside until I see what follows.

 

From the hints I've seen, there will be some visible smart data in unRAID 6.

 

Depending on that I'll do something with monit, cron or something else to do some kind of monthly testing on my drives.

 

I had a shell a while back but it was lost in the storm.

For each day of the month that matched a connected disk,

I saved the smart logs, then did a monthly badblocks in read mode, then a final one.

 

I'll resurrect that once I see what the new features are.

 

You can do a monthly read via badblocks, which is similar to the smart long test surface scan.

I prefer the smart long test since it writes a log entry in the drive.

 

smartd can also be programmed via the configuration to issue these tests periodically.

It also can alert based on specific attributes via email.

 

The smart tests and emails used to save us grief as we would have early warning of impending doom.

Link to comment

How one addresses the issue when one of these parameters is out of bounds apparently depends on the temperament of the individual.  A cautious person would quickly replace the drive and the bold one would wait to see if the drive is actually going to continue on its path to failure...

 

Pending sectors can cause a rebuild to fail. the potential for a timeout and to have the drive kick out of the array is higher with pending sectors. I know from experience having a double drive failure.

 

What was worse for me is I had just completed a parity check. The check did not reveal anything.

Link to comment

I use smart to determine drive replacement extensively, however it's predictive ability is limited. For a successful prediction I would expect to a drive without errors to be accepted for replacement (RMA). I could then move the drive contents without error (or special effort).

 

Mostly, it says you're starting to have errors (not a prediction), think about replacing.

Link to comment

I had spoken to Tom once about saving the smart logs he reads into /var/log so it could be archived.

He did not want to put much more code in the mainline.

I wrote a wrapper to do it, but with all the hoopla and unRAID6 I put it aside until I see what follows.

 

From the hints I've seen, there will be some visible smart data in unRAID 6.

 

Depending on that I'll do something with monit, cron or something else to do some kind of monthly testing on my drives.

 

I had a shell a while back but it was lost in the storm.

For each day of the month that matched a connected disk,

I saved the smart logs, then did a monthly badblocks in read mode, then a final one.

 

I'll resurrect that once I see what the new features are.

 

You can do a monthly read via badblocks, which is similar to the smart long test surface scan.

I prefer the smart long test since it writes a log entry in the drive.

 

smartd can also be programmed via the configuration to issue these tests periodically.

It also can alert based on specific attributes via email.

 

The smart tests and emails used to save us grief as we would have early warning of impending doom.

 

Sounds good. I checked out some of v6 and it does seem to solve most of my gripes with Unraid as far as disk monitoring/health goes, but we'll see. If it isn't acceptable I'd be up for starting a bounty for implementing something, because I don't think it's good enough as it is, especially for NAS software. Waiting for a drive to fail and using rebuilds to "fix" it doesn't seem right, that should be a last resort, not the standard.

 

I'm hoping it will at least track the common errors and display them in the gui similar to Unmenu, have a way to initiate (and schedule) smart tests/parity checks, and notifications by email at least, if not by growl or something similar.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...