Seagate Archive 8TB ST8000AS0002 Unraid 5stable kernel panic


Recommended Posts

Hello all,

 

when I use a newly bought Seagate Archive 8TB drive (ST8000AS0002) on a Unraid 5stable array as a parity drive (not tried as data drive), I will get a kernel panic after some time. This interestingly even blocks the traffic going over the attached switch, no idea how that happens.

(When I replace the 8TB drive with a 3TB drive for parity, all works smooth again for days. The 8TB drive is then still connected to the machine but not assigned in Unraid)

 

My questions:

  Is that a know problem?

  How may this be fixed?

      Move to Unraid6?

      Drive firmware update?

 

Any ideas?

 

Thanks for caring,

JC

 

 

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    ST8000AS0002-1NA17Z

Serial Number:    Z8403NMN

Firmware Version: AR13

User Capacity:    8,001,563,222,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  9

ATA Standard is:  Not recognized. Minor revision code: 0x001f

Local Time is:    Wed May  6 00:25:19 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (  0) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x30a5) SCT Status supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  118  099  006    Pre-fail  Always      -      194380720

  3 Spin_Up_Time            0x0003  090  090  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      13

  5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  071  060  030    Pre-fail  Always      -      13681010

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      251

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      3

183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0

189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0

190 Airflow_Temperature_Cel 0x0022  065  057  045    Old_age  Always      -      35 (Min/Max 26/41)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      10

193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      37

194 Temperature_Celsius    0x0022  035  043  000    Old_age  Always      -      35 (0 25 0 0)

195 Hardware_ECC_Recovered  0x001a  118  099  000    Old_age  Always      -      194380720

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      182325656682555

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      28300614680

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      16244567367

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%      251        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

Link to comment

The S.M.A.R.T. data looks okay except perhaps for the seek error rates [but these can sometimes be high, so it's hard to say for sure whether that's an issue].

 

Did you do a thorough test on this drive before putting it in service?  [i.e. either a couple cycles of the pre-clear script or at least run the manufacturer's diagnostics]

 

Details on your configuration might provide some clue ... not only the hardware components, but also the specific disk drives you've got connected.

 

Link to comment
  • 2 weeks later...

The S.M.A.R.T. data looks okay except perhaps for the seek error rates [but these can sometimes be high, so it's hard to say for sure whether that's an issue].

 

Did you do a thorough test on this drive before putting it in service?  [i.e. either a couple cycles of the pre-clear script or at least run the manufacturer's diagnostics]

 

Details on your configuration might provide some clue ... not only the hardware components, but also the specific disk drives you've got connected.

 

Now I did one round of preclear and you will find the new SMART report attached to the end of this post.

 

4x Seagate ST3000DM001-1CH166, one of them will be cache drive

2x WD WDC_WD30EZRX-00AZ6B0_WD

 

Plus the 8TB drive (2x)

 

What else would be of relevance?

SuperMicro Mainboard X9SCL-F-0, 8TB attached to SATA3

Digitus DS-30104-1 additional SATA controller

 

Here the SMART report after a preclear of the 8TB Seagate drive:

 

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    ST8000AS0002-1NA17Z

Serial Number:    Z8403NMN

Firmware Version: AR13

User Capacity:    8,001,563,222,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  9

ATA Standard is:  Not recognized. Minor revision code: 0x001f

Local Time is:    Thu May 14 21:40:29 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (  0) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x30a5) SCT Status supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      146594920

  3 Spin_Up_Time            0x0003  090  090  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  100  100  020    Old_age  Always      -      21

  5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  076  060  030    Pre-fail  Always      -      44486206

  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      465

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      3

183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0

189 High_Fly_Writes        0x003a  100  100  000    Old_age  Always      -      0

190 Airflow_Temperature_Cel 0x0022  057  056  045    Old_age  Always      -      43 (Min/Max 26/44)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      21

193 Load_Cycle_Count        0x0032  100  100  000    Old_age  Always      -      63

194 Temperature_Celsius    0x0022  043  044  000    Old_age  Always      -      43 (0 25 0 0)

195 Hardware_ECC_Recovered  0x001a  117  099  000    Old_age  Always      -      146594920

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      38190849196214

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      60384571496

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      63277694076

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Completed without error      00%      266        -

# 2  Short offline      Completed without error      00%      251        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

 

 

 

Link to comment

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

 

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

 

If it would me mine, I'll give it away asap...  :-\

Link to comment

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

 

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

 

If it would me mine, I'll give it away asap...  :-\

 

It's totally normal for the 8TB Archive.  Both mine have similarly high numbers after 22 days.

 

Perhaps you should only give advice when you know what you're talking about, and not lead people on a wild goose chase?

Link to comment

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

 

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

 

If it would me mine, I'll give it away asap...  :-\

 

With a few exceptions, the "raw" attribute values are not meaningful. Each manufacturer is free to use the raw number however they want, and frequently they will use bit positions to indicate certain values. Interpreting a bunch of status bits as a number can produce alarmingly high decimal numbers that are, as I said, meaningless. Even for a single manufacturer, the values can have different meanings for different models and even firmware versions.

 

Manufacturers do "normalize" the values into a scale from 1 to 255. Lower is worse. A nominal value is often 100. The "VALUE" column is the current normalized value, the "WORST" column is how low the value has gone in the past, and the THRESH" is the value at which the attribute will be considered failed. So for attribute #1,

  1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      146594920

 

the current normalized value is 117, the worst it has gotten is 99, and the drive will consider 6 and below a failure. You are not even close to failure. The raw value means nothing to you.

 

The few attributes we look carefully at the raw values are reallocated sectors (#5), pending sectors (#197), CRC errors (#198) and temperature (#194). #5 and #197 are often indicators of drive failure long before the normalize values drop significantly. The #198 is often a sign of a bad or loose cable. The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

 

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

Link to comment

It's totally normal for the 8TB Archive.  Both mine have similarly high numbers after 22 days.

 

Perhaps you should only give advice when you know what you're talking about, and not lead people on a wild goose chase?

 

Yeah, right... IN FACT that disk makes his system to go kernel panic.  ::)

 

Smart ID #7 is BAD even for the disk firmware: a worst value touched of 60 (on a 100 basis) with a SMART BAD low limit fixed @ 30 for a 22 days disk is... BAD. IMHO. Sentence.

 

Or probably this is the reason because I don't go 4 Seagates... never.  ;)

Link to comment

The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

 

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

 

From what I've read HGST and Seagate now define the max temp of 60c on these larger multi-platter drives.

 

i.e.

Operating (drive case max °C) 60

Nonoperating (ambient °C) –40 to 70

 

I remember reading a specific article from Seagate saying they raised the acceptable temperature.

 

At 50 or 52-54 I might be exploring a better cooler solution.

 

From what I've seen in my 6TB HGST 7200 RPM drives and 6TB Seagate drives, the highest values were in the 45c range after a grueling 1 week badblocks preclear burn in.

 

I'm not disputing the temperature goals per post, I'm responding what what Seagate and HGST believes the high acceptable range is.

Frankly 55 is too close to 60. So I might set alarms at 53-54 and stop work if it climbs.

Link to comment

It's totally normal for the 8TB Archive.  Both mine have similarly high numbers after 22 days.

 

Perhaps you should only give advice when you know what you're talking about, and not lead people on a wild goose chase?

 

Yeah, right... IN FACT that disk makes his system to go kernel panic.  ::)

 

Smart ID #7 is BAD even for the disk firmware: a worst value touched of 60 (on a 100 basis) with a SMART BAD low limit fixed @ 30 for a 22 days disk is... BAD. IMHO. Sentence.

 

Or probably this is the reason because I don't go 4 Seagates... never.  ;)

 

I would not look at a SMART report for kernel panic. No activity (or inactivity) of the storage subsystem should panic the OS. The place to look would be syslog and the core.

Link to comment

I would not look at a SMART report for kernel panic. No activity (or inactivity) of the storage subsystem should panic the OS. The place to look would be syslog and the core.

 

In a "perfect world"  you would be surely right...

 

But we are in the real world and sometimes (read: often...) things are a bit different...  ::)

 

In a perfect world hard disks would work fast & reliable way. Ever.  :)

 

In a bit "less perfect" world, they could have issues and, e.g. could delay responding when they have internal reading issues... BUT in this world, well written hard disk & controller firmwares, well written drivers and a well written & rock solid OS kernel would be able to manage these delays flawlessly...

 

BUT we are in the real world... and - even if I don't know more about unRAID OS, I know WELL a lot of other, even similar, OS behaviours... and I know WELL hard drives and SMART attributes, even if here there is someone not so convinced about it... - in the real world a delay responding hard drive COULD BE a real BIG problem.

 

Moreover, when this hard drive is in a RAID (or pseudoRAID) array, this CAN BE a real issue for a "storage oriented OS" which HAVE TO DIRECTLY MANAGE this array.

 

Just to talk about some examples: on Windows OS (even the latest releases), a bad working USB stick is able to hang the system very easily. And I saw Linux distros not working a lot better in the same environment...

 

Anyway my mental approach is ever do things as simple as possible:

 

- jus7incase told us that his system was stable before installing the new hard disk - FACT;

- jus7incase told us that after installing the new disk he got some "kernel panics" - FACT;

- jus7incase posted here some hard drive SMART values (which I don't think they are good for a 22 days old disk and that quite surely CAN cause reading delays from the disk) - BUT this is *only* my OPINION;

 

If just7case will remove the drive from the system and then it would work fine, he had a "compatibility" and/or bad drive issue with his system - FACT.

 

Even if I would trust of that disk neither to store my porn collection (;D), a good try could be to connect it to the system by a better SATA controller if he has one available: it could manage in a better way the (possible) hard disk delays and no more to cause a kernel panic.

 

P.S.: just FYI... professional hard drives monitoring tools like Hard Disk Sentinel (on Windows) on "server restricted configuration" (safer one...) often report as "well below a 100% of life status" even hard drives without a single bad sector. This because of when a single bad sector will appear, sometimes we are waaay close to a complete drive failure (I personally saw a lot of WD Raptor and Green series to dead rapidly in this way). Other professional tools like HDD Regenerator reports as "not so good" even simple reading delays during surface scan, since they (correctly... IMHO) think that a delay today, would be a bad sector tomorrow...

Link to comment

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen. The drive report shows a functioning drive. Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

 

Since this is a repeating panic, should be pretty easy to get more details.

 

Link to comment

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen.

...

I totally agree.

 

...

The drive report shows a functioning drive.

...

I REALLY would like to see it under a HDD Regenerator scan...  ;)

 

 

...

Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

 

Since this is a repeating panic, should be pretty easy to get more details.

I agree.

BUT if you read my steplist above, there is a 4th point to take all this in account (the system MUST work fine again with the new HDD removed...  ;))

Link to comment

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen.

...

I totally agree.

 

...

The drive report shows a functioning drive.

...

I REALLY would like to see it under a HDD Regenerator scan...  ;)

 

 

...

Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

 

Since this is a repeating panic, should be pretty easy to get more details.

I agree.

BUT if you read my steplist above, there is a 4th point to take all this in account (the system MUST work fine again with the new HDD removed...  ;))

So, what if it is the controller firmware or driver having trouble with the 8TB drive? or any other software component? Removing the drive does nothing to uncovering these shortcomings or determining the actual failure.

Link to comment

So, what if it is the controller firmware or driver having trouble with the 8TB drive? or any other software component? Removing the drive does nothing to uncovering these shortcomings or determining the actual failure.

 

Quote myself...

 

"If just7case will remove the drive from the system and then it would work fine, he had a "compatibility" and/or bad drive issue with his system - FACT."

 

It seems clear to me...  ;)

Link to comment

Gentlemen, don't worry about the reported temperature. The report was done right after stopping the pre-clear. The drive is at around 34 centigrades when spinning idle. Totally normal.

 

 

The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

 

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

 

From what I've read HGST and Seagate now define the max temp of 60c on these larger multi-platter drives.

 

i.e.

Operating (drive case max °C) 60

Nonoperating (ambient °C) –40 to 70

 

I remember reading a specific article from Seagate saying they raised the acceptable temperature.

 

At 50 or 52-54 I might be exploring a better cooler solution.

 

From what I've seen in my 6TB HGST 7200 RPM drives and 6TB Seagate drives, the highest values were in the 45c range after a grueling 1 week badblocks preclear burn in.

 

I'm not disputing the temperature goals per post, I'm responding what what Seagate and HGST believes the high acceptable range is.

Frankly 55 is too close to 60. So I might set alarms at 53-54 and stop work if it climbs.

Link to comment

Guys, you are awesome. It is great to get that much response and to see how many people care and take the time to help!

 

Intermediate status:

I pre-cleared in 1 run and put it back as parity drive. No kernel panic so far.

 

For the record:

The drive is connected to one of the SATA3 connectors on a SuperMicro X9SCL-F. There is also a Digitus DS-30104-1 controller in the system, hosting the cache drive and 2 other media drives (1 of them being also a new 8TB Seagate Archive, currently pre-clearing).

Both Seagate Archive 8TB drives were bought from the same batch.

 

...

The drive report shows a functioning drive.

...

I REALLY would like to see it under a HDD Regenerator scan...  ;)

 

Good to know it looks OK so far from the SMART report.

I have no clue how to do the HDD Regenerator scan. Can it be done on the parity drive?

 

...

Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

 

Since this is a repeating panic, should be pretty easy to get more details.

 

After a reboot after a kernel panic, will the syslog be available? Doesn't Unraid delete this kind of stuff upon reboot?

I will make the syslog available when applicable, but not universally.

Raise a hand if you are willing to help and take a look and I may contact you when kernel panics next time.

 

Regarding the following statement I have a disconcerting additional information: BUT if you read my steplist above, there is a 4th point to take all this in account (the system MUST work fine again with the new HDD removed...  ;))

 

When I switched the system back to the old 3TB partity drive in order to pre-clear the 8TB drive, I experienced another kernel panic under the following circumstances: 3TB drive parity was rebuilt successfully, 8TB drive was physically/electrically connected, but not assigned to the array. The 8TB drive was idle (not being pre-cleared).

 

After rebooting and rebuilding parity in maintenance mode I started pre-clearing the 8TB drive. No errors on the console, no kernel panic.

Then I unassigned the 3TB parity drive and assigned the 8TB drive as parity and let it build the parit yin maintenance mode. 3TB electrically connected but not assigned to the array, idling. Since then no kernel panic.

 

It is a puzzle to me. I began suspecting firmware problems with the new drives, but that is not supported by the fact that the last panic occured while the new drives were idle. Also I was at some time suspecting the Digitus controller, but the 8TB parity drive is not connected to it. The other 8TB currently pre-clearing is connected to the Digitus, still no kernel panic.

 

The kernel panic always occurs by night, probably when the mover does its work and some of the other non-OS-services do their grunt work. Still, this doesn't help me understanding what's happening.

 

 

Link to comment

...

When I switched the system back to the old 3TB partity drive in order to pre-clear the 8TB drive, I experienced another kernel panic under the following circumstances: 3TB drive parity was rebuilt successfully, 8TB drive was physically/electrically connected, but not assigned to the array. The 8TB drive was idle (not being pre-cleared).

 

After rebooting and rebuilding parity in maintenance mode I started pre-clearing the 8TB drive. No errors on the console, no kernel panic.

Then I unassigned the 3TB parity drive and assigned the 8TB drive as parity and let it build the parit yin maintenance mode. 3TB electrically connected but not assigned to the array, idling. Since then no kernel panic.

...

 

This makes me think that kernel panics could not be related to the new hard disk...

 

Anyway, if you want to test it in a good way, you need a bootable ISO of HDD Regenerator (Google 4 it...  ;)) and test the disk connected alone and booting from that ISO.

 

I'll try a "Scan only (without repair)" scan since if the disk will eventually show some defects, you wouldn't want the software to try to repair it (at least until you decide what to do with the drive).

 

A fine working new disk shouldn't show not only any bad sector but any delay too.

 

If it will report some delays... (as I suspect by smart values...), this will never be a good working/reliable disk.

Then it will be up to you if to use it anyway (and go with a "full regeneration" in this case...) or try to RMA it.

 

All the operations can be done with data on the disk, since HDD Regenerator will not modifiy them (full regeneration eventually shouldn't be done without a backup, since if the disk will begin to fail during operations, it could do the hard way...  :-\).

 

 

Link to comment

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen. The drive report shows a functioning drive. Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

 

Since this is a repeating panic, should be pretty easy to get more details.

 

I had the same kernel panic last night: "5 buffers handled - should be 1", then it shuts down the CPUs, and last thing is "32 buffers handled - should be 1"

 

Current setup: parity (8TB) connected to supermicro board. Another 8TB media disk connected to Digitus controller, as well as 3TB (supposed to become cache) and current cache 250GB. Worked fine all day. During night sometimes  some (cron) job or service seems to trigger the kernel panic.

 

After reboot I looked into syslog, but it is a fresh file created right after the reboot. How can you actually diagnose an unRAID system, if syslog gets removed on a reboot?

 

??puzzled??

 

I found a core dump:

-rw-------  1 root  root  397312 2015-05-25 03:43 core

 

Does this help in any way to get further information?

 

Any ideas how to find out more?

 

Link to comment

Try the plugin powerdown. It saves syslogs.

 

Just to make sure this is applicable in my case:

 

I have a kernel panic at night when the machine is unattended. I can see the panic on screen but they keyboard does not react and the NIC is down. I cannot access the machine by keyboard and by network.

 

does "powerdown" save syslog even when I have a kernel panic and push the reset button?

 

Link to comment

Just for the record, the other 8TB drive (same batch) come with nearly the identical SMART values. Either this is normal or the whole batch has a problem.

 

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

 

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

 

If it would me mine, I'll give it away asap...  :-\

 

With a few exceptions, the "raw" attribute values are not meaningful. Each manufacturer is free to use the raw number however they want, and frequently they will use bit positions to indicate certain values. Interpreting a bunch of status bits as a number can produce alarmingly high decimal numbers that are, as I said, meaningless. Even for a single manufacturer, the values can have different meanings for different models and even firmware versions.

 

Manufacturers do "normalize" the values into a scale from 1 to 255. Lower is worse. A nominal value is often 100. The "VALUE" column is the current normalized value, the "WORST" column is how low the value has gone in the past, and the THRESH" is the value at which the attribute will be considered failed. So for attribute #1,

  1 Raw_Read_Error_Rate    0x000f  117  099  006    Pre-fail  Always      -      146594920

 

the current normalized value is 117, the worst it has gotten is 99, and the drive will consider 6 and below a failure. You are not even close to failure. The raw value means nothing to you.

 

The few attributes we look carefully at the raw values are reallocated sectors (#5), pending sectors (#197), CRC errors (#198) and temperature (#194). #5 and #197 are often indicators of drive failure long before the normalize values drop significantly. The #198 is often a sign of a bad or loose cable. The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

 

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

Link to comment

I think I narrowed it down to when this happens by disabling all services that do nightly scans on the drives, except for the mover.

I then scheduled the mover to a different time and boom - kernel panic at that time.

The mover script look quite inconspicious. Mainly an rsync. I do not then that rsync itseld should pose a problem, also since the same mover script did work fine for the last years.

The cache drive is currently hooked to the Digitus controller. I will attach that drive also to the SuperMicro board to see what happens.

I will keep you posted.

 

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen. The drive report shows a functioning drive. Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

 

Since this is a repeating panic, should be pretty easy to get more details.

 

I had the same kernel panic last night: "5 buffers handled - should be 1", then it shuts down the CPUs, and last thing is "32 buffers handled - should be 1"

 

Current setup: parity (8TB) connected to supermicro board. Another 8TB media disk connected to Digitus controller, as well as 3TB (supposed to become cache) and current cache 250GB. Worked fine all day. During night sometimes  some (cron) job or service seems to trigger the kernel panic.

 

After reboot I looked into syslog, but it is a fresh file created right after the reboot. How can you actually diagnose an unRAID system, if syslog gets removed on a reboot?

 

??puzzled??

 

I found a core dump:

-rw-------  1 root  root  397312 2015-05-25 03:43 core

 

Does this help in any way to get further information?

 

Any ideas how to find out more?

Link to comment

Folks,

 

I took a SMART report of all drives. Except for one drive, all other drives have logged no errors.

The only drive that has logged errors is the former parity drive. The error seems to be old (at 44 days operation) and I never hat parity problems with that drive. Also the feature 184 End-to-End Error reports "FAILING NOW"!

 

I was intending to replace my old 250GB cache drive with this 3TB drive.

Could someone please take a look if this drive is ok to use as cache drive? (I understand the cache content is not protected by parity, if not run in a pool, I don't have unRAID 6 yet)

 

Thank you so much!

 

The error report from the pre-clear:

 

** Changed attributes in files: /tmp/smart_start_sdc  /tmp/smart_finish_sdc

                ATTRIBUTE  NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE

      Raw_Read_Error_Rate =  109    114            6        ok          21345144

          Seek_Error_Rate =    51      51          30        near_thresh 1383022160179

        Spin_Retry_Count =  100    100          97        near_thresh 0

        End-to-End_Error =    93      93          99        FAILING_NOW 7

  Airflow_Temperature_Cel =    63      67          45        near_thresh 37

      Temperature_Celsius =    37      33            0        ok          37

 

*** Failing SMART Attributes in /tmp/smart_finish_sdc ***

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

184 End-to-End_Error        0x0032  093  093  099    Old_age  Always  FAILING_NOW 7

 

 

 

The report following was taken while the drive is pre-clearing! Thus the high temp. It is usually cooler.

 

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Device Model:    ST3000DM001-1CH166

Firmware Version: CC26

User Capacity:    3,000,592,982,016 bytes

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Thu May 28 13:05:33 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

See vendor-specific Attribute list for marginal Attributes.

 

General SMART Values:

Offline data collection status:  (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 584) seconds.

Offline data collection

capabilities: (0x73) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

No Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: (  2) minutes.

SCT capabilities:       (0x3085) SCT Status supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  113  099  006    Pre-fail  Always      -      55763472

  3 Spin_Up_Time            0x0003  093  093  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  099  099  020    Old_age  Always      -      1969

  5 Reallocated_Sector_Ct  0x0033  100  100  010    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  051  047  030    Pre-fail  Always      -      1383022134666

  9 Power_On_Hours          0x0032  084  084  000    Old_age  Always      -      14866

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  020    Old_age  Always      -      62

183 Runtime_Bad_Block      0x0032  100  100  000    Old_age  Always      -      0

184 End-to-End_Error        0x0032  093  093  099    Old_age  Always  FAILING_NOW 7

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  100  100  000    Old_age  Always      -      0

189 High_Fly_Writes        0x003a  096  096  000    Old_age  Always      -      4

190 Airflow_Temperature_Cel 0x0022  062  055  045    Old_age  Always      -      38 (Min/Max 30/39)

191 G-Sense_Error_Rate      0x0032  100  100  000    Old_age  Always      -      0

192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      11

193 Load_Cycle_Count        0x0032  081  081  000    Old_age  Always      -      38294

194 Temperature_Celsius    0x0022  038  045  000    Old_age  Always      -      38 (0 17 0 0)

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  200  000    Old_age  Always      -      0

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      182291296946429

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      72536194288

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      155726243753

 

SMART Error Log Version: 1

ATA Error Count: 5

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 5 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00  44d+20:48:56.856  READ DMA EXT

  ef 10 02 00 00 00 a0 00  44d+20:48:56.856  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00  44d+20:48:56.856  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00  44d+20:48:56.855  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00  44d+20:48:56.855  SET FEATURES [set transfer mode]

 

Error 4 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00  44d+20:48:56.731  READ DMA EXT

  ef 10 02 00 00 00 a0 00  44d+20:48:56.731  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00  44d+20:48:56.731  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00  44d+20:48:56.730  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00  44d+20:48:56.730  SET FEATURES [set transfer mode]

 

Error 3 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00  44d+20:48:56.606  READ DMA EXT

  ef 10 02 00 00 00 a0 00  44d+20:48:56.606  SET FEATURES [Reserved for Serial ATA]

  27 00 00 00 00 00 e0 00  44d+20:48:56.606  READ NATIVE MAX ADDRESS EXT

  ec 00 00 00 00 00 a0 00  44d+20:48:56.605  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00  44d+20:48:56.605  SET FEATURES [set transfer mode]

 

Error 2 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00  44d+20:48:56.451  READ DMA EXT

  25 00 08 ff ff ff ef 00  44d+20:48:56.450  READ DMA EXT

  25 00 08 ff ff ff ef 00  44d+20:48:56.443  READ DMA EXT

  25 00 08 ff ff ff ef 00  44d+20:48:56.437  READ DMA EXT

  25 00 08 ff ff ff ef 00  44d+20:48:56.430  READ DMA EXT

 

Error 1 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 00 00 00 00  Error: UNC at LBA = 0x00000000 = 0

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  c8 00 08 10 b2 06 e0 00  44d+20:48:55.925  READ DMA

  ca 00 08 98 8f 00 e0 00  44d+20:48:21.660  WRITE DMA

  c8 00 08 98 8f 00 e0 00  44d+20:48:21.660  READ DMA

  ca 00 c0 d8 8e 00 e0 00  44d+20:48:21.659  WRITE DMA

  ca 00 08 d0 8e 00 e0 00  44d+20:48:21.659  WRITE DMA

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.