Seagate Archive 8TB ST8000AS0002 Unraid 5stable kernel panic

jus7incase · May 5, 2015

Hello all,

when I use a newly bought Seagate Archive 8TB drive (ST8000AS0002) on a Unraid 5stable array as a parity drive (not tried as data drive), I will get a kernel panic after some time. This interestingly even blocks the traffic going over the attached switch, no idea how that happens.

(When I replace the 8TB drive with a 3TB drive for parity, all works smooth again for days. The 8TB drive is then still connected to the machine but not assigned in Unraid)

My questions:

Is that a know problem?

How may this be fixed?

Move to Unraid6?

Drive firmware update?

Any ideas?

Thanks for caring,

JC

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===

Device Model: ST8000AS0002-1NA17Z

Serial Number: Z8403NMN

Firmware Version: AR13

User Capacity: 8,001,563,222,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 9

ATA Standard is: Not recognized. Minor revision code: 0x001f

Local Time is: Wed May 6 00:25:19 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 0) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x30a5) SCT Status supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 194380720

3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13

5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0

7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 13681010

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 251

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 3

183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0

184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 065 057 045 Old_age Always - 35 (Min/Max 26/41)

191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 10

193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 37

194 Temperature_Celsius 0x0022 035 043 000 Old_age Always - 35 (0 25 0 0)

195 Hardware_ECC_Recovered 0x001a 118 099 000 Old_age Always - 194380720

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 182325656682555

241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 28300614680

242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 16244567367

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 251 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

garycase · May 5, 2015

The S.M.A.R.T. data looks okay except perhaps for the seek error rates [but these can sometimes be high, so it's hard to say for sure whether that's an issue].

Did you do a thorough test on this drive before putting it in service? [i.e. either a couple cycles of the pre-clear script or at least run the manufacturer's diagnostics]

Details on your configuration might provide some clue ... not only the hardware components, but also the specific disk drives you've got connected.

c3 · May 6, 2015

Do you mind sharing the syslog?

Panic locking the switch just means it took the nic out badly too. Some switches will detect the excessive flood and block the port, yours doesn't.

jus7incase · May 14, 2015

The S.M.A.R.T. data looks okay except perhaps for the seek error rates [but these can sometimes be high, so it's hard to say for sure whether that's an issue].

Did you do a thorough test on this drive before putting it in service? [i.e. either a couple cycles of the pre-clear script or at least run the manufacturer's diagnostics]

Details on your configuration might provide some clue ... not only the hardware components, but also the specific disk drives you've got connected.

Now I did one round of preclear and you will find the new SMART report attached to the end of this post.

4x Seagate ST3000DM001-1CH166, one of them will be cache drive

2x WD WDC_WD30EZRX-00AZ6B0_WD

Plus the 8TB drive (2x)

What else would be of relevance?

SuperMicro Mainboard X9SCL-F-0, 8TB attached to SATA3

Digitus DS-30104-1 additional SATA controller

Here the SMART report after a preclear of the 8TB Seagate drive:

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===

Device Model: ST8000AS0002-1NA17Z

Serial Number: Z8403NMN

Firmware Version: AR13

User Capacity: 8,001,563,222,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 9

ATA Standard is: Not recognized. Minor revision code: 0x001f

Local Time is: Thu May 14 21:40:29 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 0) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x30a5) SCT Status supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 146594920

3 Spin_Up_Time 0x0003 090 090 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 21

5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0

7 Seek_Error_Rate 0x000f 076 060 030 Pre-fail Always - 44486206

9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 465

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 3

183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0

184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0

189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0022 057 056 045 Old_age Always - 43 (Min/Max 26/44)

191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 21

193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 63

194 Temperature_Celsius 0x0022 043 044 000 Old_age Always - 43 (0 25 0 0)

195 Hardware_ECC_Recovered 0x001a 117 099 000 Old_age Always - 146594920

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 38190849196214

241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 60384571496

242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 63277694076

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Extended offline Completed without error 00% 266 -

# 2 Short offline Completed without error 00% 251 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

enetec · May 16, 2015

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

If it would me mine, I'll give it away asap... $:-\$

HellDiverUK · May 18, 2015

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

If it would me mine, I'll give it away asap... $:-\$

It's totally normal for the 8TB Archive. Both mine have similarly high numbers after 22 days.

Perhaps you should only give advice when you know what you're talking about, and not lead people on a wild goose chase?

SSD · May 18, 2015

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

If it would me mine, I'll give it away asap... $:-\$

With a few exceptions, the "raw" attribute values are not meaningful. Each manufacturer is free to use the raw number however they want, and frequently they will use bit positions to indicate certain values. Interpreting a bunch of status bits as a number can produce alarmingly high decimal numbers that are, as I said, meaningless. Even for a single manufacturer, the values can have different meanings for different models and even firmware versions.

Manufacturers do "normalize" the values into a scale from 1 to 255. Lower is worse. A nominal value is often 100. The "VALUE" column is the current normalized value, the "WORST" column is how low the value has gone in the past, and the THRESH" is the value at which the attribute will be considered failed. So for attribute #1,

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 146594920

the current normalized value is 117, the worst it has gotten is 99, and the drive will consider 6 and below a failure. You are not even close to failure. The raw value means nothing to you.

The few attributes we look carefully at the raw values are reallocated sectors (#5), pending sectors (#197), CRC errors (#198) and temperature (#194). #5 and #197 are often indicators of drive failure long before the normalize values drop significantly. The #198 is often a sign of a bad or loose cable. The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

enetec · May 18, 2015

It's totally normal for the 8TB Archive. Both mine have similarly high numbers after 22 days.

Perhaps you should only give advice when you know what you're talking about, and not lead people on a wild goose chase?

Yeah, right... IN FACT that disk makes his system to go kernel panic. ::)

Smart ID #7 is BAD even for the disk firmware: a worst value touched of 60 (on a 100 basis) with a SMART BAD low limit fixed @ 30 for a 22 days disk is... BAD. IMHO. Sentence.

Or probably this is the reason because I don't go 4 Seagates... never.

WeeboTech · May 18, 2015

The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

From what I've read HGST and Seagate now define the max temp of 60c on these larger multi-platter drives.

i.e.

Operating (drive case max °C) 60

Nonoperating (ambient °C) –40 to 70

I remember reading a specific article from Seagate saying they raised the acceptable temperature.

At 50 or 52-54 I might be exploring a better cooler solution.

From what I've seen in my 6TB HGST 7200 RPM drives and 6TB Seagate drives, the highest values were in the 45c range after a grueling 1 week badblocks preclear burn in.

I'm not disputing the temperature goals per post, I'm responding what what Seagate and HGST believes the high acceptable range is.

Frankly 55 is too close to 60. So I might set alarms at 53-54 and stop work if it climbs.

c3 · May 18, 2015

It's totally normal for the 8TB Archive. Both mine have similarly high numbers after 22 days.

Perhaps you should only give advice when you know what you're talking about, and not lead people on a wild goose chase?

Yeah, right... IN FACT that disk makes his system to go kernel panic.

Smart ID #7 is BAD even for the disk firmware: a worst value touched of 60 (on a 100 basis) with a SMART BAD low limit fixed @ 30 for a 22 days disk is... BAD. IMHO. Sentence.

Or probably this is the reason because I don't go 4 Seagates... never.

I would not look at a SMART report for kernel panic. No activity (or inactivity) of the storage subsystem should panic the OS. The place to look would be syslog and the core.

SSD · May 18, 2015

Or probably this is the reason because I don't go 4 Seagates... never.

Not politically correct, but I tend to agree with the sentiment.

enetec · May 19, 2015

I would not look at a SMART report for kernel panic. No activity (or inactivity) of the storage subsystem should panic the OS. The place to look would be syslog and the core.

In a "perfect world" you would be surely right...

But we are in the real world and sometimes (read: often...) things are a bit different... ::)

In a perfect world hard disks would work fast & reliable way. Ever.

In a bit "less perfect" world, they could have issues and, e.g. could delay responding when they have internal reading issues... BUT in this world, well written hard disk & controller firmwares, well written drivers and a well written & rock solid OS kernel would be able to manage these delays flawlessly...

BUT we are in the real world... and - even if I don't know more about unRAID OS, I know WELL a lot of other, even similar, OS behaviours... and I know WELL hard drives and SMART attributes, even if here there is someone not so convinced about it... - in the real world a delay responding hard drive COULD BE a real BIG problem.

Moreover, when this hard drive is in a RAID (or pseudoRAID) array, this CAN BE a real issue for a "storage oriented OS" which HAVE TO DIRECTLY MANAGE this array.

Just to talk about some examples: on Windows OS (even the latest releases), a bad working USB stick is able to hang the system very easily. And I saw Linux distros not working a lot better in the same environment...

Anyway my mental approach is ever do things as simple as possible:

- jus7incase told us that his system was stable before installing the new hard disk - FACT;

- jus7incase told us that after installing the new disk he got some "kernel panics" - FACT;

- jus7incase posted here some hard drive SMART values (which I don't think they are good for a 22 days old disk and that quite surely CAN cause reading delays from the disk) - BUT this is *only* my OPINION;

If just7case will remove the drive from the system and then it would work fine, he had a "compatibility" and/or bad drive issue with his system - FACT.

Even if I would trust of that disk neither to store my porn collection (), a good try could be to connect it to the system by a better SATA controller if he has one available: it could manage in a better way the (possible) hard disk delays and no more to cause a kernel panic.

P.S.: just FYI... professional hard drives monitoring tools like Hard Disk Sentinel (on Windows) on "server restricted configuration" (safer one...) often report as "well below a 100% of life status" even hard drives without a single bad sector. This because of when a single bad sector will appear, sometimes we are waaay close to a complete drive failure (I personally saw a lot of WD Raptor and Green series to dead rapidly in this way). Other professional tools like HDD Regenerator reports as "not so good" even simple reading delays during surface scan, since they (correctly... IMHO) think that a delay today, would be a bad sector tomorrow...

c3 · May 19, 2015

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen. The drive report shows a functioning drive. Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

Since this is a repeating panic, should be pretty easy to get more details.

enetec · May 19, 2015

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen.

...

I totally agree.

...

The drive report shows a functioning drive.

...

I REALLY would like to see it under a HDD Regenerator scan...

...

Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

Since this is a repeating panic, should be pretty easy to get more details.

I agree.

BUT if you read my steplist above, there is a 4th point to take all this in account (the system MUST work fine again with the new HDD removed... )

c3 · May 19, 2015

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen.

...

I totally agree.

...

The drive report shows a functioning drive.

...

I REALLY would like to see it under a HDD Regenerator scan...

...

Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

Since this is a repeating panic, should be pretty easy to get more details.

I agree.
BUT if you read my steplist above, there is a 4th point to take all this in account (the system MUST work fine again with the new HDD removed... )

So, what if it is the controller firmware or driver having trouble with the 8TB drive? or any other software component? Removing the drive does nothing to uncovering these shortcomings or determining the actual failure.

enetec · May 19, 2015

So, what if it is the controller firmware or driver having trouble with the 8TB drive? or any other software component? Removing the drive does nothing to uncovering these shortcomings or determining the actual failure.

Quote myself...

"If just7case will remove the drive from the system and then it would work fine, he had a "compatibility" and/or bad drive issue with his system - FACT."

It seems clear to me...

jus7incase · May 20, 2015

Gentlemen, don't worry about the reported temperature. The report was done right after stopping the pre-clear. The drive is at around 34 centigrades when spinning idle. Totally normal.

The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

From what I've read HGST and Seagate now define the max temp of 60c on these larger multi-platter drives.

i.e.

Operating (drive case max °C) 60

Nonoperating (ambient °C) –40 to 70

I remember reading a specific article from Seagate saying they raised the acceptable temperature.

At 50 or 52-54 I might be exploring a better cooler solution.

From what I've seen in my 6TB HGST 7200 RPM drives and 6TB Seagate drives, the highest values were in the 45c range after a grueling 1 week badblocks preclear burn in.

I'm not disputing the temperature goals per post, I'm responding what what Seagate and HGST believes the high acceptable range is.

Frankly 55 is too close to 60. So I might set alarms at 53-54 and stop work if it climbs.

jus7incase · May 20, 2015

Guys, you are awesome. It is great to get that much response and to see how many people care and take the time to help!

Intermediate status:

I pre-cleared in 1 run and put it back as parity drive. No kernel panic so far.

For the record:

The drive is connected to one of the SATA3 connectors on a SuperMicro X9SCL-F. There is also a Digitus DS-30104-1 controller in the system, hosting the cache drive and 2 other media drives (1 of them being also a new 8TB Seagate Archive, currently pre-clearing).

Both Seagate Archive 8TB drives were bought from the same batch.

...

The drive report shows a functioning drive.

...

I REALLY would like to see it under a HDD Regenerator scan...

Good to know it looks OK so far from the SMART report.

I have no clue how to do the HDD Regenerator scan. Can it be done on the parity drive?

...

Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

Since this is a repeating panic, should be pretty easy to get more details.

After a reboot after a kernel panic, will the syslog be available? Doesn't Unraid delete this kind of stuff upon reboot?

I will make the syslog available when applicable, but not universally.

Raise a hand if you are willing to help and take a look and I may contact you when kernel panics next time.

Regarding the following statement I have a disconcerting additional information: BUT if you read my steplist above, there is a 4th point to take all this in account (the system MUST work fine again with the new HDD removed... )

When I switched the system back to the old 3TB partity drive in order to pre-clear the 8TB drive, I experienced another kernel panic under the following circumstances: 3TB drive parity was rebuilt successfully, 8TB drive was physically/electrically connected, but not assigned to the array. The 8TB drive was idle (not being pre-cleared).

After rebooting and rebuilding parity in maintenance mode I started pre-clearing the 8TB drive. No errors on the console, no kernel panic.

Then I unassigned the 3TB parity drive and assigned the 8TB drive as parity and let it build the parit yin maintenance mode. 3TB electrically connected but not assigned to the array, idling. Since then no kernel panic.

It is a puzzle to me. I began suspecting firmware problems with the new drives, but that is not supported by the fact that the last panic occured while the new drives were idle. Also I was at some time suspecting the Digitus controller, but the 8TB parity drive is not connected to it. The other 8TB currently pre-clearing is connected to the Digitus, still no kernel panic.

The kernel panic always occurs by night, probably when the mover does its work and some of the other non-OS-services do their grunt work. Still, this doesn't help me understanding what's happening.

enetec · May 21, 2015

...

When I switched the system back to the old 3TB partity drive in order to pre-clear the 8TB drive, I experienced another kernel panic under the following circumstances: 3TB drive parity was rebuilt successfully, 8TB drive was physically/electrically connected, but not assigned to the array. The 8TB drive was idle (not being pre-cleared).

After rebooting and rebuilding parity in maintenance mode I started pre-clearing the 8TB drive. No errors on the console, no kernel panic.

Then I unassigned the 3TB parity drive and assigned the 8TB drive as parity and let it build the parit yin maintenance mode. 3TB electrically connected but not assigned to the array, idling. Since then no kernel panic.

...

This makes me think that kernel panics could not be related to the new hard disk...

Anyway, if you want to test it in a good way, you need a bootable ISO of HDD Regenerator (Google 4 it... ) and test the disk connected alone and booting from that ISO.

I'll try a "Scan only (without repair)" scan since if the disk will eventually show some defects, you wouldn't want the software to try to repair it (at least until you decide what to do with the drive).

A fine working new disk shouldn't show not only any bad sector but any delay too.

If it will report some delays... (as I suspect by smart values...), this will never be a good working/reliable disk.

Then it will be up to you if to use it anyway (and go with a "full regeneration" in this case...) or try to RMA it.

All the operations can be done with data on the disk, since HDD Regenerator will not modifiy them (full regeneration eventually shouldn't be done without a backup, since if the disk will begin to fail during operations, it could do the hard way... $:-\$ ).

jus7incase · May 25, 2015

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen. The drive report shows a functioning drive. Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

Since this is a repeating panic, should be pretty easy to get more details.

I had the same kernel panic last night: "5 buffers handled - should be 1", then it shuts down the CPUs, and last thing is "32 buffers handled - should be 1"

Current setup: parity (8TB) connected to supermicro board. Another 8TB media disk connected to Digitus controller, as well as 3TB (supposed to become cache) and current cache 250GB. Worked fine all day. During night sometimes some (cron) job or service seems to trigger the kernel panic.

After reboot I looked into syslog, but it is a fresh file created right after the reboot. How can you actually diagnose an unRAID system, if syslog gets removed on a reboot?

??puzzled??

I found a core dump:

-rw------- 1 root root 397312 2015-05-25 03:43 core

Does this help in any way to get further information?

Any ideas how to find out more?

BRiT · May 25, 2015

Try the plugin powerdown. It saves syslogs.

jus7incase · May 25, 2015

Try the plugin powerdown. It saves syslogs.

Just to make sure this is applicable in my case:

I have a kernel panic at night when the machine is unattended. I can see the panic on screen but they keyboard does not react and the NIC is down. I cannot access the machine by keyboard and by network.

does "powerdown" save syslog even when I have a kernel panic and push the reset button?

jus7incase · May 27, 2015

Just for the record, the other 8TB drive (same batch) come with nearly the identical SMART values. Either this is normal or the whole batch has a problem.

Based on my experience, I don't like attribute IDs #1, #195 and #7 even more. At all.

They are very high in my opinion for a disk with less of 20 days of working time and could be because of a bad magnetic surface...

If it would me mine, I'll give it away asap... $:-\$

With a few exceptions, the "raw" attribute values are not meaningful. Each manufacturer is free to use the raw number however they want, and frequently they will use bit positions to indicate certain values. Interpreting a bunch of status bits as a number can produce alarmingly high decimal numbers that are, as I said, meaningless. Even for a single manufacturer, the values can have different meanings for different models and even firmware versions.

Manufacturers do "normalize" the values into a scale from 1 to 255. Lower is worse. A nominal value is often 100. The "VALUE" column is the current normalized value, the "WORST" column is how low the value has gone in the past, and the THRESH" is the value at which the attribute will be considered failed. So for attribute #1,

1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail Always - 146594920

the current normalized value is 117, the worst it has gotten is 99, and the drive will consider 6 and below a failure. You are not even close to failure. The raw value means nothing to you.

The few attributes we look carefully at the raw values are reallocated sectors (#5), pending sectors (#197), CRC errors (#198) and temperature (#194). #5 and #197 are often indicators of drive failure long before the normalize values drop significantly. The #198 is often a sign of a bad or loose cable. The temperature is a bit subjective, but I aim to keep them maxing out in the in the upper 30s or low 40s.

Looking at these values, the only one that troubles me is the temperature, which is 43, a little higher than I would prefer to see. Although still far from a serious problem, if that is the temp during a parity check, all is fine. But if that is the idle temp, the temp under load could be approaching 50, which is too hot IMO.

jus7incase · May 27, 2015

I think I narrowed it down to when this happens by disabling all services that do nightly scans on the drives, except for the mover.

I then scheduled the mover to a different time and boom - kernel panic at that time.

The mover script look quite inconspicious. Mainly an rsync. I do not then that rsync itseld should pose a problem, also since the same mover script did work fine for the last years.

The cache drive is currently hooked to the Digitus controller. I will attach that drive also to the SuperMicro board to see what happens.

I will keep you posted.

In the real world, you should never expect a disk drive to work or be reliable. Failure is always going to happen. The drive report shows a functioning drive. Thus the syslog would be very helpful to determine the cause. The process of install a disk drive can be fairly uneventful (ie just a swap of drives) or more like major surgery, uncabling and unscrewing a drive and replacing. The later can expose components to static discharge, or dislodge them. The failure vectors are not limited to the new disk drive, or even the related OS components.

Since this is a repeating panic, should be pretty easy to get more details.

I had the same kernel panic last night: "5 buffers handled - should be 1", then it shuts down the CPUs, and last thing is "32 buffers handled - should be 1"

Current setup: parity (8TB) connected to supermicro board. Another 8TB media disk connected to Digitus controller, as well as 3TB (supposed to become cache) and current cache 250GB. Worked fine all day. During night sometimes some (cron) job or service seems to trigger the kernel panic.

After reboot I looked into syslog, but it is a fresh file created right after the reboot. How can you actually diagnose an unRAID system, if syslog gets removed on a reboot?

??puzzled??

I found a core dump:

-rw------- 1 root root 397312 2015-05-25 03:43 core

Does this help in any way to get further information?

Any ideas how to find out more?

jus7incase · May 28, 2015

Folks,

I took a SMART report of all drives. Except for one drive, all other drives have logged no errors.

The only drive that has logged errors is the former parity drive. The error seems to be old (at 44 days operation) and I never hat parity problems with that drive. Also the feature 184 End-to-End Error reports "FAILING NOW"!

I was intending to replace my old 250GB cache drive with this 3TB drive.

Could someone please take a look if this drive is ok to use as cache drive? (I understand the cache content is not protected by parity, if not run in a pool, I don't have unRAID 6 yet)

Thank you so much!

The error report from the pre-clear:

** Changed attributes in files: /tmp/smart_start_sdc /tmp/smart_finish_sdc

ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE

Raw_Read_Error_Rate = 109 114 6 ok 21345144

Seek_Error_Rate = 51 51 30 near_thresh 1383022160179

Spin_Retry_Count = 100 100 97 near_thresh 0

End-to-End_Error = 93 93 99 FAILING_NOW 7

Airflow_Temperature_Cel = 63 67 45 near_thresh 37

Temperature_Celsius = 37 33 0 ok 37

*** Failing SMART Attributes in /tmp/smart_finish_sdc ***

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

184 End-to-End_Error 0x0032 093 093 099 Old_age Always FAILING_NOW 7

The report following was taken while the drive is pre-clearing! Thus the high temp. It is usually cooler.

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

=== START OF INFORMATION SECTION ===

Device Model: ST3000DM001-1CH166

Firmware Version: CC26

User Capacity: 3,000,592,982,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Thu May 28 13:05:33 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

See vendor-specific Attribute list for marginal Attributes.

General SMART Values:

Offline data collection status: (0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 584) seconds.

Offline data collection

capabilities: (0x73) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

No Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 255) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 113 099 006 Pre-fail Always - 55763472

3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0

4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1969

5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0

7 Seek_Error_Rate 0x000f 051 047 030 Pre-fail Always - 1383022134666

9 Power_On_Hours 0x0032 084 084 000 Old_age Always - 14866

10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 62

183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0

184 End-to-End_Error 0x0032 093 093 099 Old_age Always FAILING_NOW 7

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0

189 High_Fly_Writes 0x003a 096 096 000 Old_age Always - 4

190 Airflow_Temperature_Cel 0x0022 062 055 045 Old_age Always - 38 (Min/Max 30/39)

191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11

193 Load_Cycle_Count 0x0032 081 081 000 Old_age Always - 38294

194 Temperature_Celsius 0x0022 038 045 000 Old_age Always - 38 (0 17 0 0)

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0

240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 182291296946429

241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 72536194288

242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 155726243753

SMART Error Log Version: 1

ATA Error Count: 5

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 08 ff ff ff ef 00 44d+20:48:56.856 READ DMA EXT

ef 10 02 00 00 00 a0 00 44d+20:48:56.856 SET FEATURES [Reserved for Serial ATA]

27 00 00 00 00 00 e0 00 44d+20:48:56.856 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 00 44d+20:48:56.855 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 00 44d+20:48:56.855 SET FEATURES [set transfer mode]

Error 4 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 08 ff ff ff ef 00 44d+20:48:56.731 READ DMA EXT

ef 10 02 00 00 00 a0 00 44d+20:48:56.731 SET FEATURES [Reserved for Serial ATA]

27 00 00 00 00 00 e0 00 44d+20:48:56.731 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 00 44d+20:48:56.730 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 00 44d+20:48:56.730 SET FEATURES [set transfer mode]

Error 3 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 08 ff ff ff ef 00 44d+20:48:56.606 READ DMA EXT

ef 10 02 00 00 00 a0 00 44d+20:48:56.606 SET FEATURES [Reserved for Serial ATA]

27 00 00 00 00 00 e0 00 44d+20:48:56.606 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 a0 00 44d+20:48:56.605 IDENTIFY DEVICE

ef 03 46 00 00 00 a0 00 44d+20:48:56.605 SET FEATURES [set transfer mode]

Error 2 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 08 ff ff ff ef 00 44d+20:48:56.451 READ DMA EXT

25 00 08 ff ff ff ef 00 44d+20:48:56.450 READ DMA EXT

25 00 08 ff ff ff ef 00 44d+20:48:56.443 READ DMA EXT

25 00 08 ff ff ff ef 00 44d+20:48:56.437 READ DMA EXT

25 00 08 ff ff ff ef 00 44d+20:48:56.430 READ DMA EXT

Error 1 occurred at disk power-on lifetime: 1076 hours (44 days + 20 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 00 00 00 00 00 Error: UNC at LBA = 0x00000000 = 0

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

c8 00 08 10 b2 06 e0 00 44d+20:48:55.925 READ DMA

ca 00 08 98 8f 00 e0 00 44d+20:48:21.660 WRITE DMA

c8 00 08 98 8f 00 e0 00 44d+20:48:21.660 READ DMA

ca 00 c0 d8 8e 00 e0 00 44d+20:48:21.659 WRITE DMA

ca 00 08 d0 8e 00 e0 00 44d+20:48:21.659 WRITE DMA

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Seagate Archive 8TB ST8000AS0002 Unraid 5stable kernel panic

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation