Preclear.sh results - Questions about your results? Post them here.

jbuszkie · July 24, 2009

In an effort to keep the Preclear script thread more about questions about the script itself, I've started another thread here to discuss the results. The preclear thread is peppered with result questions and questions about the script and is now 15 pages long! So I'm thinking that a seperate thread was warranted. So I'll start it off...

After running 3 interations on my new 1TB green disk I had

< 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

---

> 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 5

64c64

< 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

---

> 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1

Are 5 reallocated sectors anything to worry about.. I was hoping for 0!

This is still running on the old version of the script.. Maybe I should try the new version.. (I started my test the morning before Joe posted the new version!) I did start a cycle again on a different controller (one cycle this time - and still the old script)

Another thought... Should we start a new thread for preclear disk result questions and keep this thread for questions/comments about the functionality of preclear?

Jim

If it stays at 5, in my opinion, no problem. If it increases over time, then you might want to use the RMA process. Odds are good it will stabilize. I have one 250Gig drive that has had 100 relocated sectors since the first time I ran smartctl on it. That number has never changed on that disk.

I'd say, download the new version of preclear_disk.sh and run another set of test cycles and see if it shows an increase in re-allocated sectors. (the new version stress-tests the drive more. The old one had a bug that prevented the random cylinders from being read in addition to the linear read that was properly occurring) If the number stays at 5, fine, if not another test cycle might be in order. At that point you have all the evidence you need if an RMA is warranted.

You might want to start a thread with your preclear experience. It will allow the questions about the output to all be in one spot.

Joe L.

Ok.. I ran one more full cycle with the new verions of the script and I got no reallocated sector changes. Should I run once more or do you think I'm good now and can put the disk into service?

So... first 3 cycles. - 5 reallocated sectors

4th cycle - no more reallocated sectors.

Jim

SSD · July 24, 2009

Experience here has been that ANY reallocated sector count is a bad sign. I agree that if it holds stable (even at 100 or more) it is nothing to worry about, but experience here has shown that even a small number of reallocated sectors usually lead to more (and more and more ...). You might think of it like a string hanging from your favorite shirt. Pull on it and the entire shirt will unravel.

The fact that you've run several cycles and the number has held steady is comforting and not typical of the unraveling behavior. I'd still recommend diligence in making sure that the count doesn't increase further.

Joe L. · July 24, 2009

In an effort to keep the Preclear script thread more about questions about the script itself, I've started another thread here to discuss the results. The preclear thread is peppered with result questions and questions about the script and is now 15 pages long! So I'm thinking that a seperate thread was warranted. So I'll start it off...

After running 3 interations on my new 1TB green disk I had

< 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

---

> 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 5

64c64

< 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

---

> 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1

Are 5 reallocated sectors anything to worry about.. I was hoping for 0!

This is still running on the old version of the script.. Maybe I should try the new version.. (I started my test the morning before Joe posted the new version!) I did start a cycle again on a different controller (one cycle this time - and still the old script)

Another thought... Should we start a new thread for preclear disk result questions and keep this thread for questions/comments about the functionality of preclear?

Jim

If it stays at 5, in my opinion, no problem. If it increases over time, then you might want to use the RMA process. Odds are good it will stabilize. I have one 250Gig drive that has had 100 relocated sectors since the first time I ran smartctl on it. That number has never changed on that disk.

I'd say, download the new version of preclear_disk.sh and run another set of test cycles and see if it shows an increase in re-allocated sectors. (the new version stress-tests the drive more. The old one had a bug that prevented the random cylinders from being read in addition to the linear read that was properly occurring) If the number stays at 5, fine, if not another test cycle might be in order. At that point you have all the evidence you need if an RMA is warranted.

You might want to start a thread with your preclear experience. It will allow the questions about the output to all be in one spot.

Joe L.

Ok.. I ran one more full cycle with the new verions of the script and I got no reallocated sector changes. Should I run once more or do you think I'm good now and can put the disk into service?

So... first 3 cycles. - 5 reallocated sectors

4th cycle - no more reallocated sectors.

Jim

If you need the space, and need it now, go ahead and assign it to the array.

If not in a real rush, let it run another cycle or two, or overnight. Remember, you did 3 cycles to identify the first 5 sectors... you do not know if they all showed up in the the first cycle, or the third.

It is good that no more bad sectors were identified.

Glad it is working for you. How long did it take to run a cycle on the 1TB drive in your server?

Joe L.

jbuszkie · July 24, 2009

]If not in a real rush, let it run another cycle or two, or overnight. Remember, you did 3 cycles to identify the first 5 sectors... you do not know if they all showed up in the the first cycle, or the third.

It is good that no more bad sectors were identified.

Glad it is working for you. How long did it take to run a cycle on the 1TB drive in your server?

Joe L.

Each cycle is just about 12hours. I'm in no immediate rush so I just popped off another cycle. Maybe an interesting additiion to the script would be to save the smart data after every cycle so we can see when the events happend. When I ran the 1st 3 cycles I don't know if the events happened in the 1st, 2nd, or 3rd cycle..

Jim

Guzzi · July 24, 2009

Hi, I have succesfully precleared a disk, but got smartdifferences as below. Is this something I have to worry about or can I use this disk? I realized some interface errors in the log in the very beginning, but no errors in the script.

Thanks, Guzzi

============================================================================

==

== Disk /dev/sdq has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

62,63c62,63

< 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 31

< 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 25344

---

> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32

> 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 25345

============================================================================

Joe L. · July 24, 2009

Hi, I have succesfully precleared a disk, but got smartdifferences as below. Is this something I have to worry about or can I use this disk? I realized some interface errors in the log in the very beginning, but no errors in the script.

Thanks, Guzzi

============================================================================

==

== Disk /dev/sdq has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

62,63c62,63

< 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 31

< 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 25344

---

> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32

> 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 25345

============================================================================

This is a new one to me... According to a "google" search on "Power-Off_Retract_Count", I got the following

[pre]

# Power-Off_Retract_Count = No of times drive was powered off in an emergency, called Emergency Unload.

# Load_Cycle_Count = This number is highly affected by your power management policies. For e.g. a too aggressive power management might put hard disk to sleep too often. This number is indicative of when your hard disk parks, unparks , spins up, spins down.

[/pre]

So. reading between the lines... unless you powered down the disk while it was being cleared, it *thought* it had lost power, or it really did lose power.

It retracted the disk heads in an emergency-unload, thinking it had lost power, then loaded them again once it thought power had been restored.

I'd check the system log for any other errors while the drive was being cleared. I'd also check any power connectors or "Y" splitters. They can be intermittent.

Joe L.

Guzzi · July 24, 2009

Hi, I have succesfully precleared a disk, but got smartdifferences as below. Is this something I have to worry about or can I use this disk? I realized some interface errors in the log in the very beginning, but no errors in the script.

Thanks, Guzzi

============================================================================

==

== Disk /dev/sdq has been successfully precleared

==

============================================================================

S.M.A.R.T. error count differences detected after pre-clear

note, some 'raw' values may change, but not be an indication of a problem

62,63c62,63

< 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 31

< 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 25344

---

> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32

> 193 Load_Cycle_Count 0x0032 192 192 000 Old_age Always - 25345

============================================================================

This is a new one to me... According to a "google" search on "Power-Off_Retract_Count", I got the following
[pre]

# Power-Off_Retract_Count = No of times drive was powered off in an emergency, called Emergency Unload.

# Load_Cycle_Count = This number is highly affected by your power management policies. For e.g. a too aggressive power management might put hard disk to sleep too often. This number is indicative of when your hard disk parks, unparks , spins up, spins down.

[/pre]

So. reading between the lines... unless you powered down the disk while it was being cleared, it *thought* it had lost power, or it really did lose power.

It retracted the disk heads in an emergency-unload, thinking it had lost power, then loaded them again once it thought power had been restored.

I'd check the system log for any other errors while the drive was being cleared. I'd also check any power connectors or "Y" splitters. They can be intermittent.

Joe L.

Hi Joe,

checking the powerconnectors is no problem - I can do that.

I cheked the syslog several times during preclear and except in the very first minutes (some drive not ready) there was nothing special.

But it seems, that in the post read there happened a lot - which I do not understand; could you have a look in the log? It's the complete preclear-process from beginning to the end!?

Thanks, Guzzi

Joe L. · July 24, 2009

You have several drives with errors, not just the one you are trying to clear... and it looks like you are running out of memory too.

Are you running any add-on packages? (other than the pre-clear) The user-share file system is constantly reporting it cannot allocate memory.

How much RAM are you running?

I can't go into detail now... Perhaps RobJ can take a look and provide his input. Perhaps send him a PM and ask him to take a look.

Joe L.

Guzzi · July 24, 2009

You have several drives with errors, not just the one you are trying to clear... and it looks like you are running out of memory too.

Are you running any add-on packages? (other than the pre-clear) The user-share file system is constantly reporting it cannot allocate memory.

How much RAM are you running?

I can't go into detail now... Perhaps RobJ can take a look and provide his input. Perhaps send him a PM and ask him to take a look.

Joe L.

I have 2 GB RAM in the box:

(from /usr/bin/top -b -n1)

top - 01:15:04 up 1:13, 0 users, load average: 3.94, 4.00, 3.73

Tasks: 73 total, 2 running, 71 sleeping, 0 stopped, 0 zombie

Cpu(s): 7.8%us, 60.5%sy, 0.0%ni, 22.3%id, 5.0%wa, 0.6%hi, 3.7%si, 0.0%st

Mem: 1943344k total, 1617648k used, 325696k free, 39868k buffers

Swap: 0k total, 0k used, 0k free, 1481180k cached

(Did a reboot after I saw those kernel things in syslog - never had that before, just during this specific preclear)

Addons: I have disabled cachedirs to keep memory free while moving data to the box. Here is the goscript:

#!/bin/bash

# Start the Management Utility

/usr/local/sbin/emhttp &

cd /boot/packages && find . -name '*.auto_install' -type f -print | sort | xargs -n1 sh -c

# Unraid_Notify (E-Mail Notification)

#installpkg /boot/packages/socat-1.7.0.0-i486-2bj.tgz

#installpkg /boot/packages/unraid_notify-2.30-noarch-unRAID.tgz

installpkg /boot/packages/acpitool-0.4.7-i486-1goa.tgz

#unraid_notify start

sleep 30

# enable wakeup

/usr/sbin/ethtool -s eth0 wol g

# Start UnMenu

/boot/unmenu/uu

I have to say that I was moving constantly data to the box while clearing the disk - maybe the problems with the disk has blocked the copy process?

Do I need to upgrade the RAM to 4 GB?

RobJ · July 24, 2009

That syslog is a mess! And it's only the latter part too, it is missing the 600 to 900 odd lines of system setup at the beginning.

The drive with ID of sdn probably has a poor quality cable. I would replace it if at all possible.

And Joe is right, there were page allocation failures for many subsystems, including the share file system, Samba, and possibly involving the networking and Reiser file system modules, which is worrying. In this piece of the syslog, I don't see any kernel panics, so I don't think we can say for sure that there is any damage, such as evidence of flaky memory, or corrupted Reiser file systems, but I never fully trust a system that has crashed. Always better to restart fresh. I certainly would not try to run anything important, once I saw the first sign of suspicious system operation. Those 'Call Traces' definitely qualify as suspicious system operation. Grabbing the syslog and waiting for advice was the correct thing to do.

Even though I saw no 'panics' here, to be safe, I would reboot and run a full memory test first, then run reiserfsck on each of the data drives (see the Check Disk File systems page for instructions). I'm sorry, it is somewhat time-consuming, but it is better to be safe. The memory test is probably not needed, so you can postpone it if you wish, but I like to be thorough, and know whether a system is truly trustworthy, especially when I have just had extensive memory-related problems. I would like to say test only the data drives you were actually using, but it appears that there were numerous spin downs to many drives, and the mover ran at least twice, so it looks like all or most of your drives may have been written to.

2 GB of memory should have been more than enough. I can't see any reason so far for the problems, at least not from this syslog.

Joe L. · July 25, 2009

I have 2 GB RAM in the box:

Addons: I have disabled cachedirs to keep memory free while moving data to the box. Here is the goscript:

Do I need to upgrade the RAM to 4 GB?

Part of the original cache_dirs script set the cache-pressure to 0. I've since learned that value does NOT free up ram when other processes need it.

even if you had stopped cache_dirs, the memory Linux allocated for cache would not have been freed.

You would need to type something like:

sysctl vm.vfs_cache_pressure=10

to allow it to use the memory is had put into cache. (The most recent version of cache_dirs fixed that and uses cache_pressure=5 by default)

More memory might help, but 2 Gig should be plenty. Your first priority should be the disk errors.

These errors are /dev/sdn

ul 23 03:34:19 XMS-GMI-01 kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 23 03:34:19 XMS-GMI-01 kernel: ata12.00: BMDMA2 stat 0xd0009
Jul 23 03:34:19 XMS-GMI-01 kernel: ata12.00: cmd 25/00:00:30:8a:06/00:02:00:00:00/e0 tag 0 dma 262144 in
Jul 23 03:34:19 XMS-GMI-01 kernel:          res 51/04:7f:b1:8b:06/00:00:00:00:00/f0 Emask 0x1 (device error)
Jul 23 03:34:19 XMS-GMI-01 kernel: ata12.00: status: { DRDY ERR }
Jul 23 03:34:19 XMS-GMI-01 kernel: ata12.00: error: { ABRT }
Jul 23 03:34:19 XMS-GMI-01 kernel: ata12.00: configured for UDMA/100
Jul 23 03:34:19 XMS-GMI-01 kernel: ata12: EH complete
Jul 23 03:34:19 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB)
Jul 23 03:34:19 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] Write Protect is off
Jul 23 03:34:19 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] Mode Sense: 00 3a 00 00
Jul 23 03:34:19 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 23 03:34:32 XMS-GMI-01 kernel: ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Jul 23 03:34:32 XMS-GMI-01 kernel: ata12.00: BMDMA2 stat 0xd0009
Jul 23 03:34:32 XMS-GMI-01 kernel: ata12.00: cmd 25/00:00:30:dc:12/00:02:00:00:00/e0 tag 0 dma 262144 in
Jul 23 03:34:32 XMS-GMI-01 kernel:          res 51/04:00:2f:de:12/00:00:00:00:00/f0 Emask 0x1 (device error)
Jul 23 03:34:32 XMS-GMI-01 kernel: ata12.00: status: { DRDY ERR }
Jul 23 03:34:32 XMS-GMI-01 kernel: ata12.00: error: { ABRT }
Jul 23 03:34:32 XMS-GMI-01 kernel: ata12.00: configured for UDMA/100
Jul 23 03:34:32 XMS-GMI-01 kernel: ata12: EH complete
Jul 23 03:34:32 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] 1953525168 512-byte hardware sectors: (1.00 TB/931 GiB)
Jul 23 03:34:32 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] Write Protect is off
Jul 23 03:34:32 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] Mode Sense: 00 3a 00 00
Jul 23 03:34:32 XMS-GMI-01 kernel: sd 12:0:0:0: [sdn] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

These are memory allocation errors:

Jul 24 01:37:13 XMS-GMI-01 kernel: shfs: page allocation failure. order:0, mode:0x4020
Jul 24 01:37:13 XMS-GMI-01 kernel: Pid: 5060, comm: shfs Not tainted 2.6.29.1-unRAID #2
Jul 24 01:37:13 XMS-GMI-01 kernel: Call Trace:
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c0146307>] __alloc_pages_internal+0x33f/0x352
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c015ec2c>] __slab_alloc+0x158/0x42b
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c015fce6>] __kmalloc_track_caller+0x75/0xbe
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c02d8535>] ? __netdev_alloc_skb+0x17/0x34
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c02d8535>] ? __netdev_alloc_skb+0x17/0x34
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c02d8217>] __alloc_skb+0x4a/0x102
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c02d8535>] __netdev_alloc_skb+0x17/0x34
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<f82512fa>] rtl8169_rx_fill+0x91/0x144 [r8169]
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<f82516cf>] rtl8169_rx_interrupt+0x322/0x379 [r8169]
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<f825276c>] rtl8169_poll+0x2f/0x124 [r8169]
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c02df24c>] net_rx_action+0x5d/0x119
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c0124a48>] __do_softirq+0x84/0x121
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c0124b1a>] do_softirq+0x35/0x3a
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c0124d97>] irq_exit+0x38/0x3a
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c0104a69>] do_IRQ+0x67/0x7e
Jul 24 01:37:13 XMS-GMI-01 kernel:  [<c01033a7>] common_interrupt+0x27/0x2c
Jul 24 01:37:13 XMS-GMI-01 kernel: Mem-Info:
Jul 24 01:37:13 XMS-GMI-01 kernel: DMA per-cpu:
Jul 24 01:37:13 XMS-GMI-01 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jul 24 01:37:13 XMS-GMI-01 kernel: Normal per-cpu:
Jul 24 01:37:13 XMS-GMI-01 kernel: CPU    0: hi:  186, btch:  31 usd: 180
Jul 24 01:37:13 XMS-GMI-01 kernel: HighMem per-cpu:
Jul 24 01:37:13 XMS-GMI-01 kernel: CPU    0: hi:  186, btch:  31 usd: 136
Jul 24 01:37:13 XMS-GMI-01 kernel: Active_anon:1704 active_file:6958 inactive_anon:1964
Jul 24 01:37:13 XMS-GMI-01 kernel:  inactive_file:416907 unevictable:31739 dirty:16436 writeback:1553 unstable:0
Jul 24 01:37:13 XMS-GMI-01 kernel:  free:1895 slab:11856 mapped:1835 pagetables:175 bounce:0
Jul 24 01:37:13 XMS-GMI-01 kernel: DMA free:3488kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:8676kB unevictable:0kB present:15852kB pages_scanned:0 all_unreclaimable? no
Jul 24 01:37:13 XMS-GMI-01 kernel: lowmem_reserve[]: 0 867 1887 1887
Jul 24 01:37:13 XMS-GMI-01 kernel: Normal free:1320kB min:3732kB low:4664kB high:5596kB active_anon:1888kB inactive_anon:2148kB active_file:16288kB inactive_file:772872kB unevictable:40kB present:887976kB pages_scanned:0 all_unreclaimable? no
Jul 24 01:37:13 XMS-GMI-01 kernel: lowmem_reserve[]: 0 0 8158 8158
Jul 24 01:37:13 XMS-GMI-01 kernel: HighMem free:2772kB min:512kB low:1608kB high:2704kB active_anon:4928kB inactive_anon:5708kB active_file:11544kB inactive_file:886080kB unevictable:126916kB present:1044328kB pages_scanned:0 all_unreclaimable? no
Jul 24 01:37:13 XMS-GMI-01 kernel: lowmem_reserve[]: 0 0 0 0
Jul 24 01:37:13 XMS-GMI-01 kernel: DMA: 0*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3488kB
Jul 24 01:37:13 XMS-GMI-01 kernel: Normal: 134*4kB 2*8kB 1*16kB 1*32kB 1*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1304kB
Jul 24 01:37:13 XMS-GMI-01 kernel: HighMem: 23*4kB 13*8kB 15*16kB 33*32kB 8*64kB 2*128kB 2*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2772kB
Jul 24 01:37:13 XMS-GMI-01 kernel: 455669 total pagecache pages
Jul 24 01:37:13 XMS-GMI-01 kernel: 0 pages in swap cache
Jul 24 01:37:13 XMS-GMI-01 kernel: Swap cache stats: add 0, delete 0, find 0/0
Jul 24 01:37:13 XMS-GMI-01 kernel: Free swap  = 0kB
Jul 24 01:37:13 XMS-GMI-01 kernel: Total swap = 0kB
Jul 24 01:37:13 XMS-GMI-01 kernel: 490976 pages RAM
Jul 24 01:37:13 XMS-GMI-01 kernel: 263138 pages HighMem
Jul 24 01:37:13 XMS-GMI-01 kernel: 5140 pages reserved
Jul 24 01:37:13 XMS-GMI-01 kernel: 318096 pages shared
Jul 24 01:37:13 XMS-GMI-01 kernel: 170962 pages non-shared

Joe L.

Guzzi · July 25, 2009

Thanks Rob, Joe for the feedback.

sdn and sdq are the two drives, I currently have not yet in the array - because they both were showing those errors when i first tried setting up the empty array some weeks ago.

All other drives are in the array and were fine, showing no errors.

Because I didn't trust those 2 drives I ran preclear script to be safe - with the result above.

it was the very first time, I encountered such memoryrelated errors, never had it before - but you're right, I had even problems, accessing sambashares after this.

I restarted the box and everything is fine so far, no errors at all in the syslog (except this DMA-stuff on the IDE-port - " kernel: atiixp 0000:00:14.1: simplex device: DMA disabled").

BTW: starting preclear on either of those 2 unassigned drives gives me those above "ata12.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0" - errors in the log. They do NOT appear during startup.

cache_dirs was not started at all - removed it from go script and rebooted before I moved the files. So it definately cannot be responsible for any memoryrelated stuff.

Me too I am worried, if I see such things - I think I will remove both of the drives and test them separately and see, if they need to be RMAed.

Will also perform memorytest and chkdsk on all drives as recommended to be sure, everything is fine.

And yes, there is already stuff on almost all drives, since I am already moving data during the last weeks.

Will post after running the tests.

Guzzi

Joe L. · July 25, 2009

Me too I am worried, if I see such things - I think I will remove both of the drives and test them separately and see, if they need to be RMAed.

Will also perform memorytest and chkdsk on all drives as recommended to be sure, everything is fine.

And yes, there is already stuff on almost all drives, since I am already moving data during the last weeks.

Will post after running the tests.

Guzzi

The preclear_disk script is very good at thrashing exercising a disk. As already said, it is far easier to RMA the drives before they are loaded with your data if you find they do not test well. The errors you saw could be because of bad SATA cables or bad power cables/splitters, or even a bad disk controller. But...

Remember, your SMART report showed an emergency retraction of the heads to a safe landing spot when it thought the drive was losing power in the middle of the preclearing process. That is pretty drastic as it tries to save itself from a head crash.

Is your power supply being overloaded? Are you using a backplane for power distribution? Lots to check out, but, at least you are more informed than most Window's OS users. They just blue-screen.

Joe L.

Guzzi · July 25, 2009

Me too I am worried, if I see such things - I think I will remove both of the drives and test them separately and see, if they need to be RMAed.

Will also perform memorytest and chkdsk on all drives as recommended to be sure, everything is fine.

And yes, there is already stuff on almost all drives, since I am already moving data during the last weeks.

Will post after running the tests.

Guzzi

The preclear_disk script is very good at thrashing exercising a disk. As already said, it is far easier to RMA the drives before they are loaded with your data if you find they do not test well. The errors you saw could be because of bad SATA cables or bad power cables/splitters, or even a bad disk controller. But...

Remember, your SMART report showed an emergency retraction of the heads to a safe landing spot when it thought the drive was losing power in the middle of the preclearing process. That is pretty drastic as it tries to save itself from a head crash.

Is your power supply being overloaded? Are you using a backplane for power distribution? Lots to check out, but, at least you are more informed than most Window's OS users. They just blue-screen.

Joe L.

Maybe I wastn't completely clear: I have NO data on those 2 "suspicious" drives (they're unassigned and I didn't mount them except for temporal checking if they're empty) - only the array is filled with data (where I didn't encounter problems with the drives so far).

The drives are not new - most of them are coming from my former windows box and had been running there as raid-5 for 1-2 years (hope warranty not yet over ...)

I never got BSODs on the windows box - but a remember once or twice drives where showing "yellow" - which probably was the same CRC-Problem as now.

But nevertheless I have to admit, that there is much more transparence with unraid and linux tools what's "really" happening - windows doesn't help you much with that (just "reactivate" the drive, errors corrected by raid-layer anyway).

BTW: I ran the memorytest overnight - it passed 8 times without errors. Will chkdsk the drives when finding the time (currently working with my son on his motorcyle ;-))

The biggest hasstle with those "many-disk-machines (regardless of windows or linux, or something else) is power and cabling - and very difficult to diagnose.

Power might be fine for all normal operations - but if you are accessing a disk and at the same time 20 other disks spin up it might pull the voltage down - and I experienced in the past that HDs are VERY sensitive to voltages below 4,8 v on the 5Vrail - to be measured at the drive itself, not somewhere else, because you loose voltage on the cables.

Anyway, I thought to be safe, because I operated the windows box and now the unraid box with same powersupply but 8 drives less... so maybe again checking the cables - it seems to be focused on those two ports...

So I don't think it's overloaded powerwise, but unraid is in a diffenent box with different cabvles, no powerbackplane and there might be issues - I won't have any other possibility than to check and solve - because there is planned to add the remaining 4 disks from the windowsraid to the unraid-array as soon as the 17+ bug is solved...

I hope to soon reach the stage to put the box back in the corner and forget it for the next years ;-)

Guzzi · July 25, 2009

That syslog is a mess! And it's only the latter part too, it is missing the 600 to 900 odd lines of system setup at the beginning.

The drive with ID of sdn probably has a poor quality cable. I would replace it if at all possible.

And Joe is right, there were page allocation failures for many subsystems, including the share file system, Samba, and possibly involving the networking and Reiser file system modules, which is worrying. In this piece of the syslog, I don't see any kernel panics, so I don't think we can say for sure that there is any damage, such as evidence of flaky memory, or corrupted Reiser file systems, but I never fully trust a system that has crashed. Always better to restart fresh. I certainly would not try to run anything important, once I saw the first sign of suspicious system operation. Those 'Call Traces' definitely qualify as suspicious system operation. Grabbing the syslog and waiting for advice was the correct thing to do.

Even though I saw no 'panics' here, to be safe, I would reboot and run a full memory test first, then run reiserfsck on each of the data drives (see the Check Disk File systems page for instructions). I'm sorry, it is somewhat time-consuming, but it is better to be safe. The memory test is probably not needed, so you can postpone it if you wish, but I like to be thorough, and know whether a system is truly trustworthy, especially when I have just had extensive memory-related problems. I would like to say test only the data drives you were actually using, but it appears that there were numerous spin downs to many drives, and the mover ran at least twice, so it looks like all or most of your drives may have been written to.

2 GB of memory should have been more than enough. I can't see any reason so far for the problems, at least not from this syslog.

... I'm done... I ran the memorytest overnight - it passed 8 times without errors plus I ran the reisefsck on all data drives - all went through without any errors reported. Checked syslog also, no errors, neither after boot nor after all those activities.

Anything else I can / should do? So it seems that those problems are all around those 2 drives ? If so, I probably prefer to dispose them and order 2 new ones - much cheaper than the time it took me to check the whole server ... ;-)

Joe L. · July 26, 2009

That syslog is a mess! And it's only the latter part too, it is missing the 600 to 900 odd lines of system setup at the beginning.

The drive with ID of sdn probably has a poor quality cable. I would replace it if at all possible.

And Joe is right, there were page allocation failures for many subsystems, including the share file system, Samba, and possibly involving the networking and Reiser file system modules, which is worrying. In this piece of the syslog, I don't see any kernel panics, so I don't think we can say for sure that there is any damage, such as evidence of flaky memory, or corrupted Reiser file systems, but I never fully trust a system that has crashed. Always better to restart fresh. I certainly would not try to run anything important, once I saw the first sign of suspicious system operation. Those 'Call Traces' definitely qualify as suspicious system operation. Grabbing the syslog and waiting for advice was the correct thing to do.

Even though I saw no 'panics' here, to be safe, I would reboot and run a full memory test first, then run reiserfsck on each of the data drives (see the Check Disk File systems page for instructions). I'm sorry, it is somewhat time-consuming, but it is better to be safe. The memory test is probably not needed, so you can postpone it if you wish, but I like to be thorough, and know whether a system is truly trustworthy, especially when I have just had extensive memory-related problems. I would like to say test only the data drives you were actually using, but it appears that there were numerous spin downs to many drives, and the mover ran at least twice, so it looks like all or most of your drives may have been written to.

2 GB of memory should have been more than enough. I can't see any reason so far for the problems, at least not from this syslog.

... I'm done... I ran the memorytest overnight - it passed 8 times without errors plus I ran the reisefsck on all data drives - all went through without any errors reported. Checked syslog also, no errors, neither after boot nor after all those activities.

Anything else I can / should do? So it seems that those problems are all around those 2 drives ? If so, I probably prefer to dispose them and order 2 new ones - much cheaper than the time it took me to check the whole server ... ;-)

Other than trying new cables to those two drives, I think you are on the right track. The preclear script showed you their true colors... If they are under warranty, it is a no-brainer for me. (although I'd try a new data cable anyway if you happen to have one first...) Certainly, if you see the errors starting once more, stop the pre-clear (press Control-C to interrupt the process) and get an RMA number or two.

Joe L.

RobJ · July 28, 2009

So it seems that those problems are all around those 2 drives ? If so, I probably prefer to dispose them and order 2 new ones

I just want to be clear, I did not see any problems with sdq, the other drive, and only what appeared to be cable or connection related errors with sdn. At least from the info I had available, I don't see any problems with either of the drives themselves.

I would like to note that although the Power-Off_Retract_Count did increment, its VALUE did not budge from 200 (its peak or starting value), which indicates to me that this is well within expected values, and probably not a concern (at least as far as you can trust SMART data!).

Guzzi · July 28, 2009

... you couldn't see it, because I didn't access the other drive - as soon as I do e.g. preclear on it, I get the same messages in syslog. I do NOT get any of those errors wth all other drives (did e.g. the reiserfsck on all drives except parity).

Cabling is always a mess - I had those problems in the pre-unraid ära (windows raid-5 with the free veritas solution) as well - changed sata cables, chaged powercabling, changed powersupply, etc.

The worst problem is those splitters, that you just touch and you hear the drive spindown and up again - just because voltage dropped a bit - this depends also on the brand of the drives - some are more sensitive, some less - at that time I replaced my powercabling from those PC-standard stuff to a more solid powerdistribution - helped a lot.

Anyway, regarding this current situation: I have ordered a new drive yesterday, will be delivered today and it will replace those two "in question drives".

I can then test those drives in another box when I have time to decide if or if not I can continue using them. If they show ok, I will throw them in my backupbox later.

Currently my focus is on getting (or keeping) my main box stable and errorfree to put it "in the corner and forget about it" ;-)

Joe L. · July 28, 2009

Cabling is always a mess - I had those problems in the pre-unraid ära (windows raid-5 with the free veritas solution) as well - changed sata cables, chaged powercabling, changed powersupply, etc.

The worst problem is those splitters, that you just touch and you hear the drive spindown and up again - just because voltage dropped a bit - this depends also on the brand of the drives - some are more sensitive, some less - at that time I replaced my powercabling from those PC-standard stuff to a more solid powerdistribution - helped a lot.

If you touch a power cable and can hear any disk spin down, it is NOT that it is power sensitive, a disk is losing power. It is doing an emergency head retraction, as it detected the power fluctuation. Yes, each connection introduces a tiny voltage drop, but the resistance of a properly made connection will not affect a disk.

It is not enough to just say that you tied the cables down. Those same loose connections will act up over time as they heat and cool, as they vibrate, even microscopically, as the case vibrates when disks are used.

The pre-clear script moves the disk heads back and forth across the platters more than most other operations, and probably vibrates them a bit more than simply watching a movie, which would read each cylinder in turn (assuming most are not fragmented) It might have vibrated the case enough to make the poor connection reveal itself.

So, to test, stop the array (so no data is being written) then listen for heads un-loading while moving all the power cables slightly. Find and fix the poor connection. It might be a splitter, it might be a connector on the power supply, it could even be a connector on a disk drive... but if you do not fix it, it will come back to haunt you some time when it not convenient... (when another disk really fails, and all of a sudden a single drive failure becomes a multiple drive failure)

It might be a poorly built plug on a power supply, with a poor crimp on the wire... or it could be a microscopic crack on the circuit board of a disk drive, where the power connector soldered connection has failed from being stressed too much. In my case it was a "Y" splitter, but it caused me a lot of hair loss... and I have very little to spare these days.

I did a bit of math on this in another post: http://lime-technology.com/forum/index.php?topic=3211.msg27129#msg27129

One thing I did not think of is a disk might be in the middle of a pre-clear cycle, all the other disks are sleeping, and then you spin up all the other drives. With a poorly configured arrangement of splitters, or a marginal power supply, the sudden voltage drop on the drive being cleared, caused by the huge current spike of all the other drives spinning up, could cause it to think it is losing power and retract its heads.

If you need multiple splitters, and many of us with large arrays do, configure them to use as few as you can. Or, purchase 4-way splitters as I did, with less connectors to have poor connections.

I know the splitters are very low-tech, but the originals in my server were not very large gauge wire, and had very light-weight pins compared to the ones I replaced them with.

The new 4 way splitters are on the left in the picture below... they are much higher quality construction than the two way splitters I removed from the server on the right. I have a feeling the crimps on the pins of the old splitter's connectors are of poor quality, and intermittent...

Search here for what I purchased: http://www.intrex.com/parts/parts.aspx

This is the part number: ADA-POWYX2

Very decent for $2.99 each. There are two Intrex stores nearby to me, so it is easy to stop by for parts... Besides, working better, I'm thinking the new splitters will help to prevent "hair loss"

Joe L.

c234rmf · August 4, 2009

Hi everyone. I replaced my parity drive with new one then ran preclear on my old parity drive.

Here is the differences at the start and end of preclear:

=== START OF INFORMATION SECTION ===

Model Family: Seagate Barracuda 7200.11

Device Model: ST3750330AS

Firmware Version: SD15

User Capacity: 750,156,374,016 bytes

Device is: In smartctl database [for details use: -P show]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Tue Aug 4 05:02:36 2009 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 87686600

==================

1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 36177112

7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 16010428

==================

7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 4311566057

189 High_Fly_Writes 0x003a 006 006 000 Old_age Always - 94

==================

189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 120

190 Airflow_Temperature_Cel 0x0022 069 047 045 Old_age Always - 31 (Lifetime Min/Max 26/33)

==================

190 Airflow_Temperature_Cel 0x0022 067 047 045 Old_age Always - 33 (Lifetime Min/Max 26/34)

195 Hardware_ECC_Recovered 0x001a 038 032 000 Old_age Always - 87686600

==================

195 Hardware_ECC_Recovered 0x001a 060 032 000 Old_age Always - 36177112

Looking at the results I think it is time to send this drive back.

Any thoughts on this readout are appreciated!

For what it is worth my new Hitachi Deskstar made it through 1 preclear pass with no changes.

=== START OF INFORMATION SECTION ===

Device Model: Hitachi HDT721010SLA360

Serial Number: STN604MR3X2THP

Firmware Version: ST6OA3AA

User Capacity: 1,000,204,886,016 bytes

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Mon Aug 3 02:16:10 2009 EDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0

2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0

3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0

4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 4

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0

7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0

8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0

9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 9

10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 4

192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 4

193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 4

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0

197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

Joe L. · August 4, 2009

Looking at the results I think it is time to send this drive back.

Any thoughts on this readout are appreciated!

Here is the differences at the start and end of preclear:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 87686600

==================

1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 36177112

The read error rate "VALUE" went down... that is probably a good thing.

7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 16010428

==================

7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 4311566057

Although the raw value went up, the "VALUE" is un-changed. RAW_VALUES are used internally by the SMART firmware, no real conclusion can be made here.

189 High_Fly_Writes 0x003a 006 006 000 Old_age Always - 94

==================

189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 120

Ok, you had 26 more high-fly-writes...

190 Airflow_Temperature_Cel 0x0022 069 047 045 Old_age Always - 31 (Lifetime Min/Max 26/33)

==================

190 Airflow_Temperature_Cel 0x0022 067 047 045 Old_age Always - 33 (Lifetime Min/Max 26/34)

Temp went up two degrees.

195 Hardware_ECC_Recovered 0x001a 038 032 000 Old_age Always - 87686600

==================

195 Hardware_ECC_Recovered 0x001a 060 032 000 Old_age Always - 36177112

Hardware error recovery seems to have improved.

At first glance, looks OK to me. I see no compelling reason to RMA it.

Joe L.

c234rmf · August 5, 2009

Thanks for your reply.

I was focusing on the high fly writes. I thought these were serious errors.

I also thought that since the value was 001 and the threshold was 000 that I was close to running out of "allowed" high fly writes.

RobJ · August 5, 2009

Thanks for your reply.

I was focusing on the high fly writes. I thought these were serious errors.

I also thought that since the value was 001 and the threshold was 000 that I was close to running out of "allowed" high fly writes.

No, I don't know of any reason (for now) that High_Fly_Writes should be given any significance at all. They are not a Pre-fail item, so they aren't considered critical, and don't impact the Pass/Fail health test of the drive. I'm going to suggest to Joe and Brian that they consider dropping any checks of 189 High_Fly_Writes and all of the 240's, as there is no reason to unnecessarily alarm users.

In researching through SMART reports that I have seen, only the latest large Seagate drives actually use the High_Fly_Writes attribute. They appear to be counting something associated with High Fly Writes in the RAW_VALUE. VALUE and WORST are just 100 minus the count in RAW_VALUE, until RAW_VALUE reaches 99, and VALUE and WORST hit bottom at 001. The threshold THRESH is only used for items marked as Pre-fail, is otherwise usually meaningless. It sometimes appears to have a good practical threshold value, which I assume means "it is preferable that WORST stay above this value", but has no other impact. In this case, since VALUE drops from 100 to 001, I have to assume that they haven't fully implemented this attribute, or haven't decided a reasonable scale of values. Future drives will undoubtedly change how High_Fly_Writes is implemented and counted and scaled. Somewhere, Brian did some research on it, and had some comments, but I don't know where they are.

japilch · August 9, 2009

ok I have a 1TB eads drive , 1st preclear gave up more error results, had a power cut before I could record the results. but since then have run it several other times and the only error count that seems to be increasing is cycle count ?? is this something I sould worry about ?

screen cap from 7/30

screen cap from 8/09

below are the syslog links as well

RobJ · August 9, 2009

No problems that I can see. Neither the VALUE or WORST dropped from 200, so it is still considered essentially perfect.

At some point, it would be good for someone to prepare a table of these attributes, with expected ranges, and comments on their seriousness (or lack thereof) of particular results. Unfortunately, the table would have to take into account all of the inconsistencies between brands and models.

Preclear.sh results - Questions about your results? Post them here.

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

Inolvidable

RobJ

binhex

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation