[SOLVED] Parity disk "hangs" randomly while copying...

alikuenkano · January 24, 2011

Hello guys !!!

3 weeks ago I installed my new unraid server, and in these 3 weeks my parity drive has been marked as invalid, 3 times.

Always happends while I'm migrating my movies from the old server to the new unraid server. I always copy the movies thru my gigabit lan.

After that, I only can reboot the server, and after the reboot unraid always detects the parity disk as new parity disk , and rebuilds parity. After that, the server runs perfect, without any problem.

And, randomly, sometimes several hours after start copying the movies. Other times the error appears only minutes after start copying the movies. Other times the error not appears, finishing the copying process without problems.

My unraid server is made with:

Mobo: GIGABYTE EP45-UD3R (HPA disabled)

CPU: Intel Celeron E3400

RAM: 2Gb DDR2-800 OCZ

PSU: CORSAIR CMPSU-750TX 750W

Case: Sharkoon Rebel12

Controller Cards: 1 x Promise FASTTRAK TX2650 - PCI-E x1 (parity + cache)

3 x Promise SAT300 TX4 - PCI

Backplanes: 4 x ICYDOCK 5in3

Hard Drives: Parity disk: WD Caviar black 2Tb.

Cache disk: Seagate 500Gb.

Data disks: Migrating from old server. Actually 3xWD EARS 2Tb + 6 SEAGATE 1,5Tb + 2 SEAGATE 1Tb

Unraid OS: Version 4.6

I'm a complete newbie with unraid and i don't understand the system log (i've attached the syslog of the las hang - this morning).

Please, can you help me?

Thanks in advance...

PD. Sorry for my poor english... :'(

syslog-2011-01-24.zip

Joe L. · January 24, 2011

If the parity disk is being marked as "INVALID" then "writes" to it are failing.

You have either a bad disk, or a bad cable to the disk, or a bad disk controller port, or a loose cable (data or power) to the disk.

Only physical inspection and/or replacement will determine if it is a cable.

Substitution will determine the others.

Joe L.

Joe L. · January 24, 2011

Your problems in the log start here with a ICRC error (a checksum error communicating with the drive):

Jan 24 11:59:39 HDSERVER kernel: ata7.00: exception Emask 0x0 SAct 0x7bebffff SErr 0x0 action 0x6

Jan 24 11:59:39 HDSERVER kernel: ata7.00: irq_stat 0x41000000

Jan 24 11:59:39 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED

Jan 24 11:59:39 HDSERVER kernel: ata7.00: cmd 60/90:e8:0f:00:71/00:00:6e:00:00/40 tag 29 ncq 73728 in

Jan 24 11:59:39 HDSERVER kernel: res 41/84:00:8f:fd:70/5e:00:6e:00:00/40 Emask 0x410 (ATA bus error) <F>

Jan 24 11:59:39 HDSERVER kernel: ata7.00: status: { DRDY ERR }

Jan 24 11:59:39 HDSERVER kernel: ata7.00: error: { ICRC ABRT }

Jan 24 11:59:39 HDSERVER kernel: ata7: hard resetting link

Jan 24 11:59:39 HDSERVER kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jan 24 11:59:39 HDSERVER kernel: ata7.00: configured for UDMA/133

Jan 24 11:59:39 HDSERVER kernel: ata7: EH complete

Jan 24 12:04:25 HDSERVER kernel: ata7.00: exception Emask 0x0 SAct 0x2b7fff SErr 0x0 action 0x6 frozen

Jan 24 12:04:25 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED

Jan 24 12:04:25 HDSERVER kernel: ata7.00: cmd 60/a8:00:e7:ef:f3/00:00:6e:00:00/40 tag 0 ncq 86016 in

Jan 24 12:04:25 HDSERVER kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan 24 12:04:25 HDSERVER kernel: ata7.00: status: { DRDY }

Jan 24 12:04:25 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED

Jan 24 12:04:25 HDSERVER kernel: ata7.00: cmd 60/00:08:8f:f1:f3/02:00:6e:00:00/40 tag 1 ncq 262144 in

Jan 24 12:04:25 HDSERVER kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan 24 12:04:25 HDSERVER kernel: ata7.00: status: { DRDY }

It is followed by many read and write errors to the drive.

Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544472/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544480/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544488/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544496/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544504/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544512/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544520/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544528/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489624/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489632/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489640/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489648/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489656/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: recovery thread woken up ...
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489664/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489672/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489680/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489688/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489696/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489704/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489712/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489720/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489728/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error

First thing to try... a different cable to the parity disk.

Joe L.

alikuenkano · January 24, 2011

Thanks for your quick response, Joe.L.

I will try what you say, and I will report the result.

Thanks again !!!

prostuff1 · January 24, 2011

JoeL. already pointed out the errors, the only thing I will add is that you should probably try a different slot in your ICYDock bays. If that still does not help then you need to eliminate the docks altogether and hook the drive up to the motherboard directly.

alikuenkano · January 24, 2011

JoeL. already pointed out the errors, the only thing I will add is that you should probably try a different slot in your ICYDock bays. If that still does not help then you need to eliminate the docks altogether and hook the drive up to the motherboard directly.

The parity drive and the cache drive are connected to a Promise Sata controller, and the disks are not in the icy dock backplanes.

Are independent disks, in independent bays at the bottom of the case. So the backplanes are not guilty in this case.

So I will try with a different sata cable for the parity drive.

Thanks, prostuff1 !!!

bcbgboy13 · January 24, 2011

Have you performed a nice check on your "new" system before committing any data (especially running MEMTEST at least overnight)?

The reasons - you are using a sort of a "premium" motherboard (loaded with features) and perhaps a "premium" brand memory from OCZ (they are out of the the memory business now). Make sure you have the latest BIOS and that your memory modules are running with the recommended voltage (and then even some more). You should start from here and then go with a new SATA locking cables, check the power splitters etc...as recommended already

Good luck

alikuenkano · January 26, 2011

So I will try with a different sata cable for the parity drive.

Hello again.

I proceeded to change the sata cable connected to the parity drive.

But... It happened again after half an hour or so...

Attached is the new syslog after the new hang...

Thanks again...

syslog-2011-01-25.zip

RobJ · January 26, 2011

No more CRC errors, so the new cable seems to have fixed that. But you are right, the drive is hanging, completely unresponsive at the higher levels. The lower level SATA link is fine, but there is no response to reads, write, or even identity requests. With this latest, it was disabled even quicker than in the previous syslog. By the way, when you see "kernel: ata7.00: disabled", you can completely ignore every subsequent error related to that drive, which is over 99.9% of the rest of both syslogs.

Why it hangs is a mystery. It could be the drive itself, could be an incompatibility of that drive model with that disk controller, could be a power issue to that drive, or a problem with that SATA port or controller. As others have suggested, reconnect the drive to a completely different disk controller, and test again. And check the power connection to it, try a different one if possible. These tests should eliminate a few of the possibilities.

alikuenkano · January 26, 2011

...reconnect the drive to a completely different disk controller, and test again. And check the power connection to it, try a different one if possible...

Thanks, RobJ.

I'll keep trying. I will check the power cables/splitters. I will connect the drive to another port in the controller, and will connect the drive to another controller. I will report the results and I will attach new syslogs...

Now the system is building parity again. The parity building always finishes OK. Is strange that the problem only appears when I'm copying movies and never when building parity... Don't you think so

Thanks again !!!

prostuff1 · January 26, 2011

It is odd but I have seen more weird.

The NIC might be going bad/have gone bad also, though it does not explain the disk errors. You could try disabling the onboard NIC card and putting in a cheap GB NIC (Intel preferably) and see what happens.

bcbgboy13 · January 26, 2011

Now the system is building parity again. The parity building always finishes OK. Is strange that the problem only appears when I'm copying movies and never when building parity... Don't you think so

1. You have a "premium" Gigabyte board (extra hardware features and extra BIOS options)

Make sure you have the latest BIOS, disable the unused hardware features (serial and parallel ports, audio, floppy, IDE controllers, fire-wire, etc.) and connect the parity drive to one of the six primary SATA ports on the motherboard to insure a greater compatibility with your WD 2TB Black parity drive.

2. You also have possible "premium" graded OCZ memory. Make sure it runs at the designated voltage (some of the OCZ crap will require 2.2-2.3V to run compared to the standard 1.8V) and then you can even add a small bump (+0.05V) to insure a grater stability.

Then perform a mandatory overnight "MEMTEST"

This is the possible reason why the errors will only happens when you copy movies - because Unraid will use all the available memory as a buffer at this time.

alikuenkano · February 4, 2011

Finally solved...

1. Changed the memory: The same problem.

2. Changed the data and power cables: no luck.

3. Connected the parity drive to the motherboard SATA controller: CRC errors out !!!

The problem was the Promise TX2650 SATA controller, where the parity drive was connected.

So I changed the Promise SATA controller by a SIL based SATA controller (DAWICONTROL - german brand) and bought a high quality SATA cable.

The system now runs perfect. No CRC errors and no hangs after 24h of data transferring from old server to the new server.

So... I think the problem is finally solved !!!

Thank you all again !!!

[SOLVED] Parity disk "hangs" randomly while copying...

Recommended Posts

alikuenkano

Link to comment

Joe L.

Link to comment

Joe L.

Link to comment

alikuenkano

Link to comment

prostuff1

Link to comment

alikuenkano

Link to comment

bcbgboy13

Link to comment

alikuenkano

Link to comment

RobJ

Link to comment

alikuenkano

Link to comment

prostuff1

Link to comment

bcbgboy13

Link to comment

alikuenkano

Link to comment

Join the conversation