[SOLVED] Parity disk "hangs" randomly while copying...


Recommended Posts

Hello guys !!!

 

3 weeks ago I installed my new unraid server, and in these 3 weeks my parity drive has been marked as invalid, 3 times.

Always happends while I'm migrating my movies from the old server to the new unraid server. I always copy the movies thru my gigabit lan.

 

After that, I only can reboot the server, and after the reboot unraid always detects the parity disk as new parity disk ???, and rebuilds parity. After that, the server runs perfect, without any problem.

And, randomly, sometimes several hours after start copying the movies. Other times the error appears only minutes after start copying the movies. Other times the error not appears, finishing the copying process without problems.

 

My unraid server is made with:

 

Mobo: GIGABYTE EP45-UD3R (HPA disabled)

CPU: Intel Celeron E3400

RAM: 2Gb DDR2-800 OCZ

PSU: CORSAIR CMPSU-750TX 750W

Case: Sharkoon Rebel12

Controller Cards: 1 x Promise FASTTRAK TX2650 - PCI-E x1 (parity + cache)

                      3 x Promise SAT300 TX4 - PCI

Backplanes: 4 x ICYDOCK 5in3

Hard Drives: Parity disk: WD Caviar black 2Tb.

                Cache disk: Seagate 500Gb.

                Data disks: Migrating from old server. Actually 3xWD EARS 2Tb + 6 SEAGATE 1,5Tb + 2 SEAGATE 1Tb

Unraid OS: Version 4.6

 

I'm a complete newbie with unraid and i don't understand the system log (i've attached the syslog of the las hang - this morning).

 

Please, can you help me?

 

Thanks in advance...

 

PD. Sorry for my poor english... :'(

syslog-2011-01-24.zip

Link to comment

If the parity disk is being marked as "INVALID" then "writes" to it are failing.

 

You have either a bad disk, or a bad cable to the disk, or a bad disk controller port, or a loose cable (data or power) to the disk.

 

Only physical inspection and/or replacement will determine if it is a cable.

 

Substitution will determine the others.

 

Joe L.

Link to comment

Your problems in the log start here with a ICRC error  (a checksum error communicating with the drive):

Jan 24 11:59:39 HDSERVER kernel: ata7.00: exception Emask 0x0 SAct 0x7bebffff SErr 0x0 action 0x6

Jan 24 11:59:39 HDSERVER kernel: ata7.00: irq_stat 0x41000000

Jan 24 11:59:39 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED

Jan 24 11:59:39 HDSERVER kernel: ata7.00: cmd 60/90:e8:0f:00:71/00:00:6e:00:00/40 tag 29 ncq 73728 in

Jan 24 11:59:39 HDSERVER kernel:          res 41/84:00:8f:fd:70/5e:00:6e:00:00/40 Emask 0x410 (ATA bus error) <F>

Jan 24 11:59:39 HDSERVER kernel: ata7.00: status: { DRDY ERR }

Jan 24 11:59:39 HDSERVER kernel: ata7.00: error: { ICRC ABRT }

Jan 24 11:59:39 HDSERVER kernel: ata7: hard resetting link

Jan 24 11:59:39 HDSERVER kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

Jan 24 11:59:39 HDSERVER kernel: ata7.00: configured for UDMA/133

Jan 24 11:59:39 HDSERVER kernel: ata7: EH complete

Jan 24 12:04:25 HDSERVER kernel: ata7.00: exception Emask 0x0 SAct 0x2b7fff SErr 0x0 action 0x6 frozen

Jan 24 12:04:25 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED

Jan 24 12:04:25 HDSERVER kernel: ata7.00: cmd 60/a8:00:e7:ef:f3/00:00:6e:00:00/40 tag 0 ncq 86016 in

Jan 24 12:04:25 HDSERVER kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan 24 12:04:25 HDSERVER kernel: ata7.00: status: { DRDY }

Jan 24 12:04:25 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED

Jan 24 12:04:25 HDSERVER kernel: ata7.00: cmd 60/00:08:8f:f1:f3/02:00:6e:00:00/40 tag 1 ncq 262144 in

Jan 24 12:04:25 HDSERVER kernel:          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan 24 12:04:25 HDSERVER kernel: ata7.00: status: { DRDY }

 

It is followed by many read and write errors to the drive.

Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544472/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544480/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544488/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544496/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544504/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544512/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544520/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544528/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489624/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489632/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489640/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489648/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489656/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: recovery thread woken up ...
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489664/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489672/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489680/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489688/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489696/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489704/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489712/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489720/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error
Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489728/0, count: 1
Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error

 

First thing to try... a different cable to the parity disk.

 

Joe L.

Link to comment

JoeL. already pointed out the errors, the only thing I will add is that you should probably try a different slot in your ICYDock bays.  If that still does not help then you need to eliminate the docks altogether and hook the drive up to the motherboard directly.

 

The parity drive and the cache drive are connected to a Promise Sata controller, and the disks are not in the icy dock backplanes.

Are independent disks, in independent bays at the bottom of the case. So the backplanes are not guilty in this case.  ;)

 

So I will try with a different sata cable for the parity drive.

 

Thanks, prostuff1 !!!

Link to comment

Have you performed a nice check on your "new" system before committing any data (especially running MEMTEST at least overnight)?

 

The reasons - you are using a sort of a "premium" motherboard (loaded with features) and perhaps a "premium" brand memory from OCZ (they are out of the the memory business now). Make sure you have the latest BIOS and that your memory modules are running with the recommended voltage (and then even some more). You should start from here and then go with a new SATA locking cables, check the power splitters etc...as recommended already

 

Good luck

Link to comment

No more CRC errors, so the new cable seems to have fixed that.  But you are right, the drive is hanging, completely unresponsive at the higher levels.  The lower level SATA link is fine, but there is no response to reads, write, or even identity requests.  With this latest, it was disabled even quicker than in the previous syslog.  By the way, when you see "kernel: ata7.00: disabled", you can completely ignore every subsequent error related to that drive, which is over 99.9% of the rest of both syslogs.

 

Why it hangs is a mystery.  It could be the drive itself, could be an incompatibility of that drive model with that disk controller, could be a power issue to that drive, or a problem with that SATA port or controller.  As others have suggested, reconnect the drive to a completely different disk controller, and test again.  And check the power connection to it, try a different one if possible.  These tests should eliminate a few of the possibilities.

Link to comment

...reconnect the drive to a completely different disk controller, and test again.  And check the power connection to it, try a different one if possible...

 

Thanks, RobJ.

 

I'll keep trying. I will check the power cables/splitters. I will connect the drive to another port in the controller, and will connect the drive to another controller. I will report the results and I will attach new syslogs...

 

Now the system is building parity again. The parity building always finishes OK. Is strange that the problem only appears when I'm copying movies and never when building parity... Don't you think so ???

 

Thanks again !!!

Link to comment

Now the system is building parity again. The parity building always finishes OK. Is strange that the problem only appears when I'm copying movies and never when building parity... Don't you think so ???

 

1. You have a "premium" Gigabyte board (extra hardware features and extra BIOS options)

Make sure you have the latest BIOS, disable the unused hardware features (serial and parallel ports, audio, floppy, IDE controllers, fire-wire, etc.) and connect the parity drive to one of the six primary SATA ports on the motherboard to insure a greater compatibility with your WD 2TB Black parity drive.

 

2. You also have possible "premium" graded OCZ memory. Make sure it runs at the designated voltage (some of the OCZ crap will require 2.2-2.3V to run compared to the standard 1.8V) and then you can even add a small bump (+0.05V) to insure a grater stability.

Then perform a mandatory overnight "MEMTEST"

 

This is the possible reason why the errors will only happens when you copy movies - because Unraid will use all the available memory as a buffer at this time.

 

Link to comment
  • 2 weeks later...

Finally solved...

 

1. Changed the memory: The same problem.

2. Changed the data and power cables: no luck.

3. Connected the parity drive to the motherboard SATA controller: CRC errors out !!!

 

The problem was the Promise TX2650 SATA controller, where the parity drive was connected.

 

So I changed the Promise SATA controller by a SIL based SATA controller (DAWICONTROL - german brand) and bought a high quality SATA cable.

 

The system now runs perfect. No CRC errors and no hangs after 24h of data transferring from old server to the new server.

 

So... I think the problem is finally solved !!!

 

Thank you all again !!!

 

;)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.