alikuenkano Posted January 24, 2011 Share Posted January 24, 2011 Hello guys !!! 3 weeks ago I installed my new unraid server, and in these 3 weeks my parity drive has been marked as invalid, 3 times. Always happends while I'm migrating my movies from the old server to the new unraid server. I always copy the movies thru my gigabit lan. After that, I only can reboot the server, and after the reboot unraid always detects the parity disk as new parity disk , and rebuilds parity. After that, the server runs perfect, without any problem. And, randomly, sometimes several hours after start copying the movies. Other times the error appears only minutes after start copying the movies. Other times the error not appears, finishing the copying process without problems. My unraid server is made with: Mobo: GIGABYTE EP45-UD3R (HPA disabled) CPU: Intel Celeron E3400 RAM: 2Gb DDR2-800 OCZ PSU: CORSAIR CMPSU-750TX 750W Case: Sharkoon Rebel12 Controller Cards: 1 x Promise FASTTRAK TX2650 - PCI-E x1 (parity + cache) 3 x Promise SAT300 TX4 - PCI Backplanes: 4 x ICYDOCK 5in3 Hard Drives: Parity disk: WD Caviar black 2Tb. Cache disk: Seagate 500Gb. Data disks: Migrating from old server. Actually 3xWD EARS 2Tb + 6 SEAGATE 1,5Tb + 2 SEAGATE 1Tb Unraid OS: Version 4.6 I'm a complete newbie with unraid and i don't understand the system log (i've attached the syslog of the las hang - this morning). Please, can you help me? Thanks in advance... PD. Sorry for my poor english... :'( syslog-2011-01-24.zip Quote Link to comment
Joe L. Posted January 24, 2011 Share Posted January 24, 2011 If the parity disk is being marked as "INVALID" then "writes" to it are failing. You have either a bad disk, or a bad cable to the disk, or a bad disk controller port, or a loose cable (data or power) to the disk. Only physical inspection and/or replacement will determine if it is a cable. Substitution will determine the others. Joe L. Quote Link to comment
Joe L. Posted January 24, 2011 Share Posted January 24, 2011 Your problems in the log start here with a ICRC error (a checksum error communicating with the drive): Jan 24 11:59:39 HDSERVER kernel: ata7.00: exception Emask 0x0 SAct 0x7bebffff SErr 0x0 action 0x6 Jan 24 11:59:39 HDSERVER kernel: ata7.00: irq_stat 0x41000000 Jan 24 11:59:39 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED Jan 24 11:59:39 HDSERVER kernel: ata7.00: cmd 60/90:e8:0f:00:71/00:00:6e:00:00/40 tag 29 ncq 73728 in Jan 24 11:59:39 HDSERVER kernel: res 41/84:00:8f:fd:70/5e:00:6e:00:00/40 Emask 0x410 (ATA bus error) <F> Jan 24 11:59:39 HDSERVER kernel: ata7.00: status: { DRDY ERR } Jan 24 11:59:39 HDSERVER kernel: ata7.00: error: { ICRC ABRT } Jan 24 11:59:39 HDSERVER kernel: ata7: hard resetting link Jan 24 11:59:39 HDSERVER kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Jan 24 11:59:39 HDSERVER kernel: ata7.00: configured for UDMA/133 Jan 24 11:59:39 HDSERVER kernel: ata7: EH complete Jan 24 12:04:25 HDSERVER kernel: ata7.00: exception Emask 0x0 SAct 0x2b7fff SErr 0x0 action 0x6 frozen Jan 24 12:04:25 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED Jan 24 12:04:25 HDSERVER kernel: ata7.00: cmd 60/a8:00:e7:ef:f3/00:00:6e:00:00/40 tag 0 ncq 86016 in Jan 24 12:04:25 HDSERVER kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 24 12:04:25 HDSERVER kernel: ata7.00: status: { DRDY } Jan 24 12:04:25 HDSERVER kernel: ata7.00: failed command: READ FPDMA QUEUED Jan 24 12:04:25 HDSERVER kernel: ata7.00: cmd 60/00:08:8f:f1:f3/02:00:6e:00:00/40 tag 1 ncq 262144 in Jan 24 12:04:25 HDSERVER kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Jan 24 12:04:25 HDSERVER kernel: ata7.00: status: { DRDY } It is followed by many read and write errors to the drive. Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544472/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544480/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544488/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544496/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544504/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544512/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544520/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 read error Jan 24 12:05:12 HDSERVER kernel: handle_stripe read error: 1861544528/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489624/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489632/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489640/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489648/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489656/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: recovery thread woken up ... Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489664/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489672/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489680/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489688/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489696/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489704/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489712/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489720/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error Jan 24 12:05:12 HDSERVER kernel: handle_stripe write error: 1861489728/0, count: 1 Jan 24 12:05:12 HDSERVER kernel: md: disk0 write error First thing to try... a different cable to the parity disk. Joe L. Quote Link to comment
alikuenkano Posted January 24, 2011 Author Share Posted January 24, 2011 Thanks for your quick response, Joe.L. I will try what you say, and I will report the result. Thanks again !!! Quote Link to comment
prostuff1 Posted January 24, 2011 Share Posted January 24, 2011 JoeL. already pointed out the errors, the only thing I will add is that you should probably try a different slot in your ICYDock bays. If that still does not help then you need to eliminate the docks altogether and hook the drive up to the motherboard directly. Quote Link to comment
alikuenkano Posted January 24, 2011 Author Share Posted January 24, 2011 JoeL. already pointed out the errors, the only thing I will add is that you should probably try a different slot in your ICYDock bays. If that still does not help then you need to eliminate the docks altogether and hook the drive up to the motherboard directly. The parity drive and the cache drive are connected to a Promise Sata controller, and the disks are not in the icy dock backplanes. Are independent disks, in independent bays at the bottom of the case. So the backplanes are not guilty in this case. So I will try with a different sata cable for the parity drive. Thanks, prostuff1 !!! Quote Link to comment
bcbgboy13 Posted January 24, 2011 Share Posted January 24, 2011 Have you performed a nice check on your "new" system before committing any data (especially running MEMTEST at least overnight)? The reasons - you are using a sort of a "premium" motherboard (loaded with features) and perhaps a "premium" brand memory from OCZ (they are out of the the memory business now). Make sure you have the latest BIOS and that your memory modules are running with the recommended voltage (and then even some more). You should start from here and then go with a new SATA locking cables, check the power splitters etc...as recommended already Good luck Quote Link to comment
alikuenkano Posted January 26, 2011 Author Share Posted January 26, 2011 So I will try with a different sata cable for the parity drive. Hello again. I proceeded to change the sata cable connected to the parity drive. But... It happened again after half an hour or so... Attached is the new syslog after the new hang... Thanks again... syslog-2011-01-25.zip Quote Link to comment
RobJ Posted January 26, 2011 Share Posted January 26, 2011 No more CRC errors, so the new cable seems to have fixed that. But you are right, the drive is hanging, completely unresponsive at the higher levels. The lower level SATA link is fine, but there is no response to reads, write, or even identity requests. With this latest, it was disabled even quicker than in the previous syslog. By the way, when you see "kernel: ata7.00: disabled", you can completely ignore every subsequent error related to that drive, which is over 99.9% of the rest of both syslogs. Why it hangs is a mystery. It could be the drive itself, could be an incompatibility of that drive model with that disk controller, could be a power issue to that drive, or a problem with that SATA port or controller. As others have suggested, reconnect the drive to a completely different disk controller, and test again. And check the power connection to it, try a different one if possible. These tests should eliminate a few of the possibilities. Quote Link to comment
alikuenkano Posted January 26, 2011 Author Share Posted January 26, 2011 ...reconnect the drive to a completely different disk controller, and test again. And check the power connection to it, try a different one if possible... Thanks, RobJ. I'll keep trying. I will check the power cables/splitters. I will connect the drive to another port in the controller, and will connect the drive to another controller. I will report the results and I will attach new syslogs... Now the system is building parity again. The parity building always finishes OK. Is strange that the problem only appears when I'm copying movies and never when building parity... Don't you think so Thanks again !!! Quote Link to comment
prostuff1 Posted January 26, 2011 Share Posted January 26, 2011 It is odd but I have seen more weird. The NIC might be going bad/have gone bad also, though it does not explain the disk errors. You could try disabling the onboard NIC card and putting in a cheap GB NIC (Intel preferably) and see what happens. Quote Link to comment
bcbgboy13 Posted January 26, 2011 Share Posted January 26, 2011 Now the system is building parity again. The parity building always finishes OK. Is strange that the problem only appears when I'm copying movies and never when building parity... Don't you think so 1. You have a "premium" Gigabyte board (extra hardware features and extra BIOS options) Make sure you have the latest BIOS, disable the unused hardware features (serial and parallel ports, audio, floppy, IDE controllers, fire-wire, etc.) and connect the parity drive to one of the six primary SATA ports on the motherboard to insure a greater compatibility with your WD 2TB Black parity drive. 2. You also have possible "premium" graded OCZ memory. Make sure it runs at the designated voltage (some of the OCZ crap will require 2.2-2.3V to run compared to the standard 1.8V) and then you can even add a small bump (+0.05V) to insure a grater stability. Then perform a mandatory overnight "MEMTEST" This is the possible reason why the errors will only happens when you copy movies - because Unraid will use all the available memory as a buffer at this time. Quote Link to comment
alikuenkano Posted February 4, 2011 Author Share Posted February 4, 2011 Finally solved... 1. Changed the memory: The same problem. 2. Changed the data and power cables: no luck. 3. Connected the parity drive to the motherboard SATA controller: CRC errors out !!! The problem was the Promise TX2650 SATA controller, where the parity drive was connected. So I changed the Promise SATA controller by a SIL based SATA controller (DAWICONTROL - german brand) and bought a high quality SATA cable. The system now runs perfect. No CRC errors and no hangs after 24h of data transferring from old server to the new server. So... I think the problem is finally solved !!! Thank you all again !!! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.