Jump to content

Parity checks on the atom are painfully slow. Mine take eleven days...


Recommended Posts

Hi Everyone,

 

After being told my eleven day parity checks are not normal on an Atom, I was asked by Paul to create a topic to try and solve my issue.

 

Current build:

 

Fractal Design Array R2,

Intel D510MO/D510 Intel MN10,

4Gb DDR2 800Mhz

6x 2Tb HDD

 

What information should I post to help diagnose the issue? 

 

Regards,

Jon

Link to comment

Jun 22 04:40:01 Atlantis syslogd 1.4.1: restart.

Jun 22 21:41:51 Atlantis kernel: mdcmd (47): spindown 1

Jun 22 21:41:51 Atlantis kernel: mdcmd (48): spindown 2

Jun 23 02:53:06 Atlantis kernel: mdcmd (49): spindown 1

Jun 23 02:53:06 Atlantis kernel: mdcmd (50): spindown 2

Jun 25 01:39:58 Atlantis udevd-work[13941]: rename '/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:0:0.udev-tmp' '/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:0:0' failed: No such file or directory

Jun 25 01:41:53 Atlantis udevd-work[14863]: symlink '../../sdd' '/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:0:0.udev-tmp' failed: File exists

Jun 25 02:03:33 Atlantis emhttp: title not found

Jun 25 02:05:30 Atlantis udevd-work[21871]: symlink '../../sdb' '/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:0:0.udev-tmp' failed: File exists

Jun 25 02:06:17 Atlantis udevd-work[23661]: symlink '../../sde' '/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:0:0.udev-tmp' failed: File exists

Jun 25 02:07:25 Atlantis in.telnetd[24494]: connect from 192.168.0.3 (192.168.0.3)

Jun 25 02:07:36 Atlantis login[24495]: ROOT LOGIN  on '/dev/pts/0' from '192.168.0.3'

Jun 25 02:07:50 Atlantis udevd-work[24705]: symlink '../../sdb' '/dev/disk/by-path/pci-0000:05:00.0-scsi-0:0:0:0.udev-tmp' failed: File exists

 

Is what I get

Link to comment

After being told my eleven day parity checks are not normal on an Atom ...

 

Wow ... that is a major understatement !!    An Atom has PLENTY of CPU power for parity computations ... the parity check speed is driven almost exclusively by your disks and disk interfaces.

 

My Atom build is slightly faster than yours (D525 vs D510) ... but that shouldn't really impact things.    And my parity checks with 6 3TB WD Reds take 7:41

 

Hopefully your full syslog will have some clues as to what's going on.

Link to comment

After being told my eleven day parity checks are not normal on an Atom ...

 

Wow ... that is a major understatement !!    An Atom has PLENTY of CPU power for parity computations ... the parity check speed is driven almost exclusively by your disks and disk interfaces.

 

My Atom build is slightly faster than yours (D525 vs D510) ... but that shouldn't really impact things.    And my parity checks with 6 3TB WD Reds take 7:41

 

Hopefully your full syslog will have some clues as to what's going on.

My experience with an Atom D525 is that it has plenty of power... to do ONE thing. If you are running a parity check, AND copying or streaming, performance plummits to an absolute minimum. Or if you are watching/streaming a dvd or video etc and start a copy to or from the array, everything grinds to a complete stop.

Link to comment

One contributory factor might be that the motherboard you mention has a PCI slot (not PCI express).  This will limit the speed of transfer to the drives on the controller in that slot.  Also, it depends on the controller itself of course.  I guess you have four drives on the PCI controller and only two on the motherboard.  BUT, even taking that into account, between one and two days should be sufficient I would think.  Something else is going on.

Link to comment

My experience with an Atom D525 is that it has plenty of power... to do ONE thing. If you are running a parity check, AND copying or streaming, performance plummits to an absolute minimum. Or if you are watching/streaming a dvd or video etc and start a copy to or from the array, everything grinds to a complete stop.

 

Not sure what configuration you tried, but that's not at all true with my SuperMicro  X7SPA-H-D525-O  based system.    I run UnRAID with UnMenu, Cache_Dirs, and the APC & CleanPowerdown plugins;  and can easily do multiple copies to/from it at once;  stream a movie while doing copies; etc.    The performance is just fine.

 

Link to comment

One contributory factor might be that the motherboard you mention has a PCI slot (not PCI express). 

 

This is definitely a MAJOR factor in why parity checks run much slower than they could with an interface that could support the drive's native sustained access capabilities.  HOWEVER ...

 

 

Something else is going on.

 

absolutely ... the PCI interface bottleneck does not explain eleven days !!

 

 

 

Link to comment

If you're willing to run "at risk" (e.g. without parity protection, so no fault tolerance)  for a few days, try this ...

 

(1)  Be sure you know the serial number of your parity drive -- write it down; save an image of the Web GUI; etc. ... just be sure you know which drive it is.

 

(2)  Shut down and disconnect 3 of the drives connected to the PCI controller, so only ONE drive is connected to that controller card.    Be SURE the parity drive is still connected ... either to the motherboard (best) or if it was connected to the PCI card, be sure it's the one still connected.

 

(3)  Boot to UnRAID, and create a new configuration with just the 3 drives now connected.  Be SURE you assign your parity drive to the parity slot.    NOTE:  Since this config will only have 3 drives, you COULD save your flash drive and do this test with another flash drive, which you could load the latest version of UnRAID on (since you don't need a key with only 3 drives).

 

(4)  Let UnRAID do a parity sync with this new configuration (this will take a LONG time .. but it should be Hours ...  not days).

 

(5)  When the parity sync has completed, run a parity check and see how long it takes.

 

 

Link to comment

Sys log is the same as what I posted above :)

unfortunately, that is a partial log after the prior log had been rotated out after growing to a large size.  It does not contain the needed information.

 

Look for syslog.1 or syslog.2 in /var/log/

 

What we'll need to see is how the disk is being initialized by the disk controller and any errors that might show themselves.

 

If needed, stop the array, reboot, and then capture the system log.

 

Joe L.

Link to comment

It's to big to post so it is in a .txt file

 

https://www.dropbox.com/s/s0tc4megkygfim0/syslog.txt

I'm seeing ton of ICRC errors.  These are errors communicating with the disk connected to ata2.    Each time the error occurs, the disk controller resets itself and tries again.  These slow you down a LOT.  pretty sure ata2 = /dev/sdc

 

Usually the cause of these is bundling SATA cables together (you try to make it look neat, and instead make it most likely for cables to induce noise into each other)  cut the tie-wraps.  Do NOT run the SATA cables near each other, do not run them near the power cables.

Jun 21 19:02:00 Atlantis kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6

Jun 21 19:02:00 Atlantis kernel: ata2.00: irq_stat 0x00020002, device error via D2H FIS

Jun 21 19:02:00 Atlantis kernel: ata2.00: failed command: WRITE DMA

Jun 21 19:02:00 Atlantis kernel: ata2.00: cmd ca/00:10:3f:00:00/00:00:00:00:00/e0 tag 0 dma 8192 out

Jun 21 19:02:00 Atlantis kernel:          res 51/84:01:3f:00:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)

Jun 21 19:02:00 Atlantis kernel: ata2.00: status: { DRDY ERR }

Jun 21 19:02:00 Atlantis kernel: ata2.00: error: { ICRC ABRT }

Jun 21 19:02:00 Atlantis kernel: ata2: hard resetting link

Jun 21 19:02:01 Atlantis emhttp: shcmd (1423): :>/etc/samba/smb-shares.conf

Jun 21 19:02:01 Atlantis emhttp: Restart SMB...

Jun 21 19:02:01 Atlantis emhttp: shcmd (1424): killall -HUP smbd

Jun 21 19:02:01 Atlantis emhttp: shcmd (1425): ps axc | grep -q rpc.mountd

Jun 21 19:02:01 Atlantis emhttp: _shcmd: shcmd (1425): exit status: 1

Jun 21 19:02:01 Atlantis emhttp: shcmd (1426): /usr/local/sbin/emhttp_event svcs_restarted

Jun 21 19:02:01 Atlantis emhttp_event: svcs_restarted

Jun 21 19:02:02 Atlantis kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)

Jun 21 19:02:02 Atlantis kernel: ata2.00: configured for UDMA/100

Jun 21 19:02:02 Atlantis kernel: ata2: EH complete

Jun 21 19:02:02 Atlantis kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6

Jun 21 19:02:02 Atlantis kernel: ata2.00: irq_stat 0x00020002, device error via D2H FIS

Jun 21 19:02:02 Atlantis kernel: ata2.00: failed command: WRITE DMA EXT

Jun 21 19:02:02 Atlantis kernel: ata2.00: cmd 35/00:00:c7:04:00/00:04:00:00:00/e0 tag 0 dma 524288 out

Jun 21 19:02:02 Atlantis kernel:          res 51/84:e0:c7:04:00/00:02:00:00:00/e0 Emask 0x10 (ATA bus error)

Jun 21 19:02:02 Atlantis kernel: ata2.00: status: { DRDY ERR }

Jun 21 19:02:02 Atlantis kernel: ata2.00: error: { ICRC ABRT }

Jun 21 19:02:02 Atlantis kernel: ata2: hard resetting link

Jun 21 19:02:05 Atlantis kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 0)

Jun 21 19:02:05 Atlantis kernel: ata2.00: configured for UDMA/100

Jun 21 19:02:05 Atlantis kernel: ata2: EH complete

Jun 21 19:02:05 Atlantis kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6

Jun 21 19:02:05 Atlantis kernel: ata2.00: irq_stat 0x00020002, device error via D2H FIS

Jun 21 19:02:05 Atlantis kernel: ata2.00: failed command: WRITE DMA EXT

Jun 21 19:02:05 Atlantis kernel: ata2.00: cmd 35/00:88:c7:08:00/00:03:00:00:00/e0 tag 0 dma 462848 out

Jun 21 19:02:05 Atlantis kernel:          res 51/84:78:c7:08:00/00:03:00:00:00/e0 Emask 0x10 (ATA bus error)

Jun 21 19:02:05 Atlantis kernel: ata2.00: status: { DRDY ERR }

Jun 21 19:02:05 Atlantis kernel: ata2.00: error: { [b][size=14pt][color=red]ICRC [/color][/size][/b]ABRT }

Jun 21 19:02:05 Atlantis kernel: ata2: hard resetting link

 

It basically is saying the checksum across the SATA cable is failing.

Looks like it is /dev/sdc

 

Joe L.

Link to comment

Could be a cable issue like Joe L. mentioned, a failing drive, the Linux driver (sil24) for the Silicon Image controller in your system or the Silicon Image controller itself.  Has it always been this way (long parity checks)?  Did it start acting up after an unRAID version change or did it just occur one day with no other changes made to the system? 

 

Check your cables and try moving the drive and data cable together to a different SATA port and see what happens.  If the errors move with the drive then it is the drive/cable.  Swap out the data cable next.  If the errors stay with the port then you have a bad card/port.  Please post a smart report for the drives as well.  It will shed further light on the issue.  A quick search of the forum for obtaining smart reports will yield instructions if you are unfamiliar.

 

 

Link to comment

The parity drive is having write failures:

 

Jun 21 19:04:44 Atlantis kernel: handle_stripe write error: 55576/0, count: 1
Jun 21 19:04:44 Atlantis kernel: md: disk0 write error

 

It should have a red dot.

OOps... I'm wrong... you are right...  That is an issue.

 

Joe L.

(I did not spot that in the syslog)

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...