Long array stop times, why?

mattkhan · January 6, 2016

I've noticed, since upgrading to unraid6, that stopping the array seems to take an awfully long time and it seems the sync command is the offender, checking /sys/block/sd[a-h]/stat gives me output like

# for each in $(ls /sys/block/sd[a-h]); do echo "${each/@/} : $(cat ${each/@/}/stat |awk '{print $9}')";done
/sys/block/sda : 0
/sys/block/sdb : 0
/sys/block/sdc : 0
/sys/block/sdd : 0
/sys/block/sde : 0
/sys/block/sdf : 0
/sys/block/sdg : 0
/sys/block/sdh : 141

(the 9th column in the output being in flight io requests as per https://www.kernel.org/doc/Documentation/block/stat.txt)

which shows that sdh is the only thing with anything to do and that is the drive being precleared. All other stats are not moving for the actual drives in the array.

Is preclear holding up the array stop? or is something else going on?

To give an example, I triggered an array stop at 0817 today and it's still going 50mins later. The web ui is unresponsive throughout this time but the system is up and running & I can ssh in and look at what is going on.

mattkhan · January 6, 2016

diagnostics attached

sync is still running and UI is unresponsive

the disk errors logged are from the preclear drive and are mentioned in another thread -> https://lime-technology.com/forum/index.php?topic=45236.msg431867#msg431867

zalaga-unraid-diagnostics-20160106-0909.zip

itimpi · January 6, 2016

In my experience any process doing I/O can cause the sync command to take forever to complete regardless of what disk it is happening to. Killing the preclear process would probably allow the system to stop the array.

mattkhan · January 6, 2016

seems unfortunate that preclear affects stopping the array (and then makes the web ui completely unresponsive to boot)

is there any reason why preclear has to be run on the unraid host as opposed some random linux box? I've read through the script and it seems to just make use of a few unraid config files in a few places but that would be easy enough to stub.

itimpi · January 6, 2016

is there any reason why preclear has to be run on the unraid host as opposed some random linux box? I've read through the script and it seems to just make use of a few unraid config files in a few places but that would be easy enough to stub.

Preclear can be run on any system. It is common practise to boot a version of unRAID on another system for exactly this purpose. It can also be run on a vanilla Linux system as long as you make sure any dependencies of the script are present.

mattkhan · January 6, 2016

Preclear can be run on any system. It is common practise to boot a version of unRAID on another system for exactly this purpose. It can also be run on a vanilla Linux system as long as you make sure any dependencies of the script are present.

ok thanks, I'll go that route in future then.

SSD · January 6, 2016

Preclear can be run on any system. It is common practise to boot a version of unRAID on another system for exactly this purpose. It can also be run on a vanilla Linux system as long as you make sure any dependencies of the script are present.

ok thanks, I'll go that route in future then.

Sync is a bit of a pig. If the array is spun down it can take a minute or more to come back, spinning up all drives in the process (the newperms script calls sync and has been my main experience with this irritating behavior). Never had a preclear prevent array being stopped (maybe never tried) and am a little skeptical that it is the reason. I have had open Windows explorer sessions with array drives open, and telnet sessions with current directory set to an array disk location hold up array shutdown. You get a pretty unhelpful stream of messages at the bottom of the web gui screen which are at least a tickler to go find what is holding up the shutdown. If you don't find it, the array will never stop. I expect you'd also not be able to bring up a new web gui session, although the existing session will continue to be updated. - but if it were closed it would likely just appear to hang with the symptoms you describe.

mattkhan · January 6, 2016

Never had a preclear prevent array being stopped (maybe never tried) and am a little skeptical that it is the reason.

FWIW I checked the logs this evening and can see that zero'ing the drive completed at 0950 this morning

# stat /tmp/zerosdh
  File: ‘/tmp/zerosdh’
  Size: 231873          Blocks: 456        IO Block: 4096   regular file
Device: 2h/2d   Inode: 123856      Links: 1
Access: (0666/-rw-rw-rw-)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-01-06 20:43:32.097103305 +0000
Modify: 2016-01-06 09:50:50.200307361 +0000
Change: 2016-01-06 09:50:50.200307361 +0000

and at the same time in /var/log/syslog we see

Jan  6 09:50:49 zalaga-unraid emhttp: shcmd (122): rm -f /boot/config/plugins/dynamix/mover.cron
Jan  6 09:50:49 zalaga-unraid emhttp: shcmd (123): /usr/local/sbin/update_cron &> /dev/null
Jan  6 09:50:49 zalaga-unraid emhttp: Unmounting disks...
Jan  6 09:50:49 zalaga-unraid kernel: mdcmd (131): stop
Jan  6 09:50:49 zalaga-unraid kernel: md1: stopping
Jan  6 09:50:49 zalaga-unraid kernel: md2: stopping
Jan  6 09:50:49 zalaga-unraid kernel: md3: stopping
Jan  6 09:50:49 zalaga-unraid kernel: md4: stopping
Jan  6 09:50:49 zalaga-unraid kernel: md5: stopping
Jan  6 09:50:49 zalaga-unraid emhttp: shcmd (124): rmmod md-mod |& logger
Jan  6 09:50:49 zalaga-unraid kernel: md: unRAID driver removed
Jan  6 09:50:49 zalaga-unraid emhttp: shcmd (125): modprobe md-mod super=/boot/config/super.dat slots=24 |& logger

This looks pretty conclusive that the array shutdown sync is on all disks in the system not just array disks

SSD · January 7, 2016

If current directory of the preclear command was in an array disk, that would explain it too. I've had shutdowns hang because I had an old screen session and directory was on the array. Sync is a Linux command. It works on all disks. Just not sure why it would hang on a disk under heavy i/o. May need a Linux expert to weigh in.

mattkhan · January 7, 2016

Fair point, i was thinking of it syncing a disk at a time which, as you say, it doesn't. Well that would explain it then anyway, preclear zeroing is constantly reading from urandom to generate data to write to the disk so attempting to sync is doomed to sit there forever, ie sync is trying to flush memory to disk while another process of constantly generating data in memory to write to disk.

JorgeB · January 7, 2016

I do sometimes stop array during a preclear on my test server, it does take more than usual to stop the array if preclear is zeroing a disk, but it does stop, preclear continues in the background, if I stop the array during a preclear post read it works normally.

Long array stop times, why?

Recommended Posts

mattkhan

Link to comment

mattkhan

Link to comment

itimpi

Link to comment

mattkhan

Link to comment

itimpi

Link to comment

mattkhan

Link to comment

SSD

Link to comment

mattkhan

Link to comment

SSD

Link to comment

mattkhan

Link to comment

JorgeB

Link to comment

Join the conversation