SMLLR

[Solved] Extremely Slow Parity Checks and High CPU usage

11 posts in this topic Last Reply

Recommended Posts

Posted (edited)

So I have been battling with this for a week now and am just about at my wits end at this point. I have had unRaid setup on my server for a few months now and everything has been mostly fine until the most recent parity check on Aug 1st. When I awoke that morning, the parity check was only 5% complete and had ran for over 8 hours. Normally, this check is completed in about 20 to 24 hours with my 8TB drives. I have tried a lot of troubleshooting and have even gone so far as to roll everything back to a single 8TB storage drive and an 8TB parity drive, yet the issue persists. This issue was first observed while running the first parity check after adding an 8TB drive. I am not sure if I ran a parity check after upgrading to 6.5.3. Additionally, this issue also persists with VM manager and docker services stopped, so zero disk activity outside of the parity check should be occurring.

 

System specifications:

HP Proliant ML350e G8 v2

2x Xeon E5-2440 v2 @ 1.90GHz (8 physical cores ea for a total of 32 threads)

96GB ECC Memory (12 x 8gb)

HP H220 HBA card in IT mode (currently at firmware 15, but may be able to upgrade it to 20)

 

A few things I have yet to try are:

Revert back to an older version of unRaid

Restart the server in no plugins mode

 

I have uploaded a diagnostics below, though the log is a bit small as I canceled out the parity check after only a few minutes.

prefect-diagnostics-20180808-1319.zip

Edited by SMLLR

Share this post


Link to post
Posted (edited)

There have been a couple of similar issues helped by changing the tunables, to default or lower than default values, you're using very strange values:

 

Aug  8 12:46:53 Prefect kernel: mdcmd (31): set md_num_stripes 512
Aug  8 12:46:53 Prefect kernel: mdcmd (32): set md_sync_window 144
Aug  8 12:46:53 Prefect kernel: mdcmd (33): set md_sync_thresh 192

md_sync_thresh should be lower than md_sync_window, either change them to default, or try for example 100 for sync_thresh.

Edited by johnnie.black

Share this post


Link to post

Those settings were still set as the default when I first started experiencing this issue. I believe I started playing around with the value and even put a config in place to boot up with those options. I changed them in the disk settings before running the most recent parity check as shown below:

image.png.cd8f745e8dbff8966b03bfd255b81fa1.png

 

I can kill that config and reboot the server to see if that helps at all.

Share this post


Link to post
Posted (edited)

Found the config and wiped it out. Rebooted to verify the settings actually stuck this time and re-ran a parity check with the same results. I have attached a new diagnostics report.

 

prefect-diagnostics-20180808-1855.zip

 

EDIT: It is worth mentioning that this is while in safe mode.

Edited by SMLLR

Share this post


Link to post

Try lowering all tunables, sync_thresh lower than sync_window, but it could be a hardware problem.

Share this post


Link to post

I took the opportunity to reinstall the OS after backing up the existing configuration. As of right now, my parity rebuild is running at around 120MB/s and is 25% complete (it was running upwards of 150MB/s at the beginning). This parity rebuild is being completed with all four disks in place (three previously existing and the one new one). I fully believe all hardware is working as expected right now, however I will not know if a parity sync will work as expected until the rebuild is done tomorrow. If the parity sync works as expected, it may be worth digging in the configs to compare my old config with near stock config to see what may have caused issues. I believe reinstalling the OS should return parity sync runtimes to normal as the parity sync was working without issue prior to about two weeks ago.

 

At this rate, the rebuild will be completed probably around noon EST tomorrow. I will hopefully have a positive update at that time.

Share this post


Link to post

Finished up the rebuild, which averaged at 130MB/s, however the parity check still ran at 10MB/s and CPU was pegged at ~80%. I had to reduce the tunable settings to about a quarter of the original values to get back to where I was before the most recent parity check. It just seems odd to me that the rebuild is so fast without changing any settings yet the parity check goes so slow.

Share this post


Link to post

You are the 3rd user recently with the same issue, all 3 are using similar Xeon CPUs, I already pointed all 3 threads to LT, maybe something they can do, but leaving the md tunables at the max you can get them without causing the problem should be a good enough workaround for now.

Share this post


Link to post

I'm curious what hdparm -I <drive> says about all the drives.

 

Having a disk drop down to PIO mode instead of UDMA would give this silly slow speeds at extreme CPU loads, since PIO mode means there are no hardware acceleration of the data transfers. With DMA, the transfers just consumes memory bandwidth and at the end of the transfer the OS gets an interrupt informing that the transfer is done.

 

Below is partial output of hdparm -I - where the star before "udma6" shows which mode the drive is currently using.

Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 0
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns

 

Share this post


Link to post

All drives show udma6 as the mode the drive is currently being used. The only difference between them is that the older, 4TB drive does not have the "Advanced power management level" line.

Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 0
        Advanced power management level: 164
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns

 

Share this post


Link to post

So I believe this may be resolved, finally. I did make a large number of changes, but I believe a BIOS update and switching server's power management to OS controlled made the most impact. I am even using the default tunable settings, which I am now going to work on tweaking in an effort to improve performance. I just find it odd that it was working perfectly fine until earlier this month...

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


Copyright © 2005-2018 Lime Technology, Inc.
unRAID® is a registered trademark of Lime Technology, Inc.