rcu_sched self-detected stall - Is it a watchdog?


Recommended Posts

Just had a rcu_sched self-detected stall.  This occurred 12 minutes after my scheduled parity check started.  Up to this point, my system has always been 100% rock solid.

 

Am I reading this https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt correctly?  To me, it sounds like a watch-dog timer which if it fails to run generates the stall warning.

 

I had a perfect storm of applications hogging the cpu at the time:

 

  • Parity check running - with unthrottled (unraid-tunables-tester) values
  • Sabnzbd docker running with a whack of small (200Meg) downloads
  • Nzbdrone docker copying the downloads to the secondary server
  • HTPC scanning both server's over and over again with each new d/l
  • MariaDB docker trying to keep up with it all
  • FoldingAtHome docker doing its thing

Because of all that, my sysload was up ~38 on my triplecore.

 

As soon as I managed to kill FoldingAtHome and Sab, the stalls stopped.

 

Just asking, because I think as a general rule in the forums self detected stalls tend to get blamed on hardware (or lately reiser), but in my case I'm fairly confident that its actually software and that the system is just missing its watch dog timer due to the obscenely high sysload.

 

Looks like I'm going to have to build a script to stop Sab & Folding@Home when a parity check is in progress.

 

Any thoughts?  Is my analysis correct of the situation?  Anyone ever try playing with the timeout amounts on rcu_sched?

 

Mar 1 00:12:50 Server_A kernel: INFO: rcu_sched self-detected stall on CPU { 1} (t=6000 jiffies g=3181375 c=3181374 q=922256)
Mar 1 00:12:50 Server_A kernel: Task dump for CPU 1:
Mar 1 00:12:50 Server_A kernel: unraidd R running task 0 4443 2 0x00000008
Mar 1 00:12:50 Server_A kernel: 0000000000000000 ffff88021fc43da8 ffffffff8105e0b5 0000000000000001
Mar 1 00:12:50 Server_A kernel: 0000000000000001 ffff88021fc43dc8 ffffffff81060780 0000000000000002
Mar 1 00:12:50 Server_A kernel: ffffffff81834400 ffff88021fc43df8 ffffffff8107845f ffffffff81834400
Mar 1 00:12:50 Server_A kernel: Call Trace:
Mar 1 00:12:50 Server_A kernel: [] sched_show_task+0xbe/0xc3
Mar 1 00:12:50 Server_A kernel: [] dump_cpu_task+0x35/0x39
Mar 1 00:12:50 Server_A kernel: [] rcu_dump_cpu_stacks+0x6a/0x8c
Mar 1 00:12:50 Server_A kernel: [] rcu_check_callbacks+0x1db/0x4f9
Mar 1 00:12:50 Server_A kernel: [] ? do_IRQ+0xb6/0xd1
Mar 1 00:12:50 Server_A kernel: [] ? tick_sched_handle+0x34/0x34
Mar 1 00:12:50 Server_A kernel: [] update_process_times+0x3a/0x64
Mar 1 00:12:50 Server_A kernel: [] tick_sched_handle+0x32/0x34
Mar 1 00:12:50 Server_A kernel: [] tick_sched_timer+0x37/0x61
Mar 1 00:12:50 Server_A kernel: [] __run_hrtimer.isra.29+0x57/0xb0
Mar 1 00:12:50 Server_A kernel: [] hrtimer_interrupt+0xd9/0x1c0
Mar 1 00:12:50 Server_A kernel: [] local_apic_timer_interrupt+0x50/0x54
Mar 1 00:12:50 Server_A kernel: [] smp_apic_timer_interrupt+0x3c/0x4e
Mar 1 00:12:50 Server_A kernel: [] apic_timer_interrupt+0x6d/0x80
Mar 1 00:12:50 Server_A kernel: [] ? memcmp+0x6/0x22
Mar 1 00:12:50 Server_A kernel: [] ? xor_blocks+0x22/0x4b
Mar 1 00:12:50 Server_A kernel: [] handle_stripe+0xdeb/0x120f [md_mod]
Mar 1 00:12:50 Server_A kernel: [] unraidd+0xda/0x194 [md_mod]
Mar 1 00:12:50 Server_A kernel: [] md_thread+0xd2/0xe8 [md_mod]
Mar 1 00:12:50 Server_A kernel: [] ? __wake_up_sync+0xd/0xd
Mar 1 00:12:50 Server_A kernel: [] ? 0xffffffffa0089000
Mar 1 00:12:50 Server_A kernel: [] kthread+0xd6/0xde
Mar 1 00:12:50 Server_A kernel: [] ? kthread_create_on_node+0x172/0x172
Mar 1 00:12:50 Server_A kernel: [] ret_from_fork+0x7c/0xb0
Mar 1 00:12:50 Server_A kernel: [] ? kthread_create_on_node+0x172/0x172

Link to comment

I've determined that it's not strictly related to the high sysload.  (even with Docker completely turned off, and my sysload <1), stalls were still happening.  I've now got it pinned down to the parity check itself which means that its more I/O related.  I've adjusted the disk tunables back down, and the stalls seem to have disappeared, even with all of the dockers back running at full tilt (sysload ~7)

 

Could the tunables be causing all of those recurring stalls during mover running threads?

 

Link to comment

I've determined that it's not strictly related to the high sysload.  (even with Docker completely turned off, and my sysload <1), stalls were still happening.  I've not got it pinned down to the parity check itself which means that its more I/O related.  I've adjusted the disk tunables back down, and the stalls seem to have disappeared, even with all of the dockers back running at full tilt (sysload ~7)

 

Could the tunables be causing all of those recurring stalls during mover running threads?

 

Its not likely. Some of the users who had the stalls never even ran the tunable script or adjusted those settings.

Link to comment

So does anyone know what it actually is?  I can now turn it on and off at will?  And whether or not that script was ever run on other user's systems (if they even remember - I don't recall anyone in those problem threads checking to see what the tunables were set to) maybe the defaults are too high under certain hardware combinations.

 

On the other hand, I suppose that I could just say that its caused by xfs.  After all, I don't have any reiserfs drives in the system any more and one of the stalls did have an xfs entry in the stack dump.  (The exact same logic that is being used on stack dumps being blamed on Reiser).  Maybe we should just recommend that nobody use xfs anymore.  ;)

 

Link to comment

So does anyone know what it actually is?  I can now turn it on and off at will?  And whether or not that script was ever run on other user's systems (if they even remember - I don't recall anyone in those problem threads checking to see what the tunables were set to) maybe the defaults are too high under certain hardware combinations.

 

On the other hand, I suppose that I could just say that its caused by xfs.  After all, I don't have any reiserfs drives in the system any more and one of the stalls did have an xfs entry in the stack dump.  (The exact same logic that is being used on stack dumps being blamed on Reiser).  Maybe we should just recommend that nobody use xfs anymore.  ;)

 

I had this happening in earlier betas with all reiserfs disks. after beta10a I moved all disks to xfs and didn't see it happening anymore. I never changed any tunables.

Link to comment

So does anyone know what it actually is?  I can now turn it on and off at will?  And whether or not that script was ever run on other user's systems (if they even remember - I don't recall anyone in those problem threads checking to see what the tunables were set to) maybe the defaults are too high under certain hardware combinations.

 

That is very interesting indeed. The reason it was being associated with RFS was because it was in all the stack traces and dumps and was initially solved by the first few users by migrating to XFS.

 

Now that we have a little more information, it is definitely something else. So you can make it happen at will without running your Folding@Home tasks? If so, then you would be indispensable for the entire community and should definitely send a PM/EMail directly to all the LimeTech staff and work with them to get this resolved.

Link to comment

So does anyone know what it actually is?  I can now turn it on and off at will?  And whether or not that script was ever run on other user's systems (if they even remember - I don't recall anyone in those problem threads checking to see what the tunables were set to) maybe the defaults are too high under certain hardware combinations.

 

That is very interesting indeed. The reason it was being associated with RFS was because it was in all the stack traces and dumps and was initially solved by the first few users by migrating to XFS.

 

Now that we have a little more information, it is definitely something else. So you can make it happen at will without running your Folding@Home tasks? If so, then you would be indispensable for the entire community and should definitely send a PM/EMail directly to all the LimeTech staff and work with them to get this resolved.

No.  That was my initial thought because it did get better after I lowered my sysload.

 

I can turn it off and on at will now by adjusting the tunables settings - ie: adjusting the amount of hard drive I/O happening.  (albeit only during a parity check)

 

Already sent a PM to JonP with a link that this may be related to all of the other issues out there.  If nothing else, at least in my case I know exactly how to turn it on and off at will.  Not that reiser is completely blameless - if xfs solves the issue for those users, then it is reiser - however if the problem can be solved for those users by just adjusting the tunables down, then its alot easier than converting the whole system over to xfs.

 

If nothing else, if I am reading the link in the OP correctly these stalls are harmless.  Which I why I'm waiting for someone more knowledgeable than me to answer the basic question - what exactly is it

Link to comment

If you can make this happen "at will" then try changing the stall timeout (default is 60).  For example to change to 120 type:

 

echo 120 >/sys/module/rcupdate/parameters/rcu_cpu_stall_timeout

 

Please see if this makes any difference.

 

You can also turn off those warnings by typing:

 

echo 1 >/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress

Link to comment

If you can make this happen "at will" then try changing the stall timeout (default is 60).  For example to change to 120 type:

 

echo 120 >/sys/module/rcupdate/parameters/rcu_cpu_stall_timeout

 

Please see if this makes any difference.

 

You can also turn off those warnings by typing:

 

echo 1 >/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress

Changed md_num_stripes to 4480

Changed md_sync_window to 2048

 

17:46 - successfully recreated the stalls

17:49 - stall

17:50 - stalls continuing.

17:52 - trying to recreate my steps last night - killed FoldingAtHome then restarted

18:02 - Sabnzbd extremely unresponsive - won't load UI

18:01 - Last stall posted

18:03 - change time out to 120

18:29 - still no further stalls posted - Longest its ever gone without a stall with those tunable settings

 

syslog attached.  Will let parity check run and see what happens.  Hopefully this helps with the stall issues floating around on the forums. 

 

@JonP - I drink either Moosehead or Canadian  ;)

 

syslog_3.zip

Link to comment

Highly usable...  Only once during this test for Tom did it wind up just going for a crap and completely unresponsive.  Currently, sysload just spiked to 29, and could still use the system.  I think when it was unresponsive, it was in the middle of a stall.

 

 

@Tom:  just checked my log again, and another couple stalls:

 

18:51 & 18:56.  Definitely a huge improvement.  Going to try 180 for a time out

 

The interesting thing is that when there is a stall, the network also goes down.  In the syslog after every stall there is a CIFS error because of the system losing track of my network mount.  That's going to be why all the UI's become unresponsive

syslog_4.zip

Link to comment

So this means that something is still stalling/stalled for over 2 minutes.

 

The last one was was within the unraid md driver unraidd function that invoked handle_stripe function that then somehow made it into the check_xor macro, which seems to be invoked from various parts of the driver.

Link to comment

So this means that something is still stalling/stalled for over 2 minutes.

Thats what I think.  I can't find anywhere on the web that outright says that a self detected stall is just where it's missing a watchdog timer.

 

If I'm right about it being a watchdog (and Tom's suggestion to change the timeout's seems to imply it too) then it is by itself a "harmless error"  Since I already have it pinned down on my hardware exactly what is causing it, I'm only running this test in the hopes that maybe the other stall issues out there can be solved by LT easily since thus far I'm the only who has figured out exactly how to reproduce.

 

Still another 5 1/2 hours of testing to go...

Link to comment

Right, the messages are likely harmless as they're just diagnosing the real issue(s). In your latest test, the issue that needs to be determined and worked out is why there's a cpu process that has been starved / stalled / deadlocked for over 2 minutes.

Which is why I've also got system stats up and running real-time so that if / when there's another stall I can correlate it to processor / memory / network / hard drive  - usage / bandwidth
Link to comment

I already knew that it was md_sync causing the stalls.  Using 640 I know that I don't get any all all.  The time difference to me is immaterial (8.88 hours vs estimated 8 at the high values).  I usually run the settings at a lower value anyways to leave I/O room for other processes.

 

 

EDIT:  Will work on that now...

Link to comment

You should try lowering md_sync_window until you don't see any more stalls: how low do you have to go and what impact does it have on your parity check speed?

Reset time out back to default of 60

 

md_num_stripes / md_sync_window

 

4480 / 2048 - stalls

4224 / 1920 - stalls

3968 / 1792 - stalls

3584 / 1664 - stalls

3328 / 1536 - looks good  Will let it run for the night  (currently bouncing between 100 and 120 MB/sec)  maybe 3 MB slower than the starting values)Stall ~ 4am

 

As far as parity check speed, a lower value on md_sync does not always equal a lower speed.  Check out the tunables thread.  As long as it completes in 8-9 hours I'm happy, and any of the values will do that.  But, with the resurgence of interest in the tunables, I re-ran it and selected new values.  Happy coincidence since I wound up with a repeatable problem which is causing some issues around the threads

 

Link to comment

You should try lowering md_sync_window until you don't see any more stalls: how low do you have to go and what impact does it have on your parity check speed?

Testing finally complete (my hydro company thanks you  ;))

 

Time out was set back to the default of 60

 

md_num_stripes / md_sync_window

 

4480 / 2048 - stalls

4224 / 1920 - stalls

3968 / 1792 - stalls

3584 / 1664 - stalls

3328 / 1536 - stalls

3072 / 1408 - stalls

2816 / 1280 - stalls

2560 / 1152 - Good to go

 

As the md_sync_window was lowered, the frequency of the stalls decreased.  Minimimal (if any) difference in parity check speed.  (@1152 was 98.1 MB/s average - 8 hours, 29 minutes)

 

Hope you can gather some info to solve some of the other stalls around here.

 

syslog_8.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.