Call traces error


Recommended Posts

Recent 'Fix Common Problems' scans are reporting an issue with 'call traces' and on checking the logs, it appears to be;

 

Jan 30 17:59:57 TechNAS kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.4.30-unRAID #2
Jan 30 17:59:57 TechNAS kernel: Hardware name: HP ProLiant MicroServer, BIOS O41     07/29/2011
Jan 30 17:59:57 TechNAS kernel: 0000000000000000 ffff88011fc83dd0 ffffffff8136f79f ffff88011fc83e18
Jan 30 17:59:57 TechNAS kernel: 0000000000000132 ffff88011fc83e08 ffffffff8104a4ab ffffffff8155871e
Jan 30 17:59:57 TechNAS kernel: ffff88011afba000 ffff8800c3ad6e00 ffff88011afba3a0 0000000000000001
Jan 30 17:59:57 TechNAS kernel: Call Trace:

 

Also attached are the diagnostics.

 

Anyone any idea what's causing it and how it can be resolved?

 

Thanks.

technas-diagnostics-20170201-1648.zip

Link to comment

You are not being ignored. I had a look but I have nothing to offer you other than that your call trace was thread pool related. This is complex kernel stuff and I don't even begin to understand it - I don't think there are many people here who do. All I can suggest is to see if there's a newer BIOS and to try a different kernel - maybe roll back a version of unRAID or wait for the next release.

Link to comment

The thing with call traces is that they may not always be fatal.  In your case, it doesn't appear to be any harm done, but as stated it takes a real linux expert to determine all the actual causes.

 

Side note is that the FCP test is new and happened by coincidence to correspond roughly with 6.3.0 being released, so the traces do not necessarily mean that its 6.3.0 related and may have been occurring under previous unRaid versions but just not being caught by FCP

Link to comment

Well, there you go! My advice worked for once  8)

Lol. I was holding off upgrading to 6.3, then I had a rush of confidence and did it anyway and it went without any problems, which is good news. I've only been using unraid for just over a month, so new around here.

 

Seriously, though, if you want to post your new diagnostics we'll give them the once over.

Great, thanks. Hopefully all is fine, but new diagnostics attached.

technas-diagnostics-20170208-1636.zip

Link to comment

This new feature of FCP has opened up a big can of worms. I wonder if it ought to come with a big friendly label saying "DON'T PANIC"?

I've actually mulled around the idea of removing the test  ???

 

But the net result is that call traces are NOT a normal occurrance and do tell you that something is wrong.  Whether that "wrong" is harmless or fatal is rather hard to tell without someone actually looking at the syslog and giving best guesses.

 

Hopefully as time goes on, the community here will get better at determining the cause of some of the harmless call traces, but as I've said, a perfectly working system should not ever have any call traces....

Link to comment

This new feature of FCP has opened up a big can of worms. I wonder if it ought to come with a big friendly label saying "DON'T PANIC"?

I've actually mulled around the idea of removing the test  ???

 

I think it would be a shame if you did remove it.

 

But the net result is that call traces are NOT a normal occurrance and do tell you that something is wrong.  Whether that "wrong" is harmless or fatal is rather hard to tell without someone actually looking at the syslog and giving best guesses.

 

Hopefully as time goes on, the community here will get better at determining the cause of some of the harmless call traces, but as I've said, a perfectly working system should not ever have any call traces....

 

I'm sure the support will improve with time. I enjoy studying syslogs but feel my level of understanding isn't good enough yet. Then I see a cry for help with zero replies dropping down the list and think, well I'll have a look and see if I can make a guess.

 

Link to comment

This new feature of FCP has opened up a big can of worms. I wonder if it ought to come with a big friendly label saying "DON'T PANIC"?

I've actually mulled around the idea of removing the test  ???

 

I think it would be a shame if you did remove it.

 

But the net result is that call traces are NOT a normal occurrance and do tell you that something is wrong.  Whether that "wrong" is harmless or fatal is rather hard to tell without someone actually looking at the syslog and giving best guesses.

 

Hopefully as time goes on, the community here will get better at determining the cause of some of the harmless call traces, but as I've said, a perfectly working system should not ever have any call traces....

 

I'm sure the support will improve with time. I enjoy studying syslogs but feel my level of understanding isn't good enough yet. Then I see a cry for help with zero replies dropping down the list and think, well I'll have a look and see if I can make a guess.

I reached out to some other users for their opinion on whether or not to move it to a "Can O'Worms' mode.

Link to comment

Here's my opinion, and I want it made clear it's just my opinion.  I too am not a Linux expert, especially when it comes to any in this particular class of issue:

 

- kernel panic, BUG, or Oops (and any other issue resulting in a Call Trace)

- MCE (Machine Check Error)

- OOM (Out of Memory error)

- GPF (General Protection Fault)

- segfault

- IRQ not handled

 

Invariably, not one of the above gives any warning leading up to it, so checking the events and messages just before it is usually useless.  They tend to happen out of the blue.  However, the current operations then *may* provide an idea of a source of stress on the system at that time, which *may* be a factor.

 

It's generally hard to say whether most of them are software or hardware related, which makes it hard to know which direction to point the user.  For hardware, it could be faulty RAM stick, over-heating component, failing motherboard, power spike, or failing or bad PSU.  That means Memtest, then check cooling, then hardware troubleshooting/component isolation, including the PSU.  For software, it's a program bug in something, could be in the kernel, BIOS, card firmware, or a driver.  That means checking for motherboard BIOS update, firmware updates, newer kernel, or possibly an older kernel.  Drivers are distributed with the kernel.

 

There are several things I look for in a Call Trace:

- the module in which this occurred (may or may not be the actual cause, but it's the first suspect)

- whether that module is 'Tainted' (means it's code has been changed, which *may* mean memory corruption; but there's one exception, see below *)

- any recognizable functions in the functions traced (may or may not be the actual suspects though)

- the motherboard BIOS date

- and of course what kind of issue (GPF, kernel panic, OOM, etc)

In some cases, the issue occurs while *other* processes are running, so the reported module and functions have nothing to do with the actual issue.  Which makes Call Traces sometimes very misleading!

 

In some cases of multiple Call Traces, you will see that the module is not tainted at first, but later the same module *is* tainted!  This means the problem has gotten much more serious, something is corrupting code!  Do not continue!  Grab diagnostics and reboot!

 

* However, the VM logs will show 2 'taints' for each VM started - they are completely normal, ignore them.

 

OOM - This is the only one that's relatively obvious, but actually finding what is taking up too much memory or leaking may not be that easy to determine

GPF and kernel panic, Bug, or Oops - could be software or hardware ... hard to pin down ... examine the Call Trace, try Memtest, newer BIOS and firmware, change kernel ...

MCE - almost certainly hardware; start with Memtest, then install mcelog, which when triggered by next MCE may provide clue to which hardware component is at fault; could also be overheating I think

segfault - usually either bad RAM or dependency conflict issue; start with Memtest, then track down what installs or uses the package that's mentioned

IRQ not handled - syslog and lsirq, figure out what was using it ... but they may need new BIOS, firmware, or newer kernel, or new motherboard; this is the only one that's never fatal (once detected it shuts off the service), but if an important component like a device controller is disabled, then full server operation may not be possible

 

In my opinion, ALL of the above should be detected and alerted on, as critical issues.  While some are not immediately fatal, they *may* have altered code space, and I really don't think the user should continue, without rebooting, as soon as possible!  I do not believe it is safe or wise to continue, after any of the above.  I always grab diagnostics and any other helpful troubleshooting info, then reboot as soon as I can.  These are serious, and can result in a corrupted kernel, and therefore corrupted data.  There may not be any permanent damage yet, but it's not wise to continue.  These are not at all like exception handlers, where issues are detected *and* handled, and operation safely continues.  These are interruptions where system errors are detected but are TOO serious to handle.

 

This is also one very nice advantage we have as unRAID users, in that once we reboot, a brand new and uncorrupted kernel and server system is built, no matter what happened in the previous session!

 

Edited for Bug, Oops, and the 2 VM taints

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.