System becomes unresponsive, even with stock 5.0 stable (stock GUI)

Hawat · November 26, 2013

Sorry about the hijack. Wasn't aware I should start a new thread. Will do.

abhi.ko · November 26, 2013

Attach a new syslog. NFS does not support spaces in share names.

Here is the new syslog, system has been running since the last reboot and I have been accessing shares/disks without issues till now.

SysLog_unRAID.txt

abhi.ko · November 26, 2013

Also sometimes the tower stays unresponsive for a long time (web gui does not work,nor does the shares or the telnet session) and then all of a sudden it comes back. I keep refreshing the web gui and after say 15 minutes of not responding it starts working. Absolutely no idea what is going on.

it just did that and there is absolutely nothing I can see in the log which shows anything.

Nov 25 17:47:37 Tower kernel: mdcmd (31): spindown 0
Nov 25 18:01:46 Tower mountd[1277]: authenticated mount request from MYIP:51255 for /mnt/user/Other Media (/mnt/user/Other Media)
Nov 25 18:01:51 Tower mountd[1277]: authenticated mount request from MYIP:51270 for /mnt/user/TV Shows (/mnt/user/TV Shows)
Nov 25 18:09:38 Tower kernel: mdcmd (32): spindown 3
Nov 25 18:16:39 Tower kernel: mdcmd (33): spindown 4
Nov 25 18:20:15 Tower mountd[1277]: authenticated mount request from MYIP:52819 for /mnt/user/Other Media (/mnt/user/Other Media)
Nov 25 18:37:50 Tower kernel: mdcmd (34): spindown 4
Nov 25 19:38:45 Tower mountd[1277]: authenticated mount request from MYIP:58755 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 19:40:35 Tower mountd[1277]: authenticated mount request from MYIP:58904 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 19:45:09 Tower mountd[1277]: authenticated mount request from MYIP:59262 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 19:52:02 Tower kernel: mdcmd (35): spindown 1
Nov 25 19:53:50 Tower mountd[1277]: authenticated mount request from MYIP:59912 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 19:55:13 Tower kernel: mdcmd (36): spindown 3
Nov 25 19:55:24 Tower kernel: mdcmd (37): spindown 4
Nov 25 20:02:15 Tower mountd[1277]: authenticated mount request from MYIP:60521 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 20:31:45 Tower mountd[1277]: authenticated mount request from MYIP:62721 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 20:55:46 Tower mountd[1277]: authenticated mount request from MYIP:64515 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 20:57:36 Tower mountd[1277]: authenticated mount request from MYIP:64649 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 20:59:05 Tower kernel: mdcmd (38): spindown 1
Nov 25 20:59:26 Tower kernel: mdcmd (39): spindown 3
Nov 25 21:01:57 Tower emhttp: shcmd (68): /usr/sbin/hdparm -y /dev/sdg &> /dev/null
Nov 25 21:02:01 Tower mountd[1277]: authenticated mount request from MYIP:64975 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 21:10:51 Tower mountd[1277]: authenticated mount request from MYIP:49274 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 21:19:16 Tower mountd[1277]: authenticated mount request from MYIP:49886 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 21:31:29 Tower kernel: mdcmd (40): spindown 4
Nov 25 21:35:21 Tower mountd[1277]: authenticated mount request from MYIP:51108 for /mnt/user/Movies (/mnt/user/Movies)
Nov 25 21:48:46 Tower mountd[1277]: authenticated mount request from MYIP:52126 for /mnt/user/Movies (/mnt/user/Movies)

doorunrun · November 26, 2013

I looked over your syslog and reviewed the thread; from the symptoms you describe it does sound like a memory issue. OK, I've brought this suggestion up in other threads and maybe I'm beating a dead horse....BUT here's another suggestion to try; see if it helps stabilize your system from going into unresponsive mode.

The code below is added to your GO file in /boot/config folder. The theory here is to prevent an OOM, out of memory, condition killing off important services. Since you don't use SMB you may not want to include the second line. I couldn't say if using this technique to prevent the NFS daemon from shutting down is a good thing since I don't use it.

pgrep -f "/usr/local/sbin/emhttp" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done
pgrep -f "/usr/local/sbin/smbd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done

Are you using any disk controller cards in your system, or are all your disks plugged into the motherboard? BTW, what's the motherboard brand/model?

Thanks!

abhi.ko · November 26, 2013

I looked over your syslog and reviewed the thread; from the symptoms you describe it does sound like a memory issue. OK, I've brought this suggestion up in other threads and maybe I'm beating a dead horse....BUT here's another suggestion to try; see if it helps stabilize your system from going into unresponsive mode.

The code below is added to your GO file in /boot/config folder. The theory here is to prevent an OOM, out of memory, condition killing off important services. Since you don't use SMB you may not want to include the second line. I couldn't say if using this technique to prevent the NFS daemon from shutting down is a good thing since I don't use it.
pgrep -f "/usr/local/sbin/emhttp" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done
pgrep -f "/usr/local/sbin/smbd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done
Are you using any disk controller cards in your system, or are all your disks plugged into the motherboard? BTW, what's the motherboard brand/model?

Thanks!

Thanks for the suggestion doorunrun. Will defenitely try it. I have 4 GB of RAM - and no addons installed right now, so if something is hogging my memory then it has to be unRAID itself, I amnot sure why it would though. But it would be defentely worth a try. So thank you!

Meanwhile just an update - I just upgraded to 5.0.2 and the syste hasn't crashed yet, I am accessing shares and disks and trying to replicate the circumstances that used to cause the crash before but it is going great so far, no issues. That does not mean that it won't only upgraded this morning - not enough time has passed to tell one way or the other.

So keeping my fingers crossed and if everything goes fine for a couple of days then I will try installing a few addons (Plex Server and maybe upgrade the webGUI) and report back. It is a good thing that I am on vacation this week I am still puzzled what was causing the issue, guess we will never know if the 5.0.2 solved it.

Here is my config, same exact config that has been working without issues for over a year.

CPU: AMD A4-3400 2.7Ghz - http://www.newegg.com/Product/Product.aspx?Item=N82E16819103955

MoBo: ECS A75-FM2 (6 SATA 3 ports, USB 3.0) - http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=743866

RAM: PNY XLR8 4GB (2 x 2GB) 240-Pin DDR3 SDRAM DDR3 1600 (PC3 12800) - http://www.newegg.com/Product/Product.aspx?Item=N82E16820178265

PSU: Corsair AX 430W ATX - http://www.newegg.com/Product/Product.aspx?Item=N82E16817139026&Tpk=cx%20430

Case: NZXT Source 210 Black (8 x 3.5" Internal HDD drive bays) - http://www.newegg.com/Product/Product.aspx?Item=N82E16811146075

USB: Transcend 2 GB USB stick - for unRAID - http://www.amazon.com/Transcend-JetFlash-V30-Flash-TS8GJFV30E/dp/B00284AOSY/ref=sr_1_5?ie=UTF8&qid=1337112892&sr=8-5

I have an Intel NIC installed and used instead of the onboard one. I also have a SATA exapnsion card (I am pretty sure I used this one foor the unraid box) - http://www.newegg.com/Product/Product.aspx?Item=N82E16816124045 to which the cache drve is plugged in al the data and parity drives are plugged to the on board SATA 3 ports.

Thanks!

doorunrun · November 26, 2013

Thanks! I hope the update to 5.0.2 does the job. Good luck!

abhi.ko · November 26, 2013

Thanks! I hope the update to 5.0.2 does the job. Good luck!

Okay the server is still running but the webgui has crashed. I can still access shares and the system through telnet but tower/main is not accessible.

Saw this in the log corresponding to the time when the crash happened, not sure if this means anything (complete log attached below):

Nov 26 08:47:42 Tower emhttp: shcmd (59): /usr/sbin/hdparm -y /dev/sdg &> /dev/null

Is this related to the spaces in the share names('TV Shows' and 'Other Media') for nfs that I had mentioned before?

syslog.txt

doorunrun · November 26, 2013

Okay the server is still running but the webgui has crashed. I can still access shares and the system through telnet but tower/main is not accessible.

Saw this in the log corresponding to the time when the crash happened, not sure if this means anything (complete log attached below):
Nov 26 08:47:42 Tower emhttp: shcmd (59): /usr/sbin/hdparm -y /dev/sdg &> /dev/null

I believe that's the spindown command for one of the drives (sdg). In my case I started seeing those show up for a SSD cache drive and there's no point in trying to spin it down. So I remove the option for that drive, FWIW.

abhi.ko · November 26, 2013

Okay the server is still running but the webgui has crashed. I can still access shares and the system through telnet but tower/main is not accessible.

Saw this in the log corresponding to the time when the crash happened, not sure if this means anything (complete log attached below):
Nov 26 08:47:42 Tower emhttp: shcmd (59): /usr/sbin/hdparm -y /dev/sdg &> /dev/null
I believe that's the spindown command for one of the drives (sdg). In my case I started seeing those show up for a SSD cache drive and there's no point in trying to spin it down. So I remove the option for that drive, FWIW.

Okay thanks for that, sdg is my cache drive.

So the system did not crash and I have been using it pretty much throughout since upgrading to 5.0.2, however emhttp crashed and never came up again so had to reboot from telnet to get it running again. Everything else is working fine and have had no issues. Wonder what is causing the webgui to go down?

Any ideas?

doorunrun · November 26, 2013

Wonder what is causing the webgui to go down?

Any ideas?

Now's a good time to try the OOM prevention line for emhttp in your go file.

dgaschk · November 27, 2013

The syslog does not indicate that emhttp is crashing. What does "ps x | grep emhttp" show?

abhi.ko · November 27, 2013

The syslog does not indicate that emhttp is crashing. What does "ps x | grep emhttp" show?

root@Tower:/# ps x | grep emhttp
1227 ?        Sl     0:03 /usr/local/sbin/emhttp
4348 pts/1    S+     0:00 grep emhttp

This is right now and now everything is running fine, will try this again if it crashes later and let you know what it says.

But honestly doorunrun's 2 line script seems like it did something - haven't had any issues since I added that to the go file and rebooted. Installed uu and Plex and it all seems to be going great (so far - fingers crossed).

doorunrun · November 27, 2013

I appreciate you trying the script. I wish there was a better way to isolate the problem other than "cut and try."

With unRAID writing the syslog to RAM, capturing errors leading up to a crash is difficult. There is a script written by a forum user that redirected logging to the USB drive and was intended to be used on a temporary basis as a better way to analyze things after a crash. Here's a link to a thread with it attached: http://lime-technology.com/forum/index.php?topic=28316.msg251503#msg251503

Since the script writes frequently to your USB drive you shouldn't use it for very long. That action is said to shorten the life of the flash drive.

abhi.ko · November 27, 2013

I appreciate you trying the script. I wish there was a better way to isolate the problem other than "cut and try."

With unRAID writing the syslog to RAM, capturing errors leading up to a crash is difficult. There is a script written by a forum user that redirected logging to the USB drive and was intended to be used on a temporary basis as a better way to analyze things after a crash. Here's a link to a thread with it attached: http://lime-technology.com/forum/index.php?topic=28316.msg251503#msg251503

Since the script writes frequently to your USB drive you shouldn't use it for very long. That action is said to shorten the life of the flash drive.

I got to thank you for the script doorunrun - I haven't had any issues so far after I have tried that and it looks like my assumption that 4GB was enough was wrong.

So far no system crashes after the 5.0.2 upgrade and no webgui crashes after your script was added to the go file. So looks like I am back in business. I did install Plex and uu (and a few packages) yesterday and it has been running fine so far, planning to install crashplan and that would be it. Hope the 4GB should do for that much.

So thank you again. I will close out this thread after a couple more days if everything seems okay.

doorunrun · November 27, 2013

Yes, I agree with you 4GB should be enough. Another "urban myth" to point out is the idea to limit your memory to 4095 mb. That has been reported to help with some motherboards. Limetech doesn't think much of this, but it's one of those things that just won't go away.

I see 5.0.3 has been posted, just in time for holiday tinkering

abhi.ko · December 13, 2013

Yes, I agree with you 4GB should be enough. Another "urban myth" to point out is the idea to limit your memory to 4095 mb. That has been reported to help with some motherboards. Limetech doesn't think much of this, but it's one of those things that just won't go away.

I see 5.0.3 has been posted, just in time for holiday tinkering

Okay bad news is that the server is still crashing/freezing up and it is more frequent than before.

I have upgraded to 5.0.4 but still the problem is not solved. I haven't been able to access the server for over 2 weeks now, did not have time to post here till now. Getting to the point where I think I will have to ditch unRAID altogether and go for some other OS, but I am worried whether the migration would be painless and simple.

Trust me when I say this that this is not what I wanted to do, but the issues I've had to deal with the last couple of months has got me to the point where I am thinking I would be happy if I can get to access the server, unRAID unfortunately is broken for me and I do not know how to fix it. Unless some kind hearted soul here can help me or Tom himself or his team can provide some customer service I will have to migrate to other solutions. I am thnking FreeNAS or WHS - does anyone have any insights on what can be done here or on how to migrate to another server OS please?

All help is welcome.

I can try and capture the log by adding the script that doorunrun suggested to transfer the log writing to the USB drive so that it will show us what is causin the freeze up? Does any one have any suggestions please?

limetech · December 13, 2013

Don't use that script. Just open a telnet session and type this command:

tail -f /var/log/syslog

Leave the window open. Any messages going into syslog will appear there. Alternately you could click the 'Log' button on the left of the webGui menu bar and an browser window will open showing the same thing.

Were you able to run without NFS? If so, did it crash without running NFS?

These kinds of "complete freeze" problems are almost certainly hardware issues. Also memtest only tests the CPU<->MEM path, and not DMA<->MEM path. I've seen RAM pass memtest yet still cause failures.

abhi.ko · December 13, 2013

Okay Thank you for taking a look at it Tom.

I am doing what you suggested now on a complete stock 5.0.4 fresh install on my flash drive. I had to hard reboot the system so it is running a parity check on reboot, I am letting it run the parity check unless you think that I should I stop it.

So far it is up and running, I have a copyof the syslog from the telnet session attached here for reference - the system is still running - so this log s before the crash. Not sue if this information is useful but wanted to share.

No I haven't tried stopping the NFS shares, especially since all my XMBC paths are defined using the NFS share, but I defenitely can try that if that is what you suggest. Has "enabling NFS shares" been known to cause these issues?

I can get a new set of RAM sticks and try with those, but is there a way to test and be sure what part of the the hardware is causing the issue. If not Memtest then is there some other we can tell??

Thanks for the help again.

syslog.txt

abhi.ko · December 13, 2013

Okay I spoke too soon. Crash/Freeze happened a few minutes after my last post. 3 more lines on the syslog, the complete syslog until the crash is posted here.

Now the system is running (meaning the fans are spinning and the system LED's are blinking/on) but it is in-accessible or frozen everyother way. Not sure what is causing this, hope the attached syslg gives us some clue as to what is going on.

syslog.txt

limetech · December 13, 2013

Okay I spoke too soon. Crash/Freeze happened a few minutes after my last post. 3 more lines on the syslog, the complete syslog until the crash is posted here.

Now the system is running (meaning the fans are spinning and the system LED's are blinking/on) but it is in-accessible or frozen everyother way. Not sure what is causing this, hope the attached syslg gives us some clue as to what is going on.

Next to last entry in the syslog shows a drive getting spun down. Maybe an I/O request is coming in (due that mount in the last line of the syslog), which is trying to access that disk and the PSU is buckling trying to spin it up. Can you tell if that drive (sdg) is in the spun up or spun down state?

Here are two tests to try. First is to disable spin-down. Just run with all disks spinning and see if the crash happens. You seem to be able to make it crash soon enough where this would be a clue.

The other test is, assuming you have 2 memory sticks plugged into your motherboard, remove one of the sticks and let it run. If it crashes, swap the still-plugged-in stick with the one you unplugged and try again. If it still crashes, probably not memory since you wouldn't expect both sticks to be bad (but stranger things have happened).

So I would suspect first the PSU, next the memory. Beyond that, maybe motherboard, controllers, etc. Probably not the s/w or else I would be getting lots more reports of this happening.

limetech · December 13, 2013

Something else to try. Without any I/O happening (maybe start in maintenance mode), click the Spin Down and Spin Up buttons a couple times to try and stress the PSU. You should not see any hangs, crashes, etc. just doing this.

abhi.ko · December 13, 2013

Tried the spinup and down in Maintenance mode at least 15 times without issues.

Now trying to run with spindown disabled for all the disks, so far no issues but noticed this at the end of the active syslog, does that indicate something:

Dec 13 17:00:56 Tower emhttp: Restart NFS...
Dec 13 17:00:56 Tower emhttp: shcmd (127): exportfs -ra |& logger
Dec 13 17:00:56 Tower logger: exportfs: Warning: /mnt/user/TV Shows does not support NFS export.
Dec 13 17:00:56 Tower logger: exportfs: Warning: /mnt/user/Other Media does not support NFS export.
Dec 13 17:00:56 Tower emhttp: shcmd (128): /usr/local/sbin/emhttp_event svcs_restarted
Dec 13 17:00:56 Tower emhttp_event: svcs_restarted
Dec 13 17:00:56 Tower emhttp: shcmd (129): /usr/local/sbin/emhttp_event started
Dec 13 17:00:56 Tower emhttp_event: started
Dec 13 17:00:56 Tower avahi-daemon[1797]: Service "Tower" (/services/smb.service) successfully established.
Dec 13 17:01:07 Tower rpc.statd[1134]: nsm_parse_reply: can't decode RPC reply
Dec 13 17:01:45 Tower last message repeated 3 times
Dec 13 17:02:49 Tower last message repeated 5 times
Dec 13 17:03:53 Tower last message repeated 5 times
Dec 13 17:04:56 Tower last message repeated 5 times
Dec 13 17:06:00 Tower last message repeated 5 times
Dec 13 17:07:04 Tower last message repeated 5 times
Dec 13 17:08:08 Tower last message repeated 5 times
Dec 13 17:09:11 Tower last message repeated 5 times
Dec 13 17:10:15 Tower last message repeated 5 times
Dec 13 17:11:19 Tower last message repeated 5 times
Dec 13 17:12:22 Tower last message repeated 5 times

sdg is my cache drive and it was spun down I believe (blinking green light) when it crashed the last time.

limetech · December 13, 2013

Here's a similar redhat bug report for that:

https://bugzilla.redhat.com/show_bug.cgi?id=858793

Maybe something is crashing the s/w since half of NFS is implemented in the kernel - and if something goes wrong in there, well.... crash...

Something else to try then: Go to the Settings/NFS page, disable NFS, then reboot server. When server is back up, go to Settings/NFS page and enable NFS. Now see if it ever crashes. What this sequence does is guarantee that for sure, the NFS kernel modules are loaded before NFS is enabled. Maybe there's a race condition somewhere when NFS starts out enabled during server start up.

Edit: I really hate NFS. It's an antiquated protocol that makes a lot of things much more difficult. /rant

abhi.ko · December 14, 2013

Here's a similar redhat bug report for that:

https://bugzilla.redhat.com/show_bug.cgi?id=858793

Maybe something is crashing the s/w since half of NFS is implemented in the kernel - and if something goes wrong in there, well.... crash...

Something else to try then: Go to the Settings/NFS page, disable NFS, then reboot server. When server is back up, go to Settings/NFS page and enable NFS. Now see if it ever crashes. What this sequence does is guarantee that for sure, the NFS kernel modules are loaded before NFS is enabled. Maybe there's a race condition somewhere when NFS starts out enabled during server start up.

Edit: I really hate NFS. It's an antiquated protocol that makes a lot of things much more difficult. /rant

Thanks a ton Tom!

I have no logical reason why I picked NFS over SMB to map my XMBC locations, but if that is all that needs to be fixed then that is just editing an xml file (for each XBMC installation - I have 5) to fix it and change to SMB. Now that I heard your opinion on NFS I will try doing exactly that and switch it all to SMB.

I will try what you suggested for the disable-reboot-enable sequence first though and see if everything works out with NFS before switching to SMB.

For right now no crashes after disabling spindown but something interesting I noticed, had a few instances of the web GUI(emhttp freezing or not responding) and also Telnet returns a "host not found" message but the media is playing on my HTPC while that is happening - any idea why that would be?

I am trying to access it from multiple locations but nothing has crashed it yet. Not sure if that is the end of that issue or whether it is just a matter of time before it comes back (sorry my previous experiences have made me a skeptic).

System becomes unresponsive, even with stock 5.0 stable (stock GUI)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation