Jump to content

Server Locking Up


Recommended Posts

I've been having difficulty with:

 

1) Server locking up when not in use for no particular reason

2) Server locking up when during a file copy

 

The issue goes away for some time and then crops up again and when it does the crash during a file copy is repeatable.  This is one of those mornings.  Normally, when it freezes I am unable to telnet or access via web, so I am unable to pull a syslog.  This has happened 10's of times and everytime I have to hit the power button to restart the server.

 

Well, this morning it happened 4 times while trying to copy a 4GB file to disc1.  The 4th time it did something different.  When I hit the power button it beeped a bunch of times and did a controlled shutdown.  Hence, I have a syslog to share!!  I was able to retreive the syslog that it saved from the /flash/logs directory.

 

Could someone please look at this and tell me if you see any clues as to what might be happening, please??

Link to comment

I've been having difficulty with:

 

1) Server locking up when not in use for no particular reason

2) Server locking up when during a file copy

 

The issue goes away for some time and then crops up again and when it does the crash during a file copy is repeatable.  This is one of those mornings.  Normally, when it freezes I am unable to telnet or access via web, so I am unable to pull a syslog.  This has happened 10's of times and everytime I have to hit the power button to restart the server.

 

Well, this morning it happened 4 times while trying to copy a 4GB file to disc1.  The 4th time it did something different.  When I hit the power button it beeped a bunch of times and did a controlled shutdown.  Hence, I have a syslog to share!!  I was able to retreive the syslog that it saved from the /flash/logs directory.

 

Could someone please look at this and tell me if you see any clues as to what might be happening, please??

The interesting lines are here:

[pre]

May 20 06:38:22 UNRAID kernel: mdcmd (19): nocheck

May 20 06:38:22 UNRAID kernel: md: md_do_sync: got signal, exit...

May 20 06:38:22 UNRAID kernel: md: recovery thread sync completion status: -4        <--- Boot process completed here

May 20 06:40:40 UNRAID emhttp: shcmd (38): beep -r 2                                <--- first sign of any problem

May 20 06:40:40 UNRAID kernel: TCP(wget:7651): Application bug, race in MSG_PEEK.

May 20 06:40:40 UNRAID emhttp: shcmd (39): /etc/rc.d/rc.samba stop | logger

May 20 06:40:40 UNRAID emhttp: shcmd (40): /etc/rc.d/rc.nfsd stop | logger

May 20 06:40:41 UNRAID emhttp: Spinning up all drives...

May 20 06:40:42 UNRAID emhttp: shcmd (41): sync

May 20 06:40:42 UNRAID emhttp: shcmd (42): umount /mnt/user

May 20 06:40:42 UNRAID emhttp: shcmd (43): rmdir /mnt/user

May 20 06:40:42 UNRAID emhttp: shcmd (44): umount /mnt/disk1

May 20 06:40:42 UNRAID emhttp: shcmd (44): umount /mnt/disk2

May 20 06:40:42 UNRAID emhttp: shcmd (44): umount /mnt/disk3

May 20 06:40:42 UNRAID emhttp: shcmd (44): umount /mnt/disk4

May 20 06:40:42 UNRAID emhttp: _shcmd: shcmd (44): exit status: 1

May 20 06:40:42 UNRAID emhttp: shcmd (45): rmdir /mnt/disk1

May 20 06:40:42 UNRAID emhttp: shcmd (45): umount /mnt/disk5

May 20 06:40:42 UNRAID emhttp: shcmd (45): umount /mnt/disk6

May 20 06:40:42 UNRAID emhttp: _shcmd: shcmd (45): exit status: 1

May 20 06:40:43 UNRAID emhttp: shcmd (47): rmdir /mnt/disk4

May 20 06:40:43 UNRAID emhttp: shcmd (49): rmdir /mnt/disk5

May 20 06:40:43 UNRAID emhttp: shcmd (50): rmdir /mnt/disk3

May 20 06:40:43 UNRAID emhttp: shcmd (52): rmdir /mnt/disk6

May 20 06:40:43 UNRAID emhttp: shcmd (55): rmdir /mnt/disk2

May 20 06:40:43 UNRAID kernel: mdcmd (28): stop

May 20 06:40:43 UNRAID kernel: md: 2 devices still in use.

May 20 06:40:43 UNRAID emhttp: shcmd (56): /etc/rc.d/rc.ntpd stop >/dev/null 2>&1

May 20 06:40:43 UNRAID ntpd[1469]: ntpd exiting on signal 15

May 20 06:40:44 UNRAID emhttp: _shcmd: shcmd (56): exit status: 1

May 20 06:40:44 UNRAID emhttp: shcmd (57): sync

May 20 06:40:46 UNRAID emhttp: shcmd (58): /sbin/poweroff

May 20 06:40:46 UNRAID shutdown[7721]: shutting down for system halt

May 20 06:40:46 UNRAID init: Switching to runlevel: 0

May 20 06:40:49 UNRAID rc.unRAID[7735]: Stopping unRAID.[/pre]

 

I'm not sure... the server has just finished booting up at 06:38:22.  , Then at 06:40:40 it issued the "beep" command and shuts down.  The only error is the "Application Bug" error message.  It apparently is a kernel bug introduced in Linux 2.6.28.1.  It was probably the cause of the beep and subsequent shutdown. 

 

Do a google search on "kernel: TCP(wget): Application bug, race in MSG_PEEK" and you will see lots of complaints.

 

Joe L.

Link to comment

[pre]

May 20 06:38:22 UNRAID kernel: mdcmd (19): nocheck

May 20 06:38:22 UNRAID kernel: md: md_do_sync: got signal, exit...

May 20 06:38:22 UNRAID kernel: md: recovery thread sync completion status: -4        <--- Boot process completed here

May 20 06:40:40 UNRAID emhttp: shcmd (38): beep -r 2                                <--- first sign of any problem

May 20 06:40:40 UNRAID kernel: TCP(wget:7651): Application bug, race in MSG_PEEK.

May 20 06:40:40 UNRAID emhttp: shcmd (39): /etc/rc.d/rc.samba stop | logger

May 20 06:40:40 UNRAID emhttp: shcmd (40): /etc/rc.d/rc.nfsd stop | logger

[/pre]

Searching google on "kernel: TCP(wget): Application bug, race in MSG_PEEK" does show that there is a kernel bug that they are trying to squash.

 

The beeps I heard, though, happened when I pressed the power button.  I thought that they were related to the powerdown script.  At the time the server actually locks up, I don't hear any beeps (it's a silent death).

 

Looking at this a second time... I do hear beeps at the end of the boot process and that is where you stated "<--- first sign of any problem".  Then immediatly after that the "Application bug" is logged.  So, maybe something is happening immediatly at boot time that is the cause of my server lock-ups that happen at a later time. (kind of a timb-bomb scenerio).

 

JT

Link to comment

[pre]

May 20 06:38:22 UNRAID kernel: mdcmd (19): nocheck

May 20 06:38:22 UNRAID kernel: md: md_do_sync: got signal, exit...

May 20 06:38:22 UNRAID kernel: md: recovery thread sync completion status: -4        <--- Boot process completed here

May 20 06:40:40 UNRAID emhttp: shcmd (38): beep -r 2                                 <--- first sign of any problem

May 20 06:40:40 UNRAID kernel: TCP(wget:7651): Application bug, race in MSG_PEEK.

May 20 06:40:40 UNRAID emhttp: shcmd (39): /etc/rc.d/rc.samba stop | logger

May 20 06:40:40 UNRAID emhttp: shcmd (40): /etc/rc.d/rc.nfsd stop | logger

[/pre]

Searching google on "kernel: TCP(wget): Application bug, race in MSG_PEEK" does show that there is a kernel bug that they are trying to squash.

 

The beeps I heard, though, happened when I pressed the power button.  I thought that they were related to the powerdown script.  At the time the server actually locks up, I don't hear any beeps (it's a silent death).

 

Looking at this a second time... I do hear beeps at the end of the boot process and that is where you stated "<--- first sign of any problem".  Then immediatly after that the "Application bug" is logged.  So, maybe something is happening immediatly at boot time that is the cause of my server lock-ups that happen at a later time. (kind of a timb-bomb scenerio).

 

JT

If that is true, then it appears as if you pressing the power button and perhaps it was trying to invoke the new script Tom recently added in /usr/local/sbin/powerdown, and it invoked "wget"   I'm guessing here... but that is the only use of "wget" I know of involved in the shutdown process.      Since I did not see other obvious errors in the syslog, I really don't have a clue.
Link to comment

I don't see any other clues either.  You might try testing the standard causes of crashes: test the memory overnight, check for heat problems, try a different power supply.

 

In addition, you should try running from a completely stock configuration, make sure the problem is not related to one of your many addons.  Just rename your go script, and use the original one.

Link to comment

Well, really, I've been reporting on this problem since January:

 

http://lime-technology.com/forum/index.php?topic=3000.0

 

As you can see in that post I've replaced the motherboard, the power supply (610W single-rail power supply), I've removed the problematic seagate 1.5GB drive from the array entirely, I've done numerous memory checks (never a problem), I've tried removing add-ins, and I've installed new SATA controllers.  Also, I have kept up on all unRaid releases (currently 4.5b6). I'm beginning to think that there is a ghost in my machine.

 

I'm only re-reporting the issue now because this is the first time since January that I've managed to record a syslog after the lock-up.  It's also likely that this recent lock-up is a symptom of a diferrent cause and that is why I was able to get the syslog. 

 

Who knows  ???

 

I really appreciate that you guys looked at that syslog for me.  Thank you.

Link to comment

Just read your other thread, and WOW!  You have had a tough path here.  Most or all of what you called 'parity syncs' were parity checks, that could have been immediately and safely aborted, saving you a LOT of time.  A parity check after a bad shutdown is a Good Thing, very necessary, but only needs to run for the first few percent of the drive (in my opinion), unless it has not been run recently!

 

These kinds of problems are very hard to solve.  I hope you have been following a similar story here, as it is possible there will be something there that might spark an idea to be tested, that will resolve your issue.

 

You won't like this next recommendation, as it's natural to want your shiny new system to run well, with ALL of the fancy extras you paid for (went to the trouble of installing and configuring), but from my standpoint as a troubleshooter, all of your addons make me very uncomfortable.  They add a lot of complexity, and make it hard to 'point the finger'.  Since you have found a repeatable test, that usually crashes the machine, would you mind confirming for us one more time, that running with a stock go file (absolutely no addons or system tweaks) still causes a crash?

Link to comment
  • 1 year later...

Well, I'm having difficulty with my server locking up during file transfer again.  I haven't changed any hardware at all, other than adding a few drives.  I have changed the server version to 4.7, though.

 

What happens is I will begin a large file transfer (usually 4.5GB ISO files) and then everything just stops... The web interface, telnet, the transfer... everything.  After a hard power cycle everything comes back up. 

 

This time I managed to capture some info in a running telnet session window.  Can anyone make any sense of this info:

 

root@UNRAID:/mnt#
Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: EIP: [<c10798cc>] __d_lookup+0xb8/0xd5 SS:ESP 0068:def11ddc

Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: Stack:

Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: Code: e8 39 42 04 75 1c 8b 42 08 8b 4d e8 8b 55 f0 e8 65 b0 0b 00 85 c0 75 0a f0 ff 03 fe 43 08 89 d8 eb 1e fe 43 08 8b 3f 85 ff 74 13 <8b> 07 0f 18 00 90 8d 5f ec 8b 4d ec 39 4b 20 75 e9 eb 87 31 c0

Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:02:04.0/host2/target2:0:0/2:0:0:0/block/sdb/stat

Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: Call Trace:

Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: CR2: 0000000086ac00bf

Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: Process find (pid: 25025, ti=def10000 task=f6c9acb0 task.ti=def10000)

Message from syslogd@UNRAID at Wed Jan 26 07:21:15 2011 ...
UNRAID kernel: Oops: 0000 [#1] SMP

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...