Mover crashing server


Recommended Posts

Hi all,

 

I just wanted to raise this issue one last time in an rc thread, before starting a new one...

 

For those that haven't read my previous posts in the rc1 and 2 threads, since upgrading to rc1, 2 and 3, mover has been completely crashing my unraid system, to the point where the shares, gui and telnet access doesn't work. Everything else seems to be running normally, the shares, sab, sick beard, couch potato and PLEX all run as they should, as long as mover isn't started.

 

I posted a limited syslog in the rc2 thread, which was looked at and the problem was considered to be potentially hardware related, however i have since found out how to create a full syslog via unmenu (sorry, noob!) and just wanted to post again and get one last opinion as to whether it could be rc related, before starting a separate thread, as it does seem to be a bit of a coincidence.

 

I have noticed this error in the log,

unRAID kernel: ata5: sas eh calling libata port error handler

repeatedly occurring, as well as for different sata numbers and it has actually previously been brought up in this thread... http://lime-technology.com/forum/index.php?topic=15049.285 and was potentially linked to beta 12a, however i can't seem to find an explanation for the problem?

 

I don't know if this syslog will help diagnose the problem, or if it is rc related, but i just wanted to check one last time with as much info as i could get.

 

Thank you,

 

Rich

syslog-2012-05-13.txt

Link to comment
  • Replies 55
  • Created
  • Last Reply

Top Posters In This Topic

Have you disabled/uninstalled all add ons and see what happens? This needs to be your first step in the process to see if it is unRAID or an add-on that is causing the problem. If it happens with all add ons uninstalled then attach the syslog here. If it is fine then install each add on one at a time and when the problem comes back post in that add ons thread.

Link to comment

Hi all,

 

I just wanted to raise this issue one last time in an rc thread, before starting a new one...

 

For those that haven't read my previous posts in the rc1 and 2 threads, since upgrading to rc1, 2 and 3, mover has been completely crashing

The crash seems to be in the shared file-system when accessing libc.

May 13 19:11:04 unRAID kernel: shfs[2745]: segfault at 8000000 ip b750c68c sp b74640e8 error 4 in libc-2.11.1.so[b7496000+15c000]

May 13 19:11:05 unRAID logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]

from what I've read in this post:

http://lime-technology.com/forum/index.php?topic=2903.msg24380#msg24380

 

SAB installs a different version of libc with this command:

cp /boot/custom/usr/share/packages/libstdc++.so* /usr/lib

 

You've probably installed an incompatible version of libc.  (incompatible with the operating system version used in the rc2/3 series.)    Typically, this is not a library to be replaced unless you know exactly what you are doing.

 

In any case, expect little to no help from lime-tech unless you have the same issue with the mover/rsync with NO add-ons loaded, especially when the segmentation fault is in a library you installed with an add-on.  He has mentioned several times that he is interested in issues with the release he is distributing, not in those caused by an add-on.  (Those issues should be reported to their respective authors)

 

Joe L.

Link to comment

Sorry, i should have said in my previous post, this problem is happening with 0 add-ons running, i've just rebooted the system and created a new syslog with nothing running but unmenu, just to double check.

 

Based on what Joe L has said above you should try with the addons (sabnzbd specifically) not installed at all. Not installed but not running.

 

As the problem may be in the installation phase of the plugin so will appear whether it's actually running or not.

Link to comment

Sorry, i should have said in my previous post, this problem is happening with 0 add-ons running, i've just rebooted the system and created a new syslog with nothing running but unmenu, just to double check.

 

Based on what Joe L has said above you should try with the addons (sabnzbd specifically) not installed at all. Not installed but not running.

 

As the problem may be in the installation phase of the plugin so will appear whether it's actually running or not.

Exactly some of your files are still being overwritten by a version supplied with a plugin.

Link to comment

...

I have noticed this error in the log,

unRAID kernel: ata5: sas eh calling libata port error handler

repeatedly occurring, as well as for different sata numbers and it has actually previously been brought up in this thread... http://lime-technology.com/forum/index.php?topic=15049.285 and was potentially linked to beta 12a, however i can't seem to find an explanation for the problem?

 

Those are initialization debugs, not errors. They're just telling the reader where errors will be handled, should they occur.

Link to comment

Sorry, i should have said in my previous post, this problem is happening with 0 add-ons running, i've just rebooted the system and created a new syslog with nothing running but unmenu, just to double check.

 

Based on what Joe L has said above you should try with the addons (sabnzbd specifically) not installed at all. Not installed but not running.

 

As the problem may be in the installation phase of the plugin so will appear whether it's actually running or not.

The correct version of glibc for RC3 is 2.9, foud here:

http://slackware.osuosl.org/slackware-13.0/slackware/l/glibc-2.9-i486-3.txz

 

Tom, you're still using Slackware 13 as development environment?

Link to comment

Sorry, i should have said in my previous post, this problem is happening with 0 add-ons running, i've just rebooted the system and created a new syslog with nothing running but unmenu, just to double check.

 

Based on what Joe L has said above you should try with the addons (sabnzbd specifically) not installed at all. Not installed but not running.

 

As the problem may be in the installation phase of the plugin so will appear whether it's actually running or not.

Exactly some of your files are still being overwritten by a version supplied with a plugin.

 

I took everything out of the go script, so although the folder for sab is sat on my cache drive still, it is not involved in the system boot, at all. Is that not enough, or is there something else i need to do to isolate it?

 

Thanks

Link to comment

Look for ".plg" files and remove or rename them (.pl_), they do installations as well.

 

Ok, this is a copy of the extra lines in my go script...

sleep 10
/boot/unmenu./uu
cd /boot/packages && find . -name '*.auto_install' -type f -print | sort | xargs -n1 sh -c 
sleep 20
# determine if cache drive online, retry upto 7 times
for i in 0 1 2 3 4 5 6 7
do
    if [ ! -d /mnt/cache ]
    then
      sleep 15
    fi
done
# If Cache drive is online, start SABnzbd, Sickbeard, and CouchPotato
if [ -d /mnt/cache ]; then
  cd /mnt/cache/.usenet
  installpkg /boot/packages/SABnzbdDependencies-2.1-i486-unRAID.tgz
  python /mnt/cache/.usenet/sabnzbd/SABnzbd.py -d
  python /mnt/cache/.usenet/sickbeard/SickBeard.py --daemon
  python /mnt/cache/.usenet/couchpotato/CouchPotato.py -d
fi
sleep 10
/mnt/user/PLEX/0961/start.sh

 

Not only have i taken all the above out, but i've renamed the 'packages', '.usenet' and '0961' folders, so the paths would be incorrect anyway.

I can't find any .plg files, or think of anymore i can do. I hope thats enough to demonstrate an add-on free boot up?

 

Sadly tho, i still have the mover problem  :(

Link to comment

 

Not only have i taken all the above out, but i've renamed the 'packages', '.usenet' and '0961' folders, so the paths would be incorrect anyway.

I can't find any .plg files, or think of anymore i can do. I hope thats enough to demonstrate an add-on free boot up?

 

Sadly tho, i still have the mover problem  :(

 

Have you examined cache drive lately? Permissions, etc? Might be the heat but I couldn't find your original mover report/syslog.

 

Link to comment

Have you looked into the /boot/extra folder?

 

Yeah, renamed that as well  :-\

 

The safest way to test an RC is with a clean install, so rather then digging through configs, folders, renaming files to stop plugins you should just cut everything off your flash drive (if you want to keep some configs) then format the flash and put a clean install of 5.0-rcX on to test with. Just replacing bzimage/bzroot is not enough when bug testing an RC.

 

Just reformatted my flash drive and then freshly installed rc3 with absolutely nothing else and same problem  :'(

Although i think the syslog looks slightly different...

syslog.txt

Link to comment

Have you looked into the /boot/extra folder?

 

Yeah, renamed that as well  :-\

 

The safest way to test an RC is with a clean install, so rather then digging through configs, folders, renaming files to stop plugins you should just cut everything off your flash drive (if you want to keep some configs) then format the flash and put a clean install of 5.0-rcX on to test with. Just replacing bzimage/bzroot is not enough when bug testing an RC.

 

Just reformatted my flash drive and then freshly installed rc3 with absolutely nothing else and same problem  :'(

Although i think the syslog looks slightly different...

That puts the ball back in lime-techs court...

 

Same failure... in the user-share file system.

May 13 19:10:44 unRAID logger: mover started

May 13 19:10:44 unRAID logger: moving Files and Programs/

May 13 19:10:44 unRAID logger: ./Files and Programs/My Programs/Parallels Desktop v7.0.15055/Parallels Desktop v7.0.15055.dmg

May 13 19:10:44 unRAID logger: .d..t...... ./

May 13 19:10:45 unRAID logger: cd+++++++++ Files and Programs/My Programs/Parallels Desktop v7.0.15055/

May 13 19:10:45 unRAID logger: >f+++++++++ Files and Programs/My Programs/Parallels Desktop v7.0.15055/Parallels Desktop v7.0.15055.dmg

May 13 19:11:04 unRAID logger: moving Media/

May 13 19:11:04 unRAID logger: ./Media/TV/Fringe/Season04/Fringe - 04x22.mp4

May 13 19:11:04 unRAID logger: .d..t...... ./

May 13 19:11:04 unRAID logger: rsync: get_xattr_names: llistxattr("Media",1024) failed: Software caused connection abort (103)

May 13 19:11:04 unRAID logger: .d..t...... Media/

May 13 19:11:04 unRAID logger: rsync: get_acl: sys_acl_get_file(Media, ACL_TYPE_ACCESS): Transport endpoint is not connected (107)

May 13 19:11:04 unRAID kernel: shfs[2745]: segfault at 8000000 ip b750c68c sp b74640e8 error 4 in libc-2.11.1.so[b7496000+15c000]

May 13 19:11:05 unRAID logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]

May 13 19:11:05 unRAID logger: ./Media/TV/Bones/Bones - VII/Bones - 07x12.mp4

May 13 19:11:05 unRAID logger: rsync: get_acl: sys_acl_get_file(., ACL_TYPE_ACCESS): Transport endpoint is not connected (107)

 

Looks like Tom has some investigation to do...

 

Joe L.

Link to comment

Have you looked into the /boot/extra folder?

 

Yeah, renamed that as well  :-\

 

The safest way to test an RC is with a clean install, so rather then digging through configs, folders, renaming files to stop plugins you should just cut everything off your flash drive (if you want to keep some configs) then format the flash and put a clean install of 5.0-rcX on to test with. Just replacing bzimage/bzroot is not enough when bug testing an RC.

 

Just reformatted my flash drive and then freshly installed rc3 with absolutely nothing else and same problem  :'(

Although i think the syslog looks slightly different...

That puts the ball back in lime-techs court...

 

Same failure... in the user-share file system.

May 13 19:10:44 unRAID logger: mover started

May 13 19:10:44 unRAID logger: moving Files and Programs/

May 13 19:10:44 unRAID logger: ./Files and Programs/My Programs/Parallels Desktop v7.0.15055/Parallels Desktop v7.0.15055.dmg

May 13 19:10:44 unRAID logger: .d..t...... ./

May 13 19:10:45 unRAID logger: cd+++++++++ Files and Programs/My Programs/Parallels Desktop v7.0.15055/

May 13 19:10:45 unRAID logger: >f+++++++++ Files and Programs/My Programs/Parallels Desktop v7.0.15055/Parallels Desktop v7.0.15055.dmg

May 13 19:11:04 unRAID logger: moving Media/

May 13 19:11:04 unRAID logger: ./Media/TV/Fringe/Season04/Fringe - 04x22.mp4

May 13 19:11:04 unRAID logger: .d..t...... ./

May 13 19:11:04 unRAID logger: rsync: get_xattr_names: llistxattr("Media",1024) failed: Software caused connection abort (103)

May 13 19:11:04 unRAID logger: .d..t...... Media/

May 13 19:11:04 unRAID logger: rsync: get_acl: sys_acl_get_file(Media, ACL_TYPE_ACCESS): Transport endpoint is not connected (107)

May 13 19:11:04 unRAID kernel: shfs[2745]: segfault at 8000000 ip b750c68c sp b74640e8 error 4 in libc-2.11.1.so[b7496000+15c000]

May 13 19:11:05 unRAID logger: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1042) [sender=3.0.7]

May 13 19:11:05 unRAID logger: ./Media/TV/Bones/Bones - VII/Bones - 07x12.mp4

May 13 19:11:05 unRAID logger: rsync: get_acl: sys_acl_get_file(., ACL_TYPE_ACCESS): Transport endpoint is not connected (107)

 

Looks like Tom has some investigation to do...

 

Joe L.

 

Thanks for your help and patience Joe, very much appreciated  :)

 

Tom/Lime-Tech, i'd really appreciate your help with this one please...

Link to comment

You might want to shoot over an email to him to ensure it doesn't get lost in the thread (if it keeps growing of course).

 

Trying to reproduce...

 

Thank you

 

Rich-

I don't see this issue on any test server.  I do see possible file system corruption in your system log, however.  Please run a 'file system check' on all your data drives, and then see if this issue persists.

 

Here's how to do this (running unRaid version 5.0):

 

Stop the array.  You should notice a checkbox under the Start button that says “Maintenance mode”.  Check this and then Start the array.  This will start the unRaid driver but not mount any of the hard disks.  Next open a telnet window and type this command for each data disk:

 

reiserfsck /dev/md1  <-- this will check disk1

reiserfsck /dev/md2  <-- this will check disk2

etc.

 

When you invoke ‘reiserfsck’ it will ask you to type ‘Yes’ to continue – type exactly like that, upper case Y, lower case e, lower case s.

 

The utility can take anywhere from a few minutes to as long as an hour depending mainly on how many files are on the disk.

 

If it finds errors, it will report them and say what action to take next in order to fix the errors.  Typically you re-run 'reiserfsck' utility with a specific switch specified.  Follow whatever instructions it recommends.

 

Once all disks have been checked, you can Stop the array and the Start again normally.

Link to comment

You might want to shoot over an email to him to ensure it doesn't get lost in the thread (if it keeps growing of course).

 

Trying to reproduce...

 

Thank you

 

Rich-

I don't see this issue on any test server.  I do see possible file system corruption in your system log, however.  Please run a 'file system check' on all your data drives, and then see if this issue persists.

 

Here's how to do this (running unRaid version 5.0):

 

Stop the array.  You should notice a checkbox under the Start button that says “Maintenance mode”.  Check this and then Start the array.  This will start the unRaid driver but not mount any of the hard disks.  Next open a telnet window and type this command for each data disk:

 

reiserfsck /dev/md1  <-- this will check disk1

reiserfsck /dev/md2  <-- this will check disk2

etc.

 

When you invoke ‘reiserfsck’ it will ask you to type ‘Yes’ to continue – type exactly like that, upper case Y, lower case e, lower case s.

 

The utility can take anywhere from a few minutes to as long as an hour depending mainly on how many files are on the disk.

 

If it finds errors, it will report them and say what action to take next in order to fix the errors.  Typically you re-run 'reiserfsck' utility with a specific switch specified.  Follow whatever instructions it recommends.

 

Once all disks have been checked, you can Stop the array and the Start again normally.

 

Just finished running reiserfsck on all 5 data disks and they all came back 'No corruptions found'  :-\

I also managed to check the cache drive too, by un-assigning it, restarting maintenance mode and then running...

reiserfsck --check /dev/sdc1

 

But that came back 'No corruptions found' as well.

 

 

Link to comment
  • 3 months later...

Hi all,

 

Its been a few months (mainly because of work), but recently i’ve been playing around with some settings to see if it has any impact on my mover problem and just wanted to post my findings and ask the unRAID Pros a few questions.

 

I started off by changing mover to only run once a month, which meant new files just sat on my cache drive, until i recently moved everything manually to the appropriate disk and share. This didn't solve the mover issue, the system still crashed, just only once a month, which made it more useable. After two monthly crashes, i started to notice that duplicate files were appearing in the syslog, which (although i didn't understand the mover syslog errors) suggested to me that mover was at least partially working, as it had started to create the file in the right place, albeit not the whole file.

 

After i finished manually ‘moving’ 250 gigs worth of files to the right places, I changed all my shares so they would no longer use the cache drive. I then invoked mover to see what would happen and for the first time in ages it ran, start to finish, without crashing the system... Admittedly there was nothing to ‘move’, as the cache drive was empty, but the process still completed with no visible problems.

 

A few days later i decided to test a theory and created a ‘TEST’ share (set to use the cache drive) and copied a file over. I then invoked mover and to my surprise it worked perfectly, although it did take around 5 mins for a 70MB file. My next step was to try a bigger file... 7GB, but unfortunately the file never ‘moved’ and i also noticed a duplicate file warning in the syslog (no crash though). I then noticed something i hadn't seen before in the syslog...

shfs/user0: shfs_write: write: (28) No space left on device

 

This got me thinking, all my shares are set to 'fill-up', and what if mover was always trying to ‘move’ files to the first numerical disk, I.E. Disk1 to 'fill it up', but was then running out of space and quitting straight away, without trying the next disk? That would explain the incomplete, duplicate files and why the 70MB file worked, but the 7GB didn't. So i changed my ‘TEST’ share to ‘high-water’ and re-invoked mover, which then worked perfectly (other than quite a large number of ‘duplicate object’ warnings), as it 'moved' the 7 GB to the vacant disk3, instead of trying to fit it on the nearly full disk1 and/or disk2.

 

So the testing goes on! While i wasn't able to determine why mover was crashing the system, it seems to have stopped now. Next, I am going to set one of my media shares back to ‘high-water’ and enable the cache disk again and see what happens.

 

Two things i’m curious about though (pardon my ignorance)...

Does mover not have the facility to detect full drives and does it defualtly try to 'move' to Disk1 first and then just stop, if the disk is full (if the share is set to ‘fill-up’)?

Are the large number of ‘duplicate object’ reports, in the syslog (see below), for the successful mover cycle, normal?

 

Thanks in advance,

 

Rich

 

 

Syslogs...

 

TEST share set to Fill-up - Unsuccessful

 Aug 17 15:17:45 unRAID logger: mover started
Aug 17 15:17:45 unRAID logger: skipping PLEX/
Aug 17 15:17:45 unRAID logger: moving TEST/
Aug 17 15:17:45 unRAID logger: ./TEST/TEST.mkv
Aug 17 15:17:45 unRAID logger: .d..t...... TEST/
Aug 17 15:17:45 unRAID logger: >f+++++++++ TEST/TEST.mkv
Aug 17 15:17:45 unRAID shfs/user0: shfs_write: write: (28) No space left on device
Aug 17 15:17:45 unRAID shfs/user: duplicate object: /mnt/disk1/TEST/TEST.mkv (Minor Issues)
Aug 17 15:17:45 unRAID logger: rsync: writefd_unbuffered failed to write 4092 bytes to socket [sender]: Broken pipe (32) (Minor Issues)
Aug 17 15:17:45 unRAID logger: rsync: write failed on "/mnt/user0/TEST/TEST.mkv": No space left on device (28) (Minor Issues)
Aug 17 15:17:45 unRAID logger: rsync error: error in file IO (code 11) at receiver.c(302) [receiver=3.0.7] (Errors)
Aug 17 15:17:45 unRAID logger: rsync: connection unexpectedly closed (31 bytes received so far) [sender]
Aug 17 15:17:45 unRAID logger: rsync error: error in rsync protocol data stream (code 12) at io.c(601) [sender=3.0.7] (Errors)
Aug 17 15:17:45 unRAID logger: mover finished

 

TEST share set to High-water - Successful

 Aug 17 15:25:05 unRAID logger: mover started
Aug 17 15:25:05 unRAID logger: skipping PLEX/
Aug 17 15:25:05 unRAID logger: moving TEST/
Aug 17 15:25:05 unRAID logger: ./TEST/TEST.mkv
Aug 17 15:25:05 unRAID logger: .d..t...... TEST/
Aug 17 15:25:05 unRAID logger: >f+++++++++ TEST/TEST.mkv
Aug 17 15:25:06 unRAID shfs/user: duplicate object: /mnt/disk4/TEST/TEST.mkv (Minor Issues)
Aug 17 15:25:41 unRAID last message repeated 10 times
Aug 17 15:26:42 unRAID last message repeated 14 times
Aug 17 15:27:47 unRAID last message repeated 13 times
Aug 17 15:28:52 unRAID last message repeated 13 times
Aug 17 15:28:55 unRAID logger: ./TEST/
Aug 17 15:28:55 unRAID logger: .d..t...... ./
Aug 17 15:28:55 unRAID logger: .d..t...... TEST/
Aug 17 15:28:56 unRAID logger: mover finished

Link to comment

Following my above post, i set my media share back to using the cache drive and to 'high-water', then after routine use and files being placed on the cache drive, invoked 'mover', only to have my entire system crash again. However if i, again, place a file in my newly created 'TEST' share (on the cache drive), mover works as it should.

 

Could this be a permissions issue, with the shares i created before upgrading to rc3??

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.