Jump to content

[SOLVED] Mover failing, allocation problems?


Recommended Posts

Box is a Dell SC440, Unraid 4.6, 6-drive setup (5data+parity), and 80GB cache drive.

 

The box has been running nicely for over a year. I noticed a backup to Unraid was crashing with disk-full so browsed to tower. The cache drive was down to 2GB free. I manually started mover from the Shares tab and watched as it filled one data drive completely - 0 free space - and never touched another data drive. Reads continue to increment on the cache drive, and reads and writes increment on the full drive and parity drive, but available space on the cache drive never changes.

 

I stopped the backup process so no files should be open. All my shares are set to use high-water allocation. Free space on the other data drives is: 269G, 973G, 1161G, 263G.

 

What am I seeing here? I'm happy to disable cache for the backup share but how do I recover from this as safely as possible?

 

Unraid has been up for weeks but syslog only shows output for part of today. I presume the traffic is wrapping its buffer.

 

--- Mover first gets angry about here ---

Mar 25 14:41:14 Tower logger: >f.stp..... Backups/FrankenPC-III/genie3/_Genie Timeline/0/C/Users/Dude/.VirtualBox/Machines/BasicXP/Snapshots/{c51a8c5e-f709-47a7-a968-ba748bf068ef}.10.vdi

Mar 25 14:53:25 Tower shfs0: shfs_write: write: (28) No space left on device

Mar 25 14:53:25 Tower logger: rsync: writefd_unbuffered failed to write 4 bytes [sender]: Broken pipe (32)

Mar 25 14:53:25 Tower logger: rsync: write failed on "/mnt/user0/Backups/FrankenPC-III/genie3/_Genie Timeline/0/C/Users/Dude/.VirtualBox/Machines/BasicXP/Snapshots/{c51a8c5e-f709-47a7-a968-ba748bf068ef}.10.vdi": No space left on device (28)

Mar 25 14:53:25 Tower logger: rsync error: error in file IO (code 11) at receiver.c(298) [receiver=3.0.2]

Mar 25 14:53:25 Tower logger: rsync: connection unexpectedly closed (31 bytes received so far) [sender]

Mar 25 14:53:25 Tower logger: rsync error: error in rsync protocol data stream (code 12) at io.c(635) [sender=3.0.2]

Mar 25 14:53:25 Tower logger: ./Backups/FrankenPC-III/genie3/_Genie Timeline/0/C/Users/Dude/.VirtualBox/Machines/BasicXP/Snapshots/{c51a8c5e-f709-47a7-a968-ba748bf068ef}.11.vdi

Mar 25 14:53:25 Tower logger: >f.stp..... Backups/FrankenPC-III/genie3/_Genie Timeline/0/C/Users/Dude/.VirtualBox/Machines/BasicXP/Snapshots/{c51a8c5e-f709-47a7-a968-ba748bf068ef}.11.vdi

Mar 25 14:53:25 Tower shfs0: shfs_write: write: (28) No space left on device

 

--- lots of dupe object errors ---

 

Mar 25 04:40:02 Tower syslogd 1.4.1: restart.

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.db.1

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.10

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.11

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.12

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.13

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.14

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.15

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.16

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.17

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.18

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.19

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.20

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.21

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.22

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.23

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.24

Mar 25 04:45:29 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/db_sys/Index.tdb.25

...repeat the above 20+ times...

Mar 25 04:45:49 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/0/C/Users/Dude/.VirtualBox/Machines/BasicXP/Snapshots/{c51a8c5e-f709-47a7-a968-ba748bf068ef}.10.vdi

Mar 25 04:45:49 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/0/C/Users/Dude/.VirtualBox/Machines

...repeat above line 50+ times...

Mar 25 04:47:06 Tower shfs: shfs_write: write: (28) No space left on device

Mar 25 04:47:06 Tower last message repeated 2 times

Mar 25 04:47:16 Tower shfs: duplicate object: /mnt/disk4/Backups/FrankenPC-III/genie3/_Genie Timeline/0/C/Users/Dude/AppData/Local/Mozilla/Firefox/Profiles/w2dzaeps.default/Cache/3/A0/26ECBd01

...ad nauseum...

Link to comment

Most of the shares have the min setting blank. Default? I notice a couple not involved are set to 0. Not involved meaning, the two shares with min set to zero don't have any files on the cache drive. I'm not sure how that happened; zero seems highly stoopid. The cache's Min free space setting of 20GB is more my MO.

 

If it's significant, this system has been updated a few times. I think the first version was 4.1? It would have been whatever was current 14 months ago. The array has only been added to, never cleared.

Link to comment

What do you mean by

other data drives
?

 

The most-free setting (not high-water) comes close to obviating the min free space setting. But when a drive becomes too full too hold the file that is being written to it the problem that your experiencing will occur.

 

What are you share settings exactly?

Link to comment

What do you mean by

other data drives
?

 

Do you mean regarding free space? Those are the other four drives that aren't parity or cache. The fifth (#4 actually) has 0 free space. Am I misunderstanding the question?

 

The most-free setting (not high-water) comes close to obviating the min free space setting. But when a drive becomes too full too hold the file that is being written to it the problem that your experiencing will occur.

 

Thanks. I've been running on false assumptions re the interface.

 

What are you share settings exactly?

 

All user shares are set for high-water, and now 100000000 min free space. No special split levels or drive includes/excludes.

 

I'm currently running a non-correcting parity check. Since my last post, I'd upped all min free settings and restarted from the web interface. Once it came back up I removed some large files from the cache and drive 4, then started mover from the Shares tab. Activity looked the same as before, with drive 4 filling completely. Then the web interface became unresponsive. I forgot that a console shutdown -r was bad. When the system came back up it showed 4 parity errors at boot:

 

Mar 25 18:30:58 Tower kernel: md: recovery thread checking parity...
...devices...
Mar 25 18:30:59 Tower kernel: md: parity incorrect: 35384
Mar 25 18:30:59 Tower kernel: md: parity incorrect: 35392
Mar 25 18:30:59 Tower kernel: tg3: eth0: Link is up at 1000 Mbps, full duplex.
Mar 25 18:30:59 Tower kernel: tg3: eth0: Flow control is on for TX and on for RX.
Mar 25 18:30:59 Tower kernel: md: parity incorrect: 36256
Mar 25 18:30:59 Tower kernel: md: parity incorrect: 36264

 

Those looked real to me but since they're the first I've ever seen on unraid I wanted to make sure, so NOCORRECT. It's 28% complete. No sync errors yet.

Link to comment

I was referring to this:

I stopped the backup process so no files should be open. All my shares are set to use high-water allocation. Free space on the other data drives is: 269G, 973G, 1161G, 263G.

I was not sure how many data drive there were.

 

The parity check errors are a new problem for your array caused by the unclean shutdown that unRAID does not handle gracefully. unRAID informs you of parity errors but it is difficult to determine on which drive the errors lay. Additional parity information is planned for version >5.0 that will help with this type of error. unRAID currently handles drive failures, upgrades, or additions very well. If your data is mostly audio or video content then then 4 errors are nothing to worry about; you'll probably never notice them. Use a UPS and install the powerdown add-on via unmenu. (If you don't yet have unmenu then install it after all of these problems are resolved.)

 

Run parity checks until it completes without error at least twice. Then report back and we can work on the cache transfer problem.

 

Link to comment

61% now, knocking on wood.

 

Since the parity errors showed immediately during boot I guessed they were output from a dirty-bit workspace, and maybe only potential parity errors. But, the code is a mystery. I'll let this check complete and run another before moving forward.

Link to comment

OK, what is the split level setting?

 

Peter

 

Wow. I may have been thinking about split levels upside-down.

 

Backups share is set to 1. That could do it.

Software share is set to 0 but no files involved.

Top share is set to 2 but no files involved.

Other shares are blank.

 

I've switched Backups to use most-free allocation, but am waiting on the 2nd NOCORRECT check to complete before allowing any file activity. Currently 68% without error.

Link to comment

Okay, so two NOCORRRECT passes with 0 errors. Am I on the right track assuming they were dirty bits and not actual parity errors?

 

Mar 26 07:00:45 Tower kernel: mdcmd (4661): check NOCORRECT

Mar 26 07:00:45 Tower kernel:

Mar 26 07:00:45 Tower kernel: md: recovery thread woken up ...

Mar 26 07:00:45 Tower kernel: md: recovery thread checking parity...

Mar 26 07:00:45 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.

Mar 26 08:30:19 Tower emhttp: shcmd (56): /usr/sbin/hdparm -y /dev/sda >/dev/null

Mar 26 14:25:10 Tower kernel: md: sync done. time=26297sec rate=74286K/sec

Mar 26 14:25:10 Tower kernel: md: recovery thread sync completion status: 0

 

All split levels set to zero.

 

"Is it safe?"

Link to comment

Split level 0 likely is your problem. Try reading this.

 

http://lime-technology.com/wiki/index.php?title=Un-Official_UnRAID_Manual#Split_level

 

Peter

 

Split level 0 is a special case. Split level 0 requires you to create the desired top level or parent folder structure. unRAID will unconditionally create an object on the disk that contains the parent folders. unRAID will choose which disk to use according to the allocation method if the parent folders exist on multiple disks.

 

Exception for unRAID versions 4.4.x and below: If you set the Split level to 0, then all directories/files created under that share will be on the same disk where the share was originally created. In other words, use level 0 to not allow the share to split.

 

I have the top level directories on all drives but things below vary. The description above doesn't really make it clear if "the parent" or "all parents" have to exist. So use blank or something huge?

Link to comment

For split level 0 if the parent exists on a particular disk then that disk will be considered for writing based on allocation method. I'm not sure what blank means. A very high split level will allocate to all included drives based on allocation method.

Link to comment

The parent is considered all the directoy levels that exist above the file being written.

 

Zero also means the ability to split becomes self limiting. For example, say you have a "Backups" share. You write the backup for your "Workstation #3" to the Backups share by creating a "Workstation #3" subdirectory in the Backups share. When this directory is created it can go to any disk which contains Backups. However, after it is created then anything new written to the Workstation #3 backup must remain on the disk where it was first created. The root directory "\Backups\Workstation #3" now exists on one disk so any new data goes to that disk.

 

Blank is the same thing as a high number, allowing files to be placed according to the allocation method only.

 

Peter

Link to comment

Thanks everyone, and thanks Peter for the clarifications.

 

I know it's poor form to criticize from the idiot's chair, but I could almost see split levels requiring more effort to enable. Good to know blank means disabled.

 

Minor problem in the scheme of things. It's testament to unRAID's robustness that I've had to do so little to keep it going that I don't know what I'm doing. :)

 

Seriously though, some reasonable hardware, a good UPS, and I've only had to feed drives to the thing. Excellent product.

Link to comment

Yes, you need to have a fairly defined directory structure to be able to use split level. If the share is a real mix of directory levels and file placements then too low a split level can cause a share not to use the available disks. If you plan the directory levels then you can pick a matching split level and everything will be good.

 

Peter

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...