Server nog longer starting due to BTRFS read only issue

Helmonder · February 11, 2017

After doing some changes to my BTRFS volumes I experienced a BTRFS error. The volume got marked read-only. Now my unraid system refuses to start.

This is what I did and what happened:

http://lime-technology.com/forum/index.php?topic=56470.msg538907#msg538907

At the moment only disk10, disk11, disk3, disk6, disk7 and disk9 get mounted.

Disk8 does not get mounted and an error appears in the log when it tries to get mounted:

Feb 11 13:13:16 Tower emhttp: shcmd (47): mkdir -p /mnt/disk8
Feb 11 13:13:16 Tower emhttp: shcmd (48): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md8 /mnt/disk8 |& logger
Feb 11 13:13:16 Tower kernel: BTRFS info (device md8): disk space caching is enabled
Feb 11 13:13:16 Tower kernel: BTRFS info (device md8): has skinny extents
Feb 11 13:13:16 Tower vsftpd[13363]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:22 Tower vsftpd[13396]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:29 Tower vsftpd[13427]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:37 Tower vsftpd[13461]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:47 Tower vsftpd[13499]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:53 Tower in.telnetd[13525]: connect from 192.168.1.36 (192.168.1.36)
Feb 11 13:13:57 Tower vsftpd[13543]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:00 Tower login[13526]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.36'
Feb 11 13:14:07 Tower vsftpd[13596]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:17 Tower vsftpd[13638]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:27 Tower vsftpd[13688]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:37 Tower vsftpd[13734]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:47 Tower vsftpd[13777]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:57 Tower vsftpd[13821]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:07 Tower vsftpd[13865]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:17 Tower vsftpd[13909]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:27 Tower vsftpd[13952]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:28 Tower kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000035c
Feb 11 13:15:28 Tower kernel: IP: [<ffffffff812dd4af>] flush_space+0x44/0x472
Feb 11 13:15:28 Tower kernel: PGD 7cd245067
Feb 11 13:15:28 Tower kernel: PUD 7ce2be067
Feb 11 13:15:28 Tower kernel: PMD 0
Feb 11 13:15:28 Tower kernel:
Feb 11 13:15:28 Tower kernel: Oops: 0000 [#1] PREEMPT SMP
Feb 11 13:15:28 Tower kernel: Modules linked in: md_mod nct6775 hwmon_vid bonding e1000e ptp pps_core x86_pkg_temp_thermal coretemp i2c_i801 i2c_smbus mpt3sas kvm_intel ahci raid_class i2c_core libahci scsi_transport_sas kvm ipmi_si video backlight [last unloaded: pps_core]
Feb 11 13:15:28 Tower kernel: CPU: 2 PID: 13342 Comm: mount Not tainted 4.9.8-unRAID #1
Feb 11 13:15:28 Tower kernel: Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0b 09/17/2012
Feb 11 13:15:28 Tower kernel: task: ffff88080a7c5940 task.stack: ffffc9000b650000
Feb 11 13:15:28 Tower kernel: RIP: 0010:[<ffffffff812dd4af>]  [<ffffffff812dd4af>] flush_space+0x44/0x472
Feb 11 13:15:28 Tower kernel: RSP: 0018:ffffc9000b6537d8  EFLAGS: 00010246
Feb 11 13:15:28 Tower kernel: RAX: 0000000000020000 RBX: 0000000000000000 RCX: 0000000000020000
Feb 11 13:15:28 Tower kernel: RDX: 0000000000020000 RSI: ffff880807fb8400 RDI: 0000000000000000
Feb 11 13:15:28 Tower kernel: RBP: ffffc9000b653870 R08: 0000000000000001 R09: 0000000000000000
Feb 11 13:15:28 Tower kernel: R10: ffff88080a2eb418 R11: 0000000000000000 R12: 00000000ffffffff
Feb 11 13:15:28 Tower kernel: R13: ffff880807fb8400 R14: 0000000000000002 R15: ffff880807fb8400
Feb 11 13:15:28 Tower kernel: FS:  00002b170e849e40(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
Feb 11 13:15:28 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 13:15:28 Tower kernel: CR2: 000000000000035c CR3: 00000007cd18e000 CR4: 00000000000406e0
Feb 11 13:15:28 Tower kernel: Stack:
Feb 11 13:15:28 Tower kernel: 0000000000000000 0000000000000000 ffffc9000b653810 ffffffff812d4017
Feb 11 13:15:28 Tower kernel: 00000000ffffffe4 ffff880807fb8400 ffff8807ee8fa000 0000000000020000
Feb 11 13:15:28 Tower kernel: ffffffff812d7ff4 ffffc9000b653870 ffffffff812d8070 ffff8807ee8fa148
Feb 11 13:15:28 Tower kernel: Call Trace:
Feb 11 13:15:28 Tower kernel: [<ffffffff812d4017>] ? get_alloc_profile+0xd0/0x166
Feb 11 13:15:28 Tower kernel: [<ffffffff812d7ff4>] ? btrfs_get_alloc_profile+0x2b/0x2d
Feb 11 13:15:28 Tower kernel: [<ffffffff812d8070>] ? can_overcommit+0x7a/0x100
Feb 11 13:15:28 Tower kernel: [<ffffffff812de190>] reserve_metadata_bytes+0x569/0x651
Feb 11 13:15:28 Tower kernel: [<ffffffff813a807d>] ? __radix_tree_lookup+0x2b/0x86
Feb 11 13:15:28 Tower kernel: [<ffffffff812de87f>] btrfs_block_rsv_refill+0x6b/0x91
Feb 11 13:15:28 Tower kernel: [<ffffffff812f9c09>] btrfs_evict_inode+0x305/0x491
Feb 11 13:15:28 Tower kernel: [<ffffffff81136cc5>] evict+0xb8/0x16d
Feb 11 13:15:28 Tower kernel: [<ffffffff811373b6>] iput+0x163/0x170
Feb 11 13:15:28 Tower kernel: [<ffffffff812fa827>] btrfs_orphan_cleanup+0x326/0x394
Feb 11 13:15:28 Tower kernel: [<ffffffff813391ae>] btrfs_recover_relocation+0x3b6/0x3cc
Feb 11 13:15:28 Tower kernel: [<ffffffff812e8d6b>] ? btrfs_cleanup_fs_roots+0x12e/0x140
Feb 11 13:15:28 Tower kernel: [<ffffffff812eccc0>] open_ctree+0x1e1b/0x208e
Feb 11 13:15:28 Tower kernel: [<ffffffff812c82ef>] btrfs_mount+0xb37/0xd1e
Feb 11 13:15:28 Tower kernel: [<ffffffff810e2176>] ? pcpu_alloc+0x3d5/0x4c1
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff8113a78a>] ? alloc_vfsmnt+0x189/0x215
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] ? mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff8113a87b>] vfs_kern_mount+0x65/0xf7
Feb 11 13:15:28 Tower kernel: [<ffffffff812c7ae3>] btrfs_mount+0x32b/0xd1e
Feb 11 13:15:28 Tower kernel: [<ffffffff813b4903>] ? find_next_zero_bit+0x17/0x1d
Feb 11 13:15:28 Tower kernel: [<ffffffff810e2176>] ? pcpu_alloc+0x3d5/0x4c1
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] ? mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff8113a87b>] vfs_kern_mount+0x65/0xf7
Feb 11 13:15:28 Tower kernel: [<ffffffff8113d196>] do_mount+0x744/0xa23
Feb 11 13:15:28 Tower kernel: [<ffffffff810ddec4>] ? strndup_user+0x3a/0x6f
Feb 11 13:15:28 Tower kernel: [<ffffffff8113d66b>] SyS_mount+0x72/0x9a
Feb 11 13:15:28 Tower kernel: [<ffffffff8167d1b7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
Feb 11 13:15:28 Tower kernel: Code: ec 70 41 83 f9 05 0f 87 3b 04 00 00 48 89 4d a0 48 89 d0 49 89 f7 48 89 fb 42 ff 24 cd 40 be 83 81 41 83 cc ff 41 83 f8 01 75 18 <8b> 8f 5c 03 00 00 31 d2 c1 e1 04 48 f7 f1 85 c0 41 0f 44 c0 44
Feb 11 13:15:28 Tower kernel: RIP  [<ffffffff812dd4af>] flush_space+0x44/0x472
Feb 11 13:15:28 Tower kernel: RSP <ffffc9000b6537d8>
Feb 11 13:15:28 Tower kernel: CR2: 000000000000035c
Feb 11 13:15:28 Tower kernel: ---[ end trace 279c8d91daf3797c ]---
Feb 11 13:15:28 Tower emhttp: err: shcmd: shcmd (48): exit status: -119
Feb 11 13:15:28 Tower emhttp: mount error: No file system (-119)
Feb 11 13:15:28 Tower emhttp: shcmd (49): umount /mnt/disk8 |& logger
Feb 11 13:15:28 Tower root: umount: /mnt/disk8: not mounted
Feb 11 13:15:28 Tower emhttp: shcmd (50): rmdir /mnt/disk8
Feb 11 13:15:28 Tower emhttp: shcmd (51): mkdir -p /mnt/disk9
Feb 11 13:15:28 Tower emhttp: shcmd (52): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md9 /mnt/disk9 |& logger
Feb 11 13:15:28 Tower kernel: BTRFS info (device md9): disk space caching is enabled
Feb 11 13:15:28 Tower kernel: BTRFS info (device md9): has skinny extents
Feb 11 13:15:28 Tower kernel: BTRFS info (device md9): bdev /dev/md9 errs: wr 0, rd 0, flush 0, corrupt 3456, gen 0
Feb 11 13:15:41 Tower emhttp: shcmd (53): btrfs filesystem resize max /mnt/disk9 |& logger
Feb 11 13:15:41 Tower root: Resize '/mnt/disk9' of 'max'
Feb 11 13:15:41 Tower kernel: BTRFS info (device md9): new size for /dev/md9 is 6001175072768
Feb 11 13:15:41 Tower emhttp: shcmd (54): mkdir -p /mnt/disk10
Feb 11 13:15:41 Tower emhttp: shcmd (55): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md10 /mnt/disk10 |& logger
Feb 11 13:15:41 Tower kernel: BTRFS info (device md10): disk space caching is enabled
Feb 11 13:15:41 Tower kernel: BTRFS info (device md10): has skinny extents
Feb 11 13:15:50 Tower emhttp: shcmd (56): btrfs filesystem resize max /mnt/disk10 |& logger
Feb 11 13:15:50 Tower root: Resize '/mnt/disk10' of 'max'
Feb 11 13:15:50 Tower kernel: BTRFS info (device md10): new size for /dev/md10 is 6001175072768
Feb 11 13:15:50 Tower emhttp: shcmd (57): mkdir -p /mnt/disk11
Feb 11 13:15:50 Tower emhttp: shcmd (58): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md11 /mnt/disk11 |& logger
Feb 11 13:15:50 Tower kernel: BTRFS info (device md11): disk space caching is enabled
Feb 11 13:15:50 Tower kernel: BTRFS info (device md11): has skinny extents
Feb 11 13:16:02 Tower emhttp: shcmd (59): btrfs filesystem resize max /mnt/disk11 |& logger
Feb 11 13:16:02 Tower root: Resize '/mnt/disk11' of 'max'
Feb 11 13:16:02 Tower kernel: BTRFS info (device md11): new size for /dev/md11 is 8001563168768
Feb 11 13:16:02 Tower emhttp: shcmd (60): mkdir -p /mnt/cache
Feb 11 13:16:02 Tower emhttp: mount error: No file system (no btrfs UUID)
Feb 11 13:16:02 Tower emhttp: shcmd (61): umount /mnt/cache |& logger
Feb 11 13:16:02 Tower root: umount: /mnt/cache: not mounted
Feb 11 13:16:02 Tower emhttp: shcmd (62): rmdir /mnt/cache
Feb 11 13:16:02 Tower emhttp: shcmd (63): sync

Kind of worried now.. The server has been at thisd lever for more then 10 minutes and it appears to net get any further..

I can telnet into the system but the webgui does not load and there are no shares..

On advice I am running a memtest, it has been running for 15 minutes now and no errors. I will keep it running for longer but would appreciate help in additional steps to take.

The only thing I can think of is pulling the physical drive, then rebooting the system, unraid will hopefully emulate the drive, then add the physical drive back in and copy the data from the emulated drive to the "new" drive.. I am expecting that rebuilding it on itself will not work as it will most likely just recreate the issue I am having now..

Hoping for a better idea by someone... I need some kind of BTRFS /read-only remove command...

JorgeB · February 11, 2017

If enable disable array auto-start, try to mount disk8 read only, eg:

mkdir /x
mount -o recovery,ro /dev/sdX1 /x

If successful copy everything to another disk/server and then format that disk.

After getting the data (or if you have backups) you can try repairing the filesystem with btrfs check --repair

Helmonder · February 11, 2017

If enable disable array auto-start, try to mount disk8 read only, eg:
mkdir /x
mount -o recovery,ro /dev/sdX1 /x
If successful copy everything to another disk/server and then format that disk.

After getting the data (or if you have backups) you can try repairing the filesystem with btrfs check --repair

I started the system again, how do I disable the auto-start ?

I have tried mounting using command:

 mount -o recovery,ro /dev/md8 /x

System appears to hang now. Dit I give that command correctly ?

JorgeB · February 11, 2017

Disable autostart by editing disk.cfg on you flash drive (flash/config) and changing startArray="yes" to "no".

to mount the disk first create a temp mountpoint:

mkdir /x

then, and since the array won't be started you can't use the md device, use sdX.

mount -o recovery,ro /dev/sdX1 /x

If that fails you can try btrfs recovery.

Helmonder · February 11, 2017

Disable autostart by editing disk.cfg on you flash drive (flash/config) and changing startArray="yes" to "no".

to mount the disk first create a temp mountpoint:
mkdir /x
then, and since the array won't be started you can't use the md device, use sdX.
mount -o recovery,ro /dev/sdX1 /x
If that fails you can try btrfs recovery.

Thanks. I found the config and changed the autostart. Shutdown from command still does not work so I will have to force a hard reboot again.

Rebooting now.

Helmonder · February 11, 2017

Disable autostart by editing disk.cfg on you flash drive (flash/config) and changing startArray="yes" to "no".

to mount the disk first create a temp mountpoint:
mkdir /x
then, and since the array won't be started you can't use the md device, use sdX.
mount -o recovery,ro /dev/sdX1 /x
If that fails you can try btrfs recovery.
Thanks. I found the config and changed the autostart. Shutdown from command still does not work so I will have to force a hard reboot again.

Rebooting now.

How do I know what sdX1 to use ?

Helmonder · February 11, 2017

Just realised the webpage would be up :-) I found the drive name and can now mount /x

Helmonder · February 11, 2017

Now I have /x available with disk8.. but where do I copy it to ?

Since the array is not up I cannot access anything under /mnt ..

JorgeB · February 11, 2017

Give me a few minutes, I'm having lunch.

Helmonder · February 11, 2017

Absolutely... i thoroughly appreciate the help.. i ammout of home for a bit.

Thinking of just setting disk8 to disabled and then startong the array.... then i could copy using mc ..

Verzonden vanaf mijn iPhone met Tapatalk

JorgeB · February 11, 2017

OK, try this:

In the same disk.cfg you editet before change diskFsType.8="auto" or "btfrs" to "xfs", this should allow you to start the array without crashing, then use mc or any other util to copy from /x to /mnt/diskX or /mnt/user/sharename

Helmonder · February 11, 2017

It said btrfs, I have changed it to xfs

I can start the array. Both my disk8 but also one of my cachedrives appear to be unmountable.

I am going to focus on disk8 first.

JorgeB · February 11, 2017

Disk8 is expected to be unmountable, strange about the cache, but yes deal with disk8 first, then post your diags.

Helmonder · February 11, 2017

Copies are running..

I do notice that the transfers are really fast.. I have like 80MB/s sustained... And there is no cache drive..

I checked in console though and files are getting written. I'll let the copies finish and then look further.

I have already attached diagnostics.

tower-diagnostics-20170211-1648.zip

JorgeB · February 11, 2017

All cache disks are being detected as new, strange if it was working correctly, maybe disk8 being unmountable is causing some confusion, so let's wait until disk8 is mountable to see if the issue persists.

Completely unrelated but before I forget, this is not good for parity checks with LSI controllers:

Feb 11 16:38:29 Tower kernel: mdcmd (31): set md_num_stripes 4264
Feb 11 16:38:29 Tower kernel: mdcmd (32): set md_sync_window 1920
Feb 11 16:38:29 Tower kernel: mdcmd (33): set md_sync_thresh 192

sync_tresh needs to me much higher, close to sync_window, change it to 1872, parity check speed should improve considerably.

Helmonder · February 11, 2017

I as soon aa all the data is over i will format disk8 as xfs (getting a bit nervous about btrfs ;-).

Disk8 is a full 3tb so this will take some time.

Verzonden vanaf mijn iPhone met Tapatalk

JorgeB · February 11, 2017

(getting a bit nervous about btrfs ;-).

Can't blame you, I like btrfs and some of its features, but it can be complicated when there's trouble.

It's possible that btrfs check --repair would fix your problem, but it's only recommended as a last resort, because it could also make it worse, so this is much more work but safer since it's non destructive.

Helmonder · February 11, 2017

Jup.. I have been reading up since this morning and I got the same impression.. and I do have backups, but still..

Verzonden vanaf mijn iPhone met Tapatalk

Helmonder · February 11, 2017

Copies are running..

I see three lines appearing on my console display, I do not see them in the syslog:

ERROR: system chunk array too small 34 < 97

ERROR: superblock checksum matches but it has invalid members

ERROR: cannot scan /dev/sdf1: Input/output error

sdf is my primary parity drive ..

Log is flooded with the following line:

Feb 11 21:24:24 Tower shfs/user: err: shfs_mkdir: assign_disk: system (123) No medium found
Feb 11 21:24:28 Tower shfs/user: err: shfs_mkdir: assign_disk: system (123) No medium found

JorgeB · February 11, 2017

Parity has no filesystem but when there's an odd number of data disks with the same filesystem it will appear to have one, those errors are harmless, the no medium found errors look harmless also.

Helmonder · February 12, 2017

Alraity then... disk8 has been fully copied to other parts of the array (I did not have enough space available on one individual drive so I had to copy the data in parts).

Since disk8 was read only mounted to /x I suspect that formatting it now will not work, so I am rebooting. When the array is back up it is my plan to format drive8.

Next thing is... How do I get my cache drive back which seems to have misteriously died on me in an unrealated event..

Helmonder · February 12, 2017

System is back up and disk8 is now getting formatted with XFS..

And.... My cache drive pool is back !! After reboot everything appears to function.. The whole btrfs thing must have messed something up in the btrfs logic..

Helmonder · February 12, 2017

All cache disks are being detected as new, strange if it was working correctly, maybe disk8 being unmountable is causing some confusion, so let's wait until disk8 is mountable to see if the issue persists.

Completely unrelated but before I forget, this is not good for parity checks with LSI controllers:
Feb 11 16:38:29 Tower kernel: mdcmd (31): set md_num_stripes 4264
Feb 11 16:38:29 Tower kernel: mdcmd (32): set md_sync_window 1920
Feb 11 16:38:29 Tower kernel: mdcmd (33): set md_sync_thresh 192
sync_tresh needs to me much higher, close to sync_window, change it to 1872, parity check speed should improve considerably.

Just changed this, thanks !

JorgeB · February 12, 2017

System is back up and disk8 is now getting formatted with XFS..

And.... My cache drive pool is back !! After reboot everything appears to function.. The whole btrfs thing must have messed something up in the btrfs logic..

Good, that was my hope as there was no other reason I could see for the problem.

Helmonder · February 12, 2017

System is back up and disk8 is now getting formatted with XFS..

And.... My cache drive pool is back !! After reboot everything appears to function.. The whole btrfs thing must have messed something up in the btrfs logic..

Good, that was my hope as there was no other reason I could see for the problem.

Johny.. Seriously.. You have been a tremendous help in this whole ordeal.. Thanks thanks thanks ! Can I send you a bottle of Johny black ?

Server nog longer starting due to BTRFS read only issue

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation