Server nog longer starting due to BTRFS read only issue


Recommended Posts

After doing some changes to my BTRFS volumes I experienced a BTRFS error. The volume got marked read-only. Now my unraid system refuses to start.

 

This is what I did and what happened:

 

http://lime-technology.com/forum/index.php?topic=56470.msg538907#msg538907

 

At the moment only disk10, disk11, disk3, disk6, disk7 and disk9 get mounted.

 

Disk8 does not get mounted and an error appears in the log when it tries to get mounted:

 

Feb 11 13:13:16 Tower emhttp: shcmd (47): mkdir -p /mnt/disk8
Feb 11 13:13:16 Tower emhttp: shcmd (48): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md8 /mnt/disk8 |& logger
Feb 11 13:13:16 Tower kernel: BTRFS info (device md8): disk space caching is enabled
Feb 11 13:13:16 Tower kernel: BTRFS info (device md8): has skinny extents
Feb 11 13:13:16 Tower vsftpd[13363]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:22 Tower vsftpd[13396]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:29 Tower vsftpd[13427]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:37 Tower vsftpd[13461]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:47 Tower vsftpd[13499]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:13:53 Tower in.telnetd[13525]: connect from 192.168.1.36 (192.168.1.36)
Feb 11 13:13:57 Tower vsftpd[13543]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:00 Tower login[13526]: ROOT LOGIN  on '/dev/pts/0' from '192.168.1.36'
Feb 11 13:14:07 Tower vsftpd[13596]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:17 Tower vsftpd[13638]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:27 Tower vsftpd[13688]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:37 Tower vsftpd[13734]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:47 Tower vsftpd[13777]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:14:57 Tower vsftpd[13821]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:07 Tower vsftpd[13865]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:17 Tower vsftpd[13909]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:27 Tower vsftpd[13952]: connect from 127.0.0.1 (127.0.0.1)
Feb 11 13:15:28 Tower kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000035c
Feb 11 13:15:28 Tower kernel: IP: [<ffffffff812dd4af>] flush_space+0x44/0x472
Feb 11 13:15:28 Tower kernel: PGD 7cd245067
Feb 11 13:15:28 Tower kernel: PUD 7ce2be067
Feb 11 13:15:28 Tower kernel: PMD 0
Feb 11 13:15:28 Tower kernel:
Feb 11 13:15:28 Tower kernel: Oops: 0000 [#1] PREEMPT SMP
Feb 11 13:15:28 Tower kernel: Modules linked in: md_mod nct6775 hwmon_vid bonding e1000e ptp pps_core x86_pkg_temp_thermal coretemp i2c_i801 i2c_smbus mpt3sas kvm_intel ahci raid_class i2c_core libahci scsi_transport_sas kvm ipmi_si video backlight [last unloaded: pps_core]
Feb 11 13:15:28 Tower kernel: CPU: 2 PID: 13342 Comm: mount Not tainted 4.9.8-unRAID #1
Feb 11 13:15:28 Tower kernel: Hardware name: Supermicro X9SCL/X9SCM/X9SCL/X9SCM, BIOS 2.0b 09/17/2012
Feb 11 13:15:28 Tower kernel: task: ffff88080a7c5940 task.stack: ffffc9000b650000
Feb 11 13:15:28 Tower kernel: RIP: 0010:[<ffffffff812dd4af>]  [<ffffffff812dd4af>] flush_space+0x44/0x472
Feb 11 13:15:28 Tower kernel: RSP: 0018:ffffc9000b6537d8  EFLAGS: 00010246
Feb 11 13:15:28 Tower kernel: RAX: 0000000000020000 RBX: 0000000000000000 RCX: 0000000000020000
Feb 11 13:15:28 Tower kernel: RDX: 0000000000020000 RSI: ffff880807fb8400 RDI: 0000000000000000
Feb 11 13:15:28 Tower kernel: RBP: ffffc9000b653870 R08: 0000000000000001 R09: 0000000000000000
Feb 11 13:15:28 Tower kernel: R10: ffff88080a2eb418 R11: 0000000000000000 R12: 00000000ffffffff
Feb 11 13:15:28 Tower kernel: R13: ffff880807fb8400 R14: 0000000000000002 R15: ffff880807fb8400
Feb 11 13:15:28 Tower kernel: FS:  00002b170e849e40(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
Feb 11 13:15:28 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 11 13:15:28 Tower kernel: CR2: 000000000000035c CR3: 00000007cd18e000 CR4: 00000000000406e0
Feb 11 13:15:28 Tower kernel: Stack:
Feb 11 13:15:28 Tower kernel: 0000000000000000 0000000000000000 ffffc9000b653810 ffffffff812d4017
Feb 11 13:15:28 Tower kernel: 00000000ffffffe4 ffff880807fb8400 ffff8807ee8fa000 0000000000020000
Feb 11 13:15:28 Tower kernel: ffffffff812d7ff4 ffffc9000b653870 ffffffff812d8070 ffff8807ee8fa148
Feb 11 13:15:28 Tower kernel: Call Trace:
Feb 11 13:15:28 Tower kernel: [<ffffffff812d4017>] ? get_alloc_profile+0xd0/0x166
Feb 11 13:15:28 Tower kernel: [<ffffffff812d7ff4>] ? btrfs_get_alloc_profile+0x2b/0x2d
Feb 11 13:15:28 Tower kernel: [<ffffffff812d8070>] ? can_overcommit+0x7a/0x100
Feb 11 13:15:28 Tower kernel: [<ffffffff812de190>] reserve_metadata_bytes+0x569/0x651
Feb 11 13:15:28 Tower kernel: [<ffffffff813a807d>] ? __radix_tree_lookup+0x2b/0x86
Feb 11 13:15:28 Tower kernel: [<ffffffff812de87f>] btrfs_block_rsv_refill+0x6b/0x91
Feb 11 13:15:28 Tower kernel: [<ffffffff812f9c09>] btrfs_evict_inode+0x305/0x491
Feb 11 13:15:28 Tower kernel: [<ffffffff81136cc5>] evict+0xb8/0x16d
Feb 11 13:15:28 Tower kernel: [<ffffffff811373b6>] iput+0x163/0x170
Feb 11 13:15:28 Tower kernel: [<ffffffff812fa827>] btrfs_orphan_cleanup+0x326/0x394
Feb 11 13:15:28 Tower kernel: [<ffffffff813391ae>] btrfs_recover_relocation+0x3b6/0x3cc
Feb 11 13:15:28 Tower kernel: [<ffffffff812e8d6b>] ? btrfs_cleanup_fs_roots+0x12e/0x140
Feb 11 13:15:28 Tower kernel: [<ffffffff812eccc0>] open_ctree+0x1e1b/0x208e
Feb 11 13:15:28 Tower kernel: [<ffffffff812c82ef>] btrfs_mount+0xb37/0xd1e
Feb 11 13:15:28 Tower kernel: [<ffffffff810e2176>] ? pcpu_alloc+0x3d5/0x4c1
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff8113a78a>] ? alloc_vfsmnt+0x189/0x215
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] ? mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff8113a87b>] vfs_kern_mount+0x65/0xf7
Feb 11 13:15:28 Tower kernel: [<ffffffff812c7ae3>] btrfs_mount+0x32b/0xd1e
Feb 11 13:15:28 Tower kernel: [<ffffffff813b4903>] ? find_next_zero_bit+0x17/0x1d
Feb 11 13:15:28 Tower kernel: [<ffffffff810e2176>] ? pcpu_alloc+0x3d5/0x4c1
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff811248fd>] ? mount_fs+0xf/0x84
Feb 11 13:15:28 Tower kernel: [<ffffffff8113a87b>] vfs_kern_mount+0x65/0xf7
Feb 11 13:15:28 Tower kernel: [<ffffffff8113d196>] do_mount+0x744/0xa23
Feb 11 13:15:28 Tower kernel: [<ffffffff810ddec4>] ? strndup_user+0x3a/0x6f
Feb 11 13:15:28 Tower kernel: [<ffffffff8113d66b>] SyS_mount+0x72/0x9a
Feb 11 13:15:28 Tower kernel: [<ffffffff8167d1b7>] entry_SYSCALL_64_fastpath+0x1a/0xa9
Feb 11 13:15:28 Tower kernel: Code: ec 70 41 83 f9 05 0f 87 3b 04 00 00 48 89 4d a0 48 89 d0 49 89 f7 48 89 fb 42 ff 24 cd 40 be 83 81 41 83 cc ff 41 83 f8 01 75 18 <8b> 8f 5c 03 00 00 31 d2 c1 e1 04 48 f7 f1 85 c0 41 0f 44 c0 44
Feb 11 13:15:28 Tower kernel: RIP  [<ffffffff812dd4af>] flush_space+0x44/0x472
Feb 11 13:15:28 Tower kernel: RSP <ffffc9000b6537d8>
Feb 11 13:15:28 Tower kernel: CR2: 000000000000035c
Feb 11 13:15:28 Tower kernel: ---[ end trace 279c8d91daf3797c ]---
Feb 11 13:15:28 Tower emhttp: err: shcmd: shcmd (48): exit status: -119
Feb 11 13:15:28 Tower emhttp: mount error: No file system (-119)
Feb 11 13:15:28 Tower emhttp: shcmd (49): umount /mnt/disk8 |& logger
Feb 11 13:15:28 Tower root: umount: /mnt/disk8: not mounted
Feb 11 13:15:28 Tower emhttp: shcmd (50): rmdir /mnt/disk8
Feb 11 13:15:28 Tower emhttp: shcmd (51): mkdir -p /mnt/disk9
Feb 11 13:15:28 Tower emhttp: shcmd (52): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md9 /mnt/disk9 |& logger
Feb 11 13:15:28 Tower kernel: BTRFS info (device md9): disk space caching is enabled
Feb 11 13:15:28 Tower kernel: BTRFS info (device md9): has skinny extents
Feb 11 13:15:28 Tower kernel: BTRFS info (device md9): bdev /dev/md9 errs: wr 0, rd 0, flush 0, corrupt 3456, gen 0
Feb 11 13:15:41 Tower emhttp: shcmd (53): btrfs filesystem resize max /mnt/disk9 |& logger
Feb 11 13:15:41 Tower root: Resize '/mnt/disk9' of 'max'
Feb 11 13:15:41 Tower kernel: BTRFS info (device md9): new size for /dev/md9 is 6001175072768
Feb 11 13:15:41 Tower emhttp: shcmd (54): mkdir -p /mnt/disk10
Feb 11 13:15:41 Tower emhttp: shcmd (55): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md10 /mnt/disk10 |& logger
Feb 11 13:15:41 Tower kernel: BTRFS info (device md10): disk space caching is enabled
Feb 11 13:15:41 Tower kernel: BTRFS info (device md10): has skinny extents
Feb 11 13:15:50 Tower emhttp: shcmd (56): btrfs filesystem resize max /mnt/disk10 |& logger
Feb 11 13:15:50 Tower root: Resize '/mnt/disk10' of 'max'
Feb 11 13:15:50 Tower kernel: BTRFS info (device md10): new size for /dev/md10 is 6001175072768
Feb 11 13:15:50 Tower emhttp: shcmd (57): mkdir -p /mnt/disk11
Feb 11 13:15:50 Tower emhttp: shcmd (58): set -o pipefail ; mount -t btrfs -o noatime,nodiratime /dev/md11 /mnt/disk11 |& logger
Feb 11 13:15:50 Tower kernel: BTRFS info (device md11): disk space caching is enabled
Feb 11 13:15:50 Tower kernel: BTRFS info (device md11): has skinny extents
Feb 11 13:16:02 Tower emhttp: shcmd (59): btrfs filesystem resize max /mnt/disk11 |& logger
Feb 11 13:16:02 Tower root: Resize '/mnt/disk11' of 'max'
Feb 11 13:16:02 Tower kernel: BTRFS info (device md11): new size for /dev/md11 is 8001563168768
Feb 11 13:16:02 Tower emhttp: shcmd (60): mkdir -p /mnt/cache
Feb 11 13:16:02 Tower emhttp: mount error: No file system (no btrfs UUID)
Feb 11 13:16:02 Tower emhttp: shcmd (61): umount /mnt/cache |& logger
Feb 11 13:16:02 Tower root: umount: /mnt/cache: not mounted
Feb 11 13:16:02 Tower emhttp: shcmd (62): rmdir /mnt/cache
Feb 11 13:16:02 Tower emhttp: shcmd (63): sync

 

Kind of worried now.. The server has been at thisd lever for more then 10 minutes and it appears to net get any further..

 

I can telnet into the system but the webgui does not load and there are no shares..

 

On advice I am running a memtest, it has been running for 15 minutes now and no errors. I will keep it running for longer but would appreciate help in additional steps to take.

 

The only thing I can think of is pulling the physical drive, then rebooting the system, unraid will hopefully emulate the drive, then add the physical drive back in and copy the data from the emulated drive to the "new" drive.. I am expecting that rebuilding it on itself will not work as it will most likely just recreate the issue I am having now..

 

Hoping for a better idea by someone... I need some kind of BTRFS /read-only remove command...

 

 

Link to comment

If enable disable array auto-start, try to mount disk8 read only, eg:

 

mkdir /x
mount -o recovery,ro /dev/sdX1 /x

 

If successful copy everything to another disk/server and then format that disk.

 

After getting the data (or if you have backups) you can try repairing the filesystem with btrfs check --repair

 

Link to comment

If enable disable array auto-start, try to mount disk8 read only, eg:

 

mkdir /x
mount -o recovery,ro /dev/sdX1 /x

 

If successful copy everything to another disk/server and then format that disk.

 

After getting the data (or if you have backups) you can try repairing the filesystem with btrfs check --repair

 

I started the system again, how do I disable the auto-start ?

 

I have tried mounting using command:

 

 mount -o recovery,ro /dev/md8 /x

 

System appears to hang now. Dit I give that command correctly ?

Link to comment

Disable autostart by editing disk.cfg on you flash drive (flash/config) and changing startArray="yes" to "no".

 

to mount the disk first create a temp mountpoint:

 

mkdir /x

 

then, and since the array won't be started you can't use the md device, use sdX.

 

mount -o recovery,ro /dev/sdX1 /x

 

If that fails you can try btrfs recovery.

 

 

Link to comment

Disable autostart by editing disk.cfg on you flash drive (flash/config) and changing startArray="yes" to "no".

 

to mount the disk first create a temp mountpoint:

 

mkdir /x

 

then, and since the array won't be started you can't use the md device, use sdX.

 

mount -o recovery,ro /dev/sdX1 /x

 

If that fails you can try btrfs recovery.

 

Thanks. I found the config and changed the autostart. Shutdown from command still does not work so I will have to force a hard reboot again.

 

Rebooting now.

Link to comment

Disable autostart by editing disk.cfg on you flash drive (flash/config) and changing startArray="yes" to "no".

 

to mount the disk first create a temp mountpoint:

 

mkdir /x

 

then, and since the array won't be started you can't use the md device, use sdX.

 

mount -o recovery,ro /dev/sdX1 /x

 

If that fails you can try btrfs recovery.

 

Thanks. I found the config and changed the autostart. Shutdown from command still does not work so I will have to force a hard reboot again.

 

Rebooting now.

 

How do I know what sdX1  to use ?

Link to comment

All cache disks are being detected as new, strange if it was working correctly, maybe disk8 being unmountable is causing some confusion, so let's wait until disk8 is mountable to see if the issue persists.

 

Completely unrelated but before I forget, this is not good for parity checks with LSI controllers:

 

Feb 11 16:38:29 Tower kernel: mdcmd (31): set md_num_stripes 4264
Feb 11 16:38:29 Tower kernel: mdcmd (32): set md_sync_window 1920
Feb 11 16:38:29 Tower kernel: mdcmd (33): set md_sync_thresh 192

 

sync_tresh needs to me much higher, close to sync_window, change it to 1872, parity check speed should improve considerably.

Link to comment

(getting a bit nervous about btrfs ;-).

 

Can't blame you, I like btrfs and some of its features, but it can be complicated when there's trouble.

 

It's possible that btrfs check --repair would fix your problem, but it's only recommended as a last resort, because it could also make it worse, so this is much more work but safer since it's non destructive.

Link to comment

Copies are running..

 

I see three lines appearing on my console display, I do not see them in the syslog:

 

ERROR: system chunk array too small 34 < 97

ERROR: superblock checksum matches but it has invalid members

ERROR: cannot scan /dev/sdf1: Input/output error

 

sdf is my primary parity drive ..

 

Log is flooded with the following line:

 

Feb 11 21:24:24 Tower shfs/user: err: shfs_mkdir: assign_disk: system (123) No medium found
Feb 11 21:24:28 Tower shfs/user: err: shfs_mkdir: assign_disk: system (123) No medium found

Link to comment

Alraity then... disk8 has been fully copied to other parts of the array (I did not have enough space available on one individual drive so I had to copy the data in parts).

 

Since disk8 was read only mounted to /x I suspect that formatting it now will not work, so I am rebooting. When the array is back up it is my plan to format drive8.

 

Next thing is... How do I get my cache drive back which seems to have misteriously died on me in an unrealated event..

Link to comment

All cache disks are being detected as new, strange if it was working correctly, maybe disk8 being unmountable is causing some confusion, so let's wait until disk8 is mountable to see if the issue persists.

 

Completely unrelated but before I forget, this is not good for parity checks with LSI controllers:

 

Feb 11 16:38:29 Tower kernel: mdcmd (31): set md_num_stripes 4264
Feb 11 16:38:29 Tower kernel: mdcmd (32): set md_sync_window 1920
Feb 11 16:38:29 Tower kernel: mdcmd (33): set md_sync_thresh 192

 

sync_tresh needs to me much higher, close to sync_window, change it to 1872, parity check speed should improve considerably.

 

Just changed this, thanks !

Link to comment

System is back up and disk8 is now getting formatted with XFS..

 

And.... My cache drive pool is back !! After reboot everything appears to function.. The whole btrfs thing must have messed something up in the btrfs logic..

 

Good, that was my hope as there was no other reason I could see for the problem.

 

Johny.. Seriously.. You have been a tremendous help in this whole ordeal.. Thanks thanks thanks !  Can I send you a bottle of Johny black ?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.