Cant start VMs, and cant stop them - Unraid 6.2 beta 19

rpetz · March 22, 2016

So I decided to give the unraid 6.2 beta a try because of the NVME and Nvidia/hyper-v support and I'm getting a really weird issue now. None of my VMs will load into windows, one of them did once (after probably ten unsuccessful tries) but hasn't again since.

Basically I start the server, start the array, start a VM, and it just sits at the spinning Windows logo during boot indefinitely. I've let it sit for an hour with no change. I have tried stopping the VM, nothing - I have tried force stopping the VM and it always comes back with the following error:

(obviously with a different process ID each time)

Failed to terminate process 13447 with SIGKILL: Device or resource busy

After that error appears the webUI locks up (I still get SSH access) and I cannot umount any of the disks so safely shut the machine down so I have to force restart it

The vdisk for the VM is stored on my array, though it does use the cache layer.

Obviously, the VM operated fine in 6.1.9 - however there is one major difference now and that is that my cache layer is no longer a pair of 256gb SSDs it is now a single 512gb NVME SSD. That being said I had a different VM running off of the NVME drive which was mounted into Unraid in my Go file in 6.1.9 that worked flawlessley before - I no longer have that VM.

The vm logs from the webUI will never load, so I don't have that info to post up since I'm not aware of where those logs are stored (I'll be happy to grab those logs if someone could point me to where they are located).

Here's one of the VM's XML config:

<domain type='kvm' id='1'>
  <name>Nyx</name>
  <uuid>1583133c-98ee-3342-24da-45b22af1fbe4</uuid>
  <description>SQL Server</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='9'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='11'/>
    <vcpupin vcpu='4' cpuset='12'/>
    <vcpupin vcpu='5' cpuset='13'/>
    <vcpupin vcpu='6' cpuset='14'/>
    <vcpupin vcpu='7' cpuset='15'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor id='none'/>
    </hyperv>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='4' threads='2'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/user/ArrayVDisks/Nyx/vdisk1.img'/>
      <backingStore/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <alias name='virtio-disk2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='nec-xhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:18:e4:9d'/>
      <source bridge='br0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/8'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/8'>
      <source path='/dev/pts/8'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-Nyx/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='5900' autoport='yes' websocket='5700' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Of note in the syslog I do see I'm getting a lot of this error:

Hydra kernel: BTRFS warning (device loop0): csum failed ino 3069 off 1785856 csum 2365913268 expected csum 1094680760

And regardless of VM execution I am seeing this error frequently now as well:

(usually this is repeated over and over like it's trying to complete a task that isn't ever finishing correctly)

Mar 22 10:11:03 Hydra kernel: ------------[ cut here ]------------
Mar 22 10:11:03 Hydra kernel: WARNING: CPU: 11 PID: 8076 at fs/btrfs/extent-tree.c:4180 btrfs_free_reserved_data_space_noquota+0x5b/0x7b()
Mar 22 10:11:03 Hydra kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables vhost_net vhost macvtap macvlan xt_nat veth iptable_filter ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 nf_nat ip_tables md_mod tun mxm_wmi x86_pkg_temp_thermal coretemp kvm_intel kvm i2c_i801 e1000e alx ptp mdio ahci pps_core nvme libahci wmi [last unloaded: md_mod]
Mar 22 10:11:03 Hydra kernel: CPU: 11 PID: 8076 Comm: kworker/u48:3 Tainted: G        W       4.4.5-unRAID #1
Mar 22 10:11:03 Hydra kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme6, BIOS P2.10 12/15/2015
Mar 22 10:11:03 Hydra kernel: Workqueue: writeback wb_workfn (flush-btrfs-6)
Mar 22 10:11:03 Hydra kernel: 0000000000000000 ffff88102d5f3600 ffffffff8136891e 0000000000000000
Mar 22 10:11:03 Hydra kernel: 0000000000001054 ffff88102d5f3638 ffffffff8104a28a ffffffff812ab3d1
Mar 22 10:11:03 Hydra kernel: 0000000000004000 ffff88105359ec00 ffff880fa9dc3be0 ffff88102d5f3734
Mar 22 10:11:03 Hydra kernel: Call Trace:
Mar 22 10:11:03 Hydra kernel: [<ffffffff8136891e>] dump_stack+0x61/0x7e
Mar 22 10:11:03 Hydra kernel: [<ffffffff8104a28a>] warn_slowpath_common+0x8f/0xa8
Mar 22 10:11:03 Hydra kernel: [<ffffffff812ab3d1>] ? btrfs_free_reserved_data_space_noquota+0x5b/0x7b
Mar 22 10:11:03 Hydra kernel: [<ffffffff8104a347>] warn_slowpath_null+0x15/0x17
Mar 22 10:11:03 Hydra kernel: [<ffffffff812ab3d1>] btrfs_free_reserved_data_space_noquota+0x5b/0x7b
Mar 22 10:11:03 Hydra kernel: [<ffffffff812c27d0>] btrfs_clear_bit_hook+0x143/0x272
Mar 22 10:11:03 Hydra kernel: [<ffffffff812d8f25>] clear_state_bit+0x8b/0x155
Mar 22 10:11:03 Hydra kernel: [<ffffffff812d9227>] __clear_extent_bit+0x238/0x2c3
Mar 22 10:11:03 Hydra kernel: [<ffffffff812d96e3>] clear_extent_bit+0x12/0x14
Mar 22 10:11:03 Hydra kernel: [<ffffffff812d9c76>] extent_clear_unlock_delalloc+0x46/0x18f
Mar 22 10:11:03 Hydra kernel: [<ffffffff8111df29>] ? igrab+0x32/0x46
Mar 22 10:11:03 Hydra kernel: [<ffffffff812d662d>] ? __btrfs_add_ordered_extent+0x288/0x2cf
Mar 22 10:11:03 Hydra kernel: [<ffffffff812c65cd>] cow_file_range+0x300/0x3bd
Mar 22 10:11:03 Hydra kernel: [<ffffffff812c7249>] run_delalloc_range+0x321/0x331
Mar 22 10:11:03 Hydra kernel: [<ffffffff812da2af>] writepage_delalloc.isra.14+0xaa/0x126
Mar 22 10:11:03 Hydra kernel: [<ffffffff812dc3d4>] __extent_writepage+0x150/0x1f7
Mar 22 10:11:03 Hydra kernel: [<ffffffff812dc6d1>] extent_write_cache_pages.isra.10.constprop.24+0x256/0x30c
Mar 22 10:11:03 Hydra kernel: [<ffffffff812dcbcf>] extent_writepages+0x46/0x57
Mar 22 10:11:03 Hydra kernel: [<ffffffff812c4384>] ? btrfs_direct_IO+0x28e/0x28e
Mar 22 10:11:03 Hydra kernel: [<ffffffff812c2f19>] btrfs_writepages+0x23/0x25
Mar 22 10:11:03 Hydra kernel: [<ffffffff810c2bbf>] do_writepages+0x1b/0x24
Mar 22 10:11:03 Hydra kernel: [<ffffffff8112945b>] __writeback_single_inode+0x3d/0x151
Mar 22 10:11:03 Hydra kernel: [<ffffffff81129a15>] writeback_sb_inodes+0x212/0x38e
Mar 22 10:11:03 Hydra kernel: [<ffffffff81129c02>] __writeback_inodes_wb+0x71/0xa9
Mar 22 10:11:03 Hydra kernel: [<ffffffff81129de8>] wb_writeback+0x10b/0x195
Mar 22 10:11:03 Hydra kernel: [<ffffffff8112a37f>] wb_workfn+0x157/0x22b
Mar 22 10:11:03 Hydra kernel: [<ffffffff8112a37f>] ? wb_workfn+0x157/0x22b
Mar 22 10:11:03 Hydra kernel: [<ffffffff8105ac40>] process_one_work+0x194/0x2a0
Mar 22 10:11:03 Hydra kernel: [<ffffffff8105b5f6>] worker_thread+0x26b/0x353
Mar 22 10:11:03 Hydra kernel: [<ffffffff8105b38b>] ? rescuer_thread+0x285/0x285
Mar 22 10:11:03 Hydra kernel: [<ffffffff8105f870>] kthread+0xcd/0xd5
Mar 22 10:11:03 Hydra kernel: [<ffffffff8105f7a3>] ? kthread_worker_fn+0x137/0x137
Mar 22 10:11:03 Hydra kernel: [<ffffffff8161a43f>] ret_from_fork+0x3f/0x70
Mar 22 10:11:03 Hydra kernel: [<ffffffff8105f7a3>] ? kthread_worker_fn+0x137/0x137
Mar 22 10:11:03 Hydra kernel: ---[ end trace a5e83c137feb7195 ]---

I have attached my diagnostics file with this post. Here's my rig:

ASRock Extreme6 MoBo - X99 chipset

Intel Xeon 12 core processor

64GB DDR4 memory

10TB WD Red SATA array (5x2TB hard drives - one drive parity, four drives storage, usable 8TB)

512GB Samsung 950 NVME SSD

NVidia GeForce GTX 970 GFX card (x2 - not in SLI - used for VM passthrough)

NVidia GeForce GTX 730 GFX card (used for unraid video out as X99 chipsets do not have onboard graphics)

All storage disks are formatted using BTRFS

This server isn't doing anything critical yet so I'm not too worried about rolling it back just yet, but obviously I would like to get it running again if I could haha

BTW I know that this is beta software, I don't expect it to be perfect. However, I just read through the entire 6.2b18 and 6.2b19 release threads and didn't find anyone complaining about this issue so I imagine it is something tied to me specifically that might be resolvable since the only crazy thing in my setup is the NVME drive that others have reported as working fine.

hydra-diagnostics-20160322-1034.zip

rpetz · March 22, 2016

okay I seem to have alleviated my issues - I was planning to replace my unraid usb key with a new one since the original key I was using was quite old and not one I would trust for a long time. I got a new key setup, transferred my unraid license, and rebuilt the unraid configuration from the ground up (fresh unraid install, not from a backup) without touching the data on the array. I now am able to start the VMs that I care about without issue. There is still one VM that is causing the symptoms described in the post above but I literally had just built that VM and I really don't mind destroying it and rebuilding it again.

That being said this issue still isn't resolved because of the BTRFS errors I described above. I would like to resolve those as they appear to be pretty significant. Any ideas?

rpetz · March 22, 2016

ehhh I spoke too soon - my VMs start but very quickly degrade

The issue very much seems to be that the underlying disks are getting locked up, the guest OSs are running but once they start to try and access any of the shares they immediately stop responding

rpetz · March 23, 2016

As an update - I ran a BTRFS scrub against all of my drives and none of them came back with any errors, so I'm really confused as to why I'm seeing csum errors in syslog

EDIT: from what I'm reading (I'm a software engineer, but only semi-new to linux) the loop0 device that's throwing the csum warnings sounds like it actually represents the 'user' mount...so I suppose it makes sense that the drives are fine but there is something wrong with the loop0 mount?

EDIT2: looks like the loop0 errors were from bringing forward my old docker.img file, I rebuilt that and the loop0 errors appear to have disappeared

Okay so now the issue stil remains that the underlying disks are getting locked up while the VMs are trying to access them....if I try to shut down a VM it simply locks up the system, and if I try to force stop it instead from the web UI it throws an error stating that the device is busy and then locks up the system

rpetz · March 23, 2016

Here's something interesting from my libvirt log:

2016-03-23 02:28:05.379+0000: 5808: info : libvirt version: 1.3.1
2016-03-23 02:28:05.379+0000: 5808: info : hostname: Hydra
2016-03-23 02:28:05.379+0000: 5808: warning : qemuDomainObjTaint:2223 : Domain id=1 name='Atlas' uuid=10486b00-4ece-cd7e-33b9-29bb551f8c89 is tainted: high-privileges
2016-03-23 02:28:05.379+0000: 5808: warning : qemuDomainObjTaint:2223 : Domain id=1 name='Atlas' uuid=10486b00-4ece-cd7e-33b9-29bb551f8c89 is tainted: host-cpu

Could this be the reason this VM wont start and it locks up the system on shutdown?

dAigo · March 23, 2016

Failed to terminate process 13447 with SIGKILL: Device or resource busy
After that error appears the webUI locks up (I still get SSH access) and I cannot umount any of the disks so safely shut the machine down so I have to force restart it

The vdisk for the VM is stored on my array, though it does use the cache layer.

I have the excact same issue. I moved my vms to the array during upgrade, to add the cache drive.

When I startet a vm it showed the symptoms you described, even before adding the nvme cache drive.

Some instances of other programs (MC, htop) also froze and hat to be restartet in another ssh session.

It may be related to NVMe, because you also have a NVMe disk, but at this point, I dont think it does.

Once I added the new cache (NVMe), moved the "system"-share with libvirt.img an all vms to the cache, they were working again.

To sum it up, there are 2 issues:

1) A VM (at least windows) running a vDisk on the array, infinitly boots (without freezing).

2) Destroying a VM in the state mentioned above, will fail (resource busy) an lock up the WebGui and other stuff

I believe running VMs directly from the array was never recommended, but it definitly worked with 6.1.9.

Some of my vms had a second disk on the array for backup reasons, not anymore.

Btw. some VMs (win10) bootet do the desktop, but I could not do anything apart from moving the mouse.

My Server2012r2 VM never made it so far.

Maybe some guest-agent issues, that begin once it gets loaded, which may be at diffrent times for diffrent VMs/OSs

rpetz · March 23, 2016

Failed to terminate process 13447 with SIGKILL: Device or resource busy
After that error appears the webUI locks up (I still get SSH access) and I cannot umount any of the disks so safely shut the machine down so I have to force restart it

The vdisk for the VM is stored on my array, though it does use the cache layer.
I have the excact same issue. I moved my vms to the array during upgrade, to add the cache drive.

When I startet a vm it showed the symptoms you described, even before adding the nvme cache drive.

Some instances of other programs (MC, htop) also froze and hat to be restartet in another ssh session.

It may be related to NVMe, because you also have a NVMe disk, but at this point, I dont think it does.

Once I added the new cache (NVMe), moved the "system"-share with libvirt.img an all vms to the cache, they were working again.

To sum it up, there are 2 issues:

1) A VM (at least windows) running a vDisk on the array, infinitly boots (without freezing).

2) Destroying a VM in the state mentioned above, will fail (resource busy) an lock up the WebGui and other stuff

I believe running VMs directly from the array was never recommended, but it definitly worked with 6.1.9.

Some of my vms had a second disk on the array for backup reasons, not anymore.

Btw. some VMs (win10) bootet do the desktop, but I could not do anything apart from moving the mouse.

My Server2012r2 VM never made it so far.

Maybe some guest-agent issues, that begin once it gets loaded, which may be at diffrent times for diffrent VMs/OSs

Yessss - exact same issues

I can attest to running vdisks off of the array working flawlessly for me in 6.1.8 and 6.1.9 as well. While I would like to blame the move to the NVMe cache drive though, I'm technically not even involving the cache drive on my worst VM as it's vdisk is on a share on the array (called ArrayVDisks) and that share is set to not use the cache drive.

Here's the state of my three total VMs:

Atlas - Array only vdisk, infinitely boots, locks up server when force stopping

Nyx - Array only vdisk, boots 75% of the time, unable to do anything but keyboard/mouse input once on desktop, locks up server when force stopping

Endeavour - NVME Cache vdisk for boot/os, secondary array vdisk for permanent storage, boots 75% of the time, same responsiveness on desktop and force stopping as Nyx

Neither Nyx nor Endeavour can be shutdown gracefully after a successful boot as it locks up the server

Endeavour is the most telling of the issue as I reproduced this several times:

1. Launch VM, boot into windows, login

2. Launch UPlay (Ubisoft's version of Steam essentially) and noticed a patch was available for a game that's loaded onto my secondary drive (array vdisk)

3. Patch started to download, went about halfway in, then just stopped and never finished

4. Waited an hour, shut down the VM which tried to shutdown for twenty minutes without success and required a force shutdown of the VM - locking up the server and requiring a hard shutdown

5. Restarted the server and repeated steps 1-4 with the same result

Hope this repro steps helps narrow down this issue

dAigo · March 23, 2016

I did not report the error or try to investigate, because I have other issues with 6.2.

So my logs may show stuff that is unrelated. Its usually better to solve one problem at a time.

Unless you also have the issue of a high Host-CPU-Load while watching a Video (YT) in the guest (on the nvme cache), then it may be related

I could reproduce the error on my server and post the diagnostics if Lime-Tech thinks it helps.

Forcefully shutting down unRAID, while the array is running is not something I want to do that often ^^

From the official wiki for 6.0 / 6.1:

Create User Shares for Virtualization

So I guess, running a vm from the array was supported, but not recommended due to perfomance.

But 6.2 may look different...

rpetz · March 23, 2016

haven't even been able to get that far to see that issue - sounds like I might just move back to 6.1.9 until this is all sorted

thanks for the info on your setup!

dAigo · March 27, 2016

6.2_beta20 seems to ba a fix for me.

I have not testet every vm yet, but one that could not boot before, runs fine now.

You should try it.

rpetz · March 27, 2016

wow that release came a lot quicker than I expected - unfortunately I've already started to explore other options outside of unraid for my setup, but almost all of them are falling on their faces with my particular hardware setup so I may very well wind up back at unraid this weekend in which case I'll give this a go

thanks for following up though, it's good to hear that I may have this as a fall-back option!

dAigo · March 29, 2016

Issue is not fixed, just not as bad I guess or I that 1 VM was not enough to trigger the Problem, who knows.

Moved everything back to the array and as soon as a windows VM with a system disk on the array boots or a secondary vDisk is used heavily during a backup, system lock up.

But my other issue is fixed, so I can focus on this problem...

Reported it in the 6.2 beta20 announcment with some diagnostics.

Cant start VMs, and cant stop them - Unraid 6.2 beta 19

Recommended Posts

rpetz

Link to comment

rpetz

Link to comment

rpetz

Link to comment

rpetz

Link to comment

rpetz

Link to comment

dAigo

Link to comment

rpetz

Link to comment

dAigo

Link to comment

rpetz

Link to comment

dAigo

Link to comment

rpetz

Link to comment

dAigo

Link to comment

Join the conversation