KVM Logging


Recommended Posts

I have been using KVM to run a Windows 10 Machine for about 6 months now and while it works I find that the VM crashes 2 - 3 times a week.  By crashing I mean the VM is just off, I can't RDP and the VM has the red box.  In the VM 'Log' that is accessible through webGui this is all that shows:

 

Domain id=1 is tainted: high-privileges
Domain id=1 is tainted: custom-argv
Domain id=1 is tainted: host-cpu
char device redirected to /dev/pts/0 (label charserial0)
2016-04-08 18:15:39.513+0000: shutting down

 

I am assuming that there are much more detailed logs for KVM that would point me in the direction as to why my VM's keep crashing but I don't know where to look for them.  My guess is that it has something to do with memory, but that is just a hunch that I have no supporting information to prove.

 

Are there any experts out there that wouldn't mind helping me?

 

Here is my current VM XML in case that is helpful:

 

<domain type='kvm' id='1' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
  <name>Win 10 Main PC</name>
  <uuid>e43d49b3-00c0-a740-1250-341a6f1f11a4</uuid>
  <description>This is the main gaming rig in the office</description>
  <metadata>
    <vmtemplate name="Custom" icon="windows.png" os="windows"/>
  </metadata>
  <memory unit='KiB'>13107200</memory>
  <currentMemory unit='KiB'>13107200</currentMemory>
  <memoryBacking>
    <nosharepages/>
    <locked/>
  </memoryBacking>
  <vcpu placement='static'>7</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='2'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='4'/>
    <vcpupin vcpu='4' cpuset='5'/>
    <vcpupin vcpu='5' cpuset='6'/>
    <vcpupin vcpu='6' cpuset='7'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='7' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/user/VM_HDD_Lib/Win 10 Main PC/vdisk1.img'/>
      <backingStore/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <alias name='virtio-disk2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:d9:23:a0'/>
      <source bridge='xenbr0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/Win 10 Main PC.org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <hostdev mode='subsystem' type='usb' managed='yes'>
      <source>
        <vendor id='0x1532'/>
        <product id='0x011a'/>
        <address bus='3' device='4'/>
      </source>
      <alias name='hostdev0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='yes'>
      <source>
        <vendor id='0x046d'/>
        <product id='0xc24a'/>
        <address bus='3' device='3'/>
      </source>
      <alias name='hostdev1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
  <qemu:commandline>
    <qemu:arg value='-device'/>
    <qemu:arg value='ioh3420,bus=pci.0,addr=1c.0,multifunction=on,port=2,chassis=1,id=root.1'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,host=01:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='vfio-pci,host=01:00.1,bus=root.1,addr=00.1'/>
  </qemu:commandline>
</domain>

 

Here is the current hardware listing in case that is helpful:

00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x8 Controller (rev 06)
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)
00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5)
00:1c.5 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #6 (rev d5)
00:1c.6 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #7 (rev d5)
00:1c.7 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #8 (rev d5)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation Z87 Express LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)
01:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1)
02:00.0 USB controller: Fresco Logic Device 1100 (rev 10)
04:00.0 Ethernet controller: Qualcomm Atheros Killer E220x Gigabit Ethernet Controller (rev 13)
05:00.0 PCI bridge: Pericom Semiconductor PI7C9X111SL PCIe-to-PCI Reversible Bridge (rev 02)
06:04.0 RAID bus controller: Silicon Image, Inc. SiI 3124 PCI-X Serial ATA Controller (rev 01)
07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
08:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 01)

 

I am passing through the Video Card (01:00.0 & 01:00.1) and the PCIx USB (02:00.0).  I ahve PCIe ACS Override setting enabled.  I have turned on the MSI patch.  All drivers and Win10 are fully updated with latest patches and drivers. 

Link to comment

Looking at your config, I would suspect 2 things and the first being the most likely. A system log may shed some more light on what is going on.

 

1. Amount of assigned memory, if unRAID needs more it will kill the VMs (unRAID did this to me many times until I backed off the memory I gave to my VMs)

2. Amount of CPU cores assigned. This is less likely but if you take away some memory from the VM and it is still crashing I would try less cores.

Link to comment

Looking at your config, I would suspect 2 things and the first being the most likely. A system log may shed some more light on what is going on.

 

1. Amount of assigned memory, if unRAID needs more it will kill the VMs (unRAID did this to me many times until I backed off the memory I gave to my VMs)

2. Amount of CPU cores assigned. This is less likely but if you take away some memory from the VM and it is still crashing I would try less cores.

 

Thanks for the suggestion on number one.  Is there any way I can limit the amount of memory that unRAID can consume?  This box has 16GB total and I am giving 13GB to the VM.  This means I am giving 3GB to unRAID.  My secondary unRAID only has 1GB of RAM and runs without issue (no VM on second unRAID, just straight up unRAID and docker running).

 

Seems odd that there is such a hard delineation of memory management with KVM.  All the other VM providers I have used have much better memory management and will not shut down a VM when the host needs more RAM

Link to comment

Anyone else have any ideas here?

 

Can anyone tell me why the host needing memory is crashing my VM's?

since unRAID normally has no concept of a Swap File since it runs purely from RAM then if it really needs more memory to keep functioning it may well be a choice between killing the VMs or crashing the whole system.

 

I would think the easiest way forward is to make sure there is more RAM available for unRAID in the first place?  There is also a plugin for adding Swap File support - I have no idea how effective that might be.

 

Not knowing what else might be running on your system, it is possible you can use a setting that means other processes are more likely to be selected as candidates for killing in 'out-of-memory' scenarios?

 

 

Link to comment

If you're running docker containers, then since you are severely limiting the available memory by unRaid, then you should also limit the available memory to each container, as by default those have access to your full amount of memory and can trigger OOM killers.

 

Add this to your extra parameters for the containers

 

 --memory=8GB

(or whatever seems a sensible value to you)

 

Note however that AFAIK you cannot limit the entire usage of docker as a whole

Link to comment

Looking at your config, I would suspect 2 things and the first being the most likely. A system log may shed some more light on what is going on.

 

1. Amount of assigned memory, if unRAID needs more it will kill the VMs (unRAID did this to me many times until I backed off the memory I gave to my VMs)

2. Amount of CPU cores assigned. This is less likely but if you take away some memory from the VM and it is still crashing I would try less cores.

 

Thanks for the suggestion on number one.  Is there any way I can limit the amount of memory that unRAID can consume?  This box has 16GB total and I am giving 13GB to the VM.  This means I am giving 3GB to unRAID.  My secondary unRAID only has 1GB of RAM and runs without issue (no VM on second unRAID, just straight up unRAID and docker running).

 

You are likely oversubscribing the memory.  What folks need to understand is that on top of the memory you allocate directly to the guest VM, QEMU needs a bit more for the emulated controllers it will generate for storage and networking.  The exact amount of memory it needs is an undecidable problem.  You can read more about memory reservations and tuning for QEMU/KVM in the libvirt documentation here:  http://libvirt.org/formatdomain.html

 

I would suggest dropping the amount of memory being allocated to the VM to solve your problem.

 

Seems odd that there is such a hard delineation of memory management with KVM.  All the other VM providers I have used have much better memory management and will not shut down a VM when the host needs more RAM

 

All the other VM providers you have used don't have Docker Containers or NAS functionality running as part of the host OS.  If you want a pure VM-only hypervisor solution, we are not the choice.  If you're looking for an "enterprise-grade" solution, we are not the choice.  We provide consolidated functionality in a server OS where VMs are one of the features, but not the only or defining feature.

 

You also can set a memory limit to the host OS (see here:  http://stackoverflow.com/questions/13484016/setting-limit-to-total-physical-memory-available-in-linux).

Link to comment

You are likely oversubscribing the memory.  What folks need to understand is that on top of the memory you allocate directly to the guest VM, QEMU needs a bit more for the emulated controllers it will generate for storage and networking.  The exact amount of memory it needs is an undecidable problem.  You can read more about memory reservations and tuning for QEMU/KVM in the libvirt documentation here:  http://libvirt.org/formatdomain.html

 

I would suggest dropping the amount of memory being allocated to the VM to solve your problem.

 

Since you say it is "likely" how do I determine 100% that this is the issue?  Are there no logs or dump files or crash reports that I can check to validate this assumption?

 

Seems odd that there is such a hard delineation of memory management with KVM.  All the other VM providers I have used have much better memory management and will not shut down a VM when the host needs more RAM

 

All the other VM providers you have used don't have Docker Containers or NAS functionality running as part of the host OS.  If you want a pure VM-only hypervisor solution, we are not the choice.  If you're looking for an "enterprise-grade" solution, we are not the choice.  We provide consolidated functionality in a server OS where VMs are one of the features, but not the only or defining feature.

 

You also can set a memory limit to the host OS (see here:  http://stackoverflow.com/questions/13484016/setting-limit-to-total-physical-memory-available-in-linux).

 

I thinks you are being too flipant black and white on your comment about what your software is and is not.  I am not looking for the things you mention as not being, but what I am looking for is a product that works well and is stable with all the advertised features enabled. 

 

I am not doing anything fancy or custom, I have not enabled any third party add-ons or mods to achieve the results I have.  I am simply using all of the OOTB functionality that is available so I would expect the system to work like that without my VM's crashing every 2 - 5 days.

 

If I am truly oversubscribing the memory then there should be a way to a. prove that and b. tell me that during normal use so that I don't run in to such a frustrating road block.

Link to comment

You are likely oversubscribing the memory.  What folks need to understand is that on top of the memory you allocate directly to the guest VM, QEMU needs a bit more for the emulated controllers it will generate for storage and networking.  The exact amount of memory it needs is an undecidable problem.  You can read more about memory reservations and tuning for QEMU/KVM in the libvirt documentation here:  http://libvirt.org/formatdomain.html

 

I would suggest dropping the amount of memory being allocated to the VM to solve your problem.

 

Since you say it is "likely" how do I determine 100% that this is the issue?  Are there no logs or dump files or crash reports that I can check to validate this assumption?

 

You can review the system log immediately after the event occurs.  You will see an OOM killer message (out of memory).  That confirms that you are running out of memory.  In fact, when posting any message for support here in the forum, it is advised to include your system diagnostics file which can be downloaded directly from within the webGui itself under the Tools > Diagnostics tab.

 

Seems odd that there is such a hard delineation of memory management with KVM.  All the other VM providers I have used have much better memory management and will not shut down a VM when the host needs more RAM

 

All the other VM providers you have used don't have Docker Containers or NAS functionality running as part of the host OS.  If you want a pure VM-only hypervisor solution, we are not the choice.  If you're looking for an "enterprise-grade" solution, we are not the choice.  We provide consolidated functionality in a server OS where VMs are one of the features, but not the only or defining feature.

 

You also can set a memory limit to the host OS (see here:  http://stackoverflow.com/questions/13484016/setting-limit-to-total-physical-memory-available-in-linux).

 

I thinks you are being too flipant black and white on your comment about what your software is and is not.  I am not looking for the things you mention as not being, but what I am looking for is a product that works well and is stable with all the advertised features enabled.

 

I am not doing anything fancy or custom, I have not enabled any third party add-ons or mods to achieve the results I have.  I am simply using all of the OOTB functionality that is available so I would expect the system to work like that without my VM's crashing every 2 - 5 days.

 

If I am truly oversubscribing the memory then there should be a way to a. prove that and b. tell me that during normal use so that I don't run in to such a frustrating road block.

 

Not sure how I'm being too flippant.  The system will work without your VMs crashing if you either A) increase the memory on your system or B) lower the memory assigned to your VMs.  This isn't a software instability issue, but a memory oversubscription problem.

 

As far as proving to you that you're oversubscribing the memory, the OOM killer in the event logs can point to that.  As far as generating a visible alert to OOM events, that sounds like a good feature request as it's not built into the software today.  Please post a message in the feature request board for that.

Link to comment

You can review the system log immediately after the event occurs.  You will see an OOM killer message (out of memory).  That confirms that you are running out of memory.  In fact, when posting any message for support here in the forum, it is advised to include your system diagnostics file which can be downloaded directly from within the webGui itself under the Tools > Diagnostics tab.

 

 

OK so I have had a crash today.  Here is the syslog.  Can we find out from this log what is eating up all my memory?

syslog.txt

Link to comment

Anyone able to look at the syslog and see what is causing the OOM situation?

 

The only thing that's clear from that excerpt is that the VM blew up, reaching 17GB before it was killed.

 

We recommend always providing the complete Diagnostics, because it has extra info that can usually help us help you.  (Tools -> Diagnostics)

Link to comment

Beyond the fact that you had an OOM, not much more to say.

 

I'd install Fix Common Problems plugin, then toss it into troubleshooting mode until the system crashes.

 

More relevant details will be put into the syslog every 10 minutes, (along with running a diagnostics every 30 minutes, and a continuous tail on the syslog).

Link to comment

Beyond the fact that you had an OOM, not much more to say.

 

I'd install Fix Common Problems plugin, then toss it into troubleshooting mode until the system crashes.

 

More relevant details will be put into the syslog every 10 minutes, (along with running a diagnostics every 30 minutes, and a continuous tail on the syslog).

 

OK thanks for the tip.

 

I am certain there is a memory leak in unRAID because my crashes get worse and worse as uptime progresses.  I eventually reboot unRAID and I am stable again for about 6 - 8 days.

 

Limetech I am happy to work with you to figure out this problem.  I am certain getting to the root cause will in fact uncover a real bug that can then be fixed.

Link to comment

Beyond the fact that you had an OOM, not much more to say.

 

I'd install Fix Common Problems plugin, then toss it into troubleshooting mode until the system crashes.

 

More relevant details will be put into the syslog every 10 minutes, (along with running a diagnostics every 30 minutes, and a continuous tail on the syslog).

 

OK thanks for the tip.

 

I am certain there is a memory leak in unRAID because my crashes get worse and worse as uptime progresses.  I eventually reboot unRAID and I am stable again for about 6 - 8 days.

 

Limetech I am happy to work with you to figure out this problem.  I am certain getting to the root cause will in fact uncover a real bug that can then be fixed.

FCP will help out there because it will log a lot of more useful items into the syslog that can help identify what's going on
Link to comment

I am certain there is a memory leak in unRAID because my crashes get worse and worse as uptime progresses.  I eventually reboot unRAID and I am stable again for about 6 - 8 days.

 

Limetech I am happy to work with you to figure out this problem.  I am certain getting to the root cause will in fact uncover a real bug that can then be fixed.

 

May I suggest it's too early to call it a "memory leak in unRAID", yet.  ;)

 

First, lots of other users are not having the problem.  I *think* you may be the first to see memory leak behavior.  And second, you still have VM's, Docker containers, and plugins running (I assume).  Why not begin eliminating them one by one, or in batches, and see if the apparent leak stops with any of them.

 

In all 3 OOM's, total_vm (total virtual memory requested, I believe) reached 17GB, and actual memory usage reached about 12.5GB (I believe).  I'm not really sure why the OOM's occurred, and I found online others that had random OOM's, that didn't appear 'justified'.  Your VM was always the largest memory user, but only using roughly a quarter of your 16GB.  Java was probably the next largest consumer, but still under 1GB.  I compared the 2 OOM's in your last syslog, to see if I could detect a leak, a specific allocation growth, and I couldn't!  Nothing appears to have grown from the first to the second, of any significance.  I'm curious what you are monitoring, when you see a leak.

 

If you would like to *play*, I have a couple of parameters you can change.  Linux, unlike any other OS I'm aware of, likes to overcommit memory.  Anything that asks, gets, about as much as they want, even if they never use it.  These 2 options change it to act more responsibly, only allow what can be committed from actual memory figures.  The reason I mention these to play with is that users suffering from OOM's ceased having OOM's after changing these.  That doesn't mean they *fixed* the problem, but they did stop it failing as an OOM.  It's more likely to fail now as an in-program error, or as a segfault, both probably much nicer than an OOM that kills primary processes.  (References: #1, #2, #3)

  sysctl vm.overcommit_memory=2

  sysctl vm.overcommit_ratio=80

You can view their current values with these.

  sysctl vm.overcommit_memory

  sysctl vm.overcommit_ratio

You can view current memory numbers with this.

  cat /proc/meminfo

 

Default value of vm.overcommit_ratio is 50, as in half of your installed RAM.  Values for vm.overcommit_memory are:

  0 - allow over committing memory up to some 'reasonable' amount; this is the default

  1 - allow any over committing; can allow wildly over committed amounts

  2 - commit from actual swap space plus vm.overcommit_ratio percentage of installed RAM

 

In meminfo, you can see the current committed amount of RAM as Committed_AS.  If you change the ratio, you will see the CommitLimit rise, from 50%of RAM to 80% of RAM.  It might be interesting to monitor meminfo repeatedly, and see if you see anything growing.  It could be especially interesting to view the one just before an OOM occurs.  Or not, I may be on the wrong track.

 

One big new variable here is VM's.  KVM and VM's are relatively new to Linux, and I don't know hardly anything about how they fit into the memory management circus here.  Disclaimer: I'm in no way an expert here, just learning.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.