Author Topic: Sharing "page allocaion failure" errors experience - problem solved  (Read 3365 times)

Offline theone

  • Hero Member
  • *****
  • Posts: 929
Hello,

I have been experiencing "page allocaion failure" errors on my unRAID server when running a Virtual Machine on VirtualBox for unRAID (http://lime-technology.com/forum/index.php?topic=10978.0).
I know that my problem is a result of VirtualBox because when the Virtual Machine is disabled/stopped the errors don't reoccur.
The errors would be very frequent every few minutes.

After implementing the solution described further down the problem seems to have gone (running without errors for ~2 days).
I wanted to share my experience so others with such a problem (or similar not just because of VirtualBox) can get help.

The errors I was getting were formatted as below:
Code: [Select]
Oct 17 14:01:57 Tower kernel: swapper/2: page allocation failure: order:2, mode:0x4020
Oct 17 14:01:57 Tower kernel: Pid: 0, comm: swapper/2 Tainted: G        W  O 3.4.11-unRAID #1
Oct 17 14:01:57 Tower kernel: Call Trace:
Oct 17 14:01:57 Tower kernel:  [<c1062eb6>] warn_alloc_failed+0xbd/0xcf
Oct 17 14:01:57 Tower kernel:  [<c106372e>] __alloc_pages_nodemask+0x47c/0x4a5
Oct 17 14:01:57 Tower kernel:  [<c12a12e0>] ? skb_copy+0x2e/0x83
Oct 17 14:01:57 Tower kernel:  [<c10637ad>] __get_free_pages+0xf/0x21
Oct 17 14:01:57 Tower kernel:  [<c10831e8>] __kmalloc_track_caller+0x2e/0xf2
Oct 17 14:01:57 Tower kernel:  [<c12a0807>] ? __alloc_skb+0x2a/0xf5
Oct 17 14:01:57 Tower kernel:  [<c12a082d>] __alloc_skb+0x50/0xf5
Oct 17 14:01:57 Tower kernel:  [<c12a12e0>] skb_copy+0x2e/0x83
Oct 17 14:01:57 Tower kernel:  [<f85ec612>] vboxNetFltLinuxPacketHandler+0x60/0x4e8 [vboxnetflt]
Oct 17 14:01:57 Tower kernel:  [<c104530f>] ? getnstimeofday+0x4d/0xd0
Oct 17 14:01:57 Tower kernel:  [<c10453a4>] ? ktime_get_real+0x12/0x2e
Oct 17 14:01:57 Tower kernel:  [<c12a72f1>] dev_queue_xmit_nit+0x13b/0x144
Oct 17 14:01:57 Tower kernel:  [<c12a76d8>] dev_hard_start_xmit+0x5c/0x31b
Oct 17 14:01:57 Tower kernel:  [<c12a7c10>] ? dev_queue_xmit+0x279/0x286
Oct 17 14:01:57 Tower kernel:  [<c12b6184>] sch_direct_xmit+0x54/0x13f
Oct 17 14:01:57 Tower kernel:  [<c12a7a95>] dev_queue_xmit+0xfe/0x286
Oct 17 14:01:57 Tower kernel:  [<c12ae7a1>] neigh_resolve_output+0x135/0x15b
Oct 17 14:01:57 Tower kernel:  [<c1062a0d>] ? free_pages_prepare+0xb4/0xb6
Oct 17 14:01:57 Tower kernel:  [<c12b5685>] ? eth_rebuild_header+0x53/0x53
Oct 17 14:01:57 Tower kernel:  [<c12c3504>] ip_finish_output+0x22d/0x258
Oct 17 14:01:57 Tower kernel:  [<c12c35c2>] ip_output+0x93/0x9a
Oct 17 14:01:57 Tower kernel:  [<c12c2a5a>] ip_local_out+0x1b/0x1e
Oct 17 14:01:57 Tower kernel:  [<c12c2f3e>] ip_queue_xmit+0x2ad/0x2ee
Oct 17 14:01:57 Tower kernel:  [<c129e519>] ? __skb_clone+0x22/0xbf
Oct 17 14:01:57 Tower kernel:  [<c12d27b6>] tcp_transmit_skb+0x4d3/0x508
Oct 17 14:01:57 Tower kernel:  [<c12d4993>] tcp_write_xmit+0x2f2/0x3ec
Oct 17 14:01:57 Tower kernel:  [<c12d4ad1>] __tcp_push_pending_frames+0x18/0x6f
Oct 17 14:01:57 Tower kernel:  [<c12d1481>] tcp_rcv_established+0xfa/0x575
Oct 17 14:01:57 Tower kernel:  [<c12d6978>] tcp_v4_do_rcv+0x47/0x13a
Oct 17 14:01:57 Tower kernel:  [<c12d6e4e>] tcp_v4_rcv+0x3e3/0x665
Oct 17 14:01:57 Tower kernel:  [<c12bf240>] ip_local_deliver_finish+0xba/0x192
Oct 17 14:01:57 Tower kernel:  [<c12bf379>] ip_local_deliver+0x61/0x66
Oct 17 14:01:57 Tower kernel:  [<c12bef01>] ip_rcv_finish+0x23d/0x253
Oct 17 14:01:57 Tower kernel:  [<c12bf150>] ip_rcv+0x239/0x26f
Oct 17 14:01:57 Tower kernel:  [<c12a5189>] __netif_receive_skb+0x223/0x259
Oct 17 14:01:57 Tower kernel:  [<c12a6767>] netif_receive_skb+0x66/0x6c
Oct 17 14:01:57 Tower kernel:  [<c12a682b>] napi_skb_finish+0x1e/0x34
Oct 17 14:01:57 Tower kernel:  [<c12a6c98>] napi_gro_receive+0xe7/0xef
Oct 17 14:01:57 Tower kernel:  [<f84463ba>] e1000_receive_skb+0x3c/0x42 [e1000e]
Oct 17 14:01:57 Tower kernel:  [<f8446628>] e1000_clean_rx_irq+0x268/0x30e [e1000e]
Oct 17 14:01:57 Tower kernel:  [<f8448995>] e1000_clean+0x6a/0x211 [e1000e]
Oct 17 14:01:57 Tower kernel:  [<c12a6d6e>] net_rx_action+0x59/0x12c
Oct 17 14:01:57 Tower kernel:  [<c1027056>] __do_softirq+0x6b/0xe5
Oct 17 14:01:57 Tower kernel:  [<c1026feb>] ? irq_enter+0x41/0x41
Oct 17 14:01:57 Tower kernel:  <IRQ>  [<c1026e9f>] ? irq_exit+0x32/0x58
Oct 17 14:01:57 Tower kernel:  [<c1003506>] ? do_IRQ+0x7c/0x90
Oct 17 14:01:57 Tower kernel:  [<c13208e9>] ? common_interrupt+0x29/0x30
Oct 17 14:01:57 Tower kernel:  [<c10081ed>] ? mwait_idle+0x4a/0x50
Oct 17 14:01:57 Tower kernel:  [<c10083e8>] ? cpu_idle+0x4b/0x65
Oct 17 14:01:57 Tower kernel:  [<c131b5f9>] ? start_secondary+0x9a/0x9c
Oct 17 14:01:57 Tower kernel: Mem-Info:
Oct 17 14:01:57 Tower kernel: DMA per-cpu:
Oct 17 14:01:57 Tower kernel: CPU    0: hi:    0, btch:   1 usd:   0
Oct 17 14:01:57 Tower kernel: CPU    1: hi:    0, btch:   1 usd:   0
Oct 17 14:01:57 Tower kernel: CPU    2: hi:    0, btch:   1 usd:   0
Oct 17 14:01:57 Tower kernel: CPU    3: hi:    0, btch:   1 usd:   0
Oct 17 14:01:57 Tower kernel: Normal per-cpu:
Oct 17 14:01:57 Tower kernel: CPU    0: hi:  186, btch:  31 usd: 179
Oct 17 14:01:57 Tower kernel: CPU    1: hi:  186, btch:  31 usd: 171
Oct 17 14:01:57 Tower kernel: CPU    2: hi:  186, btch:  31 usd: 184
Oct 17 14:01:57 Tower kernel: CPU    3: hi:  186, btch:  31 usd: 143
Oct 17 14:01:57 Tower kernel: HighMem per-cpu:
Oct 17 14:01:57 Tower kernel: CPU    0: hi:  186, btch:  31 usd: 112
Oct 17 14:01:57 Tower kernel: CPU    1: hi:  186, btch:  31 usd:  92
Oct 17 14:01:57 Tower kernel: CPU    2: hi:  186, btch:  31 usd: 137
Oct 17 14:01:57 Tower kernel: CPU    3: hi:  186, btch:  31 usd: 145
Oct 17 14:01:57 Tower kernel: active_anon:61918 inactive_anon:38 isolated_anon:0
Oct 17 14:01:57 Tower kernel:  active_file:100718 inactive_file:184117 isolated_file:0
Oct 17 14:01:57 Tower kernel:  unevictable:234794 dirty:176 writeback:0 unstable:0
Oct 17 14:01:57 Tower kernel:  free:102653 slab_reclaimable:9203 slab_unreclaimable:115250
Oct 17 14:01:57 Tower kernel:  mapped:228658 shmem:54 pagetables:1054 bounce:0
Oct 17 14:01:57 Tower kernel: DMA free:3500kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15780kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:176kB slab_unreclaimable:12144kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Oct 17 14:01:57 Tower kernel: lowmem_reserve[]: 0 869 4039 4039
Oct 17 14:01:57 Tower kernel: Normal free:248680kB min:3736kB low:4668kB high:5604kB active_anon:30716kB inactive_anon:4kB active_file:19120kB inactive_file:19188kB unevictable:9908kB isolated(anon):0kB isolated(file):0kB present:890008kB mlocked:0kB dirty:16kB writeback:0kB mapped:2544kB shmem:4kB slab_reclaimable:36636kB slab_unreclaimable:448856kB kernel_stack:1808kB pagetables:172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:19174 all_unreclaimable? no
Oct 17 14:01:57 Tower kernel: lowmem_reserve[]: 0 0 25363 25363
Oct 17 14:01:57 Tower kernel: HighMem free:158432kB min:512kB low:3920kB high:7328kB active_anon:216956kB inactive_anon:148kB active_file:383752kB inactive_file:717280kB unevictable:929268kB isolated(anon):0kB isolated(file):0kB present:3246564kB mlocked:0kB dirty:688kB writeback:0kB mapped:912088kB shmem:212kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:4044kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Oct 17 14:01:57 Tower kernel: lowmem_reserve[]: 0 0 0 0
Oct 17 14:01:57 Tower kernel: DMA: 1*4kB 31*8kB 45*16kB 27*32kB 18*64kB 4*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3500kB
Oct 17 14:01:57 Tower kernel: Normal: 21762*4kB 20168*8kB 1*16kB 4*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 248664kB
Oct 17 14:01:57 Tower kernel: HighMem: 3568*4kB 1022*8kB 557*16kB 101*32kB 1689*64kB 113*128kB 5*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 158432kB
Oct 17 14:01:57 Tower kernel: 519691 total pagecache pages
Oct 17 14:01:57 Tower kernel: 0 pages in swap cache
Oct 17 14:01:57 Tower kernel: Swap cache stats: add 0, delete 0, find 0/0
Oct 17 14:01:57 Tower kernel: Free swap  = 0kB
Oct 17 14:01:57 Tower kernel: Total swap = 0kB
Oct 17 14:01:57 Tower kernel: 1310704 pages RAM
Oct 17 14:01:57 Tower kernel: 1082370 pages HighMem
Oct 17 14:01:57 Tower kernel: 490407 pages reserved
Oct 17 14:01:57 Tower kernel: 304658 pages shared
Oct 17 14:01:57 Tower kernel: 430686 pages non-shared

After some investigation on the internet I came across the following site that describes the "page allocation failure" log output:
http://www.linuxsmiths.com/blog/?p=527

Reading through it I found that the "trace" output is where correct information can be gathered about the source of the error, which in my case is "vboxNetFltLinuxPacketHandler" and the "e1000" intel ethernet driver.
Further in the article it is mentioned that a fast (1GIG ethernet connection) can exhaust quite fast the memory resources (TX/RX buffer) and cause this error.
The given answer was changes the minimum free memory that the OS always leaves free for fast allocation.
This can be done by increasing /proc/sys/vm/min_free_kbytes.

Again turning to the internet I found several mentions of "page allocation failure" related to "vboxNetFltLinuxPacketHandler":
https://www.virtualbox.org/ticket/5260
https://www.virtualbox.org/ticket/10108
There also I found reference to vm.min_free_kbytes

Searching the internet once again for min_free_kbytes I can accross the following link:
http://serverfault.com/questions/236170/page-allocation-failure-am-i-running-out-of-memory

Quote
#change value for this boot
sysctl -w vm.min_free_kbytes=65536

#change value for subsequent boots
echo "vm.min_free_kbytes=65536" >> /etc/sysctl.conf

and

Quote
This tells the kernel to try and keep 64MB of RAM free at all times. It’s useful in two main cases:
* Swap-less machines, where you don’t want incoming network traffic to overwhelm the kernel and force an OOM before it has time to flush any buffers.
* x86 machines, for the same reason: the x86 architecture only allows DMA transfers below approximately 900MB of RAM. So you can end up with the bizarre situation of an OOM error with tons of RAM free.

I do not have a swap file.

Looking at unRAID kernel settings I saw that the min_free_kbytes is only 3806 which is ~3.7MB - seems very low.
I added the following to my go script (after array is up) and reboot:
Code: [Select]
#Increase v.min_free_kbytes to 64MB (default was 3806=3.71MB) for better immunity to "page allocation failure"
sysctl -w vm.min_free_kbytes=65536

Since then (2 days now - October 17 2012) I have had no "page allocation errors" as of yet.

Hope this helps someone.
Case: "Two Towers" Custom Build | MB: Supermicro X7SPA-HF | Memory: 2x 2GB Kingston 667MHz | PSU: Corsair CX400W | Boot Device: Patriot 8GB Xporter XT Boost | unRAID version: 5.0.5 i386
Storage: 1x WD30EZRX (parity), 1x WD30EZRX, 2x WD20EARS, 1x WD10EADS, 1xWD5000AAKS, 1x ST3250318AS (cache)
My Plugins: VirtualBox Plugin          Swap File Plugin           YAMJ Plugin