OS-8005: bhyve memory pressure needs to target ARC better

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-09-30T14:09:54.922Z
Updated at:2021-03-08T20:58:26.991Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2021-03-08T20:58:26.980Z)

Fix Versions

2021-03-11 modestly priced receptacle (Release Date: 2021-03-11)

Related Links

Labels

bhyve

Description

When a bhyve instance is starting, it applies memory pressure in hopes that it will cause the ARC to shrink. Sometimes this works, sometimes it doesn't.

In bad cases, it causes a memory shortfall so severe that before the ARC can shrink, the system starts swapping out LWPs. This is described in this smartos-discuss thread.

Currently vm_reserve_pages just calls page_needfree, which is a generic means for putting pressure on virtual memory. This causes needfree to be increased. Without tuning, if freemem - needfree is less than 1/128 of RAM (desfree), this can cause swapping to start. That is, rather than just paging out memory that hasn't been used recently, entire LWPs are swapped out.

At first blush, it would seem that a ZFS interface that allows a call to arc_reduce_target_size() would be sufficient. However, as we are waiting for the arc to shrink, someone else can come along and allocate memory (normal memory allocation, page in, etc.). Asking for slightly more than what is needed may be one approach to deal with this.

The long term fix in this area is to come up with a better reservation scheme that allows pages to be reserved but not immediately allocated to the VM. This has its own set of challenges that is beyond the scope of this ticket.

Until this is fixed, it is likely that the best workaround is to limit the size of the ARC by setting zfs_arc_max in /etc/system.

Comments

Comment by Former user
Created at 2019-10-01T14:51:59.071Z
Updated at 2019-10-01T15:04:32.670Z

In reading through zfsonlinux changes to arc.c, I see that the ARC tunables are dynamic on Linux.  If we had the same on SmartOS, vmm could leverage this to dynamically adjust arc_c_max as bhyve instances come and go.  Specifically, as a bhyve instance is starting, it should reduce arc_c_max by the size of any locked memory that the instance will use.  This should be inclusive of the size of guest RAM, the required number of {{page_t}}s, and perhaps other memory used by a bhyve instance.

The dynamic ARC tuning feature is something that would be of general use in illumos and is separable from the bhyve problem.  See illumos#11765.


Comment by Former user
Created at 2019-10-01T18:26:34.982Z
Updated at 2019-10-01T18:27:49.444Z

Presuming that we teach vmm how to update zfs_arc_max, we may also want:


Comment by Michael Zeller
Created at 2020-03-19T21:14:19.319Z

While reproducing this in a small 4GB VM I am able to use this program to fill the ARC and prevent a bhyve VM from booting:

use std::fs;
use std::io::prelude::*;
use std::io::{Seek, SeekFrom};
use std::path::Path;
use std::process::Command;

const FILE_PATH: &str = "/var/tmp/bigfile";

fn create_file() {
    Command::new("dd")
        .args(&[
            "if=/dev/urandom",
            &format!("of={}", FILE_PATH),
            "bs=1M",
            "count=2048",
        ])
        .output()
        .expect("failed to create file");
}

fn read_file_chunks(file: &Path) -> std::io::Result<()> {
    let max = 128;
    let mut buf = Vec::with_capacity(max);
    let mut f = fs::File::open(file).expect("failed to open file");
    let file_len = f.seek(SeekFrom::End(0))?;
    let mut pos = 0;

    loop {
        let f = &mut f;
        buf.truncate(0);
        f.seek(SeekFrom::Start(pos))?;
        f.take(max as u64).read_to_end(&mut buf)?;

        pos += max as u64;
        if pos >= file_len {
            break;
        }
    }
    Ok(())
}

fn main() {
    let file = Path::new(FILE_PATH);
    if !file.is_file() {
        println!("creating file at: {}", FILE_PATH);
        create_file();
    };

    loop {
        read_file_chunks(&file).expect("failed to read file");
    }
}

In my case I gave bhyve a chunk of memory:

root - kraid ~ # vmadm get 3e570482-5ff4-c1f7-9c53-8398f05fc2c4 | json ram
1844

I let the arcfill program run for a bit until memstat showed me "ZFS File Data" using up a large amount. Then I tried to start the VM and saw the following:

root - kraid ~ # mdb -ke ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     392116              1531   38%
Boot pages                  76071               297    7%
ZFS File Data              118101               461   11%
Anon                        14558                56    1%
Exec and libs                 539                 2    0%
Page cache                   3520                13    0%
Free (cachelist)            31806               124    3%
Free (freelist)            408399              1595   39%

Total                     1045110              4082
Physical                  1045109              4082
root - kraid ~ # arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
20:57:13  264M  1.1M      0  270K    0  869K   56   54K    1   454M  358M
20:57:14  127K   974      0     0    0   974   99     1    0   471M  358M
20:57:15  123K   937      0     0    0   937   99     1    0   470M  358M
20:57:16  131K   667      0     0    0   667   66     1    0   436M  358M

The VM's platform.log showed:

{"log":"Unable to allocate memory (1933574144), retrying in 1 second\n","stream":"stderr","time":"2020-03-19T20:58:18.443450000Z"}
{"log":"Unable to allocate memory (1933574144), retrying in 1 second\n","stream":"stderr","time":"2020-03-19T20:58:19.443620000Z"}
{"log":"Unable to allocate memory (1933574144), retrying in 1 second\n","stream":"stderr","time":"2020-03-19T20:58:20.443844000Z"}
{"log":"Unable to allocate memory (1933574144), retrying in 1 second\n","stream":"stderr","time":"2020-03-19T20:58:21.443973000Z"}

Finally once I killed the arcfill program the VM was able to start. This is what I saw once the VM was booting:

5:02 PM
root - kraid ~ # mdb -ke ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     386634              1510   37%
Boot pages                  76071               297    7%
ZFS File Data               86311               337    8%
VMM Memory                 476416              1861   46%
Anon                         5431                21    1%
Exec and libs                 211                 0    0%
Page cache                   1327                 5    0%
Free (cachelist)             9007                35    1%
Free (freelist)              3702                14    0%

Total                     1045110              4082
Physical                  1045109              4082

root - kraid ~ # arcstat 1 3
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
21:03:09  273M  1.3M      0  321K    0  943K   58   58K    1   344M  358M
21:03:10   169     4      2     4    2     0    0     4    2   344M  358M
21:03:11     0     0      0     0    0     0    0     0    0   344M  358M

Comment by Former user
Created at 2020-03-27T22:43:17.962Z
Updated at 2020-03-27T22:43:42.285Z

Talking to Allan Jude, he pointed me at a tunable added to FreeBSD's ZFS implementation called arc_free_target. The idea is that is used by arc_available_memory(). arc_available_memory() itself reports how much memory the ARC can consume on the system before it needs to reclaim buffers (a negative value indicates a reclaim is needed immediately). The implementation itself is pretty straightforward (it looks at various VM parameters and returns the smallest value).

This approach seems promising, so we (for testing) added a similar tunable (called arc_virt_machine_reserved for now) and a new vmm ioctl VM_ARC_RESV. The idea is that prior to Bhyve doing it's memory allocation loop, it issues the VM_ARC_RESV ioctl. This tracks the amount reserved per vmm instance (to allow it to be released when the instance is destroyed), increments arc_virt_machine_reserved by the amount, and kicks off the zfs arc reclaim thread (which uses the arc_available_memory() function to determine if a reclaim should be done).

This work is being tracked in the arc-testing branch.

Initial testing shows promise:

On my home system, starting with:

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    2705547             10568    8%
Boot pages                  75227               293    0%
ZFS File Data             7250662             28322   22%
VMM Memory                7344896             28691   22%
Anon                       281578              1099    1%
Exec and libs                6678                26    0%
Page cache                  41844               163    0%
Free (cachelist)           602954              2355    2%
Free (freelist)          15215600             59435   45%

Total                    33524986            130956
Physical                 33524985            130956

I created a BHYVE instance w/ 70Gb of ram -- enough to use the remaining free memory + some of the ARC.
The instance was able to start, though it did take 5 times through the loop before that occurred. Once the VM was started, the ARC had shrunk accordingly:

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    2663962             10406    8%
Boot pages                  75227               293    0%
ZFS File Data             1276298              4985    4%
VMM Memory               25699328            100388   77%
Anon                       251720               983    1%
Exec and libs                1554                 6    0%
Page cache                   6471                25    0%
Free (cachelist)           689647              2693    2%
Free (freelist)           2860779             11174    9%

Total                    33524986            130956
Physical                 33524985            130956

Since the ARC reclaim/reaping is done asynchronously, and given the amount of memory the ARC has to free, it doesn't seem that unreasonable that it did take some time before the memory was available. We would need to determine if this is acceptable, or if we might need to be more aggressive.

The code also currently will warn if the reservation would take the ARC below arc_c_min. The thought is that arc_c_min capping reservations there should still allow the system to function in the face of large amounts of VMs running. This will need some more testing though -- in a mixed (VM and non-VM) workload it seems possible that VMs memory usage could still push out memory being used by non-VM instances (as well as GZ processes) causing poor performance for those instances

Preventing this might be a bit more challenging -- ideally, if we had some notion of the size actual working set for all of the non-VM instances, we could prevent a VM from starting if it's size + non-VM working set sizes + OS overhead exceeded physical ram, but I'm not sure if there's currently any metric that could be examined (it also assumes the non-VM instance working set size is static, which may be a poor assumption). One idea might be to examine any RAM resource limits of a zone and use that as well as part of the determinations if a VM should be allowed to start or not.


Comment by Former user
Created at 2021-03-07T01:10:23.966Z

As a small refinement of the above workaround (and note that it is intended as an interim measure until the bhyve kernel memory mechanism can be improved), we still track the total number of pages requested for VMM as well as per VM instance. A request that would put the total of arc_c_min + VMM is rejected (to allow for a minimum amount of ARC). As long as that succeeds, the arc_available_memory will consider the VMM requirements and will shrink the ARC as needed.

As noted, since the shrinking is asynchronous, if the ARC has grown very large, it can still take a few seconds to release sufficient ram (especially for larger VMs), which can delay the startup of a VM by a few seconds.

This also does not address the need for VMM memory to be contiguous in the kernel. That can still cause excessive delays or prevent the startup of a VM – especially large VMs in the face of high kernel memory fragmentation.  However, our experience so far suggests this is less frequent than the ARC hogging too much memory.


Comment by Former user
Created at 2021-03-07T01:16:59.959Z

To test, I started with an unmodified PI that had been running for some time, so it's ARC utilization was rather high:

> ::memstat
Page Summary Pages MB %Tot
------------ ---------------- ---------------- ----
Kernel 6083886 23765 18%
Boot pages 75234 293 0%
ZFS File Data 17147676 66983 51%
VMM Memory 5255168 20528 16%
Anon 390880 1526 1%
Exec and libs 6596 25 0%
Page cache 53352 208 0%
Free (cachelist) 737205 2879 2%
Free (freelist) 3776029 14750 11%

Total 33526026 130961
Physical 33526025 130961

I then tried to startup a large VM (64gb -- large enough that the amount of ram was larger than the freelist, but still small enough that there was sufficient memory between the freelist and the ARC for it). As expected, it just sat in the 'cannot allocate memory' loop:

[2021-03-05T18:26:30.542830000Z]  INFO: zoneadmd/61341 on bob: (stream=stdout)
    nvlist version: 0
        bhyve_args = bhyve -H -U 5e63409e-2cc1-4886-d0d2-9c7efb89878c -B 1,manuf
acturer=Joyent,product=SmartDC HVM,version=7.20210106T005452Z,serial=5e63409e-2cc1-4886-d0d2-9c7efb89878c,sku=001,family=Virtual Machine -s 31,lpc -l com1,/dev/zconsole -l com2,socket,/tmp/vm.ttyb -l bootrom,/usr/share/bhyve/uefi-csm-rom.bin -s 0,hostbridge,model=i440fx -c 4 -m 65536 -s 0:4:0,virtio-blk,/dev/zvol/rdsk/zones/5e63409e-2cc1-4886-d0d2-9c7efb89878c/disk0 -s 6:0,virtio-net-viona,net0
[2021-03-05T18:26:30.542921000Z]  INFO: zoneadmd/61341 on bob: (stream=stdout)
    -s 30:0,fbuf,vga=off,unix=/tmp/vm.vnc -s 30:1,xhci,tablet SYSbhyve-39

[2021-03-05T18:27:30.549585000Z]  INFO: zoneadmd/61341 on bob: (stream=stderr)
    Unable to allocate memory (68719476736), retrying in 1 second

[2021-03-05T18:28:31.549547000Z]  INFO: zoneadmd/61341 on bob: (stream=stderr)
    Unable to allocate memory (68719476736), retrying in 1 second
...

Rebooting with a PI with the interim fix, I first did a lot of disk activity to fill up the ARC. Finally it looked similar to how it was with the previous PI:

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    4258635             16635   13%
Boot pages                  75234               293    0%
ZFS File Data            18077431             70614   54%
VMM Memory                5255168             20528   16%
Anon                       275276              1075    1%
Exec and libs                6593                25    0%
Page cache                  46697               182    0%
Free (cachelist)           414516              1619    1%
Free (freelist)           5116476             19986   15%

Total                    33526026            130961
Physical                 33526025            130961

Obviously getting the numbers exact would be an incredible challenge, but the state should be similar enough.
I then tried to start the same VM, which succeeded (note that it did take about 3 seconds for the ARC to empty sufficiently before it could start):

[2021-03-07T00:57:41.699396000Z]  INFO: zoneadmd/3034 on bob: -H -U 5e63409e-2cc
1-4886-d0d2-9c7efb89878c -B 1,manufacturer=Joyent,product=SmartDC HVM,version=7.
20210304T234710Z,serial=5e63409e-2cc1-4886-d0d2-9c7efb89878c,sku=001,family=Virt
ual Machine -s 31,lpc -l com1,/dev/zconsole -l com2,socket,/tmp/vm.ttyb -l bootr
om,/usr/share/bhyve/uefi-csm-rom.bin -s 0,hostbridge,model=i440fx -c 4 -m 65536
-s 0:4:0,virtio-blk,/dev/zvol/rdsk/zones/5e63409e-2cc1-4886-d0d2-9c7efb89878c/di
sk0 -s 6:0,virtio-net-viona,net0 -s 30:0,fbuf,vga=off,unix=/tmp/vm.vnc -s 30:1,x
hci,tablet SYSbhyve-39 (stream=stdout)
[2021-03-07T00:57:41.699454000Z]  INFO: zoneadmd/3034 on bob: (stream=stdout)


[2021-03-07T00:58:41.705468000Z]  INFO: zoneadmd/3034 on bob: (stream=stderr)
    Unable to allocate memory (68719476736), retrying in 1 second

[2021-03-07T00:59:08.706009000Z]  INFO: zoneadmd/3034 on bob: (stream=stderr)
    Unable to allocate memory (68719476736), retrying in 1 second

[2021-03-07T00:59:09.706333000Z]  INFO: zoneadmd/3034 on bob: (stream=stderr)
    Unable to allocate memory (68719476736), retrying in 1 second

[2021-03-07T00:59:49.486936000Z]  INFO: zoneadmd/3034 on bob: (stream=stderr)
    rdmsr to register 0x140 on vcpu 0

And the memory usage was reflected in ::memstat

> ::memstat
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                    4319110             16871   13%
Boot pages                  75234               293    0%
ZFS File Data             1994417              7790    6%
VMM Memory               22040576             86096   66%
Anon                       121777               475    0%
Exec and libs                1286                 5    0%
Page cache                   5335                20    0%
Free (cachelist)           620228              2422    2%
Free (freelist)           4348063             16984   13%

Total                    33526026            130961
Physical                 33526025            130961