OS-7447: formalize bhyve resource exclusion


Issue Type:Improvement
Priority:4 - Normal
Created at:2018-12-13T23:23:04.960Z
Updated at:2019-07-05T15:48:14.576Z


Created by:Patrick Mooney [X]
Reported by:Patrick Mooney [X]
Assigned to:Patrick Mooney [X]


Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-07-05T15:48:14.564Z)

Fix Versions

2019-07-18 Werewolf Bar Mitzvah (Release Date: 2019-07-18)

Related Issues

Related Links




Currently, bhyve uses control of the instance vCPUs to serialize modifications to in-kernel instance state. Common actions such as running a vCPU or delivering an interrupt require locking that specific vCPU. Operations which have wider effects, such as altering the memory map of the VM, require locking all the vCPUs. This ensures that no vCPU will be executing when critical resources might be undergoing state changes. It also means that items such as the memory map can be accessed in a lockless manner during normal operation, since holding any of the vCPU locks ensures that they will remain read-only for that period.

This situation is acceptable for bhyve in FreeBSD, but for illumos bhyve it presents a challege: How does one handle non-vCPU actors such as viona? Like one of the guest vCPUs, viona requires access to perform guest-physical to host-virtual translations and interrupt injection. Its actions are not associated with any of the running guest vCPUs though. Also, its execution pattern -- long running threads for the rings -- is different from bhyve ioctls which generally have a tightly bound lifetime. (Even vCPU execution operations will be short if interrupted by a request to lock that vCPU for other actions)

All of this suggests that the existing vCPU locking model should augmented with additional protection when VM-changing actions are occurring.


Comment by Patrick Mooney [X]
Created at 2019-02-26T19:47:31.668Z

Testing for this change began with a general smoke-test of the typical guest population (Linux, Windows, FreeBSD). The change is supposed to be confined to changing the internals of locking and resource exclusion within bhyve, which should result in no externally visible change to guest behavior. The initial tests confirm this, with guests running normally.

One aspect of the change, the concept of lease break/sign cycles when a VM write lock is requested, is likely the most delicate from an internals perspective. Requiring that viona flush any transmissions, and temporarily clear its mac rx function, is a possibly intrusive act. Checking to make sure that the lease mechanism functions as desired in this case is prudent.

Without any runtime consumers which would incur VM write lock acquisitions, I added some temporary test scaffolding to exercise those code paths:

diff --git a/usr/src/cmd/bhyvectl/bhyvectl.c b/usr/src/cmd/bhyvectl/bhyvectl.c                                                                                                                                                                                                                                             [0/0]
index d7179d5874..5e2f1cd131 100644
--- a/usr/src/cmd/bhyvectl/bhyvectl.c
+++ b/usr/src/cmd/bhyvectl/bhyvectl.c
@@ -306,6 +306,7 @@ static int unassign_pptdev, bus, slot, func;
 static int run;
 static int get_cpu_topology;
+static int do_wrlock;

  * VMCB specific.
@@ -1479,6 +1480,7 @@ setup_options(bool cpu_intel)
                { "get-suspended-cpus", NO_ARG, &get_suspended_cpus,    1 },
                { "get-intinfo",        NO_ARG, &get_intinfo,           1 },
                { "get-cpu-topology",   NO_ARG, &get_cpu_topology,      1 },
+               { "wrlock",             NO_ARG, &do_wrlock,     1 },

        const struct option intel_opts[] = {
@@ -1903,6 +1905,12 @@ main(int argc, char *argv[])

+       if (!error && do_wrlock) {
+               error = vm_wrlock(ctx);
+               printf("wrlock: %d\n", error);
+               exit(0);
+       }
        if (!error && memsize)
                error = vm_setup_memory(ctx, memsize, VM_MMAP_ALL);

diff --git a/usr/src/lib/libvmmapi/common/mapfile-vers b/usr/src/lib/libvmmapi/common/mapfile-vers
index cecf22dd4c..f0a2a9d442 100644
--- a/usr/src/lib/libvmmapi/common/mapfile-vers
+++ b/usr/src/lib/libvmmapi/common/mapfile-vers
@@ -119,6 +119,7 @@ SYMBOL_VERSION ILLUMOSprivate {
+               vm_wrlock;

diff --git a/usr/src/lib/libvmmapi/common/vmmapi.c b/usr/src/lib/libvmmapi/common/vmmapi.c
index 9ef7c2eb20..9153107bf1 100644
--- a/usr/src/lib/libvmmapi/common/vmmapi.c
+++ b/usr/src/lib/libvmmapi/common/vmmapi.c
@@ -1786,6 +1786,15 @@ vm_get_device_fd(struct vmctx *ctx)
        return (ctx->fd);

+vm_wrlock(struct vmctx *ctx)
+       int error;
+       error = ioctl(ctx->fd, VM_WRLOCK, 0);
+       return (error);
 #ifdef __FreeBSD__
 const cap_ioctl_t *
 vm_get_ioctls(size_t *len)
diff --git a/usr/src/lib/libvmmapi/common/vmmapi.h b/usr/src/lib/libvmmapi/common/vmmapi.h
index 0c372c70d0..af1ab48f54 100644
--- a/usr/src/lib/libvmmapi/common/vmmapi.h
+++ b/usr/src/lib/libvmmapi/common/vmmapi.h
@@ -270,6 +270,7 @@ int vm_set_topology(struct vmctx *ctx, uint16_t sockets, uint16_t cores,
            uint16_t threads, uint16_t maxcpus);
 int    vm_get_topology(struct vmctx *ctx, uint16_t *sockets, uint16_t *cores,
            uint16_t *threads, uint16_t *maxcpus);
+int    vm_wrlock(struct vmctx *ctx);

 #ifdef __FreeBSD__
diff --git a/usr/src/uts/i86pc/io/vmm/vmm_sol_dev.c b/usr/src/uts/i86pc/io/vmm/vmm_sol_dev.c
index d14199fcb5..ebcd49870c 100644
--- a/usr/src/uts/i86pc/io/vmm/vmm_sol_dev.c
+++ b/usr/src/uts/i86pc/io/vmm/vmm_sol_dev.c
@@ -459,6 +459,7 @@ vmmdev_do_ioctl(vmm_softc_t *sc, int cmd, intptr_t arg, int md,
        case VM_MAP_PPTDEV_MMIO:
        case VM_ALLOC_MEMSEG:
        case VM_MMAP_MEMSEG:
+       case VM_WRLOCK:
                lock_type = LOCK_WRITE_HOLD;
@@ -1275,6 +1276,9 @@ vmmdev_do_ioctl(vmm_softc_t *sc, int cmd, intptr_t arg, int md,
+       case VM_WRLOCK:
+               error = 0;
+               break;
                error = ENOTTY;
diff --git a/usr/src/uts/i86pc/sys/vmm_dev.h b/usr/src/uts/i86pc/sys/vmm_dev.h
index 63ccc36dc6..1bcc3f52e9 100644
--- a/usr/src/uts/i86pc/sys/vmm_dev.h
+++ b/usr/src/uts/i86pc/sys/vmm_dev.h
@@ -387,6 +387,7 @@ enum {
 #ifndef __FreeBSD__
        /* illumos-custom ioctls */
+       IOCNUM_WRLOCK = 257,

@@ -504,6 +505,7 @@ enum {
 #ifndef __FreeBSD__
 #define        VM_DEVMEM_GETOFFSET \
        _IOW('v', IOCNUM_DEVMEM_GETOFFSET, struct vm_devmem_offset)
+#define        VM_WRLOCK _IO('v', IOCNUM_WRLOCK)

 /* ioctls used against ctl device for vm create/destroy */
 #define        VMM_IOC_BASE            (('V' << 16) | ('M' << 8))

This essentially gives bhyvectl the ability to force VM write lock acquisition/release. I booted a PI with this change and repeated the wrlock command when VMs were in a variety of workloads. Most interesting was doing so during an iperf3 test, where throughput did not significantly decline unless the wrlock command was hammered repeatedly. This is a comforting result, knowing that write lock acquisitions, which should be extremely rare during runtime, are not detrimental to guest performance.

Comment by Patrick Mooney [X]
Created at 2019-02-26T22:47:05.159Z

@john.levon's suggestions of testing under reboot conditions unearthed on avenue for deadlock inside viona. With that fixed, further testing revealed a troubling deadlock condition in the vioapic:

stack pointer for thread fffffe5af6d7abc0 (bhyve/44 [vcpu 3]): fffffe007bc161a0
[ fffffe007bc161a0 _resume_from_idle+0x12b() ]
  fffffe007bc161d0 swtch+0x141()
  fffffe007bc16270 cv_timedwait_sig_hires+0x39d(fffffe5960964120, fffffe5960964118, 3b9aca00, f4240, 0)
  fffffe007bc162a0 cv_reltimedwait_sig+0x4f(fffffe5960964120, fffffe5960964118, 3e8, 4)
  fffffe007bc16300 vm_handle_rendezvous+0x9b(fffffe5960964000, 3)
  fffffe007bc16360 vm_smp_rendezvous+0x10b()
  fffffe007bc16410 vioapic_write+0x153(fffffe5b032c67c0, 3, 1c, 1a036)
  fffffe007bc16490 vioapic_mmio_rw+0xe6(fffffe5b032c67c0, 3, fec00010, fffffe007bc164b8, 4, 0)
  fffffe007bc164f0 vioapic_mmio_write+0x4e(fffffe5960964000, 3, fec00010, 1a036, 4, fffffe007bc1671f)
  fffffe007bc16580 emulate_mov+0xb2(fffffe5960964000, 3, fec00010, fffffe5960964600, fffffffff8458e10, fffffffff8458e70, fffffe007bc1671f)
  fffffe007bc165d0 vmm_emulate_instruction+0x6c(fffffe5960964000, 3, fec00010, fffffe5960964600, fffffe59609645e8, fffffffff8458e10, fffffffff8458e70, fffffe007bc1671f)
  fffffe007bc16670 vm_handle_inst_emul+0x18c(fffffe5960964000, 3, fffffe007bc1671f)
  fffffe007bc16760 vm_run+0x3f7(fffffe5960964000, fffffe007bc16ae0)
  fffffe007bc16c20 vmmdev_do_ioctl+0xca6(fffffe5afc395040, c0907601, fffffc7fe7e01e00, 202003, fffffe5afe105640, fffffe007bc16ea8)
  fffffe007bc16cc0 vmm_ioctl+0x132(13100000002, c0907601, fffffc7fe7e01e00, 202003, fffffe5afe105640, fffffe007bc16ea8)
  fffffe007bc16d00 cdev_ioctl+0x39(13100000002, c0907601, fffffc7fe7e01e00, 202003, fffffe5afe105640, fffffe007bc16ea8)
  fffffe007bc16d50 spec_ioctl+0x60(fffffe5afc87d680, c0907601, fffffc7fe7e01e00, 202003, fffffe5afe105640, fffffe007bc16ea8, 0)
  fffffe007bc16de0 fop_ioctl+0x55(fffffe5afc87d680, c0907601, fffffc7fe7e01e00, 202003, fffffe5afe105640, fffffe007bc16ea8, 0)
  fffffe007bc16f00 ioctl+0x9b(3, c0907601, fffffc7fe7e01e00)
  fffffe007bc16f10 sys_syscall+0x19f()
> ::pgrep bhyvectl
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  17074   6395   6395   6334      0 0x4a004000 fffffe5af7f2a070 bhyvectl
> fffffe5af7f2a070::walk thread | ::findstack -v
stack pointer for thread fffffe5af7b884a0 (bhyvectl/1): fffffe00817125b0
[ fffffe00817125b0 _resume_from_idle+0x12b() ]
  fffffe00817125e0 swtch+0x141()
  fffffe0081712620 cv_wait+0x70(fffffe596096435e, fffffe5960964350)
  fffffe0081712690 vcpu_set_state_locked+0x8e(fffffe5960964000, 1, 1, 1)
  fffffe00817126f0 vcpu_set_state+0x5e(fffffe5960964000, 1, 1, 1)
  fffffe0081712710 vcpu_lock_one+0x23(fffffe5afc395040, 1)
  fffffe0081712760 vmm_write_lock+0x36(fffffe5afc395040)
  fffffe0081712c20 vmmdev_do_ioctl+0x186(fffffe5afc395040, 20007701, 0, 202003, fffffe5af80763f0, fffffe0081712ea8)
  fffffe0081712cc0 vmm_ioctl+0x132(13100000002, 20007701, 0, 202003, fffffe5af80763f0, fffffe0081712ea8)
  fffffe0081712d00 cdev_ioctl+0x39(13100000002, 20007701, 0, 202003, fffffe5af80763f0, fffffe0081712ea8)
  fffffe0081712d50 spec_ioctl+0x60(fffffe5af8088280, 20007701, 0, 202003, fffffe5af80763f0, fffffe0081712ea8, 0)
  fffffe0081712de0 fop_ioctl+0x55(fffffe5af8088280, 20007701, 0, 202003, fffffe5af80763f0, fffffe0081712ea8, 0)
  fffffe0081712f00 ioctl+0x9b(3, 20007701, 0)
  fffffe0081712f10 sys_syscall+0x19f()

Cleaning up the old vm_rendezvous logic from the vioapic will probably be necessary to fix this.

Comment by Patrick Mooney [X]
Created at 2019-06-05T20:51:45.440Z

As mentioned in OS-7622, I repeated the wrlock-cycle test with both that fix and this one. With both in place, I was unable to reproduce the deadlock conditions during repeated reboot tests.

Comment by Patrick Mooney [X]
Created at 2019-06-27T21:33:32.794Z

My Windows smoke test (running through Windows Update) completed successfully. A Linux guest was able to complete a similar workload. Ideally there is no guest-perceptible change in behavior with this change.

Comment by Jira Bot
Created at 2019-07-05T15:45:41.239Z

illumos-joyent commit b69647d3aacd223817db151e369b34ec2471cf28 (branch master, by Patrick Mooney)

OS-7447 formalize bhyve resource exclusion
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Reviewed by: John Levon <john.levon@joyent.com>
Approved by: John Levon <john.levon@joyent.com>