OS-6716: bhyve guest VCPUs pause intermittently

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-03-01T03:18:18.417Z
Updated at:2018-03-16T17:00:28.794Z

People

Created by:Bryan Cantrill [X]
Reported by:Bryan Cantrill [X]
Assigned to:Patrick Mooney [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-03-14T21:51:04.027Z)

Fix Versions

2018-03-15 Nibelheim (Release Date: 2018-03-15)

Related Links

Labels

bhyve

Description

@joe.landman reported (and others have also reported) that bhyve guests on SmartOS seem to pause intermittently. Running in a guest on his machine with this simple invocation can demonstrate this:

$ while true; do date ; sleep 1 ; done

This will occasionally see very long pauses, e.g.:

Thu Mar  1 02:54:28 UTC 2018
Thu Mar  1 02:54:29 UTC 2018
Thu Mar  1 02:54:30 UTC 2018
Thu Mar  1 02:54:31 UTC 2018
Thu Mar  1 02:54:32 UTC 2018
Thu Mar  1 02:54:33 UTC 2018
Thu Mar  1 02:54:34 UTC 2018
Thu Mar  1 02:56:32 UTC 2018
Thu Mar  1 02:56:33 UTC 2018
Thu Mar  1 02:56:34 UTC 2018
Thu Mar  1 02:56:35 UTC 2018
Thu Mar  1 02:56:36 UTC 2018
Thu Mar  1 02:56:37 UTC 2018
Thu Mar  1 02:57:00 UTC 2018
Thu Mar  1 02:57:01 UTC 2018
Thu Mar  1 02:57:02 UTC 2018
Thu Mar  1 02:57:03 UTC 2018

I looked into this with this DTrace invocation:

# dtrace -n vcpu_set_state_locked:entry'{@[arg1, arg2] = count()}' -x aggsortkey -n tick-1sec'{printa(@); clear(@)}'

During the paused periods, one will see VCPUs that no longer change state, e.g.:

                0                0               41
                0                1               87
                0                2               36
                0                3               10
                1                0                1
                1                1              500
                1                2              250
                1                3              249
                2                0                1
                2                1              502
                2                2              251
                2                3              250
                3                0                1
                3                1              500
                3                2              250
                3                3              249
                4                0                1
                4                1              502
                4                2              251
                4                3              250
                5                0                1
                5                1              500
                5                2              250
                5                3              249
                6                0                1
                6                1              502
                6                2              251
                6                3              250
                7                0                2
                7                1              498
                7                2              250
                7                3              246
                8                0                1
                8                1             1433
                8                2              935
                8                3              497
                9                0                1
                9                1              503
                9                2              252
                9                3              250
               10                0                0
               10                1                0
               10                2                0
               10                3                0
               11                0                1
               11                1              500
               11                2              250
               11                3              249
               12                0                1
               12                1              500
               12                2              250
               12                3              249
               13                0                1
               13                1              691
               13                2              346
               13                3              344
               14                0                0
               14                1                0
               14                2                0
               14                3                0
               15                0                1
               15                1            12986
               15                2             6493
               15                3             6492

In the above output, vCPUs 10 and 14 have seemingly stopped changing state. These vCPUs seem to always (virtually) halted:

              genunix`cv_wait_sig+0x185
              vmm`vm_handle_hlt+0x17c
              vmm`vm_run+0x38c
              vmm`vmmdev_do_ioctl+0xc0e
              vmm`vmm_ioctl+0x12c
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`sys_syscall+0x19f

Also of note: running pstack on the stuck zhyve process always seems to unstick it. On the one hand, this would point to a missed wakeup -- but on the other, the vm_handle_hlt loop will check (and break) on the thread's AST, which will be set if made runnable via /proc control.

One experiment that might be worth trying to replace the cv_wait_sig in vm_handle_hlt with a cv_timedwait_sig with (say) a 1ms timeout to see if the pauses are still experienced. To be clear, this isn't a fix -- this would merely paper over the problem -- but it may provide an interesting data point (especially if the pauses persist).

Comments

Comment by Patrick Mooney [X]
Created at 2018-03-01T05:03:18.753Z

If you look at the bhyve code, they do wake up at the system hz frequency while sleeping in halt in order to check for signals. When I fixed up that initial callout bug, I switched to a the untimed cv_waig_sig, since we had proper signal sensitivity, but perhaps is an upstream bug anyways. It seems rather unfortunate to do all those wake-ups during idle.


Comment by Bryan Cantrill [X]
Created at 2018-03-01T22:26:43.422Z

In looking at this, I wrote a D script to trace VCPU manipulation, and then waited for a VCPU to become wedged. Here's relevant output:

10748211627 81597175930350 cpu=1 vmm_glue_callout_reset_sbt sbt=296448
10748214274 81597175932997 cpu=1 vlapic_set_intr_ready vec=253 lvl=0 svr=0x1ff
10748215174 81597175933897 cpu=1 vcpu_notify_event
10748286258 81597176004981 cpu=1 vlapic_callout_handler
10748289918 81597176008641 cpu=1 vlapic_fire_timer
10748290471 81597176009194 cpu=1 vlapic_set_intr_ready vec=253 lvl=0 svr=0x1ff
10748291227 81597176009950 cpu=1 vcpu_notify_event
10748292672 81597176011395 cpu=1 vlapic_set_intr_ready vec=238 lvl=0 svr=0x1ff
10748300533 81597176019256 cpu=1 -> vm_handle_hlt
10748303876 81597176022599 cpu=1 vlapic_pending_intr=0
22140421862 81608568140585 cpu=1 vlapic_pending_intr=0
22140425119 81608568143842 cpu=1 <- vm_handle_hlt

This is showing relative time followed by absolute time, both in nanoseconds, followed by details. As we can see we set the timer effectively as we were halting -- the VCPU halted a mere 8 microseconds after the interrupt was delivered. The surprise was that vlapic_pending_intr was returning 0; there is clearly a pending interrupt! This calls into vmx_pending_intr -- which had this change made to it ~2 months ago:

commit 02cc877968bbcd57695035c67114a67427f54549
Author: tychon <tychon@FreeBSD.org>
Date:   Thu Dec 21 18:30:11 2017 +0000

    Recognize a pending virtual interrupt while emulating the halt instruction.
    
    Reviewed by:    grehan, rgrimes
    Sponsored by:   Dell EMC Isilon
    Differential Revision:  https://reviews.freebsd.org/D13573

diff --git a/sys/amd64/vmm/intel/vmx.c b/sys/amd64/vmm/intel/vmx.c
index e563bed6212..f803862079c 100644
--- a/sys/amd64/vmm/intel/vmx.c
+++ b/sys/amd64/vmm/intel/vmx.c
@@ -3174,8 +3174,28 @@ vmx_pending_intr(struct vlapic *vlapic, int *vecptr)
        pir_desc = vlapic_vtx->pir_desc;
 
        pending = atomic_load_acq_long(&pir_desc->pending);
-       if (!pending)
-               return (0);     /* common case */
+       if (!pending) {
+               /*
+                * While a virtual interrupt may have already been
+                * processed the actual delivery maybe pending the
+                * interruptibility of the guest.  Recognize a pending
+                * interrupt by reevaluating virtual interrupts
+                * following Section 29.2.1 in the Intel SDM Volume 3.
+                */
+               uint64_t val;
+               uint8_t rvi, ppr;
+
+               vmx_getreg(vlapic_vtx->vmx, vlapic->vcpuid,
+                   VMCS_IDENT(VMCS_GUEST_INTR_STATUS), &val);
+               rvi = val & APIC_TPR_INT;
+               lapic = vlapic->apic_page;
+               ppr = lapic->ppr & APIC_TPR_INT;
+               if (rvi > ppr) {
+                       return (1);
+               }
+
+               return (0);
+       }
 
        /*
         * If there is an interrupt pending then it will be recognized only
@@ -3185,7 +3205,7 @@ vmx_pending_intr(struct vlapic *vlapic, int *vecptr)
         * interrupt will be recognized.
         */
        lapic = vlapic->apic_page;

This seems relevant to our problem! We still need to determine if this can explain all of the behavior, but it's certainly related -- and it's an important fix to pull in. @patrick.mooney is pulling this in; @joe.landman will re-run this when we've got new bits.


Comment by Bryan Cantrill [X]
Created at 2018-03-02T01:55:30.602Z

To really get this nailed down, we ran this D script:

#pragma D option quiet
#pragma D option switchrate=1000hz
#pragma D option bufsize=32m
#pragma D option strsize=32

BEGIN
{
	start = timestamp;
}

vlapic_icrtmr_write_handler:entry
{
	printf("%d %d vcpu=%d %s icr_timer=%d\n", timestamp - start, timestamp,
	    args[0]->vcpuid, probefunc, args[0]->apic_page->icr_timer);
}

vmm_glue_callout_reset_sbt:entry
/arg3 == (uintptr_t)&`vlapic_callout_handler/
{
	this->v = (struct vlapic *)arg4;
	printf("%d %d vcpu=%d %s sbt=%d\n", timestamp - start, timestamp,
	    this->v->vcpuid, probefunc, arg1);
}

vlapic_fire_timer:entry,
vlapic_callout_handler:entry
{
	this->v = (struct vlapic *)arg0;
	printf("%d %d vcpu=%d %s\n", timestamp - start, timestamp,
	    this->v->vcpuid, probefunc);
}

vcpu_notify_event:entry
{
	printf("%d %d vcpu=%d %s\n", timestamp - start, timestamp,
	    arg1, probefunc);
}

vm_handle_hlt:entry
/!self->me/
{
	self->me = arg1 + 1;
}

vm_handle_hlt:entry,
vmx_run:entry,
vmx_inject_interrupts:entry,
vmx_enter_guest:entry
{
	printf("%d %d vcpu=%d -> %s\n", timestamp - start, timestamp,
	    self->me - 1, probefunc);
}

vm_handle_hlt:return,
vmx_run:return,
vmx_inject_interrupts:return,
vmx_enter_guest:return
{
	printf("%d %d vcpu=%d <- %s\n", timestamp - start, timestamp,
	    self->me - 1, probefunc);
}

vmx_exit_process:entry
{
	printf("%d %d vcpu=%d vmx_exit_process reason=%d qual=%d\n",
	    timestamp - start, timestamp,
	    self->me - 1, args[2]->u.vmx.exit_reason,
	    args[2]->u.vmx.exit_qualification);
}

vlapic_pending_intr:return
{
	printf("%d %d vcpu=%d %s=%d\n", timestamp - start, timestamp,
	    self->me - 1, probefunc, arg1);
}

vlapic_set_intr_ready:entry
{
	printf("%d %d vcpu=%d vlapic_set_intr_ready vec=%d lvl=%d svr=0x%x\n",
	    timestamp - start, timestamp, args[0]->vcpuid, arg1, arg2,
	    args[0]->apic_page->svr);
}

vmx_pending_intr:entry,
vmx_set_intr_ready:entry,
vmx_inject_pir:entry
{
	this->vtx = (struct vlapic_vtx *)arg0;
	printf("%d %d vcpu=%d pcpu=%d %s pending=%d pir=[%x,%x,%x,%x]\n",
	    timestamp - start, timestamp,
	    args[0]->vcpuid, cpu, probefunc,
	    this->vtx->pir_desc->pending,
	    this->vtx->pir_desc->pir[0],
	    this->vtx->pir_desc->pir[1],
	    this->vtx->pir_desc->pir[2],
	    this->vtx->pir_desc->pir[3]);
}

Here is output during a paused VCPU:

14530882925 6758922281886 vcpu=9 <- vmx_enter_guest
14530883776 6758922282737 vcpu=9 vmx_exit_process reason=1 qual=0
14530885154 6758922284115 vcpu=9 vlapic_callout_handler
14530886247 6758922285208 vcpu=9 vlapic_fire_timer
14530887308 6758922286269 vcpu=9 vlapic_set_intr_ready vec=238 lvl=0 svr=0x1ff
14530887963 6758922286924 vcpu=9 pcpu=20 vmx_set_intr_ready pending=0 pir=[0,0,0,0]
14530888794 6758922287755 vcpu=9 vlapic_set_intr_ready vec=253 lvl=0 svr=0x1ff
14530889175 6758922288136 vcpu=9 vcpu_notify_event
14530889513 6758922288474 vcpu=9 pcpu=16 vmx_set_intr_ready pending=1 pir=[0,0,0,400000000000]
14530890321 6758922289282 vcpu=9 -> vmx_inject_interrupts
14530891292 6758922290253 vcpu=9 pcpu=20 vmx_inject_pir pending=1 pir=[0,0,0,2000400000000000]
14530892419 6758922291380 vcpu=9 <- vmx_inject_interrupts
14530893061 6758922292022 vcpu=9 -> vmx_enter_guest
14530894549 6758922293510 vcpu=9 <- vmx_enter_guest
14530895237 6758922294198 vcpu=9 vmx_exit_process reason=12 qual=0
14530896599 6758922295560 vcpu=9 <- vmx_run
14530898048 6758922297009 vcpu=9 -> vm_handle_hlt
14530898726 6758922297687 vcpu=9 pcpu=20 vmx_pending_intr pending=0 pir=[0,0,0,0]
14530899901 6758922298862 vcpu=9 vlapic_pending_intr=0
24451694423 6768843093384 vcpu=9 vlapic_set_intr_ready vec=253 lvl=0 svr=0x1ff
24451695367 6768843094328 vcpu=9 pcpu=23 vmx_set_intr_ready pending=0 pir=[0,0,0,0]
24451696957 6768843095918 vcpu=9 vcpu_notify_event
24451705759 6768843104720 vcpu=9 pcpu=20 vmx_pending_intr pending=1 pir=[0,0,0,2000000000000000]
24451708978 6768843107939 vcpu=9 vlapic_pending_intr=1
24451711347 6768843110308 vcpu=9 <- vm_handle_hlt

Of note -- and a pattern that we have seen -- is that there are two interrupts pending. Note that we enter the guest at 14530893061 and then exit the guest about 1.5 microseconds later at 14530894549 with EXIT_REASON_HLT. @patrick.mooney notes that the pseudocode from section 29.2.2 in volume 3 of the Intel Software Developer's Manual implies pretty strongly that only a single interrupt is processed:

Vector ← RVI; VISR[Vector] ← 1;
SVI ← Vector;
VPPR ← Vector & F0H; VIRR[Vector] ← 0;
IF any bits set in VIRR
  THEN RVI ← highest index of bit set in VIRR
  ELSE RVI ← 0; 
FI;
deliver interrupt with Vector through IDT;
cease recognition of any pending virtual interrupt;

So it seems reasonable that the guest would exit with the second (local APIC timer) interrupt pending -- and without the fix from upstream bhyve, we will not properly re-enter the guest.


Comment by Jira Bot
Created at 2018-03-14T21:51:17.002Z

illumos-joyent commit 9c78fec9bbf9d3887166ba5cb4678f16a7b5a47d (branch master, by Patrick Mooney)

OS-6716 bhyve guest VCPUs pause intermittently
Reviewed by: John Levon <john.levon@joyent.com>
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Approved by: Bryan Cantrill <bryan@joyent.com>