OS-6829: bhyve can miss PIR wake-ups

Issue Type:	Bug
Priority:	4 - Normal
Status:	Resolved
Created at:	2018-03-26T04:45:26.543Z
Updated at:	2018-03-28T22:59:17.683Z

Created by:	Former user
Reported by:	Former user
Assigned to:	Former user

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-03-28T22:59:17.668Z)

@accountid:6248927d45ece00069cb9db5 reported that an illumos guest running under bhyve was having trouble cloning a git repo over the network. I went to reproduce those circumstances and wasn't even able to install git thanks to the network transfer for the package manager getting hung up. After some digging, I found that a simple 'curl' of a local moderately large http resource (in this case, just a platform tarball) could reproduce this hang.

When the issue is reproduced, the curl transfer will cease, as if there is a network disruption. A look at ::cpuinfo in the debugger yields a common theme:

[3]> ::cpuinfo -v
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc39b80  1b    1    6  -1   no    no t-633  ffffff000f405c40
 (idle)
                       |    |    |
            RUNNING <--+    |    +--> PIL THREAD
              READY         |           6 ffffff000f4e6c40
             EXISTS         |
             ENABLE         +-->  PRI THREAD           PROC
                                  105 ffffff000f4e6c40 sched
stack pointer for thread ffffff000f4e6c40: ffffff000f4e6600
[ ffffff000f4e6600 resume_from_intr+0xb7() ]
  ffffff000f4e6630 swtch+0x82()
  ffffff000f4e66d0 turnstile_block+0x21a()
  ffffff000f4e6740 mutex_vector_enter+0x3a3()
  ffffff000f4e67d0 mac`mac_rx_soft_ring_process+0x53()
  ffffff000f4e6900 mac`mac_rx_srs_proto_fanout+0x29a()
  ffffff000f4e6980 mac`mac_rx_srs_drain+0x363()
  ffffff000f4e6a10 mac`mac_rx_srs_process+0x3ce()
  ffffff000f4e6a70 mac`mac_rx_common+0x143()
  ffffff000f4e6ad0 mac`mac_rx+0xb6()
  ffffff000f4e6b40 vioif`vioif_process_rx+0x1fc()
  ffffff000f4e6b70 vioif`vioif_rx_handler+0x20()
  ffffff000f4e6be0 av_dispatch_autovect+0x91()
  ffffff000f4e6c20 dispatch_hardint+0x36()
  ffffff000f4ddaa0 switch_sp_and_call+0x13()
  30 [no mapping for address]
[3]> ffffff000f405c40::findstack
stack pointer for thread ffffff000f405c40: ffffff000f405ad0
[ ffffff000f405ad0 _resume_from_idle+0x112() ]
  ffffff000f405bc0 mach_cpu_idle+6()
  ffffff000f405bf0 cpu_idle+0x11a()
  ffffff000f405c00 cpu_idle_adaptive+0x13()
  ffffff000f405c20 idle+0xa7()
  ffffff000f405c30 thread_start+8()

The thread handling inbound virtio packets went to sleep on a mutex, resulting the idle thread coming back on-CPU. That said, the mutex desired by the virtio thread is available and the thread itself is runnable. NMIs delivered to the guest via bhyvectl will wake that CPU and return the system to a lively state. Why then, are we seemingly stuck in a hlt, unable to be woken by "normal" interrupts.

Casting a wide net with dtrace around the interrupt handling routines in bhyve allowed me to find an event pattern which appears to explain part of the problem:

ts:18853472203520 thr:ffffff1a82f20420 pcpu:33 vcpu:0 icrtmr_write:entry(val:000a2cc4, exp:0,183219319117709312)
ts:18853472207990 thr:ffffff1a82f20420 pcpu:33 vcpu:0 cyclic_reprogram(ffffff195ba31ad0, abs:18853482142486, rel:9934496)
ts:18853472209208 thr:ffffff1af64d7500 pcpu:32 vcpu:0 lapic_set_intr(tgt:0 vec:60)
ts:18853472211729 thr:ffffff1a82f20420 pcpu:33 vcpu:0 icrtmr_write:return(val:000a2cc4, exp:0,183294085908398080)
...
ts:18853475095943 thr:ffffff1af64d7500 pcpu:32 vcpu:0 lapic_set_intr(tgt:1 vec:61)
ts:18853475099094 thr:ffffff1af64d7500 pcpu:32 vcpu:0 vmx_set_intr_ready(tgt:1) = 1
ts:18853475101480 thr:ffffff1af64d7500 pcpu:32 vcpu:0 vcpu_notify_event(tgt:1)
ts:18853475108812 thr:ffffff1af64d7500 pcpu:32 vcpu:0 lapic_set_intr(tgt:0 vec:60)
ts:18853475112456 thr:ffffff1af64d7500 pcpu:32 vcpu:0 vmx_set_intr_ready(tgt:0) = 1
ts:18853475113060 thr:ffffff1a82f20420 pcpu:33 vcpu:0 vmx_exit_process(reason:12)
ts:18853475127369 thr:ffffff1a82f20420 pcpu:33 vcpu:0 vmx_exit_process(HLT, 0000)
ts:18853475129522 thr:ffffff1af64d7500 pcpu:32 vcpu:0 vcpu_notify_event(tgt:0)
ts:18853475131928 thr:ffffff1a82f20420 pcpu:33 vcpu:0 vm_handle_hlt:entry
ts:18853475137488 thr:ffffff1a82f20420 pcpu:33 vcpu:0 vmx_pending_intr(pend:1, ppr:70) = 0
ts:18853475159202 thr:ffffff1af64d7500 pcpu:32 vcpu:0 lapic_set_intr(tgt:0 vec:60)
ts:18853475162382 thr:ffffff1af64d7500 pcpu:32 vcpu:0 vmx_set_intr_ready(tgt:0) = 0
...
ts:18853482154669 thr:ffffff007befec40 pcpu:33 vcpu:0 vlapic_fire_lvt(vec:d3, mode:000)
ts:18853482163049 thr:ffffff007befec40 pcpu:33 vcpu:0 vmx_set_intr_ready(tgt:0) = 0

Here we see the normal cyclics logic wiring up the APIC timer for a 1000hz interval on vcpu 0. Some time later, inbound interrupts are processed and posted to vcpu0. In the middle of that processing (presumably when sleeping on the mutex in MAC), vcpu0 executes hlt with a PPR=70. It will ignore lower-level interrupts (including any more from virtio) at this priority. Eventually, enough time elapses for the APIC timer to expire, attempting to deliver that interrupt (of a high enough level, d0, to be serviced at PPR=70). Notably, it does not wake the hlt-ed vcpu for some reason. A closer examination of the code exposes why.

In vlapic_fire_lvt, on trigger of the APIC timer, the vector is set ready, but vcpu notification is dependent on if vlapic_set_intr_ready indicates such a need:

        vec = lvt & APIC_LVT_VECTOR;
        mode = lvt & APIC_LVT_DM;

        switch (mode) {
        case APIC_LVT_DM_FIXED:
                if (vec < 16) {
                        vlapic_set_error(vlapic, APIC_ESR_SEND_ILLEGAL_VECTOR);
                        return (0);
                }
                if (vlapic_set_intr_ready(vlapic, vec, false))
                        vcpu_notify_event(vlapic->vm, vlapic->vcpuid, true);

The body of vmx_set_intr_ready outlines the requirement for notification:

        notify = atomic_cmpset_long(&pir_desc->pending, 0, 1);

        VMX_CTR_PIR(vlapic->vm, vlapic->vcpuid, pir_desc, notify, vector,
            level, "vmx_set_intr_ready");
        return (notify);

Only when the PIR structure goes from "none pending" to "some pending" will it request a notification to the vCPU. If there were already interrupts pending, shouldn't the vcpu have been woken from the hlt?

In vmx_pending_intr, called from the vlapic_pending_intr check in vm_handle_hlt, the PPR plays an important role:

                rvi = vmexit->u.hlt.intr_status & APIC_TPR_INT;
                lapic = vlapic->apic_page;
                ppr = lapic->ppr & APIC_TPR_INT;
                if (rvi > ppr) {
                        return (1);
                }

                return (0);
...
                pirval = pir_desc->pir[i];
                if (pirval != 0) {
                        vpr = (i * 64 + flsl(pirval) - 1) & APIC_TPR_INT;
                        return (vpr > ppr);
                }

In both the new check which pays heed to APICv state stored in the VMCS and the old one walking the pending PIR data, any pending vectors are compared against the PPR. When the thread above slept in hlt, it possess a PPR of 70, which would preclude it from handling the virtio-related vectors. That would not, however, prevent such interrupts from being queued as "pending". When a later high level interrupt is asserted, the existence of pending interrupts suppresses the vcpu notification and results in the hlt proceeding for too long.

An outstanding question in this case: Why was the guest thread "waiting" until the APIC timer fired? It seems like a mutex_exit should have resulted in a cross-call to that vCPU in order to wake it. There's no evidence to suggest that any such call was attempted.

OS-6829: bhyve can miss PIR wake-ups

Details

People

Resolution

Fix Versions

Labels

Description

Comments