OS-7354: bhyve should avoid reserved PIR descriptor bits

Details

Issue Type:Improvement
Priority:4 - Normal
Status:Resolved
Created at:2018-11-05T21:41:18.531Z
Updated at:2019-01-24T23:08:21.503Z

People

Created by:Patrick Mooney [X]
Reported by:Patrick Mooney [X]
Assigned to:Patrick Mooney [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-01-24T23:08:21.472Z)

Fix Versions

2019-01-31 Kenneth Parcell (Release Date: 2019-01-31)

Related Links

Labels

bhyve

Description

As part of OS-6829/OS-6930, the APICv logic for queuing interrupts and sending vCPU notifications about them was changed to use reserved bits in the PIR descriptor. The Intel manual noted that "The processor does not modify these bits." and "These bits may be used by software and by other agents in the system". It seemed reasonable at the time to use them since it mean asserting the status of pending interrupts and recording the relative priority of queued interrupts could be done in a single atomic operation. After reading through the Intel IOMMU manual, I found those reserved bit with clearly defined purposes when an IOMMU is posting interrupts directly to a guest. While there's no indication that our current use of those reserved bits is harmful in any way, it would be nice to clear out of them in case IOMMU-posted interrupts become a priority in the future.

With all this in mind, a new approach to deciding when a queued interrupt should incur a vCPU notification is needed. As before, spurious notifications are acceptable, but it's preferred to keep them to a minimum.

Comments

Comment by Patrick Mooney [X]
Created at 2019-01-23T16:30:34.115Z
Updated at 2019-01-24T21:43:31.552Z

I used a shortened version of the hlt.d script from OS-6829 for testing this change:

#pragma D option bufpolicy=ring
#pragma D option bufsize=512k
#pragma D option quiet

BEGIN
{
        self->vcpu = -1;
        sts = timestamp;
}

vmm_drv_msi:entry
{
        self->vcpu = -1;
}

vcpu_lock_one:entry
{
        self->vcpu = arg1;
/*
        printf("ts:%lu thr:%p pcpu:%02d vcpu:%d vcpu_lock_one\n", (unsigned long)timestamp - sts, curthread, cpu, self->vcpu);
*/
}

vlapic_set_intr_ready:entry
{
        self->tgt_vcpu = args[0]->vcpuid;
        self->vec = arg1;
}

vlapic_set_intr_ready:return
/self->vec/
{
        printf("ts:%lu thr:%p pcpu:%02d vcpu:%d %s(tgt:%d vec:%02x) = %d\n", (unsigned long)timestamp - sts, curthread, cpu, self->vcpu, probefunc, self->tgt_vcpu, self->vec, arg1);
}

vcpu_notify_event:entry
{

        /* vcpu_notify_event(vm, cpu, true); */
        self->tgt_vcpu = arg1;
        printf("ts:%lu thr:%p pcpu:%02d vcpu:%d vcpu_notify_event(tgt:%d)\n", (unsigned long)timestamp - sts, curthread, cpu, self->vcpu, self->tgt_vcpu);
}

vm_handle_hlt:entry
{
        self->vcpu = arg1;
        self->in_hlt = 1;
        printf("ts:%lu thr:%p pcpu:%02d vcpu:%d vm_handle_hlt:entry\n", (unsigned long)timestamp - sts, curthread, cpu, self->vcpu);
}

vmx_pending_intr:entry
/self->in_hlt/
{
        this->vtx = (struct vlapic_vtx *)arg0;
        self->pird = this->vtx->pir_desc;
        self->lapic = (struct LAPIC *)args[0]->apic_page;
        self->vecp = args[1];
}

vmx_pending_intr:return
/self->in_hlt/
{
        this->ppr = self->lapic->ppr & 0xf0;
        this->pending = self->pird->pending;

        if (self->vecp != NULL) {
                this->vec = *self->vecp;
        } else {
                this->vec = 0;
        }

        printf("ts:%lu thr:%p pcpu:%02d vcpu:%d %s(ppr:%02x, vec:%02x, pend:%d) = %d\n", (unsigned long)timestamp - sts, curthread, cpu, self->vcpu, probefunc, this->ppr, this->vec, this->pending, arg1);
}

vm_handle_hlt:return
{
        self->in_hlt = 0;
        printf("ts:%lu thr:%p pcpu:%02d vcpu:%d vm_handle_hlt:return\n", (unsigned long)timestamp - sts, curthread, cpu, self->vcpu);
}

The intent was to capture the same high-PPR HLT conditions that triggered the undesirable conditions in OS-6829/OS-6930. At least on illumos, a git clone of a large repository was adequate to incur the racing conditions which result in a high-PPR HLT. This was captured by the tracing:

ts:13222124787 thr:fffffe59a553a0c0 pcpu:18 vcpu:2 vm_handle_hlt:entry
ts:13222144654 thr:fffffe59a553a0c0 pcpu:18 vcpu:2 vmx_pending_intr(ppr:10, vec:00, pend:1) = 1
ts:13222150283 thr:fffffe59962a6880 pcpu:21 vcpu:3 vlapic_set_intr_ready(tgt:0 vec:d3) = 1
ts:13222152634 thr:fffffe59a553a0c0 pcpu:18 vcpu:2 vm_handle_hlt:return
ts:13222154716 thr:fffffe59962a6880 pcpu:21 vcpu:3 vcpu_notify_event(tgt:0)
ts:13222183583 thr:fffffe5afdb91480 pcpu:19 vcpu:0 vmx_pending_intr(ppr:10, vec:00, pend:1) = 1
ts:13222191200 thr:fffffe5afdb91480 pcpu:19 vcpu:0 vm_handle_hlt:return
ts:13222193145 thr:fffffe59a553a0c0 pcpu:18 vcpu:2 vm_handle_hlt:entry
ts:13222213102 thr:fffffe59a553a0c0 pcpu:18 vcpu:2 vmx_pending_intr(ppr:70, vec:00, pend:1) = 0
ts:13222229037 thr:fffffe5afdb91480 pcpu:19 vcpu:0 vm_handle_hlt:entry
ts:13222230676 thr:fffffe59962a6880 pcpu:21 vcpu:3 vlapic_set_intr_ready(tgt:2 vec:a1) = 1
ts:13222234286 thr:fffffe59962a6880 pcpu:21 vcpu:3 vcpu_notify_event(tgt:2)
ts:13222244850 thr:fffffe59962a6880 pcpu:21 vcpu:3 vm_handle_hlt:entry
ts:13222248294 thr:fffffe5afdb91480 pcpu:19 vcpu:0 vmx_pending_intr(ppr:10, vec:00, pend:0) = 0
ts:13222255069 thr:fffffe59962a6880 pcpu:21 vcpu:3 vmx_pending_intr(ppr:10, vec:00, pend:0) = 0
ts:13222265323 thr:fffffe59a553a0c0 pcpu:30 vcpu:2 vmx_pending_intr(ppr:70, vec:00, pend:1) = 1
ts:13222275980 thr:fffffe59a553a0c0 pcpu:30 vcpu:2 vm_handle_hlt:return

I repeated a similar workload on different guest types including centos7 and freebsd. I also booted up a Win2016 guest to check for a regression there.


Comment by Jira Bot
Created at 2019-01-24T23:05:49.834Z

illumos-joyent commit ce6a7649078ae9f0ae246da1fffb1343093388f2 (branch master, by Patrick Mooney)

OS-7354 bhyve should avoid reserved PIR descriptor bits
Reviewed by: John Levon <john.levon@joyent.com>
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>