OS-7012: bhyve wedged in vlapic cyclics

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-06-12T17:11:46.737Z
Updated at:2019-05-10T19:17:30.726Z

People

Created by:Max Bruning
Reported by:Max Bruning
Assigned to:Patrick Mooney

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-10-02T01:40:51.027Z)

Fix Versions

2018-09-27 Buzz Aldrin (Release Date: 2018-09-27)

Related Issues

Related Links

Labels

bhyve

Description

There is a loop in cyclic_juggle_one_to that results in a system hang.
The thoth id will be added when the dump on RAA09033 (us-east-3b) is loaded into thoth.
The system did not respond to ping or to console input. Remote connections dropped. No ssh. Here is some analysis of the dump. First the status. The system is running 20180606T114846Z at the time of the nmi.

> ::status
debugging crash dump vmcore.0 (64-bit) from RAA09033
operating system: 5.11 joyent_20180606T114846Z (i86pc)
image uuid: (not set)
panic message: NMI received

dump content: kernel pages only
>

A look at what's on cpu:

::cpuinfo
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc4c000  1f    0    0 109   no    no t-103  fffffe00f40e6c40 sched
  1 fffffeb1cce9c000  1f    0    0  -1   no    no t-231  fffffe00f43fbc40 (idle)
  2 fffffeb1cce9e000  1f    0    0  -1   no    no t-122  fffffe00f447bc40 (idle)
  3 fffffeb1cce94000  1f    0    0  -1   no    no t-102  fffffe00f428ac40 (idle)
  4 fffffeb1cce86000  1f    0    0  -1   no    no t-5308 fffffe00f4668c40 (idle)
  5 fffffeb1cd1f1000  1f    0    0  -1   no    no t-117  fffffe00f46e3c40 (idle)
  6 fffffeb1cd1ed000  1f    0    0  -1   no    no t-913  fffffe00f475ec40 (idle)
  7 fffffffffbc51e00  1b    0    0  -1   no    no t-119  fffffe00f47d9c40 (idle)
  8 fffffeb1cd1e3000  1f    1    0  -1   no    no t-398  fffffe00f4854c40 (idle)
  9 fffffeb1cd1df000  1f    0    0  -1   no    no t-122  fffffe00f48cfc40 (idle)
 10 fffffeb1cd1db000  1f    1    0  -1   no    no t-116  fffffe00f4950c40 (idle)
 11 fffffeb1cd51b000  1f    0    0  -1   no    no t-1640 fffffe00f49cbc40 (idle)
 12 fffffeb1cd517000  1f    1    0  -1   no    no t-110  fffffe00f4a46c40 (idle)
 13 fffffeb1cd50f000  1f    0    0  -1   no    no t-1365 fffffe00f4ad9c40 (idle)
 14 fffffeb1cd509000  1f    0   10  -1   no    no t-5231 fffffe00f4b6cc40 (idle)
 15 fffffeb1cd505000  1f    0    0  -1   no    no t-123  fffffe00f4be7c40 (idle)
 16 fffffeb1cd4ff000  1f    1    0 101   no    no t-2286038 fffffe00f4caac40 sched
 17 fffffeb1cd7df000  1f    0    0  -1   no    no t-129  fffffe00f4ce8c40 (idle)
 18 fffffeb1cd7db000  1f    0    0  -1   no    no t-109  fffffe00f4d6ec40 (idle)
 19 fffffeb1cd7d5000  1f    0    0  -1   no    no t-105  fffffe00f4df4c40 (idle)
 20 fffffeb1cd7d1000  1f    0    0  -1   no    no t-117  fffffe00f4e7ac40 (idle)
 21 fffffeb1cd7cd000  1f    0    0  -1   no    no t-112  fffffe00f4f00c40 (idle)
 22 fffffeb1cd7c7000  1f    0    0  -1   no    no t-12   fffffe00f4f86c40 (idle)
 23 fffffeb1cd7c3000  1f    0    0  -1   no    no t-747  fffffe00f500cc40 (idle)
 24 fffffeb1cdb44000  1f    0    0  -1   no    no t-24   fffffe00f50aac40 (idle)
 25 fffffeb1cdc1b000  1f    0    0  -1   no    no t-115  fffffe00f52b6c40 (idle)
 26 fffffeb1cdc0d000  1f    0    0  -1   no    no t-112  fffffe00f545ac40 (idle)
 27 fffffeb1cdc09000  1f    0    0  -1   no    no t-3231 fffffe00f5538c40 (idle)
 28 fffffeb1cdc01000  1f    1    0  -1   no    no t-537  fffffe00f5616c40 (idle)
 29 fffffeb1cdc05000  1f    0    0  -1   no    no t-127  fffffe00f5778c40 (idle)
 30 fffffeb1cdc1f000  1f    1    0  -1   no    no t-348  fffffe00f5804c40 (idle)
 31 fffffeb1cdf72000  1f    0    0  -1   no    no t-1231 fffffe00f58aec40 (idle)

The thread running on cpu 0 is clock(). All other cpu cores except for cpu 16 are idle. The thread running on cpu 16 has been on for a while (t-2286038 ticks). This is ~38 minutes. Let's look at the stack for that thread.

fffffe00f4caac40::findstack -v
stack pointer for thread fffffe00f4caac40: fffffe00f4ca8960
  fffffe00f4ca89b0 vm_mapping_find+0x67(fffffeb43257ea00, 3bb28000, 1000)
  fffffe00f4ca8a10 vmspace_find_kva+0x55(fffffeb43257ea00, 3bb28000, 1000)
  fffffe00f4ca8a50 vmm_drv_gpa2kva+0x39(fffffeb42baf1080, 3bb28000, 1000)
  fffffe00f4ca8a80 viona_gpa2kva+0x1c(fffffeb1ea965940, 3bb28000, 1000)
  fffffe00f4ca8b60 vq_popchain+0x37c(fffffeb1ea965950, 1, fffffe00f4ca8da0, 0)
  fffffe00f4ca8f00 viona_recv_merged+0x1f1(fffffeb266c58600, fffffeb266c58600, 0)
  fffffe00f4ca8fa0 viona_rx+0x140(fffffeb1ea965940, 0, fffffeb266c58600, 0)
  fffffe00f4ca8ff0 mac_rx_deliver+0x37(fffffeb414a498f0, 0, fffffeb266c58600, 0)
  fffffe00f4ca9080 mac_rx_soft_ring_process+0x99(fffffeb414a498f0, fffffeb434747680, fffffeb266c58600, fffffeb266c58600, 1, 0)
  fffffe00f4ca9930 mac_rx_srs_fanout+0x3b2(fffffeb43ac4c980, fffffeb266c58600)
  fffffe00f4ca99b0 mac_rx_srs_drain+0x25e(fffffeb43ac4c980, 800)
  fffffe00f4ca9a40 mac_rx_srs_process+0x109(fffffeb43ab42ca8, fffffeb43ac4c980, fffffeb266c58600, 0)
  fffffe00f4ca9aa0 mac_rx_common+0x230(fffffeb43ab42ca8, 0, fffffeb266c58600)
  fffffe00f4ca9b00 mac_rx+0xb6(fffffeb427e84840, 8, fffffffff85ba202)
  fffffe00f4ca9d20 overlay_mux_recv+0x32a(fffffeb1f3f0b7f0, fffffeb6ae8af280, 146, 0, fffffeb431374380)
  fffffe00f4ca9dc0 so_queue_msg_impl+0x160(fffffeb1f3f0b7f0, fffffeb6ae8af280, 146, 0, fffffe00f4ca9e2c, 0, 0)
  fffffe00f4ca9e10 so_queue_msg+0x30(fffffeb1f3f0b7f0, fffffeb6ae8af280, 146, 0, fffffe00f4ca9e2c, 0)
  fffffe00f4ca9e80 udp_ulp_recv+0xa3(fffffeb1f3f0aa40, fffffeb6ae8af280, 146, fffffe00f4caa310)
  fffffe00f4ca9fe0 udp_input+0x3c2(fffffeb1f3f0aa58, fffffeb266c58600, fffffeb26561c014, fffffeb1f3f0aa40)
  fffffe00f4caa0d0 ip_fanout_v4+0xd92(fffffeb266c58600, fffffeb26561c014, fffffe00f4caa310)
  fffffe00f4caa140 ip_input_local_v4+0x16e(fffffeb1f44e6430, fffffeb266c58600, fffffeb26561c014, fffffe00f4caa310)
  fffffe00f4caa1b0 ire_recv_local_v4+0x132(fffffeb1f44e6430, fffffeb266c58600, fffffeb26561c014, fffffe00f4caa310)
  fffffe00f4caa2d0 ill_input_short_v4+0x4d6(fffffeb1f44e6430, fffffeb26561c014, fffffeb2fc741728, fffffe00f4caa310, fffffffff7a54a11)
  fffffe00f4caa530 ip_input_common_v4+0x27a(fffffeb2fc741728, 0, fffffeb266c58600, 0, 0, 0, 0)
  fffffe00f4caa570 ip_input+0x2b(fffffeb2fc741728, 0, fffffeb266c58600, 0)
  fffffe00f4caa600 mac_rx_soft_ring_process+0x99(fffffeb414a4e390, fffffeb1f3eaec40, fffffeb266c58600, fffffeb266c58600, 1, 0)
  fffffe00f4caa730 mac_rx_srs_proto_fanout+0x29a(fffffeb1fa1fd680, fffffeb266c58600)
  fffffe00f4caa7b0 mac_rx_srs_drain+0x25e(f6, fffffe00f4caa890)
  fffffe00f4caa890 apix_hilevel_intr_prolog+0x3e(fffffffffb80094a, fffffe00f4caa890, fffffe00f4caa890, b)
  fffffe00f4caa9b0 mutex_delay_default+7()
  fffffe00f4caaa20 lock_set_spl_spin+0xb2(fffffeb43199bb29, b, fffffeb43199bb2c, 2)
  fffffe00f4caaa90 mutex_vector_enter+0x361(fffffeb43199bb28)
  fffffe00f4caaad0 vlapic_callout_handler+0x2a(fffffeb43199ba80)
  fffffe00f4caaaf0 vmm_glue_callout_handler+0x2f(fffffeb43199bad0)
  fffffe00f4caab90 cyclic_softint+0xfd(fffffeb1cd4ff000, 0)
  fffffe00f4caaba0 cbe_low_level+0x14()
  fffffe00f4caabf0 av_dispatch_softvect+0x78(2)
  fffffe00f4caac20 apix_dispatch_softint+0x35(0, 0)
  fffffe00f971c2b0 switch_sp_and_call+0x13()
  fffffe00f971c300 apix_do_softint+0x6c(fffffe00f971c370)
  fffffe00f971c360 apix_do_interrupt+0x362(fffffe00f971c370, f7)
  fffffe00f971c370 _interrupt+0xba()
  fffffe00f971c4a0 cbe_xcall+0x88(fffffeb1cd1e9000, fffffeb1cd1e9000, fffffffffb9ef5e0, fffffe00f971c4b0)
  fffffe00f971c520 cyclic_remove_here+0x57(fffffeb1ce1dc6c0, 15, fffffe00f971c560, 1)
  fffffe00f971c5b0 cyclic_juggle_one_to+0xf4(fffffeb4288c3cd0, fffffeb1ce1dc000)
  fffffe00f971c5f0 cyclic_move_here+0x88(fffffeb4288c3cd0)
  fffffe00f971c620 vmm_glue_callout_localize+0x27(fffffeb1f3a9f748)
  fffffe00f971c650 vhpet_localize_resources+0x40(fffffeb1f3a9f600)
  fffffe00f971c680 vm_localize_resources+0x9a(fffffeb4393d6000, fffffeb4393d6240)
  fffffe00f971c760 vm_run+0x18a(fffffeb4393d6000, fffffe00f971cae0)
  fffffe00f971cc20 vmmdev_do_ioctl+0x762(fffffeb420ebf600, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8)
  fffffe00f971ccc0 vmm_ioctl+0x12c(13100000003, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8)
  fffffe00f971cd00 cdev_ioctl+0x39(13100000003, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8)
  fffffe00f971cd50 spec_ioctl+0x60(fffffeb429d9b940, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8, 0)
  fffffe00f971cde0 fop_ioctl+0x55(fffffeb429d9b940, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8, 0)
  fffffe00f971cf00 ioctl+0x9b(3, c0907601, fffffc7fe9fffe10)
  fffffe00f971cf10 sys_syscall+0x19f()

For this to be on the processor for such a long time, there is most likely a loop in the above stacktrace. Starting at the beginning (the end of the output), a (bhyve) process makes an ioctl call to run the vm. During this, the code calls vmm_glue_callout_localize which acquires the cpu_lock mutex (from examining source code) and calls into the cyclic code which in turn makes a cross call. While in cbe_xcall, an interrupt occurs. The interrupt handling code gets back into the vmm code which in turn enters the networking code. The networking code ends up in the viona code which eventually ends up in vm_mapping_find().

If any of the routines in the above stack does not return, it would explain why the thread has been on cpu for such a long time. A quick look at the function at the top of the stack, vm_mapping_find(). This function searches a linked list for a mapping.

static vmspace_mapping_t *
vm_mapping_find(struct vmspace *vms, uintptr_t addr, size_t size)
{
	vmspace_mapping_t *vmsm;
	list_t *ml = &vms->vms_maplist;
	const uintptr_t range_end = addr + size;

	ASSERT(MUTEX_HELD(&vms->vms_lock));
	ASSERT(addr <= range_end);

	if (addr >= vms->vms_size) {
		return (NULL);
	}
	for (vmsm = list_head(ml); vmsm != NULL; vmsm = list_next(ml, vmsm)) {
		const uintptr_t seg_end = vmsm->vmsm_addr + vmsm->vmsm_len;

		if (addr >= vmsm->vmsm_addr && addr < seg_end) {
			if (range_end <= seg_end) {
				return (vmsm);
			} else {
				return (NULL);
			}
		}
	}
	return (NULL);
}

Looking at the arguments passed to vm_mapping_find:

fffffeb43257ea00::print -t -a struct vmspace
fffffeb43257ea00 struct vmspace {
    fffffeb43257ea00 struct vm_map vm_map = {
        fffffeb43257ea00 struct vmspace *vmm_space = 0
    }
    fffffeb43257ea08 kmutex_t vms_lock = {
        fffffeb43257ea08 void *[1] _opaque = [ 0 ]
    }
    fffffeb43257ea10 struct pmap vms_pmap = {
        fffffeb43257ea10 void *pm_pml4 = 0xfffffeb414e3b000
        fffffeb43257ea18 cpuset_t pm_active = {
            fffffeb43257ea18 ulong_t [4] cpub = [ 0, 0, 0, 0 ]
        }
        fffffeb43257ea38 long pm_eptgen = 0x1dec58
        fffffeb43257ea40 enum pmap_type pm_type = 0x1 (PT_EPT)
        fffffeb43257ea48 void *pm_map = 0xfffffeb42f0c82c0
    }
    fffffeb43257ea50 uintptr_t vms_size = 0x400000000000
    fffffeb43257ea58 list_t vms_maplist = {
        fffffeb43257ea58 size_t list_size = 0x38
        fffffeb43257ea60 size_t list_offset = 0
        fffffeb43257ea68 struct list_node list_head = {
            fffffeb43257ea68 struct list_node *list_next = 0xfffffeb42f40b350
            fffffeb43257ea70 struct list_node *list_prev = 0xfffffeb433912788
        }
    }
}

The code searches the vms_maplist.

fffffeb43257ea58::walk list | ::print -t vmspace_mapping_t
vmspace_mapping_t {
    list_node_t vmsm_node = {
        struct list_node *list_next = 0xfffffeb42f40b2e0
        struct list_node *list_prev = 0xfffffeb43257ea68
    }
    vm_object_t vmsm_object = 0xfffffeb4211dba58
    uintptr_t vmsm_addr = 0x100000000
    size_t vmsm_len = 0x140000000
    off_t vmsm_offset = 0x100000000
    uint_t vmsm_prot = 0x7
}
vmspace_mapping_t {
    list_node_t vmsm_node = {
        struct list_node *list_next = 0xfffffeb4315e0bf8
        struct list_node *list_prev = 0xfffffeb42f40b350
    }
    vm_object_t vmsm_object = 0xfffffeb4211dba58
    uintptr_t vmsm_addr = 0
    size_t vmsm_len = 0xc0000000
    off_t vmsm_offset = 0
    uint_t vmsm_prot = 0x7
}
vmspace_mapping_t {
    list_node_t vmsm_node = {
        struct list_node *list_next = 0xfffffeb433912788
        struct list_node *list_prev = 0xfffffeb42f40b2e0
    }
    vm_object_t vmsm_object = 0xfffffeb431580008
    uintptr_t vmsm_addr = 0xffe00000
    size_t vmsm_len = 0x200000
    off_t vmsm_offset = 0
    uint_t vmsm_prot = 0x5
}
vmspace_mapping_t {
    list_node_t vmsm_node = {
        struct list_node *list_next = 0xfffffeb43257ea68
        struct list_node *list_prev = 0xfffffeb4315e0bf8
    }
    vm_object_t vmsm_object = 0xfffffeb440498988
    uintptr_t vmsm_addr = 0xc1000000
    size_t vmsm_len = 0x1000000
    off_t vmsm_offset = 0
    uint_t vmsm_prot = 0x3
}

There are 4 vmspace_mapping_t on the linked list. The code will match the second one (based on secondand third arguments to vm_mapping_find), and return. So the thread is not hung in this routine.

Going further down the stack, (and skipping the networking code), most of the other routines do not loop, so must return. But cyclic_juggle_one_to() does loop:

	/*
	 * Remove the cyclic from the source.  As mentioned above, we cannot
	 * block during this operation; if we cannot remove the cyclic
	 * without waiting, we spin for a time shorter than the interval, and
	 * reattempt the (non-blocking) removal.  If we continue to fail,
	 * we will exponentially back off (up to half of the interval).
	 * Note that the removal will ultimately succeed -- even if the
	 * cyclic handler is blocked on a resource held by a thread which we
	 * have preempted, priority inheritance assures that the preempted
	 * thread will preempt us and continue to progress.
	 */
	for (delay = NANOSEC / MICROSEC; ; delay <<= 1) {
		/*
		 * Before we begin this operation, disable kernel preemption.
		 */
		kpreempt_disable();
		if (cyclic_remove_here(src, ndx, &when, CY_NOWAIT))
			break;

		/*
		 * The operation failed; enable kernel preemption while
		 * spinning.
		 */
		kpreempt_enable();

		CYC_PTRACE("remove-retry", idp, src);

		if (delay > (cyclic->cy_interval >> 1))
			delay = cyclic->cy_interval >> 1;

		/*
		 * Drop the RW lock to avoid a deadlock with the cyclic
		 * handler (because it can potentially call cyclic_reprogram().
		 */
		rw_exit(&idp->cyi_lock);
		drv_usecwait((clock_t)(delay / (NANOSEC / MICROSEC)));
		rw_enter(&idp->cyi_lock, RW_WRITER);
	}

If cyclic_remove_here() never succeeds, this loops on cpu forever. During this time, the thread holds the cpu_lock. A check of stacks on the system shows many threads waiting on the cpu_lock, and at least one of them has been waiting ~38 minutes (the same amount of time the on processor thread has been running.

Why cyclic_remove_here is not returning success remains to be figured out.

Comments

Comment by Patrick Mooney
Created at 2018-06-12T17:59:52.221Z

Thoth ID: 6b8ffaeb09649ee383573e84e27be1c2


Comment by Patrick Mooney
Created at 2018-06-12T20:59:02.131Z
Updated at 2019-05-10T19:17:30.719Z

Assuming for the purposes of debugging that cyclics are not directly to blame here, let's take a closer look at the troublesome thread in question. There are three distinct parts to the stacktrace.

The initial vm_run, which is in the process of localizing its cyclic resources:

  fffffe00f971c2b0 switch_sp_and_call+0x13()
  fffffe00f971c300 apix_do_softint+0x6c(fffffe00f971c370)
  fffffe00f971c360 apix_do_interrupt+0x362(fffffe00f971c370, f7)
  fffffe00f971c370 _interrupt+0xba()
  fffffe00f971c4a0 cbe_xcall+0x88(fffffeb1cd1e9000, fffffeb1cd1e9000, fffffffffb9ef5e0, fffffe00f971c4b0)
  fffffe00f971c520 cyclic_remove_here+0x57(fffffeb1ce1dc6c0, 15, fffffe00f971c560, 1)
  fffffe00f971c5b0 cyclic_juggle_one_to+0xf4(fffffeb4288c3cd0, fffffeb1ce1dc000)
  fffffe00f971c5f0 cyclic_move_here+0x88(fffffeb4288c3cd0)
  fffffe00f971c620 vmm_glue_callout_localize+0x27(fffffeb1f3a9f748)
  fffffe00f971c650 vhpet_localize_resources+0x40(fffffeb1f3a9f600)
  fffffe00f971c680 vm_localize_resources+0x9a(fffffeb4393d6000, fffffeb4393d6240)
  fffffe00f971c760 vm_run+0x18a(fffffeb4393d6000, fffffe00f971cae0)
  fffffe00f971cc20 vmmdev_do_ioctl+0x762(fffffeb420ebf600, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8)
  fffffe00f971ccc0 vmm_ioctl+0x12c(13100000003, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8)
  fffffe00f971cd00 cdev_ioctl+0x39(13100000003, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8)
  fffffe00f971cd50 spec_ioctl+0x60(fffffeb429d9b940, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8, 0)
  fffffe00f971cde0 fop_ioctl+0x55(fffffeb429d9b940, c0907601, fffffc7fe9fffe10, 202003, fffffeb1f34aae10, fffffe00f971cea8, 0)
  fffffe00f971cf00 ioctl+0x9b(3, c0907601, fffffc7fe9fffe10)
  fffffe00f971cf10 sys_syscall+0x19f()

The vlapic cyclic which interrupted it:

  fffffe00f4caa9b0 mutex_delay_default+7()
  fffffe00f4caaa20 lock_set_spl_spin+0xb2(fffffeb43199bb29, b, fffffeb43199bb2c, 2)
  fffffe00f4caaa90 mutex_vector_enter+0x361(fffffeb43199bb28)
  fffffe00f4caaad0 vlapic_callout_handler+0x2a(fffffeb43199ba80)
  fffffe00f4caaaf0 vmm_glue_callout_handler+0x2f(fffffeb43199bad0)
  fffffe00f4caab90 cyclic_softint+0xfd(fffffeb1cd4ff000, 0)
  fffffe00f4caaba0 cbe_low_level+0x14()
  fffffe00f4caabf0 av_dispatch_softvect+0x78(2)
  fffffe00f4caac20 apix_dispatch_softint+0x35(0, 0)

... and the network traffic, destined for viona which interrupted that:

  fffffe00f4ca89b0 vm_mapping_find+0x67(fffffeb43257ea00, 3bb28000, 1000)
  fffffe00f4ca8a10 vmspace_find_kva+0x55(fffffeb43257ea00, 3bb28000, 1000)
  fffffe00f4ca8a50 vmm_drv_gpa2kva+0x39(fffffeb42baf1080, 3bb28000, 1000)
  fffffe00f4ca8a80 viona_gpa2kva+0x1c(fffffeb1ea965940, 3bb28000, 1000)
  fffffe00f4ca8b60 vq_popchain+0x37c(fffffeb1ea965950, 1, fffffe00f4ca8da0, 0)
  fffffe00f4ca8f00 viona_recv_merged+0x1f1(fffffeb266c58600, fffffeb266c58600, 0)
  fffffe00f4ca8fa0 viona_rx+0x140(fffffeb1ea965940, 0, fffffeb266c58600, 0)
  fffffe00f4ca8ff0 mac_rx_deliver+0x37(fffffeb414a498f0, 0, fffffeb266c58600, 0)
  fffffe00f4ca9080 mac_rx_soft_ring_process+0x99(fffffeb414a498f0, fffffeb434747680, fffffeb266c58600, fffffeb266c58600, 1, 0)
  fffffe00f4ca9930 mac_rx_srs_fanout+0x3b2(fffffeb43ac4c980, fffffeb266c58600)
  fffffe00f4ca99b0 mac_rx_srs_drain+0x25e(fffffeb43ac4c980, 800)
  fffffe00f4ca9a40 mac_rx_srs_process+0x109(fffffeb43ab42ca8, fffffeb43ac4c980, fffffeb266c58600, 0)
  fffffe00f4ca9aa0 mac_rx_common+0x230(fffffeb43ab42ca8, 0, fffffeb266c58600)
  fffffe00f4ca9b00 mac_rx+0xb6(fffffeb427e84840, 8, fffffffff85ba202)
  fffffe00f4ca9d20 overlay_mux_recv+0x32a(fffffeb1f3f0b7f0, fffffeb6ae8af280, 146, 0, fffffeb431374380)
  fffffe00f4ca9dc0 so_queue_msg_impl+0x160(fffffeb1f3f0b7f0, fffffeb6ae8af280, 146, 0, fffffe00f4ca9e2c, 0, 0)
  fffffe00f4ca9e10 so_queue_msg+0x30(fffffeb1f3f0b7f0, fffffeb6ae8af280, 146, 0, fffffe00f4ca9e2c, 0)
  fffffe00f4ca9e80 udp_ulp_recv+0xa3(fffffeb1f3f0aa40, fffffeb6ae8af280, 146, fffffe00f4caa310)
  fffffe00f4ca9fe0 udp_input+0x3c2(fffffeb1f3f0aa58, fffffeb266c58600, fffffeb26561c014, fffffeb1f3f0aa40)
  fffffe00f4caa0d0 ip_fanout_v4+0xd92(fffffeb266c58600, fffffeb26561c014, fffffe00f4caa310)
  fffffe00f4caa140 ip_input_local_v4+0x16e(fffffeb1f44e6430, fffffeb266c58600, fffffeb26561c014, fffffe00f4caa310)
  fffffe00f4caa1b0 ire_recv_local_v4+0x132(fffffeb1f44e6430, fffffeb266c58600, fffffeb26561c014, fffffe00f4caa310)
  fffffe00f4caa2d0 ill_input_short_v4+0x4d6(fffffeb1f44e6430, fffffeb26561c014, fffffeb2fc741728, fffffe00f4caa310, fffffffff7a54a11)
  fffffe00f4caa530 ip_input_common_v4+0x27a(fffffeb2fc741728, 0, fffffeb266c58600, 0, 0, 0, 0)
  fffffe00f4caa570 ip_input+0x2b(fffffeb2fc741728, 0, fffffeb266c58600, 0)
  fffffe00f4caa600 mac_rx_soft_ring_process+0x99(fffffeb414a4e390, fffffeb1f3eaec40, fffffeb266c58600, fffffeb266c58600, 1, 0)
  fffffe00f4caa730 mac_rx_srs_proto_fanout+0x29a(fffffeb1fa1fd680, fffffeb266c58600)
  fffffe00f4caa7b0 mac_rx_srs_drain+0x25e(f6, fffffe00f4caa890)
  fffffe00f4caa890 apix_hilevel_intr_prolog+0x3e(fffffffffb80094a, fffffe00f4caa890, fffffe00f4caa890, b)

As @max noted, the activity at the top of the stack is fine. No part of it indicates that it is deadlocked on some contended resource. The middle portion of the stack, however, is of special interest. Here we see the callout for a vlapic timer firing and attempting to grab the lock which protects its data. The lock is contended, and being a spinlock, makes the process of finding the holder a little bit more difficult. That said, a quick survey of other threads immediately leads us in the right direction:

[ fffffe00f5b6b1a0 _resume_from_idle+0x126() ]
  fffffe00f5b6b1d0 swtch+0x141()
  fffffe00f5b6b270 turnstile_block+0x21a(0, 0, fffffffffbc28008, fffffffffbc204e0, 0, 0)
  fffffe00f5b6b2e0 mutex_vector_enter+0x3a3(fffffffffbc28008)
  fffffe00f5b6b330 cbe_xcall+0x5d(fffffeb1cd4ff000, fffffeb1cd4ff000, fffffffffb9ef980, fffffe00f5b6b340)
  fffffe00f5b6b3b0 cyclic_reprogram_here+0x46(fffffeb1ce1dc000, a, 66434a2f1d8a)
  fffffe00f5b6b3e0 cyclic_reprogram+0x60(fffffeb42dca7d20, 66434a2f1d8a)
  fffffe00f5b6b430 vmm_glue_callout_reset_sbt+0x6d(fffffeb43199bad0, 3da00, 0, fffffffff8452fc0, fffffeb43199ba80, 0)
  fffffe00f5b6b460 vlapic_icrtmr_write_handler+0x12c(fffffeb43199ba80)
  fffffe00f5b6b4a0 vlapic_write+0x111(fffffeb43199ba80, 1, 380, 1ed, fffffe00f5b6b71f)
  fffffe00f5b6b500 lapic_mmio_write+0x73(fffffeb1e83b8000, 1, fee00380, 1ed, 4, fffffe00f5b6b71f)
  fffffe00f5b6b590 emulate_mov+0xb2(fffffeb1e83b8000, 1, fee00380, fffffeb1e83b83f8, fffffffff844dc10, fffffffff844db90, fffffe00f5b6b71f)
  fffffe00f5b6b5e0 vmm_emulate_instruction+0x6c(fffffeb1e83b8000, 1, fee00380, fffffeb1e83b83f8, fffffeb1e83b83e0, fffffffff844dc10, fffffffff844db90, fffffe00f5b6b71f)
  fffffe00f5b6b680 vm_handle_inst_emul+0x18c(fffffeb1e83b8000, 1, fffffe00f5b6b71f)
  fffffe00f5b6b760 vm_run+0x3f7(fffffeb1e83b8000, fffffe00f5b6bae0)
  fffffe00f5b6bc20 vmmdev_do_ioctl+0x762(fffffeb42974d7c0, c0907601, fffffc7fe9e00e10, 202003, fffffeb427b5fef0, fffffe00f5b6bea8)
  fffffe00f5b6bcc0 vmm_ioctl+0x12c(13100000004, c0907601, fffffc7fe9e00e10, 202003, fffffeb427b5fef0, fffffe00f5b6bea8)
  fffffe00f5b6bd00 cdev_ioctl+0x39(13100000004, c0907601, fffffc7fe9e00e10, 202003, fffffeb427b5fef0, fffffe00f5b6bea8)
  fffffe00f5b6bd50 spec_ioctl+0x60(fffffeb1e9d21900, c0907601, fffffc7fe9e00e10, 202003, fffffeb427b5fef0, fffffe00f5b6bea8, 0)
  fffffe00f5b6bde0 fop_ioctl+0x55(fffffeb1e9d21900, c0907601, fffffc7fe9e00e10, 202003, fffffeb427b5fef0, fffffe00f5b6bea8, 0)
  fffffe00f5b6bf00 ioctl+0x9b(3, c0907601, fffffc7fe9e00e10)
  fffffe00f5b6bf10 sys_syscall+0x19f()

And what mutex is it blocked on? The cyclic xcall lock, held by our double-interrupted thread, of course.

> fffffffffbc28008::mutex
            ADDR  TYPE             HELD MINSPL OLDSPL WAITERS
fffffffffbc28008 adapt fffffeb4278c20e0      -      -     yes
> cbe_xcall_lock=K
                fffffffffbc28008

So given that there's a lock ordering problem between the two threads in this conditions - cbe_xcall_lock > an interrupt > vlapic_timer_mtx vs. vlapic_timer_mtx > cbe_xcall_lock how do we avoid the deadlock. The problem lies with the type of lock used for vlapic_timer_mtx: a spin lock. Were it a normal adaptive mutex, the scheduler could have passivated the interrupt thread when it blocked on vlapic_timer_mtx, returning to the still-running cbe_xcall code which was interrupted. When that completed, the thread processing the vlapic_write would have then acquired cbe_xcall_lock and the rest of things in the chain of contended resources would have been allowed to run in time.

Originally in the bhyve port, context switching while in the "critical section" of a vm_run was totally unacceptable, precluding any accessed resource from voluntarily going off-cpu due to locks. As such, spin-locks seemed suitable. Once it was clear that prevention of voluntary context switches was basically impossible, and thread ctx guards were put in place, the requirements for a spin-lock evaporated.


Comment by Patrick Mooney
Created at 2018-06-13T15:22:36.578Z

My proposed fix for this is simply to ignore requests for MTX_SPIN typing on mutexes, provided all adaptive ones instead. Initial testing with that approach unearthed OS-7016.


Comment by Patrick Mooney
Created at 2018-08-23T20:14:11.538Z
Updated at 2018-08-23T20:15:44.006Z

Given the tight nature of this race, and all of the moving parts involved, it has proven difficult to reproduce. With the spinlocks replaced with standard adaptive mutexes, a PI with the updated wad was subjected load by booting various guests. Collecting stacks of vmx_savectx and vmx_msr_guest_exit, we can look for instances where a VMX thread will go off-CPU rather than spinning on a mutex.

Here we see that a bhyve thread went through the expected context-saving while blocking on a contended lapic lock during event injection:

              vmm`vmm_savectx+0x3e
              genunix`savectx+0x3d
              unix`resume+0x70
              unix`swtch+0x141
              genunix`turnstile_block+0x21a
              unix`mutex_vector_enter+0x3a3
              vmm`vatpic_pending_intr+0x3b
              vmm`vmx_inject_interrupts+0x320
              vmm`vmx_run+0x264
              vmm`vm_run+0x21f
              vmm`vmmdev_do_ioctl+0x775
              vmm`vmm_ioctl+0x12c
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
              unix`sys_syscall+0x19f
                2

Similarly, the generic vcpu spinlocks now cause expected swtch()-ing:

              vmm`vmx_savectx+0x70
              vmm`vmm_savectx+0x3e
              genunix`savectx+0x3d
              unix`resume+0x70
              unix`swtch+0x141
              genunix`turnstile_block+0x21a
              unix`mutex_vector_enter+0x3a3
              vmm`vcpu_notify_event+0x36
              vmm`lapic_set_intr+0x8b
              vmm`vlapic_icrlo_write_handler+0x114
              vmm`vmx_handle_apic_write+0xd2
              vmm`vmx_exit_process+0x601
              vmm`vmx_run+0x6ef
              vmm`vm_run+0x21f
              vmm`vmmdev_do_ioctl+0x775
              vmm`vmm_ioctl+0x12c
              genunix`cdev_ioctl+0x39
              specfs`spec_ioctl+0x60
              genunix`fop_ioctl+0x55
              genunix`ioctl+0x9b
                9

Comment by Jira Bot
Created at 2018-09-13T15:02:14.166Z

illumos-joyent commit 88b7f8b0d2ce2d1d224879a20f7d36427c5b7707 (branch master, by Patrick Mooney)

OS-7012 bhyve wedged in vlapic cyclics
OS-7016 vmx ctx ops should inform VMRESUME
OS-6957 clean up unused mutex type from bhyve
Reviewed by: John Levon <john.levon@joyent.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Approved by: Hans Rosenfeld <hans.rosenfeld@joyent.com>