OS-7613: bhyve virtio needs barriers

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-02-25T22:44:38.459Z
Updated at:2019-02-27T21:57:12.392Z

People

Created by:Patrick Mooney [X]
Reported by:Patrick Mooney [X]
Assigned to:Patrick Mooney [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-02-27T21:56:54.290Z)

Fix Versions

2019-02-28 Mind Grapes (Release Date: 2019-02-28)

Related Links

Labels

bhyve

Description

A customer has reported problems with bhyve guests indicating problems with their virtio-block devices via stuck kernel threads (and workloads going unresponsive):

kernel: INFO: task jbd2/vdb-8:2235 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: jbd2/vdb-8 D ffff956876502080 0 2235 2 0x00000000
kernel: Call Trace:
kernel: [<ffffffff85f67c49>] schedule+0x29/0x70
kernel: [<ffffffffc036b2bc>] jbd2_journal_commit_transaction+0x23c/0x19b0 [jbd2]
kernel: [<ffffffff8582a59e>] ? __switch_to+0xce/0x580
kernel: [<ffffffff858c2d00>] ? wake_up_atomic_t+0x30/0x30
kernel: [<ffffffff85f6778f>] ? __schedule+0x3ff/0x890
kernel: [<ffffffff858aa0de>] ? try_to_del_timer_sync+0x5e/0x90
kernel: [<ffffffffc0371e89>] kjournald2+0xc9/0x260 [jbd2]
kernel: [<ffffffff858c2d00>] ? wake_up_atomic_t+0x30/0x30
kernel: [<ffffffffc0371dc0>] ? commit_timeout+0x10/0x10 [jbd2]
kernel: [<ffffffff858c1c31>] kthread+0xd1/0xe0
kernel: [<ffffffff858c1b60>] ? insert_kthread_work+0x40/0x40
kernel: [<ffffffff85f74c37>] ret_from_fork_nospec_begin+0x21/0x21
kernel: [<ffffffff858c1b60>] ? insert_kthread_work+0x40/0x40

Further investigation yielded little to go on. The virtio-block device state seemed OK and all of its block_if requests appeared to have completed successfully. No bhyve threads appeared stuck in any way. Some dtrace was concocted to capture virtio events on that second blockdev as they repeated the same fateful workload:

#!/usr/sbin/dtrace -s -C

#pragma D option bufpolicy=ring
#pragma D option bufsize=2048k
#pragma D option quiet

/* blockdev target, function=1 -> vdb */
#define TARG    1

pid$target:a.out:pci_vtblk_proc:entry
/args[0]->vbsc_vs.vs_pi->pi_func == TARG/
{
        /* void pci_vtblk_proc(struct pci_vtblk_softc *sc, struct vqueue_info *vq) */
        printf("%lu %s(%p)\n", (unsigned long)timestamp, probefunc, args[0]);
        self->proc = 1;
}

pid$target:a.out:vq_getchain:entry
/self->proc/
{
/* n = vq_getchain(vq, &idx, iov, BLOCKIF_IOV_MAX + 2, flags); */
        self->idx = args[1];
        self->iov = args[2];
}
pid$target:a.out:vq_getchain:return
/self->proc/
{
        this->hdr = (userland struct virtio_blk_hdr *)self->iov[0].iov_base;
        printf("%lu %s = { idx:%x n:%u type:%x prio:%x sect:%lx }\n",
            (unsigned long)timestamp, probefunc, *self->idx, arg1,
            this->hdr->vbh_type, this->hdr->vbh_ioprio, this->hdr->vbh_sector);
}
pid$target:a.out:blockif_read:entry,
pid$target:a.out:blockif_write:entry,
pid$target:a.out:blockif_flush:entry
/self->proc/
{
        /* blockif_*(sc->bc, &io->io_req); */
        printf("%lu %s(%p { niov:%x off:%lx resid:%lx })\n",
            (unsigned long)timestamp, probefunc, args[1],
            args[1]->br_iovcnt, args[1]->br_offset, args[1]->br_resid);
}


pid$target:a.out:pci_vtblk_proc:return
/self->proc/
{
        self->proc = 0;
}

pid$target:a.out:pci_vtblk_done_locked:entry
/args[0]->io_sc->vbsc_vs.vs_pi->pi_func == TARG/
{
        /* void pci_vtblk_done_locked(struct pci_vtblk_ioreq *io, int err) */
        printf("%lu %s:entry(%p, %d)\n", (unsigned long)timestamp, probefunc,
            args[0], args[1]);
        self->done = args[0];
}

pid$target:a.out:pci_generate_msix:entry,
pid$target:a.out:pci_generate_msi:entry
/self->done/
{
        printf("%lu %s\n", (unsigned long)timestamp, probefunc);
}

pid$target::vm_lapic_msi:entry
/self->done/
{
        /* vm_lapic_msi(struct vmctx *ctx, uint64_t addr, uint64_t msg) */
        printf("%lu %s(%lx, %lx)\n", (unsigned long)timestamp, probefunc,
            arg1, arg2);
}

pid$target:a.out:pci_vtblk_done_locked:return
/self->done/
{
        printf("%lu %s:return(%p)\n", (unsigned long)timestamp, probefunc,
            self->done);
        self->done = 0;
}

pid$target:a.out:pci_vtblk_notify:entry
/((userland struct pci_vtblk_softc *)arg0)->vbsc_vs.vs_pi->pi_func == TARG/
{
        /* void pci_vtblk_notify(void *vsc, struct vqueue_info *vq) */
        printf("%lu %s(%p)\n", (unsigned long)timestamp, probefunc, args[0]);
}

They ran the collection as the guest underwent the same hang, with iostat -x 1 reporting a "stuck" IO in the guest. The last few events in the capture were of some interest:

1346720612600184 pci_vtblk_notify(6c2a50)
1346720612607390 pci_vtblk_proc(6c2a50)
1346720612619473 vq_getchain = { idx:d n:6 type:1 prio:0 sect:4d3879a0 }
1346720612626816 blockif_write(6c9668 { niov:4 off:9a70f34000 resid:4000 })
1346720612642677 pci_vtblk_proc(6c2a50)
1346720612645786 vq_getchain = { idx:6a n:6 type:1 prio:4004 sect:3e85c350 }
1346720612648628 blockif_write(6f95a8 { niov:4 off:7d0b86a000 resid:4000 })
1346720612657608 pci_vtblk_notify(6c2a50)
1346720612702296 pci_vtblk_done_locked:entry(6c9668, 0)
1346720612709961 pci_generate_msix
1346720612713694 vm_lapic_msi(fee02000, 4062)
1346720612722376 pci_vtblk_done_locked:return(6c9668)
1346720612729369 pci_vtblk_done_locked:entry(6f95a8, 0)
1346720612739112 pci_vtblk_done_locked:return(6f95a8)

Two write operations are inserted into the virtqueue. When the first completes, we issue an interrupt to the guest. The second completes very shortly afterward, but does not result in an interrupt. This by itself is not a surprise. The Linux virtio drivers disable interrupts (via the avail ring flags) while processing ring data. Enabling interrupts again after processing the used ring is a careful dance, though, since it presents a window where the host misses the interrupt re-enabling after adding a new item to the used ring. To avoid the missed interrupt situation, memory barriers are used to ensure the visibility of the store to avail_flags and the load from used_idx are not reordered:

unsigned virtqueue_enable_cb_prepare(struct virtqueue *_vq)
{
        struct vring_virtqueue *vq = to_vvq(_vq);
        u16 last_used_idx;

        START_USE(vq);

        /* We optimistically turn back on interrupts, then check if there was
         * more to do. */
        /* Depending on the VIRTIO_RING_F_EVENT_IDX feature, we need to
         * either clear the flags bit or point the event index at the next
         * entry. Always do both to keep code simple. */
        vq->vring.avail->flags &= ~VRING_AVAIL_F_NO_INTERRUPT;
        vring_used_event(&vq->vring) = last_used_idx = vq->last_used_idx;
        END_USE(vq);
        return last_used_idx;
}
...
bool virtqueue_poll(struct virtqueue *_vq, unsigned last_used_idx)
{
        struct vring_virtqueue *vq = to_vvq(_vq);

        virtio_mb(vq->weak_barriers);
        return (u16)last_used_idx != vq->vring.used->idx;
}
...
bool virtqueue_enable_cb(struct virtqueue *_vq)
{
        unsigned last_used_idx = virtqueue_enable_cb_prepare(_vq);
        return !virtqueue_poll(_vq, last_used_idx);
}

This care does seem necessary, as the Intel 64 Architecture Memory Ordering guide notes in section 2.3 that "Loads may be reordered with older stores to different locations". It would follow that the host side of the virtio device should also need such a barrier. Looking at qemu, it does something similar:

static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
{
    uint16_t old, new;
    bool v;
    /* We need to expose used array entries before checking used event. */
    smp_mb();
    /* Always notify when queue is empty (when feature acknowledge) */
    if (virtio_vdev_has_feature(vdev, VIRTIO_F_NOTIFY_ON_EMPTY) &&
        !vq->inuse && virtio_queue_empty(vq)) {
        return true;
    }

    if (!virtio_vdev_has_feature(vdev, VIRTIO_RING_F_EVENT_IDX)) {
        return !(vring_avail_flags(vq) & VRING_AVAIL_F_NO_INTERRUPT);
    }
...

To provide further support for this hypothesis, that an interrupt was missed, wedging the device, I issued a direct read to it on the ailing system:

dd if=/dev/vdb of=/dev/null iflag=direct count=1

That IO succeeded, along with the old one which was "stuck". This lends further credence to the theory that an interrupt was missed.

Although the race is very tight – this is the first case of it we've seen – the customer is able to replicate it with some regularity. Memory barriers should be added to bhyve's virtio handling where they are suspected to be needed, at which point the customer can test with an updated PI to provide more evidence.

Comments

Comment by Patrick Mooney [X]
Created at 2019-02-26T17:46:15.420Z

I've run a PI featuring this change through the normal set of guest smoke tests including network and disk activity.  No issues were observed.

A test PI has been passed to the customer who reported the issue. They will continue the workload which has been able to replicate the problem to check if the fix has the desired effect.


Comment by Patrick Mooney [X]
Created at 2019-02-27T20:01:28.109Z

The customer reports that after a night of testing, they've not seen the issue on the test PI.


Comment by Jira Bot
Created at 2019-02-27T21:57:12.392Z

illumos-joyent commit e393062f0aebf8081aed83fd67670d9094d2a2a3 (branch master, by Patrick Mooney)

OS-7613 bhyve virtio needs barriers
OS-7614 viona should use proper barriers
Reviewed by: John Levon <john.levon@joyent.com>
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>