OS-8000: viona ring reset can race with Rx packets


Issue Type:Bug
Priority:4 - Normal
Created at:2019-09-26T23:26:54.881Z
Updated at:2019-10-01T17:41:49.285Z


Created by:Former user
Reported by:Former user
Assigned to:Former user


Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-10-01T17:41:49.272Z)

Fix Versions

2019-10-10 Close Talker (Release Date: 2019-10-10)


There was report of a machine crashing on a NULL pointer deference inside vq_popchain().

> ::status
debugging crash dump /manta/thoth/stor/thoth/e499adfd57dac42f68703c6824ab680d/vmcore.0 (64-bit) from ac-1f-6b-a5-ab-e2
operating system: 5.11 joyent_20190718T005708Z (i86pc)
git branch: release-20190718
git rev: d5273958d6f5c5953e70b91cc8eaac8b2c2d8d1f
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffb0450c026a0 addr=0 occurred in module "viona" due to a NULL pointer dereference
dump content: kernel pages only

> ::stack
vq_popchain+0x5f(fffffe02e7f3d710, fffffb0450c02890, 20, fffffb0450c02b9e)
viona_recv_merged+0x6b(fffffe02e7f3d710, ffffff97309ec040, 5d2)
viona_rx_common+0x9e(fffffe02e7f3d710, ffffff97309ec040, 0)
viona_rx_classified+0x3f(fffffe02e7f3d710, 0, ffffff97309ec040, 0)
mac_rx_deliver+0x37(fffffe37c3b999b0, 0, ffffff97309ec040, 0)
mac_rx_soft_ring_process+0x19a(fffffe37c3b999b0, fffffe34bd4d5340, ffffff97309ec040, ffffff97309ec040, 1, 0)
mac_rx_srs_fanout+0x3d2(fffffdd595005640, ffffff97309ec040)
mac_rx_srs_drain+0x256(fffffdd595005640, 800)
mac_rx_srs_process+0x428(fffffdc58b2b95d8, fffffdd595005640, ffffff97309ec040, 0)
mac_rx_classify+0x11b(fffffdc58b2b95d8, fffffdc6989d1110, ffffff97309ec040)
mac_rx_flow+0x63(fffffdc58b2b95d8, fffffdc6989d1110, fffffe01d721f440)
mac_rx_common+0x1e6(fffffdc58b2b95d8, fffffdc6989d1110, fffffe01d721f440)
mac_rx+0xb6(fffffdc58b2b95d8, fffffdc6989d1110, fffffe01d721f440)
mac_rx_ring+0x2b(fffffdc58b2b95d8, fffffdc6989d1110, fffffe01d721f440, 1)
aggr_mac_rx+0x25(fffffdc58b2b95d8, fffffdc58ce8dd00, fffffe01d721f440)
aggr_recv_path_cb+0x193(fffffdc5b1c86248, fffffdc58ce8dd00, fffffe01d721f440, 0)
aggr_recv_cb+0x1d(fffffdc5b1c86248, fffffdc58ce8dd00, fffffe01d721f440, 0)
mac_rx_common+0x253(fffffdc58b2c3730, fffffdc58b7b1dc8, fffffe01d721f440)
mac_rx+0xb6(fffffdc58b2c3730, fffffdc58b7b1dc8, fffffe01d721f440)
mac_rx_ring+0x2b(fffffdc58b2c3730, fffffdc58b7b1dc8, fffffe01d721f440, 1)
ixgbe_intr_msix+0x58(fffffdc58b33a7f8, 0)
apix_dispatch_lowlevel+0x25(23, 0)
apix_do_interrupt+0x3cf(fffffb0450bb2ab0, fffffdc51f954000)

The problem is that we are receiving a packet on a ring that has been reset (vr_state == VRS_RESET). Specifically, we try to dereference the NULL vr_avail_idx.

> fffffe02e7f3d710::print -t viona_vring_t
viona_vring_t {
    viona_link_t *vr_link = 0xfffffe02e7f3d700
    kmutex_t vr_lock = {
        void *[1] _opaque = [ 0 ]
    kcondvar_t vr_cv = {
        ushort_t _opaque = 0
    uint16_t vr_state = 0
    uint16_t vr_state_flags = 0
    uint_t vr_xfer_outstanding = 0
    kthread_t *vr_worker_thread = 0
    vmm_lease_t *vr_lease = 0
    viona_desb_t *vr_txdesb = 0
    struct iovec *vr_txiov = 0
    uint_t vr_intr_enabled = 0
    uint64_t vr_msi_addr = 0xfee00000
    uint64_t vr_msi_msg = 0x4161
    kmutex_t vr_a_mutex = {
        void *[1] _opaque = [ 0xfffffb0450c03c20 ]
    kmutex_t vr_u_mutex = {
        void *[1] _opaque = [ 0 ]
    uint64_t vr_pa = 0x2094c1000
    uint16_t vr_size = 0x400
    uint16_t vr_mask = 0x3ff
    uint16_t vr_cur_aidx = 0
    volatile struct virtio_desc *vr_descr = 0
    volatile uint16_t *vr_avail_flags = 0
    volatile uint16_t *vr_avail_idx = 0
    volatile uint16_t *vr_avail_ring = 0
    volatile uint16_t *vr_avail_used_event = 0
    volatile uint16_t *vr_used_flags = 0
    volatile uint16_t *vr_used_idx = 0
    volatile struct virtio_used *vr_used_ring = 0
    volatile uint16_t *vr_used_avail_event = 0
    struct viona_ring_stats vr_stats = {
        uint64_t rs_ndesc_too_high = 0
        uint64_t rs_bad_idx = 0
        uint64_t rs_indir_bad_len = 0
        uint64_t rs_indir_bad_nest = 0
        uint64_t rs_indir_bad_next = 0
        uint64_t rs_no_space = 0x74
        uint64_t rs_too_many_desc = 0
        uint64_t rs_desc_bad_len = 0
        uint64_t rs_bad_ring_addr = 0
        uint64_t rs_fail_hcksum = 0
        uint64_t rs_fail_hcksum6 = 0
        uint64_t rs_fail_hcksum_proto = 0
        uint64_t rs_bad_rx_frame = 0
        uint64_t rs_rx_merge_overrun = 0
        uint64_t rs_rx_merge_underrun = 0
        uint64_t rs_rx_pad_short = 0xc3
        uint64_t rs_rx_mcast_check = 0
        uint64_t rs_too_short = 0
        uint64_t rs_tx_absent = 0
        uint64_t rs_rx_hookdrop = 0
        uint64_t rs_tx_hookdrop = 0

At first we thought this might be cause by the ordering of mac_client_close() with respect to viona_ring_reset() inside of the viona_ioc_delete() ioctl -- but that wasn't the case. As proven by mdb, the link is still in a non-destroyed state, and the mac client still has its callback set to viona_rx_classified().

> 0xfffffe02e7f3d700::print -t viona_link_t l_destroyed
boolean_t l_destroyed = 0 (0)

> 0xfffffe02e7f3d700::print -t viona_link_t l_mch | ::print mac_client_impl_t mci_rx_fn
mci_rx_fn = viona_rx_classified

Then we though it might be the device being reset by the guest, causing a call to viona_ioc_ring_reset() -> viona_ring_reset(), which will lead to the Rx worker exiting and reseting the ring. However, the original crash happened on a zone that was in the process of shutting down, and the guest only seems to reset the device as part of startup.

Then we realized the real cause is that the bhyve process must have been killed, causing VRING_NEED_BAIL to return true, causing the worker thread to reset the thread. However, this reset logic performs no Rx barrier, so we could have mac threads/interrupts that sneak past the VRS_RESET check in viona_rx_classified() but don't hit vq_popchain() until after the ring reset is complete -- resulting in a NULL deref.

The fix is to add a mac barrier to the Rx worker exit and add another check to viona_rx_classified(), similar to what we do for VRSF_RENEW. Though, rather than adding a third condition to the if expression we could just set VRS_RESET earlier in the worker exit and call the Rx barrier before zeroing any ring state. However, it might be nice to add yet another state for this scenario, or perhaps use the existing VRSF_REQ_STOP as part of worker exit.


Comment by Former user
Created at 2019-09-26T23:28:16.080Z

I was able to replicate this locally by running an iperf3 server in the guest, and iperf3 client on a remote host, dtrace chilling the viona_rx_common() function, and then sending a SIGKILL to the bhyve process while all this was going on.

Comment by Former user
Created at 2019-10-01T17:29:07.480Z

To test this I ran the same test as mentioned in my previous comment:

On the current platform I was able to replicate the crash on 2/6 attempts. With the patched platform I was not able to replicate the crash after 10 attempts.

Comment by Jira Bot
Created at 2019-10-01T17:41:08.422Z

illumos-joyent commit e4b201171637339e9c582457e461fb99e6a4cc56 (branch master, by Ryan Zezeski)

OS-8000 viona ring reset can race with Rx packets
Reviewed by: Mike Zeller <mike.zeller@joyent.com>
Reviewed by: Patrick Mooney <pmooney@pfmooney.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Approved by: Mike Zeller <mike.zeller@joyent.com>