OS-5423: deadlock between netstack teardown and kstat read

Details

Issue Type:	Bug
Priority:	4 - Normal
Status:	Resolved
Created at:	2016-05-19T18:07:47.000Z
Updated at:	2018-02-07T06:15:49.510Z

People

Created by:	Former user
Reported by:	Former user
Assigned to:	Dan McDonald

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-12-08T17:04:26.150Z)

Fix Versions

2017-12-07 Gold Saucer (Release Date: 2017-12-07)

Description

We found a process hung on a shrimp today. It had the following stack:

> 0t86777::pid2proc | ::walk thread | ::findstack -v
stack pointer for thread ffffff4b1c95cb20: ffffff01f303e5d0
[ ffffff01f303e5d0 _resume_from_idle+0x112() ]
  ffffff01f303e600 swtch+0x141()
  ffffff01f303e640 cv_wait+0x70(ffffff44bdbed0c8, fffffffffbc14f20)
  ffffff01f303e680 kstat_hold+0x4d(fffffffffbc24f60, ffffff01f303e690)
  ffffff01f303e7b0 kstat_hold_bykid+0x37(1081d80, 320)
  ffffff01f303e7f0 kstat_delete+0x4e(ffffff44bdbed000)
  ffffff01f303e820 kstat_delete_netstack+0x28(ffffff44bdbed000, 320)
  ffffff01f303e840 tcp_kstat2_fini+0x21(320, ffffff44bdbed000)
  ffffff01f303e880 tcp_stack_fini+0x1db(320, ffffff4b12464400)
  ffffff01f303e900 netstack_apply_destroy+0xeb(fffffffffbcb78e8, ffffff449786edc0, 9)
  ffffff01f303e940 apply_all_modules_reverse+0x49(ffffff449786edc0, fffffffffba59800)
  ffffff01f303e9a0 netstack_stack_inactive+0x12b(ffffff449786edc0)
  ffffff01f303e9e0 netstack_rele+0x7d(ffffff449786edc0)
  ffffff01f303ea40 tcp_kstat2_update+0xbe(ffffff44bdbed000, 0)
  ffffff01f303ec30 read_kstat_data+0xd0(ffffff01f303ee58, 840add0, 100001)
  ffffff01f303ec70 kstat_ioctl+0x83(200000000, 4b02, 840add0, 100001, ffffff46f85a1d58, ffffff01f303ee58)
  ffffff01f303ecb0 cdev_ioctl+0x39(200000000, 4b02, 840add0, 100001, ffffff46f85a1d58, ffffff01f303ee58)
  ffffff01f303ed00 spec_ioctl+0x60(ffffff4353bd9000, 4b02, 840add0, 100001, ffffff46f85a1d58, ffffff01f303ee58, 0)
  ffffff01f303ed90 fop_ioctl+0x55(ffffff4353bd9000, 4b02, 840add0, 100001, ffffff46f85a1d58, ffffff01f303ee58, 0)
  ffffff01f303eeb0 ioctl+0x9b(3, 4b02, 840add0)
  ffffff01f303ef10 _sys_sysenter_post_swapgs+0x153()

So, this is a bit suspicious and unfortunate. So, let's figure out why we're
blocked. We're likely trying to hold a kstat that someone else already
is, probably ourselves, but let's prove it. So, we know that the ks id
is 0x1081d80 and the zone is 0x320. So, let's find it:

> kstat_avl_bykid::walk avl | ::printf "%p %x %x\n" ekstat_t . e_ks.ks_kid e_zone.zoneid ! grep '1081d80 320'
ffffff44bdbed000 1081d80 320
> ffffff44bdbed000::print ekstat_t
{
    e_ks = {
        ks_crtime = 0x2449a7ca9c6f1
        ks_next = 0
        ks_kid = 0x1081d80
        ks_module = [ "tcp" ]
        ks_resv = 0
        ks_instance = 0x320
        ks_name = [ "tcpstat" ]
        ks_type = 0x1
        ks_class = [ "net" ]
        ks_flags = 0
        ks_data = 0xffffff44bdbed110
        ks_ndata = 0x2f
        ks_data_size = 0x8d0
        ks_snaptime = 0x247b361824cb7
        ks_update = tcp_kstat2_update
        ks_private = 0x320
        ks_snapshot = default_kstat_snapshot
        ks_lock = 0
    }
    e_size = 0x9e0
    e_owner = 0xffffff4b1c95cb20
    e_cv = {
        _opaque = 0x1
    }
    e_avl_bykid = {
        avl_child = [ 0, 0 ]
        avl_pcb = 0xffffffa83fd5c0d1
    }
    e_avl_byname = {
        avl_child = [ 0xffffff457ac7e0e8, 0xffffffc5da5650e8 ]
        avl_pcb = 0xffffff805521e0e8
    }
    e_zone = {
        zoneid = 0x320
        next = 0xffffff4de0730da0
    }
}
> ffffff44bdbed000::print ekstat_t e_owner | ::findstack
stack pointer for thread ffffff4b1c95cb20: ffffff01f303e5d0
[ ffffff01f303e5d0 _resume_from_idle+0x112() ]
  ffffff01f303e600 swtch+0x141()
  ffffff01f303e640 cv_wait+0x70()
  ffffff01f303e680 kstat_hold+0x4d()
  ffffff01f303e7b0 kstat_hold_bykid+0x37()
  ffffff01f303e7f0 kstat_delete+0x4e()
  ffffff01f303e820 kstat_delete_netstack+0x28()
  ffffff01f303e840 tcp_kstat2_fini+0x21()
  ffffff01f303e880 tcp_stack_fini+0x1db()
  ffffff01f303e900 netstack_apply_destroy+0xeb()
  ffffff01f303e940 apply_all_modules_reverse+0x49()
  ffffff01f303e9a0 netstack_stack_inactive+0x12b()
  ffffff01f303e9e0 netstack_rele+0x7d()
  ffffff01f303ea40 tcp_kstat2_update+0xbe()
  ffffff01f303ec30 read_kstat_data+0xd0()
  ffffff01f303ec70 kstat_ioctl+0x83()
  ffffff01f303ecb0 cdev_ioctl+0x39()
  ffffff01f303ed00 spec_ioctl+0x60()
  ffffff01f303ed90 fop_ioctl+0x55()
  ffffff01f303eeb0 ioctl+0x9b()
  ffffff01f303ef10 _sys_sysenter_post_swapgs+0x153()

And haha, the person that can't find it is indeed us. So the probably
here is that we're likely trying ot do the update function for this,
which is tcp_kstat2_update().

So, for us to have found the netstack originally, then the netstack must
have still been referenced. So, the first question I have is did we find
this and it had no tcp stack? Unlikely since it has one know while we're
tearing down. In addition, based on the disassembly we can see that
we've actually went and gathered all the stats.

So, this means that something else must have decremented the netstack's
existing reference, in particular the fact that there were zones
associated with it. Okay, so what must have happened then for our zone
count to go away?

netstack_zone_destroy() must have been called, which is part of the zone
key system. So this means the zone has come and gone during this entire
time, and that as part of doing the update, we have taken the very last
reference.

So basically, the problem is that we're holding the kstat and we're
trying to grab a hold recursively so we can exit again. Sigh, this isn't
good. It's not clear what, if anything we can do here to handle this
case.

Comments

Comment by Former user
Created at 2017-11-03T21:21:34.082Z

The fundamental problem here is the recursive hold. There are a few different ways that I can see for us to handle this.

The first way is to implement a notion of a recursive hold and if a deletion happens, basically tombstoning things until the final hold is finished. The kstat_hold() code already has logic where it keeps track of the thread that has the hold. It wouldn't be unreasonable for us to add a count and some other flags there. That said, the tombstoning likely will be complicated to get through the system and properly deal with the ramifications.

On the other hand, the real problem here is that when we do the final rele on the netstack, it happens in an arbitrary context which has no relationship to what it's trying to actual clean up. While we've hit this with kstats, there's no guarantee that we can't enter from some other subsystem's re-entrant fashion. What we may want to do is actually pass the netstack clean up into the general system taskq to clean up. The problem with this approach is that we may need to deal with a notion of backpressure. What we don't want to have happening is that we've enqueued all these taskq clean up functions that we can't actually deal with. One way to solve that backpressure problem is to have a limited number of outstanding requests here and allow other threads that want to enqueue to block until that has happened. While a thread may be blocked in this model, it shouldn't end up being deadlockable. Though the number of outstanding queries would probably be an arbitrary number (somewhere in the thousands maybe?).

I think I ultimately prefer the second approach over the first, but I'd like to get some other thoughts on that.

Comment by Dan McDonald
Created at 2017-11-06T19:18:02.538Z

The CR referenced above implements everything in the second approach:

System taskq offload of cleanup.
Counting semaphore enforces number-of-outstanding limit, blocking otherwise.
Only exceptions are taskq_dispatch() failure, and interrupt-context callers do NOT use the semaphore.

Comment by Dan McDonald
Created at 2017-11-08T16:50:39.316Z

There are two test scripts above.

zones.sh captures the running zones, and starts a shutdown/reboot loop with them.

kstat.sh loops endlessly capturing netstack kstats (ip, tcp, udp, etc.).

On a stock system running 4-VMware cores, the kstats hung per this bug in < 5mins. With this fix we are on 10mins so far, and I'll report back after a longer run time.

Comment by Dan McDonald
Created at 2017-11-08T18:57:13.583Z

One hour and 20 minutes of nonstop bringup/teardown with the fix in place.

And per reviewer suggestion, I altered the semaphore's initial value to be 1, to exercise the hit-the-limit path and DTrace probe. I saw <=20ms, with most <=1ms delays on the simple four-zone, four-VMware core test rig above.

Comment by Jira Bot
Created at 2017-11-10T20:15:56.448Z

illumos-joyent commit a5d2534e6a7c00d6d07b25a15157bae43148f887 (branch master, by Dan McDonald)

OS-5423 deadlock between netstack teardown and kstat read
Reviewed by: Jason King <jason.king@joyent.com>
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Ryan Zezeski <rpz@joyent.com>