OS-7410: vmm_zsd assert tripped on zone shutdown


Issue Type:Bug
Priority:4 - Normal
Created at:2018-11-29T16:40:31.807Z
Updated at:2018-12-05T20:07:55.138Z


Created by:Michael Zeller
Reported by:Michael Zeller
Assigned to:Mike Gerdts [X]


Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-12-05T20:07:49.543Z)

Fix Versions

2018-12-06 Grizz (Release Date: 2018-12-06)

Related Links




thoth id for the dump: f6102c05e55a79c28655192652f8b32d

I have seen this crash a few times in my coal setup which is running as a KVM guest on my linux desktop. Upon issuing the "reboot" command my VM will trip the following assert during zone shutdown.

assertion failed: !zsd->vz_active, file: ../../i86pc/io/vmm/vmm_zsd.c, line: 162

fffffe0021b5cbf0 genunix:process_type+16c0d0 ()
fffffe0021b5cc20 vmm:vmm_zsd_destroy+e8 ()
fffffe0021b5cca0 genunix:zsd_apply_destroy+165 ()
fffffe0021b5ccf0 genunix:zsd_apply_all_keys+5f ()
fffffe0021b5cd40 genunix:zone_zsd_callbacks+10e ()
fffffe0021b5cda0 genunix:zone_destroy+124 ()
fffffe0021b5cef0 genunix:zone+1e7 ()
fffffe0021b5cf00 unix:brand_sys_syscall+320 ()

dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel
> $C
fffffe0021b5cbc0 vpanic()
fffffe0021b5cbf0 0xfffffffffbe156b8()
fffffe0021b5cc20 vmm_zsd_destroy+0xe8(5, fffffe17a84c8200)
fffffe0021b5cca0 zsd_apply_destroy+0x165(0, 1, fffffe17da5898c0, 25)
fffffe0021b5ccf0 zsd_apply_all_keys+0x5f(fffffffffbf23d00, fffffe17da5898c0)
fffffe0021b5cd40 zone_zsd_callbacks+0x10e(fffffe17da5898c0, 2)
fffffe0021b5cda0 zone_destroy+0x124(5)
fffffe0021b5cef0 zone+0x1e7(1, 5, fffffc7fef2ba59a, 0, 0)
fffffe0021b5cf00 sys_syscall+0x2a5()

[3]> ::zone
            ADDR   ID STATUS        NAME                PATH
fffffffffc02be80    0 running       global              /
fffffe17da5898c0    5 dead          12675ef2-10c7-41... /zones/12675ef2-10c7-...
fffffe1716b06c80   33 shutting_down 7135c9e3-9f90-62... /zones/7135c9e3-9f90-...
[root@headnode (coal) /var/crash/volatile]# vmadm list | grep 12675ef2-10c7-41
12675ef2-10c7-41df-9476-c730972cb905  OS    2048     running           manatee0

In this case the zone has made it past the zsd_apply_shutdown phase and has moved to the zsd_apply_destroy phase. The zsd_entry that is causing us to die is:

> fffffe17da5899f0::walk list list | ::print struct zsd_entry zsd_key | ::grep .==25 | ::eval <list=K | ::print struct zsd_entry
    zsd_key = 0x25
    zsd_data = 0xfffffe17a84c8200
    zsd_create = vmm_zsd_create
    zsd_shutdown = vmm_zsd_shutdown
    zsd_destroy = vmm_zsd_destroy
    zsd_linkage = {
        list_next = 0xfffffe17da589a00
        list_prev = 0xfffffe17b6367da8
    zsd_flags = 0x204
    zsd_cv = {
        _opaque = 0

At first I thought it was odd that any bhyve related code would be showing up here. However, attaching the vmm module will result in zone_key_create() being called which will insert the zsd into the list for all active zones. When a non bhyve zone is shutting down it should basically be a no-op.

If we take a look at the vmm_zsd_t entry itself (aka zsd_data above) we see:

> fffffe17a84c8200::print vmm_zsd_t
    vz_vmms = {
        list_size = 0xb0
        list_offset = 0x90
        list_head = {
            list_next = 0xfffffe17a84c8210
            list_prev = 0xfffffe17a84c8210
    vz_linkage = {
        list_next = 0
        list_prev = 0
    vz_active = 0x1 (B_TRUE)
    vz_zoneid = 0x5
    vz_lock = {
        _opaque = [ 0xfffffe17158e4440 ]

This causes us to trip ASSERT(!zsd->vz_active). Digging in further it looks like vz_active is set to B_FALSE in vmm_zsd_shutdown() via zsd_apply_all_keys() during the zsd_apply_shutdown phase. It's odd that it would have been set back to B_TRUE between the shutdown and destroy phases.

Scanning the code for where vz_active gets set to true really only shows it happening during the vmm module initialization step (vmm_zsd_create). If we take a look at the modctl we see something interesting:

> fffffe171a710dd0::print modctl_t mod_loadcnt
mod_loadcnt = 0x4

It appears that the module has been loaded multiple times on this box. If I look at a fresh boot of coal as a comparison I see:

> fffffe1711449b98::print modctl_t mod_modname mod_loaded mod_loadcnt
mod_modname = 0xfffffe1718b86700 "vmm"
mod_loaded = '\0'
mod_loadcnt = 0x2

Finally, that made me wonder if somehow the module was being loaded when I issued the "reboot" command and we hit a race condition where the zsd shutdown phase for a zone has ran but the module then gets loaded and a new zsd is inserted into the list before the zsd destroy phase has ran. Sure enough the following dtrace shows us that something is causing the module to be loaded again:

[root@headnode (coal) ~]# dtrace -Zn 'vmm_zsd_init:entry {stack()}'
dtrace: description 'vmm_zsd_init:entry ' matched 0 probes
CPU     ID                    FUNCTION:NAME
  8  85827               vmm_zsd_init:entry


Comment by Michael Zeller
Created at 2018-11-29T16:46:06.451Z

Just another note.  Since the zsd is only added to zones in the zone_active list, I thought it was worth double checking that the zone that caused the panic was present.

> zone_active::walk list | ::print zone_t zone_name
zone_name = 0xfffffffffbf51fdb "global"
zone_name = 0xfffffe173f432980 "12675ef2-10c7-41df-9476-c730972cb905"
zone_name = 0xfffffe1715c27280 "7135c9e3-9f90-6280-d061-a86b02963bbf"

Comment by Michael Zeller
Created at 2018-11-29T17:07:53.469Z

Also worth noting that debug kernels will load/unload modules.  Which is likely what is happening here.

Comment by Mike Gerdts [X]
Created at 2018-12-03T14:32:19.350Z

As discussed in chat, it may be best to make it so that zone_key_create() is a bit more exclusive.

 713         mutex_enter(&zonehash_lock);    /* stop the world */
 714         for (zone = list_head(&zone_active); zone != NULL;
 715             zone = list_next(&zone_active, zone)) {
 716                 zone_status_t status;
 718                 mutex_enter(&zone->zone_lock);
 720                 /* Skip zones that are on the way down or not yet up */
 721                 status = zone_status_get(zone);
 722                 if (status >= ZONE_IS_DOWN ||
 723                     status == ZONE_IS_UNINITIALIZED) {
 724                         mutex_exit(&zone->zone_lock);
 725                         continue;
 726                 }

In particular, line 722 should use ZONE_IS_SHUTTING_DOWN rather than ZONE_IS_DOWN. Let's investigate this possibility.

As a zone is coming down, zone_shutdown is called at least once and zone_destroy is called once. At least one call to zone_shutdown will call zone_zsd_callbacks(zone, ZSD_SHUTDOWN) with zone_status >= ZONE_IS_SHUTTING_DOWN. So far so good.

Let's think about the purpose of the zsd_destroy and zsd_shutdown callbacks. They are generally there to clean up some kernel state that likely came from some userspace process doing something that caused some kernel state to be allocated. When the zone is at ZONE_IS_SHUTTING_DOWN, are we sure that there could be no new kernel state allocated?

During zone_shutdown, the state is set to at least ZONE_IS_SHUTTING_DOWN. While at ZONE_IS_SHUTTING_DOWN or greater, getproc() blocks new process creation. Likewise, zone_enter() will fail. Immediately after setting zone_shutdown ensures the state is ZONE_IS_SHUTTING_DOWN or greater, it drops some locks then calls killall().

There are other paths to zone_status_set(ZONE_IS_SHUTTING_DOWN). zone_kadmin() sets the state then makes a door call to zoneadmd which will perform a cleanup that includes zone_shutdown and zone_destroy. zone_shutdown_global is less complete: after setting the state, it calls killall() which will quickly nuke any zoneadmd that may be trying to perform cleanup. I think that's OK, as I don't see how any state that may exist could prevent the completion of the global zone halt or reboot.

I see a couple (unlikely) races that could be introduced by making the suggested change at line 722. They each involve some already running process that happens to make the right system call to perform some kernel state allocation between the time that the state is set to ZONE_STATE_SHUTTING_DOWN and the time that killall runs. This window is the longest when zone_kadmin() is involved.

Thus, the proper fix for this is to harden vmm_zsd_destroy() against the race that led to the panic.

Comment by Mike Gerdts [X]
Created at 2018-12-03T22:53:25.423Z


I'm not able to reproduce the original problem because modunload -i gets EBUSY while trying to unload the vmm module. I have done several passes of all tests in /usr/vm/tests, which leads to zones of each brand being booted and shutdown/halted several times.

Comment by Michael Zeller
Created at 2018-12-05T19:31:54.622Z

Note: The reason I was able to hit this edge case in the first place is because I am using an AMD CPU.  Therefore the sdev guard for unloading is not setup, allowing me to load/unload the vmm module.

To test I rebooted coal many times to make sure I no longer triggered the assert. The fact that this is a race condition makes it hard to be sure the fix is actually doing its job.

To be slightly more certain I ran the following while issuing "reboot"

[root@headnode (coal) ~]# dtrace -Zn 'vmm_zsd_init:entry {stack()}'
dtrace: description 'vmm_zsd_init:entry ' matched 0 probes

While executing reboot I saw the console message "reboot: Halting 33 zones." at roughly the same time as the dtrace probe firing that the vmm module was being loaded. (This doesn't happen every reboot).  I was successfully able to reboot every time, including the time when the dtrace probe fired.

Comment by Jira Bot
Created at 2018-12-05T20:07:55.138Z

illumos-joyent commit a454ac291ee3d52860e81c8b77efa408915b3aa3 (branch master, by Mike Gerdts)

OS-7410 vmm_zsd assert tripped on zone shutdown
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Mike Zeller <mike.zeller@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>