OS-8627

dlpi_open_zone() messes up DLS reference holds

Status:
In Progress
Created:
2025-02-24T10:24:54.561-0500
Updated:
2026-01-12T09:34:00.487-0500

Description

See 105c32e6a47285a85b802a0cf7bcac1c .

A simpler reproduction is:

1.) snoop -z ZONE -d net0 (Assuming net0 is a NIC in the zone)
2.) vmadm halt ZONE (which will hang)
3.) kill/exit the snoop process in #1.

The close in step 3 of the above will panic the system on the close of the /dev/net/ZONE/net0 file descriptor due to failing in the VERIFY() below:

void
dls_devnet_rele(dls_devnet_t *ddp)
{
        mutex_enter(&ddp->dd_mutex);
        VERIFY(ddp->dd_ref > 1);
        ddp->dd_ref--;

The first step is probably to see where the zone shutdown is holding up. After that make appropriate indicators available to dls_devnet_rele() for GZ processes using dlpi_open_zone(), or having the GZ process better-able to hold a `dd_ref`.

Comments (2)

Dan McDonald commented on 2025-03-11T17:01:25.663-0400:

The problem is that the cleanup of the dls_devnet_t at zone shutdown time CANNOT POSSIBLY KNOW how man of its dd_ref counts are tied to global-zone snoop -z <ZONE> -d <transient-zone-link processes.

To expedite zone shutdown, illumos#15167 and friends, which in turn is an upstream of various OS- tickets, notably OS-406 performs this shortcut:

        /*
         * Make sure downcalls into softmac_create or softmac_destroy from
         * devfs don't cv_wait on any devfs related condition for fear of
         * deadlock. Return EBUSY if the asynchronous thread started for
         * property loading as part of the post attach hasn't yet completed.
         */
        VERIFY(ddp->dd_ref != 0);
        if ((ddp->dd_ref != 1) || (!wait &&
            (ddp->dd_tref != 0 || ddp->dd_prop_taskid != 0))) {
                int zstatus = 0;

                /*
                 * There are a couple of alternatives that might be going on
                 * here; a) the zone is shutting down and it has a transient
                 * link assigned, in which case we want to clean it up instead
                 * of moving it back to the global zone, or b) its possible
                 * that we're trying to clean up an orphaned vnic that was
                 * delegated to a zone and which wasn't cleaned up properly
                 * when the zone went away.  Check for either of these cases
                 * before we simply return EBUSY.
                 *
                 * zstatus indicates which situation we are dealing with:
                 *       0 - means return EBUSY
                 *       1 - means case (a), cleanup transient link
                 *      -1 - means case (b), orphaned VNIC
                 */
                if (ddp->dd_ref > 1 && ddp->dd_zid != GLOBAL_ZONEID) {
                        zone_t  *zp;

                        if ((zp = zone_find_by_id(ddp->dd_zid)) == NULL) {
                                zstatus = -1;
                        } else {
                                if (ddp->dd_transient) {
                                        zone_status_t s = zone_status_get(zp);

                                        if (s >= ZONE_IS_SHUTTING_DOWN)
                                                zstatus = 1;
                                }
                                zone_rele(zp);
                        }
                }

                if (zstatus == 0) {
                        mutex_exit(&ddp->dd_mutex);
                        rw_exit(&i_dls_devnet_lock);
                        return (EBUSY);
                }

                /*
                 * We want to delete the link, reset ref to 1;
                 */
                if (zstatus == -1) {
                        /* Log a warning, but continue in this case */
                        cmn_err(CE_WARN, "clear orphaned datalink: %s\n",
                            ddp->dd_linkname);
                }
                ddp->dd_ref = 1; /* XXX KEBE ASKS HOW MANY DROPPED REFS? */
        }

Now until OS-2782, the only refs that would be reduced were ones from in-zone processes, which were getting killed off anyway, or the TWO (2) references instantiated by `zoneadmd` at zone boot time.

Determining the number 2 involved some DTrace of zone booting’s use of functions in dls_mgmt.c in the kernel. It’s the only file that uses dd_ref in the aforementioned panic. It’s attached, along with an annotated dtrace(8) script output, and an mdb -kw session that demonstrates how to escape a panic if you get into that situation before any fix for this bug is attempted or in place.

Dan McDonald commented on 2025-03-11T17:14:00.424-0400:

The next big question is: How do you solve this without sabotaging OS-406 and OS-2782, AND without requiring kernel mdb hacking?

A first guess might be to be quicker to return EBUSY unless the zone is in ZONE_IS_EMPTY, which means merely changing the ZONE_IS_SHUTTING_DOWN check (or making the check > ZONE_IS_SHUTTING_DOWN). Doing that means you can guarantee the dd_ref is 2, EXCEPT when there are global zone processes using /dev/net/zone/... observability devices. The DTrace testing above was done with the ZONE_IS_EMPTY check being done (via a quick instruction hack in mdb -kw).

So IF checking for ZONE_IS_EMPTY helps and is safe (could it sabotage OS-406), the question reduces to: Can we either kill snoop -z <zone> or safely wait for it?