OS-6274: links owned by NGZ erroneously marked as on loan

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-08-10T22:14:43.000Z)

Fix Versions

2017-08-17 YEREVAN OPTION (Release Date: 2017-08-17)

Related Links

Description

As reported by user brianewell in smartos-live#737, NGZ ip tunnels stopped persisting across zone reboot. This behavior appeared in the 20170202 PI and was not present in previous releases. After much spelunking I determined that this was caused by a regression introduced in commit 33df115 (part of the OS-5363 work). The regression was a one-line change to link_activate() which marks NGZ links as on loan when they are in fact not loaned because the NGZ created and owns the link.

 	} else if (linkp->ll_zoneid != GLOBAL_ZONEID) {
  		err = zone_add_datalink(linkp->ll_zoneid, linkp->ll_linkid);
 +		linkp->ll_onloan = B_TRUE;
  	}

This regression was easy to introduce because of the subtle nature of this code and lack of comments. I'm going to remove the regressive line, add clarifying comments, and also add some asserts.

Following is a detailed analysis of the issue, how I debugged it, and why my one-line change caused the regression.

To start I verified that PI 20170119 work as expected:

Then I booted the GZ into PI 20170202 and verified the iptun did not show up

At this point I thought I would recreate the iptun and see if I could monitor the zone halt/boot process for the culprit, but instead I received an error from dladm: "object already exists".

[root@sos-zone ~]# dladm create-iptun -T ipv4 -a local=172.16.21.130,remote=10.0.1.251 v4_sys76
dladm: could not create tunnel: object already exists

I didn't expect this. So I used mdb to inspect the dlmgmtd state. Sure enough the iptun exists in dlmgmtd.

> dlmgmt_name_avl::walk avl | ::print dlmgmt_link_t ll_linkid ll_link
ll_linkid = 0x3
ll_link = [ "aggr0" ]
ll_linkid = 0x1
ll_link = [ "e1000g0" ]
ll_linkid = 0x2
ll_link = [ "e1000g1" ]
ll_linkid = 0x4
ll_link = [ "net0" ]
ll_linkid = 0x5
ll_link = [ "v4_sys76" ]

Okay, so if the link already exists, why doesn't it show up (in either the GZ or the NGZ)?

> 806d448::print dlmgmt_link_t
{
    ll_head = 0x806cb10
    ll_link = [ "v4_sys76" ]
    ll_class = 0x80 (DATALINK_CLASS_IPTUN)
    ll_media = 0x80000001
    ll_linkid = 0x5
    ll_zoneid = 0x1
    ll_onloan = 0x1 (B_TRUE)
    ll_name_node = {
        avl_child = [ 0, 0 ]
        avl_parent = 0x806ca18
        avl_child_index = 0x1
        avl_balance = 0
    }
    ll_id_node = {
        avl_child = [ 0, 0 ]
        avl_parent = 0x806ca28
        avl_child_index = 0x1
        avl_balance = 0
    }
    ll_flags = 0x2
    ll_gen = 0
    ll_trans = 0 (0)
}

The important piece of state is ll_flags. There are two flags which can be set:

If a link is not marked as active then it won't show up when you query dladm. When booting the zone on 20170119 the ll_flags for the iptun contained the value 0x3. So the problem is the link is not marked as active on the 20170202 PI. I also noticed that the 20170202 PI set the ll_onloan field to true while the older PI didn't. Why?

The link_activate() function is responsible for marking a link as active. I used dtrace to verify this function was called on the 20170202 PI and that the dlmgmt_link_t had the correct ll_flags value.

[root@testsos ~]# dtrace -n 'pid$target::link_activate:entry { self->lp=args[0]; print(*args[0]); } pid$target::link_activate:return { print(*self->lp); } pid$target::link_activate:return,pid$target::zone_check_datalink:return,pid$target::zone_add_datalink:return { printf("%d", arg1); }' -p 19
dtrace: description 'pid$target::link_activate:entry ' matched 5 probes

... output elided for sake of clarity ...

  0  69115              link_activate:entry dlmgmt_link_t {
    dlmgmt_linkattr_t *ll_head = 0x806cb10
    char [32] ll_link = [ "v4_sys76" ]
    datalink_class_t ll_class = DATALINK_CLASS_IPTUN
    uint32_t ll_media = 0x80000001
    datalink_id_t ll_linkid = 0x9
    zoneid_t ll_zoneid = 0x3
    boolean_t ll_onloan = B_FALSE
    avl_node_t ll_name_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca18
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    avl_node_t ll_id_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca28
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    uint32_t ll_flags = 0x2
    uint32_t ll_gen = 0
    boolean_t ll_trans = B_FALSE
}
  0  69083       zone_check_datalink:return 4294967295
  0  69084         zone_add_datalink:return 0
  0  69075             link_activate:return dlmgmt_link_t {
    dlmgmt_linkattr_t *ll_head = 0x806cb10
    char [32] ll_link = [ "v4_sys76" ]
    datalink_class_t ll_class = DATALINK_CLASS_IPTUN
    uint32_t ll_media = 0x80000001
    datalink_id_t ll_linkid = 0x9
    zoneid_t ll_zoneid = 0x3
    boolean_t ll_onloan = B_TRUE
    avl_node_t ll_name_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca18
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    avl_node_t ll_id_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca28
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    uint32_t ll_flags = 0x3
    uint32_t ll_gen = 0
    boolean_t ll_trans = B_FALSE
}
  0  69075             link_activate:return 0

So the iptun link structure has the correct ll_flags when link_activate() returns but when I inspect the same structure with mdb afterwards the value has changed.

> dlmgmt_id_avl::walk avl | ::print -ta dlmgmt_link_t

... some output elided ...

806d448 dlmgmt_link_t {
    806d448 dlmgmt_linkattr_t *ll_head = 0x806cb10
    806d44c char [32] ll_link = [ "v4_sys76" ]
    806d46c datalink_class_t ll_class = 0x80 (DATALINK_CLASS_IPTUN)
    806d470 uint32_t ll_media = 0x80000001
    806d474 datalink_id_t ll_linkid = 0x9
    806d478 zoneid_t ll_zoneid = 0x3
    806d47c boolean_t ll_onloan = 0x1 (B_TRUE)
    806d480 avl_node_t ll_name_node = {
        806d480 struct avl_node *[2] avl_child = [ 0, 0 ]
        806d488 struct avl_node *avl_parent = 0x806ca18
        806d48c unsigned short avl_child_index = 0x1
        806d48e short avl_balance = 0
    }
    806d490 avl_node_t ll_id_node = {
        806d490 struct avl_node *[2] avl_child = [ 0, 0 ]
        806d498 struct avl_node *avl_parent = 0x806ca28
        806d49c unsigned short avl_child_index = 0x1
        806d49e short avl_balance = 0
    }
    806d4a0 uint32_t ll_flags = 0x2
    806d4a4 uint32_t ll_gen = 0
    806d4a8 boolean_t ll_trans = 0 (0)
}

Sometime after link_activate() completes some other process changed the ll_flags value. My next question was: where is link_activate() called and what comes after it that might affect the ll_flags? I did another trace and got this stack.

              dlmgmtd`link_activate
              dlmgmtd`dlmgmt_upid+0x5c
              dlmgmtd`dlmgmt_handler+0x96
              libc.so.1`__door_return+0x3d

The dlmgmt_upid() function calls dlmgmt_write_db_entry() after link_activate() and that can change the flags. But dtrace proved the ll_flags value was still 0x3 after returning from this function.

  0  69116     dlmgmt_write_db_entry:return dlmgmt_link_t {
    dlmgmt_linkattr_t *ll_head = 0x806cb10
    char [32] ll_link = [ "v4_sys76" ]
    datalink_class_t ll_class = DATALINK_CLASS_IPTUN
    uint32_t ll_media = 0x80000001
    datalink_id_t ll_linkid = 0xf
    zoneid_t ll_zoneid = 0x6
    boolean_t ll_onloan = B_TRUE
    avl_node_t ll_name_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca18
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    avl_node_t ll_id_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca28
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    uint32_t ll_flags = 0x3
    uint32_t ll_gen = 0
    boolean_t ll_trans = B_FALSE
}

With no obvious questions left I then asked cscope to show me all places where ll_flags is modified. As I walked through the list I used dtrace to eliminate candidates one at a time -- until I reached dlmgmt_destroy_common(). I would not have expected this function to show up during zone boot but sure enough it was being called somehow, and by someone. Who?

[root@testsos ~]# dtrace -n 'pid$target::dlmgmt_destroy_common:entry { print(*args[0]); printf("flags: %x\n", arg1); }' -p 19
dtrace: description 'pid$target::dlmgmt_destroy_common:entry ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0  78123      dlmgmt_destroy_common:entry dlmgmt_link_t {
    dlmgmt_linkattr_t *ll_head = 0x806b3e8
    char [32] ll_link = [ "v4_sys76" ]
    datalink_class_t ll_class = DATALINK_CLASS_IPTUN
    uint32_t ll_media = 0x80000001
    datalink_id_t ll_linkid = 0x21
    zoneid_t ll_zoneid = 0xf
    boolean_t ll_onloan = B_TRUE
    avl_node_t ll_name_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca18
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    avl_node_t ll_id_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806ca28
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    uint32_t ll_flags = 0x3
    uint32_t ll_gen = 0
    boolean_t ll_trans = B_FALSE
}flags: 1

It was at this point I got really frustrated. I tried to use my normal trick of truss -f -t \!all -u libdladm:: vmadm start to find the call to dladm_destroy_datalink_id() but it didn't work (dladm_destroy_datalink_id() is the libdladm function responsible for making the door call which results in dlmgmt_destroy_common()).

Since there is no easy way to track door calls it was at this point I decided to go nuclear and use the dtrace stop action to stop dlmgmtd when it hits dlmgmt_destroy_common(). Then I used mdb -k to inspect the door info for the dlmgmtd threads and look for my culprit.

> 0t27::pid2proc | ::walk thread | ::print kthread_t t_door | ::print door_data_t d_server.d_caller
mdb: failed to read d_server.d_caller pointer at 60: no mapping for address
d_server.d_caller = 0xffffff02f5eac840
d_server.d_caller = 0xffffff02f9250860
d_server.d_caller = 0
d_server.d_caller = 0
d_server.d_caller = 0
d_server.d_caller = 0
> 0xffffff02f5eac840::print kthread_t t_procp | ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  28976  28975  28491  28491      0 0x4a004000 ffffff030ea5a020 /sbin/dladm up-iptun
> 0xffffff02f9250860::print kthread_t t_procp | ::ps
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  28859  28851  28491  28491      0 0x4a004000 ffffff030a825018 ndd

[root@testsos ~]# mdb -p 28976
Loading modules: [ ld.so.1 libumem.so.1 libcmdutils.so.1 libc.so.1 ]
> $C
08047bf8 libc_hwcap1.so.1`__door_call+0x28(4, 8047c48, 8047bb0, 0, fedb0000, 8094fd8)
08047c18 libc_hwcap1.so.1`door_call+0xed(4, 8047c48, feffc540, fed9d3b6)
08047c78 libdladm.so.1`dladm_door_call+0x76(8094fd8, 8047ce8, 18, 8047cd0, 8047ccc, fedb0034)
08047d18 libdladm.so.1`dladm_walk_datalink_id+0x9a(fedab9c6, 8094fd8, 0, 80, 0, 1)
08047d58 libdladm.so.1`dladm_iptun_up+0x41(8094fd8, 0, 8047d88, fed8da8b)
08047d88 do_up_iptun+0x65(1)
08047dc8 main+0xb9(8047dbc, fef1c6e8, 8047df0, 8056ac7, 2, 8047dfc)
08047df0 _start+0x83(2, 8047ed8, 8047ee4, 0, 8047eed, 8047f02)

The culprit is do_up_iptun() caused by the dladm up-iptun call. Using ptree I then realized this was happening as part of the zone boot under the network/iptun svc startup. At this point it was a matter of doing a zlogin to sos-zone and running truss on dladm up-iptun to find the real reason why dladm_destroy_datalink_id() is called.

33485/1@1:            -> libdladm:i_iptun_get_dbparams(0x8094fd8, 0x8047360, 0x820, 0x0)
33485/1@1:              -> libdladm:dladm_datalink_id2info()
33485/1@1:                -> libdladm:dladm_door_call()
33485/1@1:                  -> libdladm:dladm_door_fd(0x8094fd8, 0x8047234, 0x3, 0xfed9d3b6)
33485/1@1:                  <- libdladm:dladm_door_fd() = 0
33485/1@1:                  -> libdladm:dladm_errno2status(0x0, 0x8047238, 0x3, 0xfed9d3b6)
33485/1@1:                  <- libdladm:dladm_errno2status() = 0
33485/1@1:                <- libdladm:dladm_door_call() = 0
33485/1@1:              <- libdladm:dladm_datalink_id2info() = 0
33485/1@1:              -> libdladm:dladm_getsnap_conf()
33485/1@1:                -> libdladm:dladm_door_call()
33485/1@1:                  -> libdladm:dladm_door_fd(0x8094fd8, 0x8047264, 0x8047288, 0xfed9d3b6)
33485/1@1:                  <- libdladm:dladm_door_fd() = 0
33485/1@1:                  -> libdladm:dladm_errno2status(0xd, 0x8047268, 0x8047288, 0xfed9d3b6)
33485/1@1:                  <- libdladm:dladm_errno2status() = 13
33485/1@1:                <- libdladm:dladm_door_call() = 13
33485/1@1:              <- libdladm:dladm_getsnap_conf() = 13
33485/1@1:            <- libdladm:i_iptun_get_dbparams() = 13

So the link is marked as inactive because dladm_getsnap_conf() fails with DLADM_STATUS_DENIED which is mapped to EACCESS. Looking at the dladm_getsnap_conf() code I see the following.

	if (linkp->ll_onloan && zoneid != GLOBAL_ZONEID) {
		/*
		 * The caller is in a non-global zone and the persistent
		 * configuration belongs to the global zone.
		 */
		err = EACCES;
		goto done;
	}

What this is saying is that if a link is marked "on loan" (meaning it's technically owned/created by the GZ but assigned/loaned to the NGZ) and the zone calling dladm_getsnap_conf() is an NGZ then return EACCESS because the configuration of the link is up to the GZ, not the NGZ. This code is correct and should be enforced, but why is it tripping in PI 20170202 and not 20170119? It comes back to my earlier observation that in the 20170202 PI we marked the iptun as "on loan" but not in the older one. Why?

Well as it turns out while fixing OS-5363 I fixed what I thought was a bug in link_activate().

 	} else if (linkp->ll_zoneid != GLOBAL_ZONEID) {
  		err = zone_add_datalink(linkp->ll_zoneid, linkp->ll_linkid);
 +		linkp->ll_onloan = B_TRUE;
  	}

When I first read this code it was my understanding that anytime we added a link to a zone's datalink list, by calling zone_add_datalink(), that link was then considered "on loan". My understanding was incorrect. The link_activate() code has a subtleness that eluded me. There are two cases in link_activate():

1. The link is under an NGZ's datalink list but it's ll_linkid doesn't reflect that (e.g., the link is found under zoneid 3 but ll_linkid is 0). In this case the link is owned by the GZ but is being loaned to an NGZ and the link state should be updated accordingly. We get in this situation when dlmgmtd is restated for some reason (it must resync it's in-memory state with the state of the system).

2. The link is NOT under any NGZ's (zone_check_datalink() is only concerned with NGZs) datalink list but its ll_zoneid holds the value of an NGZ. This indicates that the link is owned by an NGZ but for whatever reason is not currently under the NGZ's datalink list (e.g., because we are booting the zone and we now need to assign the link to its list).

So the fix is to revert that one line change as well as add some clarifying comments and also some asserts to prevent further confusion in the future.

Comments

Comment by Ryan Zezeski
Created at 2017-08-10T21:10:33.000Z
I tested my patch in the following ways.

[root@testsos ~]# dtrace -n 'pid$target::link_activate:entry /strstr(copyinstr((uintptr_t)&(args[0]->ll_link[0])), "v4_sys76") != NULL/ { self->lp=args[0]; print(*args[0]); } pid$target::link_activate:return { print(*self->lp); self->lp=0; }' -p $(pgrep dlmgmtd)
dtrace: description 'pid$target::link_activate:entry ' matched 2 probes
dtrace: error on enabled probe ID 2 (ID 71232: pid22:dlmgmtd:link_activate:return): invalid address (0x0) in action #1
CPU     ID                    FUNCTION:NAME
  1  71231              link_activate:entry dlmgmt_link_t {
    dlmgmt_linkattr_t *ll_head = 0x806c3e8
    char [32] ll_link = [ "v4_sys76" ]
    datalink_class_t ll_class = DATALINK_CLASS_IPTUN
    uint32_t ll_media = 0x80000001
    datalink_id_t ll_linkid = 0x9
    zoneid_t ll_zoneid = 0x3
    boolean_t ll_onloan = B_FALSE
    avl_node_t ll_name_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806db48
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    avl_node_t ll_id_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806db58
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    uint32_t ll_flags = 0x2
    uint32_t ll_gen = 0
    boolean_t ll_trans = B_FALSE
}
  1  71232             link_activate:return dlmgmt_link_t {
    dlmgmt_linkattr_t *ll_head = 0x806c3e8
    char [32] ll_link = [ "v4_sys76" ]
    datalink_class_t ll_class = DATALINK_CLASS_IPTUN
    uint32_t ll_media = 0x80000001
    datalink_id_t ll_linkid = 0x9
    zoneid_t ll_zoneid = 0x3
    boolean_t ll_onloan = B_FALSE
    avl_node_t ll_name_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806db48
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    avl_node_t ll_id_node = {
        struct avl_node *[2] avl_child = [ 0, 0 ]
        struct avl_node *avl_parent = 0x806db58
        unsigned short avl_child_index = 0x1
        short avl_balance = 0
    }
    uint32_t ll_flags = 0x3
    uint32_t ll_gen = 0
    boolean_t ll_trans = B_FALSE
}

[root@testsos ~]# mdb -p $(pgrep dlmgmtd) -e 'dlmgmt_name_avl::walk avl | ::print -ta dlmgmt_link_t'
...
806e448 dlmgmt_link_t {
    806e448 dlmgmt_linkattr_t *ll_head = 0x806c3e8
    806e44c char [32] ll_link = [ "v4_sys76" ]
    806e46c datalink_class_t ll_class = 0x80 (DATALINK_CLASS_IPTUN)
    806e470 uint32_t ll_media = 0x80000001
    806e474 datalink_id_t ll_linkid = 0x9
    806e478 zoneid_t ll_zoneid = 0x3
    806e47c boolean_t ll_onloan = 0 (0)
    806e480 avl_node_t ll_name_node = {
        806e480 struct avl_node *[2] avl_child = [ 0, 0 ]
        806e488 struct avl_node *avl_parent = 0x806db48
        806e48c unsigned short avl_child_index = 0x1
        806e48e short avl_balance = 0
    }
    806e490 avl_node_t ll_id_node = {
        806e490 struct avl_node *[2] avl_child = [ 0, 0 ]
        806e498 struct avl_node *avl_parent = 0x806db58
        806e49c unsigned short avl_child_index = 0x1
        806e49e short avl_balance = 0
    }
    806e4a0 uint32_t ll_flags = 0x3
    806e4a4 uint32_t ll_gen = 0
    806e4a8 boolean_t ll_trans = 0 (0)
}

[root@testsos ~]# pgrep dlmgmtd
22
[root@testsos ~]# pkill dlmgmtd
[root@testsos ~]# pgrep dlmgmtd
6548


[root@testsos ~]# mdb -p $(pgrep dlmgmtd) -e 'dlmgmt_name_avl::walk avl | ::print -ta dlmgmt_link_t'
80707c8 dlmgmt_link_t {
    80707c8 dlmgmt_linkattr_t *ll_head = 0x80708b8
    80707cc char [32] ll_link = [ "v4_sys76" ]
    80707ec datalink_class_t ll_class = 0x80 (DATALINK_CLASS_IPTUN)
    80707f0 uint32_t ll_media = 0x80000001
    80707f4 datalink_id_t ll_linkid = 0x9
    80707f8 zoneid_t ll_zoneid = 0x3
    80707fc boolean_t ll_onloan = 0 (0)
    8070800 avl_node_t ll_name_node = {
        8070800 struct avl_node *[2] avl_child = [ 0, 0 ]
        8070808 struct avl_node *avl_parent = 0x806db88
        807080c unsigned short avl_child_index = 0x1
        807080e short avl_balance = 0
    }
    8070810 avl_node_t ll_id_node = {
        8070810 struct avl_node *[2] avl_child = [ 0, 0 ]
        8070818 struct avl_node *avl_parent = 0x806db98
        807081c unsigned short avl_child_index = 0x1
        807081e short avl_balance = 0
    }
    8070820 uint32_t ll_flags = 0x3
    8070824 uint32_t ll_gen = 0
    8070828 boolean_t ll_trans = 0 (0)
}

Comment by Bot Bot
Created at 2017-08-10T22:13:29.000Z

illumos-joyent commit c08af99 (branch master, by Ryan Zezeski)

OS-6274 links owned by NGZ erroneously marked as on loan
    Reviewed by: Robert Mustacchi <rm@joyent.com>
    Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
    Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>