OS-6805: packet flow over a defaulted LACP port

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-03-22T17:50:14.359Z
Updated at:2019-09-11T20:15:02.676Z

People

Created by:Jeff Emershaw [X]
Reported by:Jeff Emershaw [X]
Assigned to:Ryan Zezeski [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-09-11T20:15:02.659Z)

Fix Versions

2019-09-12 Art Vandelay (Release Date: 2019-09-12)

Related Issues

Related Links

Description

Dladm show's that ixgbe3 and ixgbe0 are in the defaulted state, this means it shouldn't be send traffic out over these interfaces until they negotiated into sync state. In the snoop capture it looks as though there are arp packets sending out ixgbe2 and when that happens it seems like the icmp packets stop responding. Attached is the snoop captures on the nics.

These CN's aggrs were configured incorrectly, ixgbe4 and ixgbe5 should be part of aggr1 and aggr2.

[jemershaw@HBA08638 (ap-southeast-1a) /var/tmp]$ dladm show-phys -m
LINK         SLOT     ADDRESS            INUSE CLIENT
ixgbe4       primary  90:e2:ba:d1:97:30  yes  ixgbe4
ixgbe0       primary  c:c4:7a:de:88:8a   yes  aggr1-ixgbe0
ixgbe2       primary  c:c4:7a:de:88:8a   yes  aggr1-ixgbe2
ixgbe1       primary  c:c4:7a:de:88:8b   yes  aggr2-ixgbe1
ixgbe5       primary  90:e2:ba:d1:97:31  yes  ixgbe5
ixgbe3       primary  c:c4:7a:de:88:8b   yes  aggr2-ixgbe3
[jemershaw@HBA08638 (ap-southeast-1a) /var/tmp]$ dladm show-phys
LINK         MEDIA                STATE      SPEED  DUPLEX    DEVICE
ixgbe4       Ethernet             up         10000  full      ixgbe4
ixgbe0       Ethernet             up         10000  full      ixgbe0
ixgbe2       Ethernet             down       0      unknown   ixgbe2
ixgbe1       Ethernet             up         10000  full      ixgbe1
ixgbe5       Ethernet             up         10000  full      ixgbe5
ixgbe3       Ethernet             down       0      unknown   ixgbe3
[jemershaw@HBA08638 (ap-southeast-1a) /var/tmp]$ dladm show-aggr -L
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr1       ixgbe2       yes          no   no   no   yes       no
--          ixgbe0       yes          yes  yes  yes  no        no
aggr2       ixgbe3       yes          no   no   no   yes       no
--          ixgbe1       yes          yes  yes  yes  no        no
[jemershaw@HBA08638 (ap-southeast-1a) /var/tmp]$ dladm show-aggr -x
LINK        PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr1       --             10000Mb full   up        c:c4:7a:de:88:8a   --
            ixgbe2         0Mb  unknown   down      c:c4:7a:de:88:8a   standby
            ixgbe0         10000Mb full   up        90:e2:ba:d1:97:34  attached
aggr2       --             10000Mb full   up        c:c4:7a:de:88:8b   --
            ixgbe3         0Mb  unknown   down      c:c4:7a:de:88:8b   standby
            ixgbe1         10000Mb full   up        90:e2:ba:d1:97:35  attached
[jemershaw@HBA08638 (ap-southeast-1a) /var/tmp]$ uname -a
SunOS HBA08638 5.11 joyent_20171008T114043Z i86pc i386 i86pc

Comments

Comment by Ryan Zezeski [X]
Created at 2019-09-10T17:52:53.110Z

This was seen again in SWSUP-1515. The problem is hit only under a confluence of conditions:

1. The aggr must have more then one client. Otherwise mac_tx_send will avoid calling mac_bcast_send() and instead head directly to mac_provider_tx(), passing the ring supplied as argument which was properly selected via aggr's mac_tx_aggr_mode().

2. One of the aggr ports must be down and it must be the first one added to the aggr, causing mi_default_tx_ring() to be set to the first pseudo ring of the downed port (this initiation is part of i_mac_group_add_ring()).

3. LACP active mode will recognize that the other ports are up and call aggr_set_coll_dist() with B_TRUE in order to call aggr_send_port_enable() to enable the port. This will place the port on lg_tx_ports which will allow the port's pseudo rings to be selected as part of aggr_find_tx_ring() (part of step 1). It will also make a call to aggr_grp_update_default() to make sure the mi_defaut_tx_ring is updated to the first ring or the first port in lg_tx_ports. However, this will only happen if the pseudo ring is marked MR_INUSE, otherwise the default ring will remain unchanged, leaving us at the mercy of the luck of the draw from step 2. And sure enough, when this LACP event fires the ring is not marked as in-use because it has no clients at the moment.

Here's an example of mac_hwring_set_default() being called while the ring state is not MR_INUSE (that's the 0x0 value at the end of the mac_hwring_set_default() line).

13619438082931 aggr_send_port_enable 0xfffffe5967ffc9e8

              aggr`aggr_set_coll_dist+0x7c
              aggr`lacp_mux_sm+0x1e0
              aggr`lacp_receive_sm+0x309
              aggr`aggr_lacp_rx+0xc1
              aggr`aggr_lacp_rx_thread+0xd2
              unix`thread_start+0xb
13619438093784 aggr_grp_update_default 0xfffffe595dd158f8

              aggr`aggr_send_port_enable+0x86
              aggr`aggr_set_coll_dist+0x7c
              aggr`lacp_mux_sm+0x1e0
              aggr`lacp_receive_sm+0x309
              aggr`aggr_lacp_rx+0xc1
              aggr`aggr_lacp_rx_thread+0xd2
              unix`thread_start+0xb
13619438099257 mac_hwring_set_default 0xfffffe59513b4008 0xfffffe5c7a74eca8 0x0

The end result of these three things is that the aggr's mac will be put in a state where its default Tx ring is pointing to a downed port instance and it has no way to notice this fact and recover. You might think that we need a call to aggr_send_port_disable() to rectify the situation but the downed port is never enabled in the first place, so it's never part of the lg_tx_ports list. This is why, if you only have one aggr client (i.e., you place IP directly on top of the primary client), the broadcast traffic will work just fine, because it will select a ring from the working lg_tx_ports, but if you have two or more clients it will fallback to mac_bcast_send() and try to use the bad mi_default_tx_ring.

I think the real issue here is that mac_hwring_set_default() shouldn't concern itself with ring state. It's sole purpose is to make sure that the mi_default_tx_ring() is the first ring of the first port in lg_tx_ports.


Comment by Ryan Zezeski [X]
Created at 2019-09-11T16:16:48.542Z
Updated at 2019-09-11T17:47:11.124Z

I tested this by setting up two X540-T2s back-to-back and creating an aggr over each one. These aggrs used an L4 policy with active LACP. Then I pulled a a single cable rebooted the host, which would cause a port to be down during aggr creation before their are any clients (as opposed to pulling a cable later which is appropriately dealt with).

Patched Platform

# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
ixgbe0      phys      9000   up       --         --
ixgbe1      phys      9000   down     --         --
igb0        phys      1500   up       --         --
igb1        phys      1500   up       --         --
ixgbe2      phys      9000   down     --         --
ixgbe3      phys      9000   up       --         --
aggr0       aggr      9000   up       --         ixgbe0,ixgbe1
aggr1       aggr      9000   up       --         ixgbe2,ixgbe3

# dladm show-aggr
LINK            POLICY   ADDRPOLICY           LACPACTIVITY  LACPTIMER   FLAGS
aggr0           L4       auto                 active        short       -----
aggr1           L4       auto                 active        short       -----

# dladm show-aggr -L
LINK        PORT         AGGREGATABLE SYNC COLL DIST DEFAULTED EXPIRED
aggr0       ixgbe0       yes          yes  yes  yes  no        no
--          ixgbe1       yes          no   no   no   yes       no
aggr1       ixgbe2       yes          no   no   no   yes       no
--          ixgbe3       yes          yes  yes  yes  no        no

<GZ> root@gaia [~]
# dladm show-aggr -x
LINK        PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr0       --             10000Mb full   up        a0:36:9f:c2:3b:9c  --
            ixgbe0         10000Mb full   up        a0:36:9f:c2:3b:9c  attached
            ixgbe1         0Mb  unknown   down      a0:36:9f:c2:3b:9e  standby
aggr1       --             10000Mb full   up        b4:96:91:5:48:38   --
            ixgbe2         0Mb  unknown   down      b4:96:91:5:48:38   standby
            ixgbe3         10000Mb full   up        b4:96:91:5:48:3a   attached

On the current PI this would result in aggr1 still having it's mi_default_tx_ring set to an aggr pseudo ring ultimately pointing to ixbge2. On the patched platform, lx_tx_ports[0] and mi_default_tx_ring currently points to ixgbe3 (even though the first port in lg_ports is ixgbe2).

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs cpc ip hook neti sockfs arp usba stmf_sbd stmf zfs mm sd lofs idm sata random logindmux ptm sppp nfs mr_sas ]
> ::walk mac_client_impl_cache | ::printf "0x%p %s 0x%p %s 0x%p 0x%p\n" mac_client_impl_t . mci_name mci_mip mci_mip->mi_name mci_mip->mi_driver mci_mip->mi_default_tx_ring ! grep -w aggr1
0xfffffe5971b83ab0 aggr1 0xfffffe59513c4008 aggr1001 0xfffffe595dbe28f8 0xfffffe5c7af78ca8
0xfffffe5971b84720 aggr1-ixgbe3 0xfffffe59513cab98 ixgbe3 0xfffffe5951d6e000 0xfffffe5951dc78d8
0xfffffe5971b85390 aggr1-ixgbe2 0xfffffe59513ce160 ixgbe2 0xfffffe593cac3000 0xfffffe5951d411d8


> 0xfffffe595dbe28f8::print aggr_grp_t lg_ports[0].lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe2" ]


> 0xfffffe595dbe28f8::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring
mi_default_tx_ring = 0xfffffe5c7af78ca8
> 0xfffffe5c7af78ca8::print mac_ring_t mr_info.mri_driver | ::print aggr_pseudo_tx_ring_t atr_hw_rh | ::print mac_ring_t mr_info.mri_driver | ::print ixgbe_tx_ring_t ixgbe | ::print ixgbe_t instance
instance = 0x3


> 0xfffffe595dbe28f8::print aggr_grp_t lg_tx_ports[0]->lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe3" ]

I also verified aggr0.

> ::walk mac_client_impl_cache | ::printf "0x%p %s 0x%p %s 0x%p 0x%p\n" mac_client_impl_t . mci_name mci_mip mci_mip->mi_name mci_mip->mi_driver mci_mip->mi_default_tx_ring ! grep -w aggr0
0xfffffe595142f008 aggr0 0xfffffe59513c75d0 aggr1000 0xfffffe595dca6128 0xfffffe5951dc7800
0xfffffe595142fc78 aggr0-ixgbe1 0xfffffe59513d82b8 ixgbe1 0xfffffe5951410000 0xfffffe594e92ee00
0xfffffe59514308e8 aggr0-ixgbe0 0xfffffe59513db880 ixgbe0 0xfffffe594eec8000 0xfffffe5951400700


> 0xfffffe595dca6128::print aggr_grp_t lg_ports[0].lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe0" ]


> 0xfffffe595dca6128::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring
mi_default_tx_ring = 0xfffffe5951dc7800
> 0xfffffe5951dc7800::print mac_ring_t mr_info.mri_driver | ::print aggr_pseudo_tx_ring_t atr_hw_rh | ::print mac_ring_t mr_info.mri_driver | ::print ixgbe_tx_ring_t ixgbe | ::print ixgbe_t instance
instance = 0


> 0xfffffe595dca6128::print aggr_grp_t lg_tx_ports[0]->lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe0" ]

I also verified that no broadcast packets were hitting the downed ports via dtrace. In order to test this you assign IP to the primary aggr clients and also create a VNIC over each aggr (the later is important because it changes the path in mac_tx_send(), ensuring that we call mac_bcast_send()).

<GZ> root@gaia [~]
# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
igb0/_a           static   ok           192.168.2.3/24
aggr0/v4          static   ok           192.168.5.2/24
aggr1/v4          static   ok           192.168.5.3/24
lo0/v6            static   ok           ::1/128

# dladm show-vnic
LINK         OVER       SPEED MACADDRESS        MACADDRTYPE VID  ZONE
unused0      aggr0      10000 2:8:20:3d:77:60   random      0    --
unused1      aggr1      10000 2:8:20:52:13:fc   random      0    --

# ping -i 192.168.5.3 192.168.5.99
^C

# dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'
mac_bcast_send 0xfffffe5b7d432ab0 aggr1 0xfffffe59e1eb4698 0xfffffe5951374008 aggr1001
mac_bcast_send 0xfffffe5b7d432ab0 aggr1 0xfffffe59e1eb4698 0xfffffe5951374008 aggr1001
mac_bcast_send 0xfffffe5b7d432ab0 aggr1 0xfffffe59e1eb4698 0xfffffe5951374008 aggr1001
mac_bcast_send 0xfffffe5b7d432ab0 aggr1 0xfffffe59e1eb4698 0xfffffe5951374008 aggr1001
mac_bcast_send 0xfffffe5b7d432ab0 aggr1 0xfffffe59e1eb4698 0xfffffe5951374008 aggr1001
mac_bcast_send 0xfffffe5b7d432ab0 aggr1 0xfffffe59e1eb4698 0xfffffe5951374008 aggr1001
^C

# ping -i 192.168.5.2 192.168.5.99
^C

# dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'
mac_bcast_send 0xfffffe59513b5008 aggr0 0xfffffe59e1eb49b8 0xfffffe59513775d0 aggr1000
mac_bcast_send 0xfffffe59513b5008 aggr0 0xfffffe59e1eb49b8 0xfffffe59513775d0 aggr1000
mac_bcast_send 0xfffffe59513b5008 aggr0 0xfffffe59e1eb49b8 0xfffffe59513775d0 aggr1000

Broken PI

Just for comparison and completeness, here's the same test above but on the bad platform. This first output shows that the downed port on aggr1 is indeed the first port added, that the mi_default_tx_ring points to a ring on the downed port, and that lg_tx_ports[0] ports to the up port (ixgbe3).

# dladm show-aggr -x
LINK        PORT           SPEED DUPLEX   STATE     ADDRESS            PORTSTATE
aggr0       --             10000Mb full   up        a0:36:9f:c2:3b:9c  --
            ixgbe0         10000Mb full   up        a0:36:9f:c2:3b:9c  attached
            ixgbe1         0Mb  unknown   down      a0:36:9f:c2:3b:9e  standby
aggr1       --             10000Mb full   up        b4:96:91:5:48:38   --
            ixgbe2         0Mb  unknown   down      b4:96:91:5:48:38   standby
            ixgbe3         10000Mb full   up        b4:96:91:5:48:3a   attached

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs cpc ip hook neti sockfs arp usba stmf_sbd stmf zfs mm sd lofs idm sata random logindmux ptm sppp nfs mr_sas ]
> ::walk mac_client_impl_cache | ::printf "0x%p %s 0x%p %s 0x%p 0x%p\n" mac_client_impl_t . mci_name mci_mip mci_mip->mi_name mci_mip->mi_driver mci_mip->mi_default_tx_ring ! grep -w aggr1
0xfffffe595d085ab0 aggr1 0xfffffe5951306008 aggr1001 0xfffffe595dfb68f8 0xfffffe5a65b22638
0xfffffe595d086720 aggr1-ixgbe3 0xfffffe595130cb98 ixgbe3 0xfffffe5951e67000 0xfffffe5951df18d8
0xfffffe595d087390 aggr1-ixgbe2 0xfffffe5951310160 ixgbe2 0xfffffe59513f2000 0xfffffe5951dfa1d8


> 0xfffffe595dfb68f8::print aggr_grp_t lg_ports[0].lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe2" ]


> 0xfffffe595dfb68f8::print aggr_grp_t lg_mh | ::print mac_impl_t mi_default_tx_ring
mi_default_tx_ring = 0xfffffe5a65b22638

> 0xfffffe5a65b22638::print mac_ring_t mr_info.mri_driver | ::print aggr_pseudo_tx_ring_t atr_hw_rh | ::print mac_ring_t mr_info.mri_driver | ::print ixgbe_tx_ring_t ixgbe | ::print ixgbe_t instance
instance = 0x2


> 0xfffffe595dfb68f8::print aggr_grp_t lg_tx_ports[0]->lp_mh | ::print mac_impl_t mi_name
mi_name = [ "ixgbe3" ]

And here is confirmation from dtrace that broadcast traffic is being sent down the wrong port on aggr1.

# ping -i 192.168.5.3 192.168.5.99
^C

# dtrace -qn 'mac_bcast_send:entry /arg1 != NULL/ { this->grp = (mac_bcast_grp_t *)arg0; this->smcip = (mac_client_impl_t *)arg1; printf("%s 0x%p %s 0x%p 0x%p %s\n", probefunc, this->smcip, stringof(this->smcip->mci_name), this->grp, this->grp->mbg_mac_impl, stringof(this->grp->mbg_mac_impl->mi_name)); } ixgbe_ring_tx:entry { this->ring = (ixgbe_tx_ring_t *)arg0; this->ixgbe = this->ring->ixgbe; if ((this->ixgbe->link_state & 0x1) == 0) { printf("sending on down interface: ixgbe%d\n", this->ring->ixgbe->instance); stack(); } }'
mac_bcast_send 0xfffffe595d085ab0 aggr1 0xfffffe59e1eea648 0xfffffe5951306008 aggr1001
sending on down interface: ixgbe2

              mac`mac_hwring_tx+0x1c
              mac`mac_ring_tx+0x1b
              mac`mac_provider_tx+0x85
              mac`mac_hwring_send_priv+0x1a
              aggr`aggr_ring_tx+0x1a
              mac`mac_hwring_tx+0x1c
              mac`mac_ring_tx+0x1b
              mac`mac_provider_tx+0x85
              mac`mac_bcast_send+0x28a
              mac`mac_tx_send+0x150
              mac`mac_tx_soft_ring_process+0x89
              mac`mac_tx_aggr_mode+0x8c
              mac`mac_tx+0x1a9
              dld`proto_unitdata_req+0x273
              dld`dld_wput+0xa9
              unix`putnext+0x233
              ip`arp_output+0x1fd
              ip`arp_request+0x123
              ip`ip_ndp_resolve+0x84
              ip`ip_xmit+0xab9

Comment by Mohamed Khalfella [X]
Created at 2019-09-11T19:46:38.130Z

I tested a PI with @ryan.zezeski's changes and I can confirm the CN can boot with these changes.


Comment by Jira Bot
Created at 2019-09-11T20:10:56.385Z

illumos-joyent commit f575b548814883ebc2eb65fe406ec62d0d6b5bbb (branch master, by Ryan Zezeski)

OS-6805 packet flow over a defaulted LACP port
Reviewed by: Dan McDonald <danmcd@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>