OS-7924: OS-7520 regressed some instances of IP forwarding

Details

Issue Type:	Bug
Priority:	4 - Normal
Status:	Resolved
Created at:	2019-08-06T19:53:55.373Z
Updated at:	2019-08-20T22:32:32.922Z

People

Created by:	Former user
Reported by:	Former user
Assigned to:	Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-08-20T22:32:32.905Z)

Fix Versions

2019-08-29 Zoo York (Release Date: 2019-08-29)

Description

As part of the OS-7520 work, the IP forwarding logic stopped adjusting the checksum (because of TTL decrement) in some scenarios. However, this change was overly aggressive and stopped adjusting the checksum in cases where it should have.

Summary

Brian and Dave were seeing packets dropped when trying to reach hosts across a physical host boundary from their VPN client. I.e., they would VPN into physical host A and be able to reach all zones on host A but not reach any zones on physical host B (on the same LAN/subnet as host A). They did not have this issue before the introduction of OS-7520 (PI 20190605 and before).
Brian and Dave were both using e1000g parts.
OS-7520 changed the checksum adjustment logic in the IP forwarding path, assuming that the HCK_IPV4_HDRCKSUM_OK flag would always be cleared by the time the mblk reached it.
If a mblk reached the IP forwarding logic with that flag set, and it wasn't a mac-loopback packet, then we would fail to update its checksum to reflect the TTL adjustment.
The e1000g driver will typically attach with no Tx checksum offload. But it will provide Rx offload, thus attaching the HCK_IPV4_HDRCKSUM_OK flag.
The IP ILL_HCKSUM_CAPABLE(ill) value is based purely on Tx checksum offload. If this value is false then IP input will NOT clear the HCK_IPV4_HDRCKSUM_OK flag.
The two previous points causes the IP forwarding assumption to be broken, causing a bad checksum.
This same issue can manifest itself for other drivers if you disable their Tx checksum offload.
The real problem is that HCK_IPV4_HDRCKSUM_OK and HCK_IPV4_HDRCKSUM should have two DIFFERENT values, but they don't. Thus you can't distinguish Rx from Tx just by the flag. This matters in our new world where the more efficient mac-loopback path has blurred the lines between Rx/Tx.
For now IP forwarding can distinguish between Rx/Tx by checking for the HW_LOCAL_MAC flag, which indicates a local Tx.

Detail

The original problem report came from @accountid:557058:b557d86e-5fbb-44af-a61e-7b0a1055e444 and @accountid:62431a42fd5e45007042f374. Both of them reported that after the introduction of OS-7520 their VPN/NAT setups stopped working. In particular, any zones on the same physical host as the VPN/NAT could be reached from the VPN client; but any zones on a different physical host could not. I.e., let's say you have two physical hosts, A and B, behind a firewall on the same LAN and IP subnet. And on host A you setup a VPN zone to allow tunneling in from the internet. You would be able to send and receive traffic from all zones on physical host A, but not from physical host B. Furthermore, if you snoop/tcpdump the traffic from the perspective of the tunnel on the client side, you'll see that the IP header checksum is always off by one. Here's an example from my own macbook VPN client:

$ sudo tcpdump -vv -i utun10
Password:
tcpdump: listening on utun10, link-type NULL (BSD loopback), capture size 262144 bytes
13:54:48.876733 IP (tos 0x0, ttl 64, id 33799, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.1.6 > 10.0.99.99: ICMP echo request, id 56405, seq 0, length 64
13:54:48.877502 IP (tos 0x0, ttl 254, id 28904, offset 0, flags [DF], proto ICMP (1), length 84, bad cksum dbae (->dcae)!)
    10.0.99.99 > 192.168.1.6: ICMP echo reply, id 56405, seq 0, length 64
13:54:49.881341 IP (tos 0x0, ttl 64, id 9677, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.1.6 > 10.0.99.99: ICMP echo request, id 56405, seq 1, length 64
13:54:49.882211 IP (tos 0x0, ttl 254, id 28905, offset 0, flags [DF], proto ICMP (1), length 84, bad cksum dbad (->dcad)!)
    10.0.99.99 > 192.168.1.6: ICMP echo reply, id 56405, seq 1, length 64
13:54:50.882548 IP (tos 0x0, ttl 64, id 37044, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.1.6 > 10.0.99.99: ICMP echo request, id 56405, seq 2, length 64
13:54:50.883322 IP (tos 0x0, ttl 254, id 28906, offset 0, flags [DF], proto ICMP (1), length 84, bad cksum dbac (->dcac)!)
    10.0.99.99 > 192.168.1.6: ICMP echo reply, id 56405, seq 2, length 64

The problem, like many of the networking bugs, is a culmination of several things. The bug was introduced by these two changes I made to ire_recv_forward_v4() and ip_forward_xmit_v4(), respectively.

diff --git a/usr/src/uts/common/inet/ip/ip_input.c b/usr/src/uts/common/inet/ip/ip_input.c
index e2e7dca22c..7cfffbc84b 100644
--- a/usr/src/uts/common/inet/ip/ip_input.c
+++ b/usr/src/uts/common/inet/ip/ip_input.c
@@ -1003,8 +1003,26 @@ ire_recv_forward_v4(ire_t *ire, mblk_t *mp, void *iph_arg, ip_recv_attr_t *ira)
                ira->ira_pktlen = ntohs(ipha->ipha_length);
        }
 
-       /* Packet is being forwarded. Turning off hwcksum flag. */
-       DB_CKSUMFLAGS(mp) = 0;
+       /*
+        * The packet came here via MAC-loopback so we must reinstate
+        * its hardware IP header checksum request.
+        *
+        * IP input clears the HCK_IPV4_HDRCKSUM_OK flag. Since
+        * HCK_IPV4_HDRCKSUM_OK and HCK_IPV4_HDRCKSUM use the same
+        * value, we end up clearing the client's request to use
+        * hardware IP header checksum offload. We check for
+        * HW_LOCAL_MAC and an IP header checksum value of zero as an
+        * indicator that this packet requires reinstatement of the
+        * HCK_IPV4_HDRCKSUM flag. That said, zero is a valid
+        * checksum, and given hardware that doesn't support IP header
+        * checksum offload, this could result in the checksum being
+        * computed twice. This is fine because the checksum value is
+        * zero, and thus will retain its value on recalculation.
+        */
+       if (((DB_CKSUMFLAGS(mp) & HW_LOCAL_MAC) != 0) &&
+           ipha->ipha_hdr_checksum == 0) {
+               DB_CKSUMFLAGS(mp) |= HCK_IPV4_HDRCKSUM;
+       }
 
        /*
         * Martian Address Filtering [RFC 1812, Section 5.3.7]
@@ -1138,9 +1156,17 @@ ip_forward_xmit_v4(nce_t *nce, ill_t *ill, mblk_t *mp, ipha_t *ipha,
                return;
        }
        ipha->ipha_ttl--;
-       /* Adjust the checksum to reflect the ttl decrement. */
-       sum = (int)ipha->ipha_hdr_checksum + IP_HDR_CSUM_TTL_ADJUST;
-       ipha->ipha_hdr_checksum = (uint16_t)(sum + (sum >> 16));
+       /*
+        * Adjust the checksum to reflect the TTL decrement unless
+        * the packet expects IP header checksum offload; in which
+        * case we delay its calculation until later. Such a packet
+        * occurs when it travels via MAC-loopback over a link
+        * exposing HCKSUM_IPHDRCKSUM.
+        */
+       if ((DB_CKSUMFLAGS(mp) & HCK_IPV4_HDRCKSUM) == 0) {
+               sum = (int)ipha->ipha_hdr_checksum + IP_HDR_CSUM_TTL_ADJUST;
+               ipha->ipha_hdr_checksum = (uint16_t)(sum + (sum >> 16));
+       }
 
        /* Check if there are options to update */
        if (iraflags & IRAF_IPV4_OPTIONS) {

Basically, if we see a packet coming from mac-loopback with a zero-value IP header checksum, we assume that the sender requested IP header checksum offload and we reinstate the flag. We have to reinstate the flag because HCK_IPV4_HDRCKSUM and HCK_IPV4_HDRCKSUM_OK have the same value; and the IP input logic clears the HCK_IPV4_HDRCKSUM_OK flag when it sees it (or so I thought, it turns out it only does in some cases, and that's another aspect to this bug). The fact that these two macros have the same value has been a constant source of frustration and bugs -- we really really really REALLY need to bite the bullet and fix this. Anyways, the second part of the logic change, in ip_forward_xmit_v4(), only adjusts the checksum if we haven't requested IP header checksum offload. Which makes sense, as the hardware expects the value to be zero in order to calculate a correct checksum. However, this logic relies on one assumption: that ire_recv_forward_v4() NEVER sees an mblk with the HCK_IPV4_HDRCKSUM_OK flag set. In most cases, this is true. But it wasn't in Brian and Dave's case, why?

I assumed ire_recv_forward_v4() could never see HCK_IPV4_HDRCKSUM_OK because of this logic in ill_input_short_v4(). (Note that this function should really by using HCK_IPV4_HDRCKSUM_OK and not HCK_IPV4_HDRCKSUM, but since they currently have the same value it doesn't matter at runtime).

	/*
	 * If the packet originated from a same-machine sender or
	 * there is a good HW IP header checksum, we clear the need
	 * look at the IP header checksum.
	 */
	if ((DB_CKSUMFLAGS(mp) & HW_LOCAL_MAC) ||
	    ((DB_CKSUMFLAGS(mp) & HCK_IPV4_HDRCKSUM) &&
	    ILL_HCKSUM_CAPABLE(ill) && dohwcksum)) {
		/* Header checksum was ok. Clear the flag */
		DB_CKSUMFLAGS(mp) &= ~HCK_IPV4_HDRCKSUM;
		ira->ira_flags &= ~IRAF_VERIFY_IP_CKSUM;
	}

There's no way to reach the IP forwarding path without first hitting this logic. The dohwcksum variable is just a static kernel tunable and we can ignore it. Based on this logic you would expect any NIC providing the HCK_IPV4_HDRCKSUM_OK flag would end up having the flag cleared here. But it turns out for some drivers, like e1000g, or some configuration of drivers, like igb with Tx checksum disabled, that's not the case.

This first dtrace snippet shows the HCK_IPV4_HDRCKSUM_OK flag reaching ire_recv_forward_v4() from an e1000g part.

<GZ> root@sys76 [~]
# dtrace -qn 'ill_input_short_v4:entry { self->spec = speculation(); speculate(self->spec); printf("flags: 0x%x\n", args[0]->b_datap->db_struioun.cksum.flags); printf("ILL capabs: 0x%x\n", args[3]->ira_ill->ill_capabilities); } ire_recv_forward_v4:entry { this->flags = args[1]->b_datap->db_struioun.cksum.flags } ire_recv_forward_v4:entry /(this->flags & 0x1) != 0 && self->spec/ { commit(self->spec); self->spec = 0; } ire_recv_forward_v4:entry /(this->flags & 0x1) != 0/ { stack(); } ill_input_short_v4:return /self->spec/ { discard(self->spec); self->spec = 0; }'
flags: 0x1
ILL capabs: 0x1e0

              ip`ill_input_short_v4+0x4ee
              ip`ip_input_common_v4+0x3a7
              ip`ip_input+0x2b
              dls`i_dls_link_rx+0x211
              mac`mac_rx_deliver+0x37
              mac`mac_rx_soft_ring_process+0x19a
              mac`mac_rx_srs_proto_fanout+0x29a
              mac`mac_rx_srs_drain+0x363
              mac`mac_rx_srs_process+0x428
              mac`mac_rx_classify+0x11b
              mac`mac_rx_flow+0x63
              mac`mac_rx_common+0x1e6
              mac`mac_rx+0xb6
              mac`mac_rx_ring+0x2b
              e1000g`e1000g_intr_work+0x21a
              e1000g`e1000g_intr_pciexpress+0xb3
              apix`apix_dispatch_by_vector+0x8c
              apix`apix_dispatch_lowlevel+0x25
              unix`switch_sp_and_call+0x13

This second snippet shows the same but for an igb part with tx_hcksum_enable = 0; added to ibg.conf.

<GZ> root@gaia [~]
# dtrace -qn 'ill_input_short_v4:entry { self->spec = speculation(); speculate(self->spec); printf("flags: 0x%x\n", args[0]->b_datap->db_struioun.cksum.flags); printf("ILL capabs: 0x%x\n", args[3]->ira_ill->ill_capabilities); } ire_recv_forward_v4:entry { this->flags = args[1]->b_datap->db_struioun.cksum.flags } ire_recv_forward_v4:entry /(this->flags & 0x1) != 0 && self->spec/ { commit(self->spec); self->spec = 0; } ire_recv_forward_v4:entry /(this->flags & 0x1) != 0/ { stack(); } ill_input_short_v4:return /self->spec/ { discard(self->spec); self->spec = 0; }'
flags: 0x1
ILL capabs: 0x1e0

              ip`ill_input_short_v4+0x4ee
              ip`ip_input_common_v4+0x3a7
              ip`ip_input+0x2b
              dls`i_dls_link_rx+0x211
              mac`mac_rx_deliver+0x37
              mac`mac_rx_soft_ring_process+0x19a
              mac`mac_rx_srs_proto_fanout+0x29a
              mac`mac_rx_srs_drain+0x363
              mac`mac_rx_srs_process+0x428
              mac`mac_rx_classify+0x11b
              mac`mac_rx_flow+0x63
              mac`mac_rx_common+0x1e6
              mac`mac_rx+0xb6
              mac`mac_rx_ring+0x2b
              igb`igb_intr_rx_work+0x5c
              igb`igb_intr_rx+0x15
              apix`apix_dispatch_by_vector+0x8c
              apix`apix_dispatch_lowlevel+0x25
              unix`switch_sp_and_call+0x13

So why does this happen? It has to do with the ILL_HCKSUM_CAPABLE(ill) macro. The ILL checksum capab is determined by MAC_CAPAB_HCKSUM. The MAC capab, in turn, only concerns itself with Tx offload. The NICs/drivers consider Rx IP header checksum offload a separate thing and will set the flag regardless of the Tx offload setting. This leads to the case where an mblk has the HCK_IPV4_HDRCKSUM_OK flag set, but reports false for ILL_HCKSUM_CAPABLE(ill): thus allowing the flag to make it's way to the IP forwarding logic without being cleared. And it turns out this is the default behavior for e1000g parts, which both Brian and Dave were using in their deployments.

Comments

Comment by Former user
Created at 2019-08-20T17:03:28.142Z

I updated the IP forwarding test suite to include many more scenarios. Under the new suite, ip_fwd_006 tests the bug described in this ticket. I verified that it failed on the current platform, but passed on the patched one.

# /opt/net-tests/bin/nettest 
Test: /opt/net-tests/tests/forwarding/ip_fwd_001 (run as root)    [00:31] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_002 (run as root)    [00:31] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_003 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_004 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_005 (run as root)    [00:37] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_006 (run as root)    [00:33] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_007 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_008 (run as root)    [00:38] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_009 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_010 (run as root)    [00:35] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_011 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_012 (run as root)    [00:31] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_013 (run as root)    [00:34] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_014 (run as root)    [00:31] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_015 (run as root)    [00:36] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_016 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_017 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_018 (run as root)    [00:32] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_019 (run as root)    [00:33] [PASS]
Test: /opt/net-tests/tests/forwarding/ip_fwd_020 (run as root)    [00:32] [PASS]

Results Summary
PASS      20

Running Time:   00:11:08
Percent passed: 100.0%
Log directory:  /var/tmp/test_results/20190819T204042

I also setup my own VPN server and verified that I hit the same issue as Dave/Brian on the current platform and that it is fixed on the patched platform.

Comment by Jira Bot
Created at 2019-08-20T22:30:55.947Z

illumos-joyent commit 0678e39e27ea5dfce7fc85dcc38c837f3c8f6f6d (branch master, by Ryan Zezeski)

OS-7924 OS-7520 regressed some instances of IP forwarding
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>