OS-7520: OS-6778 broke IPv4 forwarding

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2019-01-17T15:58:45.293Z
Updated at:2019-08-20T22:30:59.301Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2019-06-20T12:12:00.287Z)

Fix Versions

2019-07-04 Verdukianism (Release Date: 2019-07-04)

Description

Issue originally reported by wiedi in illumous-joyent#184.

ATTENTION

To further clarify, this doesn't ALWAYS break IP forwarding. Two requirements must be met: 1) the underlying MAC must provide HW checksum offload, and 2) the Tx path taken from sender to router MUST go over MAC-loopback. E.g., in the scenario described below, if edge's net0 interface was on a different MAC from junon's net1 (VNICs on different NICs) then the packets would hit the wire and have their checksums calculated and everything would work in both directions.

The new MAC-loopback datapath, introduced in OS-6778, broke IPv4 forwarding. The old datapath ALWAYS calculated the IP/ULP checksum during MAC-loopback -- and IPv4 forwarding relies on this fact. The new datapath removes the checksum calculation from MAC-loopback (specifically mac_tx_send()); leaving the decision up to the client. E.g., DLS will emulate the NIC's hardware offloads, but IP + DLS bypass will not. For the later, we know we are sending packets that never leave local DRAM, so why bother with checksumming? This gives us a nice performance boost, and CPU usage reduction, for MAC-loopback traffic.

The IP forwarding code strips the dblk of all hardware checksum flags.

ire_recv_forward_v4()
	/* Packet is being forwarded. Turning off hwcksum flag. */
	DB_CKSUMFLAGS(mp) = 0;

It assumes the sender (or MAC-loopback) already calculated the IP/ULP checksums. It clears the flags to prevent the hardware from recalculating a new checksum on top of the existing one (causing an incorrect checksum). After OS-6778, when the sender expects NIC HW offloads but the packet travels over MAC-loopback, we end up passing a packet to IP forwarding with a 0x0 IP checksum and a partial TCP checksum, with the HW flags set. IP forwarding clears the flags, adjusts the IP checksum by TTL, and then passes it to a NIC. The NIC, seeing no HW flags, passes the packet as-is. The destination rejects the packet as it contains bogus checksums.

Here's what the problem looks like in detail.

[Attachment: mac-loopback-ipv4-fwd.svg]

The issue is that traffic can flow fine from client to server, but not from server to client. The original reporter used netcat as an example. If you run nc -l 5000 on the server, then nc 192.168.99.2 5000 will fail to establish a connection from the client. If you snoop various interfaces you'll see traffic makes it to the server but not back to the client (if you snoop junon/net1 then it actually causes everything to work, but more on that later). To understand why and where the traffic was falling down I employe a dtrace script to print out the salient parts of dblk/ether/IP/TCP as the mblk reached the MAC and IP forwarding boundaries.

=== <direction> <client name> <probefunc> <mblk> [<dblk flags>] ===
<L2 src> -> <L2 dst> VID: <vid>
<IP src> -> <IP dst> (<IP len>) IPsum: <ip csum> ULPsum: <TCP csum> [<TCP flags>]


<GZ> root@gaia [~]
# dtrace -Cqs /data/d/follow_ips.d

=== RX z3_net0 mac_rx_srs_process 0xfffffe5ddb73b320 [|FULL_OK|IPV4_OK|IPV4] ===
70:54:d2:44:c5:54 -> 62:7b:59:43:64:7 VID: 0
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc929 ULPsum: 0xf33e [|SYN]

=== RX z3_net0 ire_recv_forward_v4 0xfffffe5ddb73b320 [|FULL_OK] ===
70:54:d2:44:c5:54 -> 62:7b:59:43:64:7 VID: 0
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc929 ULPsum: 0xf33e [|SYN]

=== RX z3_net0 ip_forward_xmit_v4 0xfffffe5ddb73b320 [] ===
70:54:d2:44:c5:54 -> 62:7b:59:43:64:7 VID: 0
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc929 ULPsum: 0xf33e [|SYN]

=== TX z3_net1 mac_tx 0xfffffe5ddb73b320 [] ===
f2:4a:df:eb:e2:58 -> 92:92:6f:27:cc:23 VID: 5
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc92a ULPsum: 0xf33e [|SYN]

=== RX z4_net0 mac_rx_srs_process 0xfffffe5ddb73b320 [] ===
f2:4a:df:eb:e2:58 -> 92:92:6f:27:cc:23 VID: 5
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc92a ULPsum: 0xf33e [|SYN]

=== TX z4_net0 mac_tx 0xfffffe5aaaa46b60 [|PARTIAL|IPV4_OK|IPV4] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 5
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

=== RX z3_net1 mac_rx_srs_process 0xfffffe5aaaa46b60 [LOCAL_MAC|PARTIAL|IPV4_OK|IPV4] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 5
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

=== RX  ire_recv_forward_v4 0xfffffe5aaaa46b60 [LOCAL_MAC|PARTIAL] ===
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

=== RX  ip_forward_xmit_v4 0xfffffe5aaaa46b60 [] ===
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

=== TX z3_net0 mac_tx 0xfffffe5aaaa46b60 [] ===
62:7b:59:43:64:7 -> 70:54:d2:44:c5:54 VID: 0
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x1 ULPsum: 0x85e6 [ACK|SYN]

I'm going to break this down, line-by-line.

First, the client sends a SYN to establish the nc connection. Since we are only tracing gaia (the GZ with the router and server zones), we first see the packet arrive on the RX side of MAC. It has dblk flags to indicate that the hardware verified the checksums (for reasons well beyond the scope of this report, the IPV4_HDRCKSUM,{_OK} have the same value, that's why both show up every time).

=== RX z3_net0 mac_rx_srs_process 0xfffffe5ddb73b320 [|FULL_OK|IPV4_OK|IPV4] ===
70:54:d2:44:c5:54 -> 62:7b:59:43:64:7 VID: 0
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc929 ULPsum: 0xf33e [|SYN]

Next this hits the Rx side of the IPv4 forwarding logic. Notice the IPV4_OK flag is stripped as part of the IP processing. This is as designed.

=== RX z3_net0 ire_recv_forward_v4 0xfffffe5ddb73b320 [|FULL_OK] ===
70:54:d2:44:c5:54 -> 62:7b:59:43:64:7 VID: 0
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc929 ULPsum: 0xf33e [|SYN]

Next the SYN packet is sent on to the forwarding code with all dblk flags stripped. But this is okay, since the packet traveled over the wire to get from sys76 to gaia, the sys76 NIC populated the IP/ULP checksums.

=== RX z3_net0 ip_forward_xmit_v4 0xfffffe5ddb73b320 [] ===
70:54:d2:44:c5:54 -> 62:7b:59:43:64:7 VID: 0
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc929 ULPsum: 0xf33e [|SYN]

Next the SYN hits the MAC TX of the router's interface which is on the same VLAN/subnet as the server's interface.

=== TX z3_net1 mac_tx 0xfffffe5ddb73b320 [] ===
f2:4a:df:eb:e2:58 -> 92:92:6f:27:cc:23 VID: 5
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc92a ULPsum: 0xf33e [|SYN]

The SYN finally hits the MAC RX of the server's interface, which its checksums appropriately calculated (as performed by the client's NIC).

=== RX z4_net0 mac_rx_srs_process 0xfffffe5ddb73b320 [] ===
f2:4a:df:eb:e2:58 -> 92:92:6f:27:cc:23 VID: 5
192.168.2.4:57649 -> 192.168.99.2:5000 (60) IPsum: 0xc92a ULPsum: 0xf33e [|SYN]

The server responds with a SYN ACK. Since the underlying hardware of the VNIC provides checksum offloads, we add the flags to request them.

=== TX z4_net0 mac_tx 0xfffffe5aaaa46b60 [|PARTIAL|IPV4_OK|IPV4] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 5
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

The SYN ACK lands on the router's interface with the hardware checksum flags and the LOCAL_MAC flag. The later tells the RX side that this packet never left the underlying MAC (i.e., it never hit the wire). This is where post OS-6778 code diverges from the past: in the new world MAC no longer calculates the checksum; it's up to the client. Notice that the IP checksum is 0x0 (and while you can't tell just by looking at it, the TCP checksum is only partial).

=== RX z3_net1 mac_rx_srs_process 0xfffffe5aaaa46b60 [LOCAL_MAC|PARTIAL|IPV4_OK|IPV4] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 5
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

The SYN ACK lands on the forwarding code. IP has stripped the IPv4 hardware checksum flag since the packet is LOCAL_MAC. Our IP/ULP checksums are still incorrect.

=== RX  ire_recv_forward_v4 0xfffffe5aaaa46b60 [LOCAL_MAC|PARTIAL] ===
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

The SYN ACK is forwarded to the router's external interface. All dblk flags are stripped. The checksums are still incorrect.

=== RX  ip_forward_xmit_v4 0xfffffe5aaaa46b60 [] ===
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x0 ULPsum: 0x85e6 [ACK|SYN]

Finally, the SYN ACK hits the router's external interface. The hardware checksum flags are gone, and the IP/ULP checksums are incorrect (the IP checksum changed due to TTL adjustment). The client will reject this packet and this process will repeat indefinitely.

=== TX z3_net0 mac_tx 0xfffffe5aaaa46b60 [] ===
62:7b:59:43:64:7 -> 70:54:d2:44:c5:54 VID: 0
192.168.99.2:5000 -> 192.168.2.4:57649 (60) IPsum: 0x1 ULPsum: 0x85e6 [ACK|SYN]

Comments

Comment by Former user
Created at 2019-01-17T17:12:51.009Z
Updated at 2019-01-17T17:14:40.460Z

As for why snooping junon's net1 causes traffic to flow? Because it disables DLS bypass. This causes all traffic from edge net0 -> junon net1 to travel thru DLS before reaching IP. DLS, in turn, calls mac_hw_emul() to emulate the hardware checksum offloads.

=== TX z4_net0 mac_tx 0xfffffe5ddc0a9420 [|PARTIAL|IPV4_OK|IPV4] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 5
192.168.99.2:5000 -> 192.168.2.4:48706 (60) IPsum: 0x0 TCPsum: 0x85e6 [ACK|SYN]

=== RX z3_net1 mac_rx_srs_process 0xfffffe5ddc0a9420 [LOCAL_MAC|PARTIAL|IPV4_OK|IPV4] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 5
192.168.99.2:5000 -> 192.168.2.4:48706 (60) IPsum: 0x0 TCPsum: 0x85e6 [ACK|SYN]

=== RX  i_dls_link_rx 0xfffffe5ddc0a9420 [LOCAL_MAC|PARTIAL|IPV4_OK|IPV4] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 0
192.168.99.2:5000 -> 192.168.2.4:48706 (60) IPsum: 0x0 TCPsum: 0x85e6 [ACK|SYN]

=== RX  ire_recv_forward_v4 0xfffffe5ddc0a9420 [LOCAL_MAC|FULL_OK] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 0
192.168.99.2:5000 -> 192.168.2.4:48706 (60) IPsum: 0x9156 TCPsum: 0x6809 [ACK|SYN]

=== RX  ip_forward_xmit_v4 0xfffffe5ddc0a9420 [] ===
92:92:6f:27:cc:23 -> f2:4a:df:eb:e2:58 VID: 0
192.168.99.2:5000 -> 192.168.2.4:48706 (60) IPsum: 0x9156 TCPsum: 0x6809 [ACK|SYN]

=== TX z3_net0 mac_tx 0xfffffe5ddc0a9420 [] ===
62:7b:59:43:64:7 -> 70:54:d2:44:c5:54 VID: 0
192.168.99.2:5000 -> 192.168.2.4:48706 (60) IPsum: 0x9157 TCPsum: 0x6809 [ACK|SYN]

Comment by Former user
Created at 2019-05-01T21:00:24.785Z

Certusoft reported a similar problem in SWSUP-1421. They are running 20181206T011455Z, which  - according to @accountid:70121:6490ccfd-5932-4e7a-936d-554bdd3dc0d3 statement - is affected by this issue. I was able to reproduce this issue in support demo05 environment. I was also able to confirm that when running snoop on the external interface, the issue goes away.


Comment by Former user
Created at 2019-06-18T21:25:56.050Z

Test notes:


Comment by Jira Bot
Created at 2019-06-20T12:08:41.732Z

illumos-joyent commit 87738edeea3a17bfc0f19c6e1c3a597f3970e943 (branch master, by Ryan Zezeski)

OS-7520 OS-6778 broke IPv4 forwarding
OS-6878 mac_fix_cksum is incomplete
OS-7806 cannot move link from NGZ to GZ
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>


Comment by Jira Bot
Created at 2019-08-20T22:30:59.301Z

illumos-joyent commit 0678e39e27ea5dfce7fc85dcc38c837f3c8f6f6d (branch master, by Ryan Zezeski)

OS-7924 OS-7520 regressed some instances of IP forwarding
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>