OS-7098: bufmod sends corrupted LSO packets

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-07-28T03:10:32.511Z
Updated at:2018-08-20T19:58:58.836Z

People

Created by:Ryan Zezeski [X]
Reported by:Ryan Zezeski [X]
Assigned to:Ryan Zezeski [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-08-20T19:58:58.819Z)

Fix Versions

2018-08-30 Zolom Swamp (Release Date: 2018-08-30)

Related Links

Description

I first discovered this during my LSO work, but I've also produced this behavior on 20180621T003454Z. In fact, this problem has existed for as long as it's been possible to send LSO packets.

The problem presents itself when you try to snoop a link sending LSO packets. The traffic itself is fine (e.g. iperf doens't complain), but data sent to snoop is being corrupted in some way.

<GZ> root@gaia [~]
# snoop -r -vvvv -d igb0 host 192.168.2.60
...
(warning) bad packet header in buffer offset 15080: length=314881378
(warning) bad packet header in buffer offset 15084: length=-975845868
(warning) bad packet header in buffer offset 15088: length=503537050
(warning) bad packet header in buffer offset 15092: length=-730215749
(warning) bad packet header in buffer offset 15096: length=-411264108
(warning) bad packet header in buffer offset 15100: length=-1127292324
(warning) bad packet header in buffer offset 15104: length=1283861262
(warning) bad packet header in buffer offset 15108: length=82423970
(warning) bad packet header in buffer offset 15112: length=2048372326
(warning) bad packet header in buffer offset 15116: length=239070324
(warning) bad packet header in buffer offset 15120: length=86852170

When you snoop a device the packets are delivered through a STREAMS stream. The bottom is the interface itself, the top is essentially DLPI (snoop uses dlpi_recv() to passively read packets), and the middle is made up of the packet filter and buffer modules.

> 0xfffffe5b7a18bbf8::stream

+-----------------------+-----------------------+
| 0xfffffe5b722d0900    | 0xfffffe5b722d0808    | 
| strwhead              | strrhead              | 
|                       |                       |
| cnt = 0t0             | cnt = 0t288106        | 
| flg = 0x00004022      | flg = 0x0004403c      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xffffff3a36f27b70    | 0xffffff3a36f27a78    | 
| bufmod                | bufmod                | 
|                       |                       |
| cnt = 0t0             | cnt = 0t0             | 
| flg = 0x00000822      | flg = 0x00000832      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xfffffe5d33e36920    | 0xfffffe5d33e36828    | 
| pfmod                 | pfmod                 | 
|                       |                       |
| cnt = 0t0             | cnt = 0t0             | 
| flg = 0x00000822      | flg = 0x00000832      | 
+-----------------------+-----------------------+
            |                       ^
            v                       |
+-----------------------+-----------------------+
| 0xfffffe761d5a30f8    | 0xfffffe761d5a3000    | 
| igb                   | igb                   | 
|                       |                       |
| cnt = 0t0             | cnt = 0t0             | 
| flg = 0x00244062      | flg = 0x00204032      | 
+-----------------------+-----------------------+

pfmod filters packets in the kernel (based on the client filter expression) and bufmod buffers them up into chunks to send up to userspace. These "chunks" are the interface between bufmod and snoop. A chunk is just an array of bytes containing mblks/packets. In order to help snoop find the bounds of each packet bufmod adds a header to each mblk. This header is a struct sb_hdr and describes both the original length of the packet as well as the new total length after adding the header and padding out the mblk to an 8 byte boundary. You can see all of this go down in bufmod.c:sbaddmsg().

In userspace, snoop is reading these chunks and then using the sb_hdr to determine mblk boundaries. It does this in the snoop_caputure.c:scan() function. Before attempting to parse the incoming mblk and print it out, snoop first performs some sanity checks to make sure it's dealing with a legit buffer.

		if ((nhdrp->sbh_totlen == 0) ||
		    (bp + nhdrp->sbh_totlen) < bp ||
		    (bp + nhdrp->sbh_totlen) > bufstop ||
		    (nhdrp->sbh_origlen == 0) ||
		    (bp + nhdrp->sbh_origlen) < bp ||
		    (nhdrp->sbh_msglen == 0) ||
		    (bp + nhdrp->sbh_msglen) < bp ||
		    (bp + nhdrp->sbh_msglen) > bufstop ||
		    (nhdrp->sbh_msglen > nhdrp->sbh_origlen) ||
		    (nhdrp->sbh_totlen < nhdrp->sbh_msglen) ||
		    (nhdrp->sbh_timestamp.tv_sec == 0)) {
			if (cap) {
				(void) fprintf(stderr, "(warning) bad packet "
				    "header in capture file");
			} else {
				(void) fprintf(stderr, "(warning) bad packet "
				    "header in buffer");
			}
			(void) fprintf(stderr, " offset %d: length=%d\n",
			    bp - buf, nhdrp->sbh_totlen);
			goto err;
		}

If any of these checks fail, snoop prints the "bad packet" message and then starts scanning the rest of the chunk bytes in hopes of finding a legit mblk. If you read the snoop messages closely you'll see the offset always increase by four – this is snoop trying to find a good packet in the chunk. So one bad packet from bufmod can cause a whole ton of messages in snoop (especially when it's near the LSO limit).

With dtrace in hand I was able to discover that the sbh_totlen (the length of the packet + sb_hdr + padding) was greater than the chunk length. The chunk length is based off the length of the read returned by dlpi_receive(). Why are they different?

If you look back at sbaddmsg() there is one case that seems wrong.

		pad = Align(hp.sbh_totlen);
		hp.sbh_totlen += sizeof (hp);
		hp.sbh_totlen += pad;
...
		/*
		 * Join message to the chunk
		 */
		linkb(sbp->sb_head, mp);

		sbp->sb_mcount++;
		sbp->sb_mlen += hp.sbh_totlen;

		/*
		 * If the first message alone is too big for the chunk close
		 * the chunk now.
		 * If the next message would immediately cause the chunk to
		 * overflow we may as well close the chunk now. The next
		 * message is certain to be at least SMALLEST_MESSAGE size.
		 *
		 * RPZ: So at this point we've added the message to
		 * the chunk but we haven't added the padding. This
		 * feels like the bug to me.
		 */
		if (hp.sbh_totlen + SMALLEST_MESSAGE > sbp->sb_chunk) {
			sbclosechunk(sbp);
			return;
		}
...
		/* RPZ: Code that follows adds the padding. */

So the packet sizes that were giving me trouble were all 65505 (without the padding, 65512 with). That is, dlpi_recv was returning 65505 as msglen, but sbh_totlen was 65512. In this case sb_chunk is 64K (65536). The difference between 64K and sbh_totlen is 24 – which also happens to be the size of sb_hdr. So what's the value of SMALLEST_MESSAGE?

#define	SMALLEST_MESSAGE	sizeof (struct sb_hdr) + _POINTER_ALIGNMENT

usr/src/uts/common/sys/isa_defs.h:260: (<global>) #define _POINTER_ALIGNMENT 8
usr/src/uts/common/sys/isa_defs.h:326: (<global>) #define _POINTER_ALIGNMENT 4
usr/src/uts/common/sys/isa_defs.h:435: (<global>) #define _POINTER_ALIGNMENT 4
usr/src/uts/common/sys/isa_defs.h:463: (<global>) #define _POINTER_ALIGNMENT 8

Bingo! This means that the if expression will be true and we will close the chunk before padding the mblk. I confirmed this theory with dtrace. Below you see that the packets of length 65481 (65505 minus the 24 byte sb_hdr) are linked to the chunk but memset() is never called (to add the padding).

# dtrace -qn '::sbaddmsg:entry /msgdsize(args[1]) > 60000/ { self->sz = msgdsize(args[1]); self->chunk = ((struct sb *)(args[0]->q_ptr))->sb_chunk; self->t = 1; } ::linkb:entry /self->t/ { self->link = 1; } ::memset:entry /self->link/ { self->memset = 1; } ::sbaddmsg:return /self->t/ { @[self->chunk, self->sz, self->link, self->memset] = count(); self->sz = 0; self->chunk = 0; self->t = 0; self->link = 0; self->memset = 0; }'
^C

    65536            64874        1        1                1
    65536            64930        1        1                1
    65536            61402        1        1               65
    65536            65481        1        0            33274

Comments

Comment by Ryan Zezeski [X]
Created at 2018-08-18T03:00:34.297Z

Testing notes:

Verified snoop corrupt packet errors on the current platform.

<GZ> root@gaia [~]
# uname -v
joyent_20180802T002654Z

<GZ> root@gaia [~]
# snoop -r -vvvv -d ixgbe0 host 192.168.2.60 1490 1> /dev/null 2> /var/tmp/snoop-err.txt
^C
<GZ> root@gaia [~]
# wc -l /var/tmp/snoop-err.txt 
  409426 /var/tmp/snoop-err.txt

<GZ> root@gaia [~]
# head /var/tmp/snoop-err.txt 
Using device ixgbe0 (promiscuous mode)
(warning) bad packet header in buffer offset 0: length=65512
(warning) bad packet header in buffer offset 4: length=574
(warning) bad packet header in buffer offset 8: length=1534559331
(warning) bad packet header in buffer offset 12: length=808944
(warning) bad packet header in buffer offset 16: length=-1029753184
(warning) bad packet header in buffer offset 20: length=916495931
(warning) bad packet header in buffer offset 24: length=-1673805153
(warning) bad packet header in buffer offset 28: length=4521992
(warning) bad packet header in buffer offset 32: length=1963703295

Then verified there were no corrupted packets on a patched platform.

<GZ> root@gaia [~]
# uname -v
joyent_20180815T192218Z

<GZ> root@gaia [~]
# snoop -r -vvvv -d ixgbe0 host 192.168.2.60 1490 1> /dev/null 2> /var/tmp/snoop-err.txt
^C
<GZ> root@gaia [~]
# wc -l /var/tmp/snoop-err.txt 
       1 /var/tmp/snoop-err.txt

<GZ> root@gaia [~]
# head /var/tmp/snoop-err.txt 
Using device ixgbe0 (promiscuous mode)

Comment by Jira Bot
Created at 2018-08-20T19:53:01.156Z

illumos-joyent commit 9edb67c309d2e2712a09267be4dea367e37dc545 (branch master, by Ryan Zezeski)

OS-7098 bufmod sends corrupted LSO packets
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Approved by: Jason King <jbk@joyent.com>