OS-6151: mac_tx_srs_quiesce() can kill a maxbw link

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2017-05-29T01:17:04.000Z
Updated at:2017-08-17T01:14:54.000Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-06-30T21:35:36.000Z)

Fix Versions

2017-07-06 UGANDA MAXIMUM (Release Date: 2017-07-06)

Description

Setting maxbw, especially while a link is actively processing data, can kill a link's Tx SRS -- preventing the userspace client from sending data.

When a link is in bandwidth mode (via setting the maxbw property) it makes use of a Tx SRS and worker thread. As the bandwidth limit is reached for a given tick the MAC module will queue data on the Tx SRS before completely dropping packets. As bandwidth becomes available the Tx worker will remove data from this queue and send it across the wire. This is a delicate dance performed between client threads which call mac_tx_srs_enqueue() and MAC Tx SRS worker threads which pull data off the queue via mac_srs_worker(). Since bandwidth is enforced on a per-tick basis the worker thread uses a continuous 1-tick timeout to reset its bandwidth counters. This heartbeat is vital and involves the mac_srs_fire() function, srs_async CV, timeout.9f, untimeout.9f, and MAC quiesce functionality (i.e. SRS_PAUSE).

If data is queued in the SRS then srs_first != NULL. That's why mac_tx_srs_enqueue has the logic copied below. If data is already queued then it doesn't wake the worker because it knows it was woken the first time this function was called.

	if (mac_srs->srs_first != NULL)
		wakeup_worker = B_FALSE;

The worker spins in a loop checking if a) there is no data to process or b) there is data to process but we've reached the bandwidth limit for this tick. In the (b) case it's going to set a timeout to execute mac_srs_fire() on the next tick and reset the bandwidth counters. Then it's going to wait for that timeout (other threads/functions can signal this but at this point the timeout is our only hope).

			if (bw_ctl_flag && mac_srs->srs_tid == NULL) {
				/*
				 * We need to ensure that a timer is already
				 * scheduled or we force schedule one for
				 * later so that we can continue processing
				 * after this quanta is over.
				 */
				mac_srs->srs_tid = timeout(mac_srs_fire,
				    mac_srs, 1);
			}
wait:
			CALLB_CPR_SAFE_BEGIN(&cprinfo);
			cv_wait(async, lock);
			CALLB_CPR_SAFE_END(&cprinfo, lock);

Once the worker is signaled it will recheck the bandwidth numbers and then call the drain function with the data it is allowed to process that tick.

This all works fine until you change the maxbw property while the worker is waiting. The key to this puzzle are the lines following the cv_wait():

			if (mac_srs->srs_state & SRS_PAUSE)
				goto done;
			if (mac_srs->srs_state & SRS_PROC)
				goto wait;

When you change maxbw on an active link it will call mac_tx_srs_update_bwlimit(). This will call mac_tx_client_quiesce() -- which will set the SRS_PAUSE flag on the SRS. This will then trip the if statement above and jump the worker past the srs_drain_func call and instead lead it directly to the mac_srs_worker_quiesce() call. This leads to mac_srs_soft_rings_quiesce() which leads to calling untimeout(srs_tid).

        (void) untimeout(mac_srs->srs_tid);
        /*
         * RPZ: since we are calling untimeout shouldn't we set
         * srs_tid=NULL?
         */

However, as my comment points out, this function does't reset the srs_tid even though we just canceled it. This is a problem, because the only other function that resets the srs_tid is the drain function, but we jumped past that in mac_srs_worker() because of the special SRS_PAUSE logic.

So after the SRS has quiesced and we restart the workers we now have a situation where the worker thinks it has a timeout registered for mac_srs_fire() when it has no such thing. At this point the clients can queue all the data they want, but the worker thread, like Snow White, has eaten a poisoned apple and there is no prince of MAC to save her.

Comments

Comment by Former user
Created at 2017-05-30T18:13:31.000Z
Updated at 2017-05-30T22:39:11.000Z

This bug is very easy to reproduce.

To remove any doubt that my prognosis is correct I wrote a dtrace script to track the flow of Tx bytes over a link and detect when cv_wait() is called without a corresponding mac_srs_fire() timeout.

Here's the script:

#pragma D option quiet

#define SRS_PROC 0x00000010

/*
 * Pull the MAC client name from the SRS. We hook into the worker
 * thread via mac_srs_worker_restart() because mac_srs_worker() is
 * just a for loop. This way we hook into the thread anytime maxbw is
 * set, as opposed to starting the script before the link is created.
 */
fbt::mac_srs_worker_restart:entry {
	this->name = stringof(args[0]->srs_mcip->mci_name);
}

/*
 * If this worker thread is running on behalf of the MAC client we are
 * interested in then stash the SRS pointer in a thread local variable
 * (which doubles as a flag indicating we are interested in this
 * thread).
 */
fbt::mac_srs_worker_restart:entry /this->name == $1/ {
	self->srs = args[0];
	printf("Hooking into %s mac_srs_worker (0x%p)\n", this->name,
	    curthread);
}

fbt::untimeout:entry /self->srs && self->srs->srs_tid == args[0]/ {
	printf("Canceling SRS timeout: 0x%p (worker 0x%p)\n",
	    args[0], curthread);
	self->untimeout = args[0];
}

/*
 * The cv_wait(async, lock) call in mac_srs_worker() relies on the
 * mac_srs_fire() timeout to occur when in bandwidth mode. If this
 * timeout is canceled but the srs_tid isn't cleared then
 * mac_srs_worker() will think there is an outstanding timeout waiting
 * to fire when there isn't -- thus waiting on the async signal
 * forever.
 *
 * In this case you'll see the MAC client repeatedly call
 * max_tx_srs_enqueue() -- enqueuing data well past the BW queue limit
 * and waiting for the worker thread to drain it.
 */
fbt::cv_wait:entry /self->srs && !(self->srs->srs_state & SRS_PROC) &&
    self->srs->srs_tid != NULL && self->untimeout == self->srs->srs_tid/ {
	printf("OOPS! %s waiting on timeout that will never come!\n",
		stringof(self->srs->srs_mcip->mci_name));
	stack();
	wait_start = timestamp;
	wait_thread = curthread;
}

/*
 * Remember, bytes are queued by the client on a different thread than
 * worker. We need to check the name here again.
 */
fbt::mac_tx_srs_enqueue:entry /stringof(args[0]->srs_mcip->mci_name) == $1/ {
	self->srs = args[0];
	self->q_before = self->srs->srs_size;
	enqueue_count++;
}

fbt::mac_tx_srs_enqueue:entry /self->srs && self->srs->srs_size > 0 &&
    wait_start != 0 && ((timestamp - wait_start) > 5000000000)/ {
	printf("ERROR: mac_srs_worker 0x%p has been waiting for 5 seconds.\n",
	    wait_thread);
	printf("There is data to deliver -- it should have been signaled!\n");
	printf("Use mdb -k to inspect the mac_soft_ring_set_t: 0x%p\n",
	    self->srs);
	exit(1);
}

fbt::mac_tx_srs_enqueue:return /self->srs/ {
	this->queued = self->srs->srs_size - self->q_before;
	printf("%u bytes queued on %s (%u total)\n", this->queued,
	    self->srs->srs_mcip->mci_name, self->srs->srs_size);
	self->q_before = 0;
	self->srs = 0;
}

/*
 * The drain function is called from the SRS worker thread so we just
 * check self->srs which was set via the probe on
 * mac_srs_worker_restart().
 */
fbt::mac_tx_srs_drain:entry /self->srs/ {
	self->q_before = self->srs->srs_size;
	enqueue_count = 0;
}

fbt::mac_tx_srs_drain:return /self->srs/ {
	this->drained = self->q_before - self->srs->srs_size;
	printf("%u bytes drained on %s (worker 0x%p) (%u remain)\n",
	    this->drained, self->srs->srs_mcip->mci_name, curthread,
	    self->srs->srs_size);
	self->q_before = 0;
}

To reproduce I run the following steps:

# dladm create-vnic -t -l e1000g0 tvn1
# ipadm create-addr -t -T static -a 10.0.1.253/16 tvn1/v4
# dtrace -Cs maxbw-stuck.d '"tvn1"'
# dladm set-linkprop -p maxbw=20 tvn1

In other terminal on the same host:
# iperf -c 10.0.0.61 -B 10.0.1.253 -p 5007 -i 1 -t 30

Then back to the first terminal:
# dladm set-linkprop -p maxbw=19 tvn1

When you change maxbw you'll notice the iperf output stops (because it can't send data) and you'll see a change in the dtrace script output from interleavings of queued/drained message to purely queued messages:

Script output when things are running fine:

1514 bytes queued on tvn1 (1080120 total)
1514 bytes queued on tvn1 (1081634 total)
1514 bytes queued on tvn1 (1083148 total)
1514 bytes queued on tvn1 (1084662 total)
1514 bytes queued on tvn1 (1086176 total)
3028 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1083148 remain)
3028 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1080120 remain)
3846 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1076274 remain)
3028 bytes drained on tvn1 (worker 0xffffff0040dedc40) (1073246 remain)

Script output after changing maxbw under load and things go sideways:

Canceling SRS timeout: 0x17ff4654d (worker 0xffffff0040dedc40)
Hooking into tvn1 mac_srs_worker (0xffffff0040dedc40)
OOPS! tvn1 waiting on timeout that will never come!

              mac`mac_srs_worker+0x141
              unix`thread_start+0x8
Hooking into tvn1 mac_srs_worker (0xffffff00421aac40)
1514 bytes queued on tvn1 (1081634 total)
1514 bytes queued on tvn1 (1083148 total)
1514 bytes queued on tvn1 (1084662 total)
1514 bytes queued on tvn1 (1086176 total)
1514 bytes queued on tvn1 (1087690 total)
1514 bytes queued on tvn1 (1089204 total)
1514 bytes queued on tvn1 (1090718 total)
1514 bytes queued on tvn1 (1092232 total)
1514 bytes queued on tvn1 (1093746 total)
1514 bytes queued on tvn1 (1095260 total)
1514 bytes queued on tvn1 (1096774 total)
1514 bytes queued on tvn1 (1098288 total)
150 bytes queued on tvn1 (1098438 total)
54 bytes queued on tvn1 (1098492 total)
1514 bytes queued on tvn1 (1100006 total)
150 bytes queued on tvn1 (1100156 total)
ERROR: mac_srs_worker 0xffffff0040dedc40 has been waiting for 5 seconds.
There is data to deliver -- it should have been signaled!
Use mdb -k to inspect the mac_soft_ring_set_t: 0xffffff0e4ee5d040

Comment by Former user
Created at 2017-06-23T00:30:14.000Z
Updated at 2017-06-23T00:32:40.000Z

It turns out that setting maxbw is just a specific instance of a more general problem. The real issue is that any time you quiesce the Tx SRS worker you run the chance of killing the link. Today I tripped this same issue by just running a maxbw test program. It turns out that the worker is also quiesced on stream detach (dld_str_detach()). This induces the same problem as described above: we jump out of the mac_srs_worker() for(;\;) loop and untimeout() the srs_tid without setting it to NULL. Then we end up in a situation where the worker waits on async forever and the max_tx_bw_mode() just keeps piling data up in the SRS queue.

For example, I ran this dtrace script:

# dtrace -n 'fbt::mac_tx_srs_quiesce:entry { printf("srs_tid: 0x%p\n", args[0]->srs_tid); stack(); } fbt::mac_srs_soft_rings_quiesce:entry { self->t=1; } fbt::untimeout:entry /self->t/ { printf("untimeout: 0x%p\n", arg0); stack(); self->t=0; }'

And then I ran my new maxbw test which uses DLPI (thus STREAMS) to send custom ethertype frrames across two vnics on an etherstub. When the test finishes the stream is closed and detaches. Causing the following output from the dtrace script:

  9  28760         mac_tx_srs_quiesce:entry srs_tid: 0x17fff706b

              mac`i_mac_tx_client_quiesce+0x43
              mac`mac_tx_client_quiesce+0x16
              mac`mac_tx_client_flush+0x18
              mac`mac_unicast_remove+0x218
              dls`dls_mac_active_clear+0x38
              dls`dls_active_clear+0x46
              dls`dls_close+0x96
              dld`dld_str_detach+0x5f
              dld`dld_str_close+0x6a
              dld`dld_close+0x20
              genunix`qdetach+0x9b
              genunix`strclose+0x2c1
              specfs`device_close+0xb9
              specfs`spec_close+0x17b
              genunix`fop_close+0x61
              genunix`closef+0x5e
              genunix`closeall+0x67
              genunix`proc_exit+0x45c
              genunix`exit+0x15
              genunix`rexit+0x18

  9  24212                  untimeout:entry untimeout: 0x17fff706b

              mac`mac_srs_soft_rings_quiesce+0x83
              mac`mac_srs_worker_quiesce+0x9c
              mac`mac_srs_worker+0x1f1
              unix`thread_start+0x8

Then if I try to run the maxbw test a second time it hangs because it can't get any data across the link. If I inspect the Tx SRS in mdb I find the following:

> ::mac_srs ! grep vnic1
ffffff0e307029c0 maxbw_vnic1          ffffff0e306d0bb8                0 TX
ffffff0e30703680 maxbw_vnic1          ffffff0e306d0bb8                0 RX
> ffffff0e307029c0::print mac_soft_ring_set_t srs_tid
srs_tid = 0x17fff706b

So the timer was canceled (proven by dtrace output) but yet the SRS still thinks that srs_tid is valid and thus will never start a new timer -- effectively killing this link.


Comment by Former user
Created at 2017-06-28T05:58:29.000Z

I tested this by creating two vnics over an etherstub, plumbing them with IP addresses, and then changed maxbw while an iperf benchmark ran.

First, setup the interfaces:

[root@testsos /data/bin]# ./os_6151_setup.ksh
+ dladm create-etherstub -t maxbw_eth1
+ dladm create-vnic -t -l maxbw_eth1 -p mtu=9000 -p maxbw=73 maxbw_vnic1
+ dladm create-vnic -t -l maxbw_eth1 -p mtu=9000 maxbw_vnic2
+ ipadm create-addr -t -T static -a 172.16.21.130/24 maxbw_vnic1/v4
+ ipadm create-addr -t -T static -a 172.16.21.131/24 maxbw_vnic2/v4
[root@testsos /data/bin]# dladm show-link
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   up       --         --
e1000g1     phys      1500   up       --         --
aggr0       aggr      1500   up       --         e1000g0,e1000g1
net0        vnic      1500   ?        --         aggr0
maxbw_eth1  etherstub 9000   up       --         --
maxbw_vnic1 vnic      9000   up       --         maxbw_eth1
maxbw_vnic2 vnic      9000   up       --         maxbw_eth1
[root@testsos /data/bin]# ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
aggr0/?           dhcp     ok           172.16.21.128/24
maxbw_vnic1/v4    static   ok           172.16.21.130/24
maxbw_vnic2/v4    static   ok           172.16.21.131/24
lo0/v6            static   ok           ::1/128
[root@testsos /data/bin]# dladm show-linkprop -p maxbw maxbw_vnic1
LINK         PROPERTY        PERM VALUE          DEFAULT        POSSIBLE
maxbw_vnic1  maxbw           rw      73          --             --

Second, run the benchmark:

# iperf -B 172.16.21.131 -s -i 1

# iperf -B 172.16.21.130 -c 172.16.21.131 -i 1 -t 30

Third, change maxbw while running the benchmark and make sure data keeps flowing:

# iperf -B 172.16.21.130 -c 172.16.21.131 -i 1 -t 30
------------------------------------------------------------
Client connecting to 172.16.21.131, TCP port 5001
Binding to local address 172.16.21.130
TCP window size:  125 KByte (default)
------------------------------------------------------------
[  3] local 172.16.21.130 port 5001 connected with 172.16.21.131 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  17.6 MBytes   148 Mbits/sec
[  3]  1.0- 2.0 sec  16.8 MBytes   141 Mbits/sec
[  3]  2.0- 3.0 sec  16.6 MBytes   139 Mbits/sec
[  3]  3.0- 4.0 sec  16.5 MBytes   138 Mbits/sec
[  3]  4.0- 5.0 sec  16.6 MBytes   139 Mbits/sec
[  3]  5.0- 6.0 sec  16.6 MBytes   139 Mbits/sec
[  3]  6.0- 7.0 sec  16.8 MBytes   141 Mbits/sec
[  3]  7.0- 8.0 sec  16.5 MBytes   138 Mbits/sec
[  3]  8.0- 9.0 sec  16.6 MBytes   139 Mbits/sec
[  3]  9.0-10.0 sec  16.5 MBytes   138 Mbits/sec

# dladm set-linkprop -t -p maxbw=145 maxbw_vnic1

[  3] 10.0-11.0 sec  19.0 MBytes   159 Mbits/sec
[  3] 11.0-12.0 sec  24.9 MBytes   209 Mbits/sec
[  3] 12.0-13.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 13.0-14.0 sec  24.8 MBytes   208 Mbits/sec
[  3] 14.0-15.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 15.0-16.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 16.0-17.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 17.0-18.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 18.0-19.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 19.0-20.0 sec  24.9 MBytes   209 Mbits/sec
[  3] 20.0-21.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 21.0-22.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 22.0-23.0 sec  24.8 MBytes   208 Mbits/sec
[  3] 23.0-24.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 24.0-25.0 sec  24.8 MBytes   208 Mbits/sec
[  3] 25.0-26.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 26.0-27.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 27.0-28.0 sec  25.0 MBytes   210 Mbits/sec
[  3] 28.0-29.0 sec  25.0 MBytes   210 Mbits/sec
[  3]  0.0-30.0 sec   660 MBytes   185 Mbits/sec

The reason iperf reports more bandwidth than the maxbw setting is because of a bug on the Tx side of maxbw, as reported in OS-6205.


Comment by Former user
Created at 2017-06-30T21:34:41.000Z

illumos-joyent commit 1a8b5e8 (branch master, by Ryan Zezeski)

OS-6151 mac_tx_srs_quiesce() can kill a maxbw link
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Jason King <jason.king@joyent.com>