OS-7344: i40e Tx freeze caused by off-by-one DMA

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-11-02T15:47:01.296Z
Updated at:2018-11-14T21:27:28.639Z

People

Created by:Ryan Zezeski [X]
Reported by:Ryan Zezeski [X]
Assigned to:Ryan Zezeski [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-11-14T21:27:28.625Z)

Fix Versions

2018-11-22 Funcooker (Release Date: 2018-11-22)

Related Links

Description

In short: The i40e dma_attr_count_max allows us to DMA bind a 16K buffer but the controller will only allow a max of 16K - 1.

After implementing LSO on i40e (OS-5225) there were several reports of weird networking behavior. To the user it would manifest as erratic network behavior before the machine eventually seemed to stop responding (e.g. see SWSUP-1261). But the real issue is that one of the i40e Tx rings would stop sending data and prevent any packets getting out on it. I was able to reproduce this easily enough by running an iperf3 benchmark on the primary MAC client (e.g. i40e0).

# ipadm create-if -t i40e0
# ipadm create-if -t i40e1
# ipadm create-addr -t -T static -a 192.168.6.10/24 i40e0/v4
# ipadm create-addr -t -T static -a 192.168.6.11/24 i40e1/v4
# iperf3 -s -B 192.168.6.11
# iperf3 -c 192.168.6.11 -B 192.168.6.10 -t 60 -P 16

Once the freeze event occurs it will present itself as an ever decreasing itrq_desc_free count. For example below I'm printing various information for all the Tx rings on the i40e0 MAC. The number in parentheses is the number of free Tx descriptors, which are required to send data on the NIC. Notice how the 3rd ring (index [2]) drops while all the others have all their descriptors freed. As I try to run more benchmarks, and more data is sent on this ring, we will continue to eat descriptors.

> 0xfffffe052a408288::print mac_impl_t mi_driver | ::print i40e_t i40e_trqpairs | ::array i40e_trqpair_t 7 | ::printf "0x%p [%u] %u %u %u %u (%u) %u %u\n" i40e_trqpair_t . itrq_index itrq_txstat.itxs_packets.value.ui64 itrq_txstat.itxs_descriptors.value.ui64 itrq_txstat.itxs_recycled.value.ui64 itrq_txstat.itxs_err_nodescs.value.ui64 itrq_desc_free itrq_tx_intrvec itrq_desc_head
0xfffffe0528027000 [0] 19807311 65999978 65999978 2052 (1024) 1 106
0xfffffe0528027710 [1] 18530819 65149538 65149538 13951 (1024) 2 610
0xfffffe0528027e20 [2] 19759141 66903019 66902231 8172 (236) 3 215
0xfffffe0528028530 [3] 22053192 75337746 75337746 9644 (1024) 4 18
0xfffffe0528028c40 [4] 16405613 57798643 57798643 2525 (1024) 5 1011
0xfffffe0528029350 [5] 12490918 42933186 42933186 2733 (1024) 6 962
0xfffffe0528029a60 [6] 19600256 67540929 67540929 4659 (1024) 7 961

> 0xfffffe052a408288::print mac_impl_t mi_driver | ::print i40e_t i40e_trqpairs | ::array i40e_trqpair_t 7 | ::printf "0x%p [%u] %u %u %u %u (%u) %u %u\n" i40e_trqpair_t . itrq_index itrq_txstat.itxs_packets.value.ui64 itrq_txstat.itxs_descriptors.value.ui64 itrq_txstat.itxs_recycled.value.ui64 itrq_txstat.itxs_err_nodescs.value.ui64 itrq_desc_free itrq_tx_intrvec itrq_desc_head
0xfffffe0528027000 [0] 19832094 66024761 66024761 2052 (1024) 1 313
0xfffffe0528027710 [1] 18530819 65149538 65149538 13951 (1024) 2 610
0xfffffe0528027e20 [2] 19759148 66903026 66902231 8172 (229) 3 215
0xfffffe0528028530 [3] 22053192 75337746 75337746 9644 (1024) 4 18
0xfffffe0528028c40 [4] 16405613 57798643 57798643 2525 (1024) 5 1011
0xfffffe0528029350 [5] 12490918 42933186 42933186 2733 (1024) 6 962
0xfffffe0528029a60 [6] 19600256 67540929 67540929 4659 (1024) 7 961

In order to reclaim these descriptors we set a special bit that tells the controller to fire an interrupt when its done with them so that we can recycle completed descriptors. My first thought was that we were either somehow losing these interrupts or the recycle code (i40e_tx_recycle_ring()) was broken. But the dtrace below showed that we were receiving recycle events but that our itrq_desc_wbhead and itrq_desc_head were always the same! This indicates that the NIC wasn't actually processing these descriptors. I further corroborated this by snooping the other side and sure enough the packets never make it across.

# dtrace -qn '::i40e_ring_tx:entry,::i40e_tx_set_data_desc:entry,::i40e_tx_cleanup_ring:entry,::i40e_tx_recycle_ring:entry { this->itrq = (i40e_trqpair_t *)arg0; printf("0x%p 0x%p %d %s %s (%d/%d)\n", curthread, this->itrq->itrq_i40e, this->itrq->itrq_index, execname, probefunc, this->itrq->itrq_desc_free, this->itrq->itrq_tx_ring_size); stack(); } ::i40e_tx_recycle_ring:entry { this->itrq = (i40e_trqpair_t *)arg0; this->wbhead = *this->itrq->itrq_desc_wbhead; this->toclean = this->itrq->itrq_desc_head; printf("wbhead = 0x%p\ntoclean = 0x%p\ntcbhead = 0x%p\n", this->wbhead, this->toclean, this->itrq->itrq_tcb_work_list[this->itrq->itrq_desc_head]); } ::i40e_intr_msix:entry { printf("0x%p %s %d\n", arg0, probefunc, (int)(uintptr_t)arg1); }'

0xfffffe0006562c20 0xfffffe0527d6b000 2 sched i40e_tx_recycle_ring (224/1024)

              i40e`i40e_intr_tx_work+0x29
              i40e`i40e_intr_msix+0x3f
              apix`apix_dispatch_pending_autovect+0x101
              apix`apix_dispatch_pending_hardint+0x34
              unix`switch_sp_and_call+0x13
wbhead = 0xd7
toclean = 0xd7
tcbhead = 0xfffffe053b25de78

So the question became: why did the ring stop processing descriptors? To answer that I wrote a one-off program to take the contents of the Tx descriptor ring (itrq_desc_ring) and print out the descriptors in a human readable form.

0xfffffe0528027e20::print i40e_trqpair_t itrq_desc_ring | ::array i40e_tx_desc_t 0t1024 | ::printf "0x%x\n" i40e_tx_desc_t cmd_type_offset_bsz ! xargs -L1 echo > /var/tmp/i40e_desc_ring_0xfffffe0528027e20.out

$ ./i40e_tx_dsec ~/Downloads/i40e_desc_ring_0xfffffe0528027e20.out 
IDX  VALUE              TYPE CMD                      MAC  IP   L4   BUFSZ 
0    0x0                DATA NONIP|UNK|               0    0    0    0     
1    0x0                DATA NONIP|UNK|               0    0    0    0     
...   
213  0x0                DATA NONIP|UNK|               0    0    0    0     
214  0x0                DATA NONIP|UNK|               0    0    0    0     
215  0x16a00fef80000011 CTX  TSO|NOTAG|               16318 1448
216  0x1000202871670    DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   0     
217  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
218  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
219  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
220  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
221  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
222  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
223  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
224  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
225  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
226  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
227  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
228  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
229  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
230  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
231  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
232  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
233  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
234  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
235  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
236  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
237  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
238  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
239  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
240  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
241  0x16a002d400000011 CTX  TSO|NOTAG|               2896 1448
242  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
243  0x2d4202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2896  
...
291  0x16a03df840000011 CTX  TSO|NOTAG|               63457 1448
292  0x10a02871640      DATA IPV4CSUM|TCP|            14   20   32   66    
293  0xc00602871640     DATA IPV4CSUM|TCP|            14   20   32   12289 
294  0xc00202871640     DATA IPV4CSUM|TCP|            14   20   32   12288 
295  0xc00202871640     DATA IPV4CSUM|TCP|            14   20   32   12288 
296  0xc00202871640     DATA IPV4CSUM|TCP|            14   20   32   12288 
297  0xc00202871640     DATA IPV4CSUM|TCP|            14   20   32   12288 
298  0x1f8202871670     DATA EOP|RS|IPV4CSUM|TCP|     14   20   32   2016  

The 215/216 entries represent the oldest packet on the ring/queue so it's reasonable to assume that the problem is caused by it. Sure enough, the data descriptor at 216 is peculiar. It's buffer size (BUFSZ) is 0, which is bogus for two reasons: 1) the LSO CTX descriptor indicates it's a 16K packet, and 2) the X710 manual, in section 8.4.2.1.1, clearly indicates the buffer size must be between 1 and 16K - 1. So how is it zero?

Let's track the code that sets the buffer size. We set the value in i40e_tx_set_data_desc() like so.

	txdesc->cmd_type_offset_bsz =
	    LE_64(((uint64_t)I40E_TX_DESC_DTYPE_DATA |
	    ((uint64_t)tctx->itc_data_offsets << I40E_TXD_QW1_OFFSET_SHIFT) |
	    ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) |
	    ((uint64_t)dbi->dbi_len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT)));

The dbi_len comes from the i40e_tx_bind_fragment() function.

		tcb->tcb_bind_info[i++].dbi_len = dma_cookie.dmac_size;

The dmac_size has an upper bound based on the ddi_dma_attr_t we set an initialization; controlled by the dma_attr_count_max value. We've set ours to 0x3FFF, 16K - 1, which matches perfectly with the X710 manual's specification found in section 8.4.2.1.1 (that a buffer size cannot exceed 16K - 1).

> 0xfffffe0528027e20::print i40e_trqpair_t itrq_i40e->i40e_txbind_lso_dma_attr
itrq_i40e->i40e_txbind_lso_dma_attr = {
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_version = 0
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_addr_lo = 0
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_addr_hi = 0xffffffffffffffff
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_count_max = 0x3fff
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_align = 0x1000
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_burstsizes = 0xfff
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_minxfer = 0x1
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_maxxfer = 0xffffffff
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_seg = 0xffffffffffffffff
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_sgllen = 0x20
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_granular = 0x1
    itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_flags = 0x200
}

But if we look more closely at the DMA binding for our zero-length descriptor we find something odd. It's not actually zero, but 0x4000 (or 16K).

> 0xfffffe0528027e20::print i40e_trqpair_t itrq_tcb_work_list[0t216] | ::print i40e_tx_control_block_t tcb_bind_info[0]
tcb_bind_info = {
    tcb_bind_info->dbi_paddr = 0x27be1036
    tcb_bind_info->dbi_len = 0x4000
}
> 0x4000=D
                16384           

It turns out that the max size is actually supposed to be ONE LESS than the real max. We can confirm this by looking at the rootnex implementation. In this case count_max is what controls out max DMA size, and we always add 1 to it. We only roll it back if the addition was deemed to cause an overflow.

	/*
	 * Figure out our maximum segment size. If the segment size is greater
	 * than 4G, we will limit it to (4G - 1) since the max size of a dma
	 * object (ddi_dma_obj_t.dmao_size) is 32 bits. dma_attr_seg and
	 * dma_attr_count_max are size-1 type values.
	 *
	 * Maximum segment size is the largest physically contiguous chunk of
	 * memory that we can return from a bind (i.e. the maximum size of a
	 * single cookie).
	 */

	/* handle the rollover cases */
	seg = attr->dma_attr_seg + 1;
	if (seg < attr->dma_attr_seg) {
		seg = attr->dma_attr_seg;
	}
	count_max = attr->dma_attr_count_max + 1;
	if (count_max < attr->dma_attr_count_max) {
		count_max = attr->dma_attr_count_max;
	}

This is all a long way of saying that i40e has an off-by-one error in its DMA initialization.

Part of the issue is that we don't document this in ddi_dma_attr_t(9S); but it's clearly stated in the original Writing Device Drivers manual, where it says:

Specifies the maximum transfer count that the DMA engine can handle in one cookie. The limit is expressed as the maximum count minus one. This count is used as a bit mask, so the count must also be one less than a power of two.

Though I'm not sure about the bitmask comment. In this case the value would have to be 0x1FFF versus 0x3FFE.

Comments

Comment by Ryan Zezeski [X]
Created at 2018-11-05T23:35:44.287Z

It looks like this is also the culprit behind illumos-joyent#180.


Comment by Ryan Zezeski [X]
Created at 2018-11-13T16:18:37.753Z

I tested this by running the same iperf producer for an entire day and verified with dtrace that we never see a DMA bind of 0x4000 and that the rings never freeze. With this patch applied, youzhongyang still sees a potential freeze issue (illumos-joyent#180), but there are other non-committed patches at play in that issue. And we know that the DMA bind size is in fact an issue. So I see no reason to hold this patch based on the report in illumos-joyent#180.


Comment by Jira Bot
Created at 2018-11-14T21:27:11.873Z

illumos-joyent commit 6f6fae1b433b461a7b014f48ad94fc7f4927c6ed (branch master, by Ryan Zezeski)

OS-7344 i40e Tx freeze caused by off-by-one DMA
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>