In short: The i40e dma_attr_count_max allows us to DMA bind a 16K buffer but the controller will only allow a max of 16K - 1.
After implementing LSO on i40e (OS-5225) there were several reports of weird networking behavior. To the user it would manifest as erratic network behavior before the machine eventually seemed to stop responding (e.g. see SWSUP-1261). But the real issue is that one of the i40e Tx rings would stop sending data and prevent any packets getting out on it. I was able to reproduce this easily enough by running an iperf3 benchmark on the primary MAC client (e.g. i40e0).
# ipadm create-if -t i40e0
# ipadm create-if -t i40e1
# ipadm create-addr -t -T static -a 192.168.6.10/24 i40e0/v4
# ipadm create-addr -t -T static -a 192.168.6.11/24 i40e1/v4
# iperf3 -s -B 192.168.6.11
# iperf3 -c 192.168.6.11 -B 192.168.6.10 -t 60 -P 16
Once the freeze event occurs it will present itself as an ever decreasing itrq_desc_free count. For example below I'm printing various information for all the Tx rings on the i40e0 MAC. The number in parentheses is the number of free Tx descriptors, which are required to send data on the NIC. Notice how the 3rd ring (index [2]) drops while all the others have all their descriptors freed. As I try to run more benchmarks, and more data is sent on this ring, we will continue to eat descriptors.
> 0xfffffe052a408288::print mac_impl_t mi_driver | ::print i40e_t i40e_trqpairs | ::array i40e_trqpair_t 7 | ::printf "0x%p [%u] %u %u %u %u (%u) %u %u\n" i40e_trqpair_t . itrq_index itrq_txstat.itxs_packets.value.ui64 itrq_txstat.itxs_descriptors.value.ui64 itrq_txstat.itxs_recycled.value.ui64 itrq_txstat.itxs_err_nodescs.value.ui64 itrq_desc_free itrq_tx_intrvec itrq_desc_head
0xfffffe0528027000 [0] 19807311 65999978 65999978 2052 (1024) 1 106
0xfffffe0528027710 [1] 18530819 65149538 65149538 13951 (1024) 2 610
0xfffffe0528027e20 [2] 19759141 66903019 66902231 8172 (236) 3 215
0xfffffe0528028530 [3] 22053192 75337746 75337746 9644 (1024) 4 18
0xfffffe0528028c40 [4] 16405613 57798643 57798643 2525 (1024) 5 1011
0xfffffe0528029350 [5] 12490918 42933186 42933186 2733 (1024) 6 962
0xfffffe0528029a60 [6] 19600256 67540929 67540929 4659 (1024) 7 961
> 0xfffffe052a408288::print mac_impl_t mi_driver | ::print i40e_t i40e_trqpairs | ::array i40e_trqpair_t 7 | ::printf "0x%p [%u] %u %u %u %u (%u) %u %u\n" i40e_trqpair_t . itrq_index itrq_txstat.itxs_packets.value.ui64 itrq_txstat.itxs_descriptors.value.ui64 itrq_txstat.itxs_recycled.value.ui64 itrq_txstat.itxs_err_nodescs.value.ui64 itrq_desc_free itrq_tx_intrvec itrq_desc_head
0xfffffe0528027000 [0] 19832094 66024761 66024761 2052 (1024) 1 313
0xfffffe0528027710 [1] 18530819 65149538 65149538 13951 (1024) 2 610
0xfffffe0528027e20 [2] 19759148 66903026 66902231 8172 (229) 3 215
0xfffffe0528028530 [3] 22053192 75337746 75337746 9644 (1024) 4 18
0xfffffe0528028c40 [4] 16405613 57798643 57798643 2525 (1024) 5 1011
0xfffffe0528029350 [5] 12490918 42933186 42933186 2733 (1024) 6 962
0xfffffe0528029a60 [6] 19600256 67540929 67540929 4659 (1024) 7 961
In order to reclaim these descriptors we set a special bit that tells the controller to fire an interrupt when its done with them so that we can recycle completed descriptors. My first thought was that we were either somehow losing these interrupts or the recycle code (i40e_tx_recycle_ring()) was broken. But the dtrace below showed that we were receiving recycle events but that our itrq_desc_wbhead and itrq_desc_head were always the same! This indicates that the NIC wasn't actually processing these descriptors. I further corroborated this by snooping the other side and sure enough the packets never make it across.
# dtrace -qn '::i40e_ring_tx:entry,::i40e_tx_set_data_desc:entry,::i40e_tx_cleanup_ring:entry,::i40e_tx_recycle_ring:entry { this->itrq = (i40e_trqpair_t *)arg0; printf("0x%p 0x%p %d %s %s (%d/%d)\n", curthread, this->itrq->itrq_i40e, this->itrq->itrq_index, execname, probefunc, this->itrq->itrq_desc_free, this->itrq->itrq_tx_ring_size); stack(); } ::i40e_tx_recycle_ring:entry { this->itrq = (i40e_trqpair_t *)arg0; this->wbhead = *this->itrq->itrq_desc_wbhead; this->toclean = this->itrq->itrq_desc_head; printf("wbhead = 0x%p\ntoclean = 0x%p\ntcbhead = 0x%p\n", this->wbhead, this->toclean, this->itrq->itrq_tcb_work_list[this->itrq->itrq_desc_head]); } ::i40e_intr_msix:entry { printf("0x%p %s %d\n", arg0, probefunc, (int)(uintptr_t)arg1); }'
0xfffffe0006562c20 0xfffffe0527d6b000 2 sched i40e_tx_recycle_ring (224/1024)
i40e`i40e_intr_tx_work+0x29
i40e`i40e_intr_msix+0x3f
apix`apix_dispatch_pending_autovect+0x101
apix`apix_dispatch_pending_hardint+0x34
unix`switch_sp_and_call+0x13
wbhead = 0xd7
toclean = 0xd7
tcbhead = 0xfffffe053b25de78
So the question became: why did the ring stop processing descriptors? To answer that I wrote a one-off program to take the contents of the Tx descriptor ring (itrq_desc_ring) and print out the descriptors in a human readable form.
0xfffffe0528027e20::print i40e_trqpair_t itrq_desc_ring | ::array i40e_tx_desc_t 0t1024 | ::printf "0x%x\n" i40e_tx_desc_t cmd_type_offset_bsz ! xargs -L1 echo > /var/tmp/i40e_desc_ring_0xfffffe0528027e20.out
$ ./i40e_tx_dsec ~/Downloads/i40e_desc_ring_0xfffffe0528027e20.out
IDX VALUE TYPE CMD MAC IP L4 BUFSZ
0 0x0 DATA NONIP|UNK| 0 0 0 0
1 0x0 DATA NONIP|UNK| 0 0 0 0
...
213 0x0 DATA NONIP|UNK| 0 0 0 0
214 0x0 DATA NONIP|UNK| 0 0 0 0
215 0x16a00fef80000011 CTX TSO|NOTAG| 16318 1448
216 0x1000202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 0
217 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
218 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
219 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
220 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
221 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
222 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
223 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
224 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
225 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
226 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
227 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
228 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
229 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
230 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
231 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
232 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
233 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
234 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
235 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
236 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
237 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
238 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
239 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
240 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
241 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448
242 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
243 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896
...
291 0x16a03df840000011 CTX TSO|NOTAG| 63457 1448
292 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66
293 0xc00602871640 DATA IPV4CSUM|TCP| 14 20 32 12289
294 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288
295 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288
296 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288
297 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288
298 0x1f8202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2016
The 215/216 entries represent the oldest packet on the ring/queue so it's reasonable to assume that the problem is caused by it. Sure enough, the data descriptor at 216 is peculiar. It's buffer size (BUFSZ) is 0, which is bogus for two reasons: 1) the LSO CTX descriptor indicates it's a 16K packet, and 2) the X710 manual, in section 8.4.2.1.1, clearly indicates the buffer size must be between 1 and 16K - 1. So how is it zero?
Let's track the code that sets the buffer size. We set the value in i40e_tx_set_data_desc() like so.
txdesc->cmd_type_offset_bsz =
LE_64(((uint64_t)I40E_TX_DESC_DTYPE_DATA |
((uint64_t)tctx->itc_data_offsets << I40E_TXD_QW1_OFFSET_SHIFT) |
((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) |
((uint64_t)dbi->dbi_len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT)));
The dbi_len comes from the i40e_tx_bind_fragment() function.
tcb->tcb_bind_info[i++].dbi_len = dma_cookie.dmac_size;
The dmac_size has an upper bound based on the ddi_dma_attr_t we set an initialization; controlled by the dma_attr_count_max value. We've set ours to 0x3FFF, 16K - 1, which matches perfectly with the X710 manual's specification found in section 8.4.2.1.1 (that a buffer size cannot exceed 16K - 1).
> 0xfffffe0528027e20::print i40e_trqpair_t itrq_i40e->i40e_txbind_lso_dma_attr
itrq_i40e->i40e_txbind_lso_dma_attr = {
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_version = 0
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_addr_lo = 0
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_addr_hi = 0xffffffffffffffff
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_count_max = 0x3fff
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_align = 0x1000
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_burstsizes = 0xfff
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_minxfer = 0x1
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_maxxfer = 0xffffffff
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_seg = 0xffffffffffffffff
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_sgllen = 0x20
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_granular = 0x1
itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_flags = 0x200
}
But if we look more closely at the DMA binding for our zero-length descriptor we find something odd. It's not actually zero, but 0x4000 (or 16K).
> 0xfffffe0528027e20::print i40e_trqpair_t itrq_tcb_work_list[0t216] | ::print i40e_tx_control_block_t tcb_bind_info[0]
tcb_bind_info = {
tcb_bind_info->dbi_paddr = 0x27be1036
tcb_bind_info->dbi_len = 0x4000
}
> 0x4000=D
16384
It turns out that the max size is actually supposed to be ONE LESS than the real max. We can confirm this by looking at the rootnex implementation. In this case count_max is what controls out max DMA size, and we always add 1 to it. We only roll it back if the addition was deemed to cause an overflow.
/*
* Figure out our maximum segment size. If the segment size is greater
* than 4G, we will limit it to (4G - 1) since the max size of a dma
* object (ddi_dma_obj_t.dmao_size) is 32 bits. dma_attr_seg and
* dma_attr_count_max are size-1 type values.
*
* Maximum segment size is the largest physically contiguous chunk of
* memory that we can return from a bind (i.e. the maximum size of a
* single cookie).
*/
/* handle the rollover cases */
seg = attr->dma_attr_seg + 1;
if (seg < attr->dma_attr_seg) {
seg = attr->dma_attr_seg;
}
count_max = attr->dma_attr_count_max + 1;
if (count_max < attr->dma_attr_count_max) {
count_max = attr->dma_attr_count_max;
}
This is all a long way of saying that i40e has an off-by-one error in its DMA initialization.
Part of the issue is that we don't document this in ddi_dma_attr_t(9S); but it's clearly stated in the original Writing Device Drivers manual, where it says:
Specifies the maximum transfer count that the DMA engine can handle in one cookie. The limit is expressed as the maximum count minus one. This count is used as a bit mask, so the count must also be one less than a power of two.
Though I'm not sure about the bitmask comment. In this case the value would have to be 0x1FFF versus 0x3FFE.
Former user commented on 2018-11-05T18:35:44.287-0500:
It looks like this is also the culprit behind illumos-joyent#180.
Former user commented on 2018-11-13T11:18:37.753-0500:
I tested this by running the same iperf producer for an entire day and verified with dtrace that we never see a DMA bind of 0x4000 and that the rings never freeze. With this patch applied, youzhongyang still sees a potential freeze issue (illumos-joyent#180), but there are other non-committed patches at play in that issue. And we know that the DMA bind size is in fact an issue. So I see no reason to hold this patch based on the report in illumos-joyent#180.
Jira Bot commented on 2018-11-14T16:27:11.873-0500:
illumos-joyent commit 6f6fae1b433b461a7b014f48ad94fc7f4927c6ed (branch master, by Ryan Zezeski)
OS-7344#icft=OS-7344 i40e Tx freeze caused by off-by-one DMA
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>