Issue Type: | Bug |
---|---|
Priority: | 4 - Normal |
Status: | Resolved |
Created at: | 2018-11-02T15:47:01.296Z |
Updated at: | 2018-11-14T21:27:28.639Z |
Created by: | Former user |
---|---|
Reported by: | Former user |
Assigned to: | Former user |
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-11-14T21:27:28.625Z)
2018-11-22 Funcooker (Release Date: 2018-11-22)
In short: The i40e dma_attr_count_max
allows us to DMA bind a 16K buffer but the controller will only allow a max of 16K - 1.
After implementing LSO on i40e (OS-5225) there were several reports of weird networking behavior. To the user it would manifest as erratic network behavior before the machine eventually seemed to stop responding (e.g. see SWSUP-1261). But the real issue is that one of the i40e Tx rings would stop sending data and prevent any packets getting out on it. I was able to reproduce this easily enough by running an iperf3 benchmark on the primary MAC client (e.g. i40e0).
# ipadm create-if -t i40e0 # ipadm create-if -t i40e1 # ipadm create-addr -t -T static -a 192.168.6.10/24 i40e0/v4 # ipadm create-addr -t -T static -a 192.168.6.11/24 i40e1/v4 # iperf3 -s -B 192.168.6.11 # iperf3 -c 192.168.6.11 -B 192.168.6.10 -t 60 -P 16
Once the freeze event occurs it will present itself as an ever decreasing itrq_desc_free
count. For example below I'm printing various information for all the Tx rings on the i40e0 MAC. The number in parentheses is the number of free Tx descriptors, which are required to send data on the NIC. Notice how the 3rd ring (index [2]
) drops while all the others have all their descriptors freed. As I try to run more benchmarks, and more data is sent on this ring, we will continue to eat descriptors.
> 0xfffffe052a408288::print mac_impl_t mi_driver | ::print i40e_t i40e_trqpairs | ::array i40e_trqpair_t 7 | ::printf "0x%p [%u] %u %u %u %u (%u) %u %u\n" i40e_trqpair_t . itrq_index itrq_txstat.itxs_packets.value.ui64 itrq_txstat.itxs_descriptors.value.ui64 itrq_txstat.itxs_recycled.value.ui64 itrq_txstat.itxs_err_nodescs.value.ui64 itrq_desc_free itrq_tx_intrvec itrq_desc_head 0xfffffe0528027000 [0] 19807311 65999978 65999978 2052 (1024) 1 106 0xfffffe0528027710 [1] 18530819 65149538 65149538 13951 (1024) 2 610 0xfffffe0528027e20 [2] 19759141 66903019 66902231 8172 (236) 3 215 0xfffffe0528028530 [3] 22053192 75337746 75337746 9644 (1024) 4 18 0xfffffe0528028c40 [4] 16405613 57798643 57798643 2525 (1024) 5 1011 0xfffffe0528029350 [5] 12490918 42933186 42933186 2733 (1024) 6 962 0xfffffe0528029a60 [6] 19600256 67540929 67540929 4659 (1024) 7 961 > 0xfffffe052a408288::print mac_impl_t mi_driver | ::print i40e_t i40e_trqpairs | ::array i40e_trqpair_t 7 | ::printf "0x%p [%u] %u %u %u %u (%u) %u %u\n" i40e_trqpair_t . itrq_index itrq_txstat.itxs_packets.value.ui64 itrq_txstat.itxs_descriptors.value.ui64 itrq_txstat.itxs_recycled.value.ui64 itrq_txstat.itxs_err_nodescs.value.ui64 itrq_desc_free itrq_tx_intrvec itrq_desc_head 0xfffffe0528027000 [0] 19832094 66024761 66024761 2052 (1024) 1 313 0xfffffe0528027710 [1] 18530819 65149538 65149538 13951 (1024) 2 610 0xfffffe0528027e20 [2] 19759148 66903026 66902231 8172 (229) 3 215 0xfffffe0528028530 [3] 22053192 75337746 75337746 9644 (1024) 4 18 0xfffffe0528028c40 [4] 16405613 57798643 57798643 2525 (1024) 5 1011 0xfffffe0528029350 [5] 12490918 42933186 42933186 2733 (1024) 6 962 0xfffffe0528029a60 [6] 19600256 67540929 67540929 4659 (1024) 7 961
In order to reclaim these descriptors we set a special bit that tells the controller to fire an interrupt when its done with them so that we can recycle completed descriptors. My first thought was that we were either somehow losing these interrupts or the recycle code (i40e_tx_recycle_ring()
) was broken. But the dtrace below showed that we were receiving recycle events but that our itrq_desc_wbhead
and itrq_desc_head
were always the same! This indicates that the NIC wasn't actually processing these descriptors. I further corroborated this by snooping the other side and sure enough the packets never make it across.
# dtrace -qn '::i40e_ring_tx:entry,::i40e_tx_set_data_desc:entry,::i40e_tx_cleanup_ring:entry,::i40e_tx_recycle_ring:entry { this->itrq = (i40e_trqpair_t *)arg0; printf("0x%p 0x%p %d %s %s (%d/%d)\n", curthread, this->itrq->itrq_i40e, this->itrq->itrq_index, execname, probefunc, this->itrq->itrq_desc_free, this->itrq->itrq_tx_ring_size); stack(); } ::i40e_tx_recycle_ring:entry { this->itrq = (i40e_trqpair_t *)arg0; this->wbhead = *this->itrq->itrq_desc_wbhead; this->toclean = this->itrq->itrq_desc_head; printf("wbhead = 0x%p\ntoclean = 0x%p\ntcbhead = 0x%p\n", this->wbhead, this->toclean, this->itrq->itrq_tcb_work_list[this->itrq->itrq_desc_head]); } ::i40e_intr_msix:entry { printf("0x%p %s %d\n", arg0, probefunc, (int)(uintptr_t)arg1); }' 0xfffffe0006562c20 0xfffffe0527d6b000 2 sched i40e_tx_recycle_ring (224/1024) i40e`i40e_intr_tx_work+0x29 i40e`i40e_intr_msix+0x3f apix`apix_dispatch_pending_autovect+0x101 apix`apix_dispatch_pending_hardint+0x34 unix`switch_sp_and_call+0x13 wbhead = 0xd7 toclean = 0xd7 tcbhead = 0xfffffe053b25de78
So the question became: why did the ring stop processing descriptors? To answer that I wrote a one-off program to take the contents of the Tx descriptor ring (itrq_desc_ring
) and print out the descriptors in a human readable form.
0xfffffe0528027e20::print i40e_trqpair_t itrq_desc_ring | ::array i40e_tx_desc_t 0t1024 | ::printf "0x%x\n" i40e_tx_desc_t cmd_type_offset_bsz ! xargs -L1 echo > /var/tmp/i40e_desc_ring_0xfffffe0528027e20.out $ ./i40e_tx_dsec ~/Downloads/i40e_desc_ring_0xfffffe0528027e20.out IDX VALUE TYPE CMD MAC IP L4 BUFSZ 0 0x0 DATA NONIP|UNK| 0 0 0 0 1 0x0 DATA NONIP|UNK| 0 0 0 0 ... 213 0x0 DATA NONIP|UNK| 0 0 0 0 214 0x0 DATA NONIP|UNK| 0 0 0 0 215 0x16a00fef80000011 CTX TSO|NOTAG| 16318 1448 216 0x1000202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 0 217 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 218 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 219 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 220 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 221 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 222 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 223 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 224 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 225 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 226 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 227 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 228 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 229 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 230 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 231 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 232 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 233 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 234 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 235 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 236 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 237 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 238 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 239 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 240 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 241 0x16a002d400000011 CTX TSO|NOTAG| 2896 1448 242 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 243 0x2d4202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2896 ... 291 0x16a03df840000011 CTX TSO|NOTAG| 63457 1448 292 0x10a02871640 DATA IPV4CSUM|TCP| 14 20 32 66 293 0xc00602871640 DATA IPV4CSUM|TCP| 14 20 32 12289 294 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288 295 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288 296 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288 297 0xc00202871640 DATA IPV4CSUM|TCP| 14 20 32 12288 298 0x1f8202871670 DATA EOP|RS|IPV4CSUM|TCP| 14 20 32 2016
The 215/216 entries represent the oldest packet on the ring/queue so it's reasonable to assume that the problem is caused by it. Sure enough, the data descriptor at 216 is peculiar. It's buffer size (BUFSZ
) is 0, which is bogus for two reasons: 1) the LSO CTX descriptor indicates it's a 16K packet, and 2) the X710 manual, in section 8.4.2.1.1, clearly indicates the buffer size must be between 1 and 16K - 1. So how is it zero?
Let's track the code that sets the buffer size. We set the value in i40e_tx_set_data_desc()
like so.
txdesc->cmd_type_offset_bsz = LE_64(((uint64_t)I40E_TX_DESC_DTYPE_DATA | ((uint64_t)tctx->itc_data_offsets << I40E_TXD_QW1_OFFSET_SHIFT) | ((uint64_t)cmd << I40E_TXD_QW1_CMD_SHIFT) | ((uint64_t)dbi->dbi_len << I40E_TXD_QW1_TX_BUF_SZ_SHIFT)));
The dbi_len
comes from the i40e_tx_bind_fragment()
function.
tcb->tcb_bind_info[i++].dbi_len = dma_cookie.dmac_size;
The dmac_size
has an upper bound based on the ddi_dma_attr_t
we set an initialization; controlled by the dma_attr_count_max
value. We've set ours to 0x3FFF
, 16K - 1, which matches perfectly with the X710 manual's specification found in section 8.4.2.1.1 (that a buffer size cannot exceed 16K - 1).
> 0xfffffe0528027e20::print i40e_trqpair_t itrq_i40e->i40e_txbind_lso_dma_attr itrq_i40e->i40e_txbind_lso_dma_attr = { itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_version = 0 itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_addr_lo = 0 itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_addr_hi = 0xffffffffffffffff itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_count_max = 0x3fff itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_align = 0x1000 itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_burstsizes = 0xfff itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_minxfer = 0x1 itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_maxxfer = 0xffffffff itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_seg = 0xffffffffffffffff itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_sgllen = 0x20 itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_granular = 0x1 itrq_i40e->i40e_txbind_lso_dma_attr.dma_attr_flags = 0x200 }
But if we look more closely at the DMA binding for our zero-length descriptor we find something odd. It's not actually zero, but 0x4000
(or 16K).
> 0xfffffe0528027e20::print i40e_trqpair_t itrq_tcb_work_list[0t216] | ::print i40e_tx_control_block_t tcb_bind_info[0] tcb_bind_info = { tcb_bind_info->dbi_paddr = 0x27be1036 tcb_bind_info->dbi_len = 0x4000 } > 0x4000=D 16384
It turns out that the max size is actually supposed to be ONE LESS than the real max. We can confirm this by looking at the rootnex implementation. In this case count_max
is what controls out max DMA size, and we always add 1 to it. We only roll it back if the addition was deemed to cause an overflow.
/* * Figure out our maximum segment size. If the segment size is greater * than 4G, we will limit it to (4G - 1) since the max size of a dma * object (ddi_dma_obj_t.dmao_size) is 32 bits. dma_attr_seg and * dma_attr_count_max are size-1 type values. * * Maximum segment size is the largest physically contiguous chunk of * memory that we can return from a bind (i.e. the maximum size of a * single cookie). */ /* handle the rollover cases */ seg = attr->dma_attr_seg + 1; if (seg < attr->dma_attr_seg) { seg = attr->dma_attr_seg; } count_max = attr->dma_attr_count_max + 1; if (count_max < attr->dma_attr_count_max) { count_max = attr->dma_attr_count_max; }
This is all a long way of saying that i40e has an off-by-one error in its DMA initialization.
Part of the issue is that we don't document this in ddi_dma_attr_t(9S)
; but it's clearly stated in the original Writing Device Drivers manual, where it says:
Specifies the maximum transfer count that the DMA engine can handle in one cookie. The limit is expressed as the maximum count minus one. This count is used as a bit mask, so the count must also be one less than a power of two.
Though I'm not sure about the bitmask comment. In this case the value would have to be 0x1FFF
versus 0x3FFE
.
It looks like this is also the culprit behind illumos-joyent#180.
I tested this by running the same iperf producer for an entire day and verified with dtrace that we never see a DMA bind of 0x4000
and that the rings never freeze. With this patch applied, youzhongyang still sees a potential freeze issue (illumos-joyent#180), but there are other non-committed patches at play in that issue. And we know that the DMA bind size is in fact an issue. So I see no reason to hold this patch based on the report in illumos-joyent#180.
illumos-joyent commit 6f6fae1b433b461a7b014f48ad94fc7f4927c6ed (branch master, by Ryan Zezeski)
OS-7344 i40e Tx freeze caused by off-by-one DMA
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Rob Johnston <rob.johnston@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>