Issue Type: | Task |
---|---|
Priority: | 4 - Normal |
Status: | Resolved |
Created at: | 2020-04-21T11:41:53.497Z |
Updated at: | 2020-06-09T09:26:42.618Z |
Created by: | Former user |
---|---|
Reported by: | Former user |
Assigned to: | Former user |
Filing Error: This issue was not filed correctly.
(Resolution Date: 2020-06-09T09:26:42.604Z)
From investigating MANTA-4968, there appears to be a relatively rare race between shutdown(SHUT_WR) and writes across a UNIX socket.
Running
https://raw.githubusercontent.com/jlevon/grot/master/haproxy-socket.c
against haproxy in a loadbalancer zone.
We're dtracing with variants of:
https://github.com/jlevon/grot/blob/master/d/recvmsg.d
The client by this point has sent the command out over the socket, and done a shutdown(SHUT_WR).
Now it's waiting for some data from haproxy:
37 76382 kstrgetmsg:entry ------------------ T1604397695044427 vnode fffffe27a2f52000 stp:fffffe282c511350 37 75699 strget:entry T1604397695049725 stp:fffffe282c511350 q:fffffe2719643848 (next:0) sd_flag: 2800 36 115409 tl_wput:entry T1604397695049815 wq:fffffe277761b138 mp:fffffe6d7344b2c0 tep:fffffe25d2e2cd80 type 1 mtype 28 36 115493 tl_wput_common_ser:entry T1604397695052687 mp:fffffe6d7344b2c0 tep:fffffe25d2e2cd80 ->te_wq:fffffe277761b138 37 75700 strget:return T1604397695053836 ret:0 37 78194 strwaitq:entry T1604397695056500 sd_flag: 2800
It goes to sleep.
Now haproxy sends some data:
36 78747 strwrite_common:entry T1604397695666377 stp:fffffe274c2dc800 36 62143 putnext:entry T1604397695672796 wqp:ffffff18098043e8 mp:fffffe27058e6660 stp:fffffe274c2dc800 putproc:tl`tl_wput 36 115409 tl_wput:entry T1604397695676466 wq:fffffe2719646418 mp:fffffe27058e6660 tep:fffffebc7f1908c8 type 0 mtype 2020614179 36 115409 tl_wput:entry T1604397695677777 wq:fffffe2719646418 mp:fffffe27058e6660 36 68062 serializer_enqueue:entry T1604397695680837 proc:tl`tl_wput_data_ser mp:fffffe27058e6660 36 78748 strwrite_common:return T1604397695684977
Note that often we'd be able to strrput() directly to the peer (haproxy-socket) and all is good. However,
this time, the serializer is busy, so we place ourselves on it to run tl_wput_data_ser later.
Next, haproxy does a shutdown(SHUT_WR) as is its occasional wont:
36 62143 putnext:entry T1604397695706917 T_ORDREL_REQ for q:ffffff18098043e8 q_next:fffffe2719646418 mp:fffffe30b0735000 stp:fffffe274c2dc800 putproc:tl`tl_wput exec haproxy genunix`strput+0x133 genunix`kstrputmsg+0x233 sockfs`sotpi_shutdown+0x279 sockfs`socket_shutdown+0x18 sockfs`shutdown+0x4f unix`sys_syscall+0x1a8 libc.so.1`_so_shutdown+0xa haproxy`mux_pt_shutw+0xff haproxy`stream_int_shutw_conn+0x208 haproxy`process_stream+0x1bf2 haproxy`process_runnable_tasks+0x57f haproxy`run_thread_poll_loop+0x1cc haproxy`main+0xdbf haproxy`_start_crt+0x83 haproxy`_start+0x18 36 115409 tl_wput:entry T1604397695715599 wq:fffffe2719646418 mp:fffffe30b0735000 tep:fffffebc7f1908c8 type 1 mtype 10
This gets queued into the serializer via a tl_putq_ser() callback.
Next, haproxy calls close:
36 78152 strclose:entry T1604397695739902 stp:fffffe274c2dc800 36 75705 qdetach:entry T1604397695743477 stp:fffffe274c2dc800 close:tl`tl_close
And we run through the serializer. First, we have our pending data write, tl_wput_data_ser():
46 115475 tl_wput_data_ser:entry T1604397695747558 mp:fffffe27058e6660 tep:fffffebc7f1908c8 tep fffffebc7f1908c8 te_state 11 peer fffffebc7eefdbd0 te_state 10 genunix`serializer_exec+0x22 genunix`serializer_drain+0x84 genunix`serializer_enter+0x77 tl`tl_serializer_enter+0x62 tl`tl_wsrv+0x71 genunix`runservice+0x49 genunix`queue_service+0x41 genunix`stream_service+0x40 genunix`taskq_d_thread+0xbc unix`thread_start+0xb 46 71082 putq:entry T1604397695752711 q:fffffe2719646418 mp:fffffe27058e6660 46 115476 tl_wput_data_ser:return T1604397695763091
Rather than taking the "fast path" there, however, the peer state is
TS_WIND_ORDREL, so we place it back onto haproxy's write queue. Presumably the
idea is that we're later going to process it via the service routine again
(tl_wput_ser()->tl_wput_common_ser()), but it's not clear.
Next on the serializer queue we have the tl_putq_ser() callback - our T_ORDREL_REQ from
above. But this routine does this:
1903 if (tep->te_closing) { 1904 tl_wput_ser(mp, tep); 1905 } else { 1906 TL_PUTQ(tep, mp);
And as we can see:
46 115469 tl_putq_ser:entry T1604397695766517 mp:fffffe30b0735000 tep:fffffebc7f1908c8 te_closing 1 type 1 mtype 10
qdetach()>tl_close() ran enough to set >te_closing. This now gets processed immediately:
46 115495 tl_do_proto:entry T1604397695784218 handling T_ORDREL_REQ for mp:fffffe30b0735000 tep:fffffebc7f1908c8 ->wq:fffffe2719646418 next:0 stp:fffffe274c2dc800 46 115501 tl_ordrel:entry T1604397695788117 mp:fffffe30b0735000 tep:fffffebc7f1908c8 te_wq:fffffe2719646418 peer tep:fffffebc7eefdbd0 peer_rq:fffffe271403bad0 to get T_ORDREL_IND 46 62143 putnext:entry T1604397695799793 T_ORDREL_IND for q:fffffe271403bad0 q_next:fffffe2719643848 mp:fffffe30b0735000 stp:fffffe282c511350 46 77566 strrput:entry T1604397695803390 handling T_ORDREL_IND for q:fffffe2719643848 next:0 stp:fffffe282c511350 mp:fffffe30b0735000 46 77566 strrput:entry T1604397695804446:q:fffffe2719643848, mp:fffffe30b0735000 type 1 stp:fffffe282c511350 sd_flag 10242 mtype 23 46 99006 strsock_proto:entry T1604397695809474 T_ORDREL_IND 46 76788 strseteof:entry T1604397695812531 vnode fffffe27a2f52000 stp fffffe282c511350
Since the handling of T_ORDREL_REQ is to effectively tell the other side that there's no more writes coming, this translates into strseteof() on the client/read side. Now we wake up from strget() in haproxy-socket, see that STREOF is set, and return a read of zero.
The short version of this race is that the shutdown(SHUT_WR) manages to jump
the queue in front of the last data mblk because ->te_closing happens to get
As to an actual fix, no real idea. Our read-side is in state TS_WIND_ORDREL
due having done its own shutdown on its write queue (ti_statetbl[TE_ORDREL_REQ][TS_DATA_XFER])
I idly wonder if tl_wput_data_ser() should be allow the peer_tep to have
that state as well for the fast path. Logically it doesn't seem to make any
sense that if a client side has taken shutdown(SHUT_WR)/tl_ordrel(), it wouldn't
take packages on its read side still.
I actually tried that fix and my test case has been running for a very long time now.
Specifically this fix:
1 │ diff --git a/usr/src/uts/common/io/tl.c b/usr/src/uts/common/io/tl.c 2 │ index e6b3d41625..6f1d5db19c 100644 3 │ --- a/usr/src/uts/common/io/tl.c 4 │ +++ b/usr/src/uts/common/io/tl.c 5 │ @@ -1963,6 +1963,7 @@ tl_wput_data_ser(mblk_t *mp, tl_endpt_t *tep) 6 │ (tep->te_wq != NULL) && 7 │ (tep->te_wq->q_first == NULL) && 8 │ ((peer_tep->te_state == TS_DATA_XFER) || 9 │ + (peer_tep->te_state == TS_WIND_ORDREL) || 10 │ (peer_tep->te_state == TS_WREQ_ORDREL)) && 11 │ ((peer_rq = peer_tep->te_rq) != NULL) && 12 │ (canputnext(peer_rq) || tep->te_closing)) {
Change is pushed to https://github.com/joyent/illumos-joyent/tree/OS-8159
One open question with this change is this state transition:
nextstate[TE_ERROR_ACK][TS_WACK_DREQ10]
which moves to state TS_WIND_ORDREL
This state is reached from receiving a TE_DISCON_REQ
while in TS_WIND_ORDREL
I think this is never committed to ->te_state in tl_discon_req()
or elsewhere, so it's fine, but I haven't looked really closely.
muppet commit 200f1f59f179ab2705243932b9dae037a556d526 (branch master, by John Levon)
MANTA-5018 muppet crashed after hitting FSM error (#28)
MANTA-5195 switch to 'uuid' package for UUID generation
MANTA-4968 need workaround for OS-8159
Reviewed by: Kelly McLaughlin <kelly.mclaughlin@joyent.com>