OS-8159: race between write() and shutdown() for unix sockets

Details

Issue Type:Task
Priority:4 - Normal
Status:Resolved
Created at:2020-04-21T11:41:53.497Z
Updated at:2020-06-09T09:26:42.618Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Filing Error: This issue was not filed correctly.
(Resolution Date: 2020-06-09T09:26:42.604Z)

Description

From investigating MANTA-4968, there appears to be a relatively rare race between shutdown(SHUT_WR) and writes across a UNIX socket.

Comments

Comment by Former user
Created at 2020-04-21T13:14:58.666Z

Running

https://raw.githubusercontent.com/jlevon/grot/master/haproxy-socket.c

against haproxy in a loadbalancer zone.
We're dtracing with variants of:

https://github.com/jlevon/grot/blob/master/d/recvmsg.d

The client by this point has sent the command out over the socket, and done a shutdown(SHUT_WR).

Now it's waiting for some data from haproxy:

 37  76382                 kstrgetmsg:entry ------------------ T1604397695044427 vnode fffffe27a2f52000 stp:fffffe282c511350
 37  75699                     strget:entry T1604397695049725 stp:fffffe282c511350 q:fffffe2719643848 (next:0) sd_flag: 2800
 36 115409                    tl_wput:entry T1604397695049815 wq:fffffe277761b138 mp:fffffe6d7344b2c0 tep:fffffe25d2e2cd80 type 1 mtype 28
 36 115493         tl_wput_common_ser:entry T1604397695052687 mp:fffffe6d7344b2c0 tep:fffffe25d2e2cd80 ->te_wq:fffffe277761b138
 37  75700                    strget:return T1604397695053836 ret:0
 37  78194                   strwaitq:entry T1604397695056500 sd_flag: 2800

It goes to sleep.

Now haproxy sends some data:

 36  78747            strwrite_common:entry T1604397695666377 stp:fffffe274c2dc800
 36  62143                    putnext:entry T1604397695672796 wqp:ffffff18098043e8 mp:fffffe27058e6660 stp:fffffe274c2dc800 putproc:tl`tl_wput
 36 115409                    tl_wput:entry T1604397695676466 wq:fffffe2719646418 mp:fffffe27058e6660 tep:fffffebc7f1908c8 type 0 mtype 2020614179
 36 115409                    tl_wput:entry T1604397695677777 wq:fffffe2719646418 mp:fffffe27058e6660
 36  68062         serializer_enqueue:entry T1604397695680837 proc:tl`tl_wput_data_ser mp:fffffe27058e6660
 36  78748           strwrite_common:return T1604397695684977 

Note that often we'd be able to strrput() directly to the peer (haproxy-socket) and all is good. However,
this time, the serializer is busy, so we place ourselves on it to run tl_wput_data_ser later.

Next, haproxy does a shutdown(SHUT_WR) as is its occasional wont:

 36  62143                    putnext:entry T1604397695706917 T_ORDREL_REQ for q:ffffff18098043e8 q_next:fffffe2719646418 mp:fffffe30b0735000 stp:fffffe274c2dc800 putproc:tl`tl_wput exec haproxy
              genunix`strput+0x133
              genunix`kstrputmsg+0x233
              sockfs`sotpi_shutdown+0x279
              sockfs`socket_shutdown+0x18
              sockfs`shutdown+0x4f
              unix`sys_syscall+0x1a8

              libc.so.1`_so_shutdown+0xa
              haproxy`mux_pt_shutw+0xff
              haproxy`stream_int_shutw_conn+0x208
              haproxy`process_stream+0x1bf2
              haproxy`process_runnable_tasks+0x57f
              haproxy`run_thread_poll_loop+0x1cc
              haproxy`main+0xdbf
              haproxy`_start_crt+0x83
              haproxy`_start+0x18

 36 115409                    tl_wput:entry T1604397695715599 wq:fffffe2719646418 mp:fffffe30b0735000 tep:fffffebc7f1908c8 type 1 mtype 10

This gets queued into the serializer via a tl_putq_ser() callback.


Comment by Former user
Created at 2020-04-21T13:15:24.394Z

Next, haproxy calls close:

 36  78152                   strclose:entry T1604397695739902 stp:fffffe274c2dc800
 36  75705                    qdetach:entry T1604397695743477 stp:fffffe274c2dc800 close:tl`tl_close

And we run through the serializer. First, we have our pending data write, tl_wput_data_ser():

 46 115475           tl_wput_data_ser:entry T1604397695747558 mp:fffffe27058e6660 tep:fffffebc7f1908c8
tep fffffebc7f1908c8 te_state 11
peer fffffebc7eefdbd0 te_state 10

              genunix`serializer_exec+0x22
              genunix`serializer_drain+0x84
              genunix`serializer_enter+0x77
              tl`tl_serializer_enter+0x62
              tl`tl_wsrv+0x71
              genunix`runservice+0x49
              genunix`queue_service+0x41
              genunix`stream_service+0x40
              genunix`taskq_d_thread+0xbc
              unix`thread_start+0xb

 46  71082                       putq:entry T1604397695752711 q:fffffe2719646418 mp:fffffe27058e6660
 46 115476          tl_wput_data_ser:return T1604397695763091

Rather than taking the "fast path" there, however, the peer state is
TS_WIND_ORDREL, so we place it back onto haproxy's write queue. Presumably the
idea is that we're later going to process it via the service routine again
(tl_wput_ser()->tl_wput_common_ser()), but it's not clear.


Comment by Former user
Created at 2020-04-21T13:15:37.916Z
Updated at 2020-04-21T13:15:49.429Z

Next on the serializer queue we have the tl_putq_ser() callback - our T_ORDREL_REQ from
above. But this routine does this:

1903         if (tep->te_closing) {
1904                 tl_wput_ser(mp, tep);
1905         } else {
1906                 TL_PUTQ(tep, mp);

And as we can see:

 46 115469                tl_putq_ser:entry T1604397695766517 mp:fffffe30b0735000 tep:fffffebc7f1908c8 te_closing 1 type 1 mtype 10

qdetach()>tl_close() ran enough to set >te_closing. This now gets processed immediately:

 46 115495                tl_do_proto:entry T1604397695784218 handling T_ORDREL_REQ for mp:fffffe30b0735000 tep:fffffebc7f1908c8  ->wq:fffffe2719646418 next:0 stp:fffffe274c2dc800
 46 115501                  tl_ordrel:entry T1604397695788117 
 mp:fffffe30b0735000 tep:fffffebc7f1908c8 te_wq:fffffe2719646418
 peer tep:fffffebc7eefdbd0 peer_rq:fffffe271403bad0 to get T_ORDREL_IND
 46  62143                    putnext:entry T1604397695799793 T_ORDREL_IND for q:fffffe271403bad0 q_next:fffffe2719643848 mp:fffffe30b0735000 stp:fffffe282c511350
 46  77566                    strrput:entry T1604397695803390 handling T_ORDREL_IND for q:fffffe2719643848 next:0 stp:fffffe282c511350 mp:fffffe30b0735000
 46  77566                    strrput:entry T1604397695804446:q:fffffe2719643848, mp:fffffe30b0735000 type 1 stp:fffffe282c511350 sd_flag 10242 mtype 23
 46  99006              strsock_proto:entry T1604397695809474  T_ORDREL_IND
 46  76788                  strseteof:entry T1604397695812531 vnode fffffe27a2f52000 stp fffffe282c511350

Since the handling of T_ORDREL_REQ is to effectively tell the other side that there's no more writes coming, this translates into strseteof() on the client/read side. Now we wake up from strget() in haproxy-socket, see that STREOF is set, and return a read of zero.


Comment by Former user
Created at 2020-04-21T13:15:57.861Z

The short version of this race is that the shutdown(SHUT_WR) manages to jump
the queue in front of the last data mblk because ->te_closing happens to get

As to an actual fix, no real idea. Our read-side is in state TS_WIND_ORDREL
due having done its own shutdown on its write queue (ti_statetbl[TE_ORDREL_REQ][TS_DATA_XFER])
I idly wonder if tl_wput_data_ser() should be allow the peer_tep to have
that state as well for the fast path. Logically it doesn't seem to make any
sense that if a client side has taken shutdown(SHUT_WR)/tl_ordrel(), it wouldn't
take packages on its read side still.


Comment by Former user
Created at 2020-04-21T15:49:28.634Z

I actually tried that fix and my test case has been running for a very long time now.


Comment by Former user
Created at 2020-04-21T17:50:40.089Z

Specifically this fix:

    1   │ diff --git a/usr/src/uts/common/io/tl.c b/usr/src/uts/common/io/tl.c
   2   │ index e6b3d41625..6f1d5db19c 100644
   3   │ --- a/usr/src/uts/common/io/tl.c
   4   │ +++ b/usr/src/uts/common/io/tl.c
   5   │ @@ -1963,6 +1963,7 @@ tl_wput_data_ser(mblk_t *mp, tl_endpt_t *tep)
   6   │         (tep->te_wq != NULL) &&
   7   │         (tep->te_wq->q_first == NULL) &&
   8   │         ((peer_tep->te_state == TS_DATA_XFER) ||
   9   │ +       (peer_tep->te_state == TS_WIND_ORDREL) ||
  10   │         (peer_tep->te_state == TS_WREQ_ORDREL)) &&
  11   │         ((peer_rq = peer_tep->te_rq) != NULL) &&
  12   │         (canputnext(peer_rq) || tep->te_closing)) {

Comment by Former user
Created at 2020-04-22T11:50:40.844Z

Change is pushed to https://github.com/joyent/illumos-joyent/tree/OS-8159


Comment by Former user
Created at 2020-04-22T11:59:45.321Z

One open question with this change is this state transition:

nextstate[TE_ERROR_ACK][TS_WACK_DREQ10]

which moves to state TS_WIND_ORDREL

This state is reached from receiving a TE_DISCON_REQ
while in TS_WIND_ORDREL

I think this is never committed to ->te_state in tl_discon_req()
or elsewhere, so it's fine, but I haven't looked really closely.


Comment by Jira Bot
Created at 2020-04-27T11:19:06.704Z

muppet commit 200f1f59f179ab2705243932b9dae037a556d526 (branch master, by John Levon)

MANTA-5018 muppet crashed after hitting FSM error (#28)

MANTA-5195 switch to 'uuid' package for UUID generation
MANTA-4968 need workaround for OS-8159
Reviewed by: Kelly McLaughlin <kelly.mclaughlin@joyent.com>