OS-7121: lx vfork and signal handling still racey

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-10-02T22:06:10.560Z)

Fix Versions

2018-10-11 Criss Chros (Release Date: 2018-10-11)

Related Links

Description

As reported on github, smartos-live#794, golang will periodically abort due to bad state in the lx signal emulation code. This same issue was fixed in OS-6467, but it seems that it only made it harder to hit. I've managed to replicate the exact issue report in OS-6467 just by trying to compile the gitaly project (I don't think the project matters much, but rather how golang compile is implemented). My hunch is that we just need to mark l_block_all_signals as volatile to prevent concurrent threads from racing the lx_block_all_signals() call.

Comments

Comment by Ryan Zezeski
Created at 2018-08-20T15:45:57.162Z
Updated at 2018-09-18T19:05:41.328Z
I can still reproduce the panic with l_block_all_signals declared volatile. Now I think that perhaps the signal has made it past the check in issig_forreal()/lx_ptrace_issig_stop() but has not yet been delivered at the time that vfork blocks all signals. Something like

1. SIGCHLD is generated, thread T1 makes it past l_block_all_signals but doesn't get to the point of actual delivery.
2. Thread T2 calls vfork, makes brand syscall to block all signals, l_block_all_signals is set, and changes disposition os SIGCHLD.
3. SIGCHLD is delivered to userspace handler on T1 and we abort.

This is just a guess. I think the thing to do now is actually dtrace this mess and try to catch the bad behavior in action. But I'm not sure when I'll have time to get back to this.

Comment by Bryan Cantrill
Created at 2018-09-18T19:04:02.010Z
Updated at 2018-09-18T19:07:15.321Z
I ran Ryan's script for reproducing this. I wasn't able to do it on a two core virtualized environment, but was able to reproduce it on an eight core physical machine. I ran the following script:

#!/usr/sbin/dtrace -Cs

#pragma D option quiet
#pragma D option bufpolicy=ring

#define B_BLOCK_ALL_SIGS        132
#define B_UNBLOCK_ALL_SIGS      133

#define T_VFPARENT      0x4000  /* thread is vfork parent, must call vfwait */
#define SVFORK     0x00040000   /* child of vfork that has not yet exec'd */

BEGIN
{
        start = timestamp;
}

lx_brandsys:entry
/progenyof($1) && arg0 == B_BLOCK_ALL_SIGS/
{
        this->p = curthread->t_procp;
        this->lxpd = (struct lx_proc_data *)this->p->p_brand_data;
        printf("%d pid=%d tid=%d -> B_BLOCK_ALL_SIGS block=%d\n",
            timestamp - start, pid, tid, this->lxpd->l_block_all_signals);
        self->block = 1;
}

lx_brandsys:return
/self->block/
{
        this->p = curthread->t_procp;
        this->lxpd = (struct lx_proc_data *)this->p->p_brand_data;
        printf("%d pid=%d tid=%d <- B_BLOCK_ALL_SIGS block=%d\n",
            timestamp - start, pid, tid, this->lxpd->l_block_all_signals);
        self->block = 0;
}

syscall:::entry
/progenyof($1)/
{
        printf("%d pid=%d tid=%d ~> %s arg0=%p arg1=%p arg2=%p\n",
            timestamp - start, pid, tid, probefunc, arg0, arg1, arg2);
}

syscall:::return
/progenyof($1)/
{
        printf("%d pid=%d tid=%d <~ %s arg0=%p\n",
            timestamp - start, pid, tid, probefunc, arg0);
}

lx-syscall:::entry
/progenyof($1)/
{
        printf("%d pid=%d tid=%d -> %s arg0=%p arg1=%p arg2=%p\n",
            timestamp - start, pid, tid, probefunc, arg0, arg1, arg2);
}

lx-syscall:::return
/progenyof($1)/
{
        printf("%d pid=%d tid=%d <- %s arg0=%p\n",
            timestamp - start, pid, tid, probefunc, arg0);
}

lx_brandsys:entry
/progenyof($1) && arg0 == B_UNBLOCK_ALL_SIGS/
{
        this->p = curthread->t_procp;
        this->lxpd = (struct lx_proc_data *)this->p->p_brand_data;
        printf("%d pid=%d tid=%d -> B_UNBLOCK_ALL_SIGS block=%d\n",
            timestamp - start, pid, tid, this->lxpd->l_block_all_signals);
        self->unblock = 1;
}

lx_brandsys:return
/self->unblock/
{
        this->p = curthread->t_procp;
        this->lxpd = (struct lx_proc_data *)this->p->p_brand_data;
        printf("%d pid=%d tid=%d <- B_UNBLOCK_ALL_SIGS block=%d\n",
            timestamp - start, pid, tid, this->lxpd->l_block_all_signals);
        self->unblock = 0;
}

lx_sendsig:entry
/progenyof($1)/
{
        this->p = curthread->t_procp;
        this->lxpd = (struct lx_proc_data *)this->p->p_brand_data;
        printf("%d pid=%d tid=%d lx_sendsig block=%d\n", timestamp - start,
            pid, tid, this->lxpd->l_block_all_signals);
}

signal-handle
/progenyof($1)/
{
        this->p = curthread->t_procp;
        this->lxpd = (struct lx_proc_data *)this->p->p_brand_data;
        printf("%d pid=%d tid=%d handle sig=%d block=%d\n", timestamp - start,
            pid, tid, arg0, this->lxpd->l_block_all_signals);
}

proc:::create
/progenyof($1) && (curthread->t_flag & T_VFPARENT)/
{
        printf("%d pid=%d tid=%d vfork child=%d\n", timestamp - start,
            pid, tid, args[0]->pr_pid);
}

proc:::create
/progenyof($1) && !(curthread->t_flag & T_VFPARENT)/
{
        printf("%d pid=%d tid=%d fork child=%d\n", timestamp - start,
            pid, tid, args[0]->pr_pid);
}

exit:entry
/progenyof($1)/
{
        self->exit = 1;
}

lx_post_exit_sig:entry
/self->exit/
{
        printf("%d pid=%d tid=%d parent=%d exit\n", timestamp - start,
            args[0]->p_pidp->pid_id, tid, args[0]->p_parent->p_pidp->pid_id);
        self->exit = 0;
}

relvm:entry
/progenyof($1) && (curthread->t_procp->p_flag & SVFORK)/
{
        printf("%d pid=%d tid=%d parent=%d vfork relvm\n", timestamp - start,
            pid, tid, curthread->t_procp->p_parent->p_pidp->pid_id);
}

exec-success
/progenyof($1)/
{
        printf("%d pid=%d tid=%d psargs=%s\n", timestamp - start,
            pid, tid, curpsinfo->pr_psargs);
}

lx_ptrace_issig_stop:entry
/progenyof($1)/
{
        this->p = curthread->t_procp;
        this->lxpd = (struct lx_proc_data *)this->p->p_brand_data;
        printf("%d pid=%d tid=%d issig block=%d\n",
            timestamp - start, pid, tid, this->lxpd->l_block_all_signals);
}

Heavily trimmed output:

...
42881150163 pid=76623 tid=10 -> B_BLOCK_ALL_SIGS block=0
42881150765 pid=76623 tid=12 issig block=0
42881152987 pid=76623 tid=12 issig block=0
42881154308 pid=76623 tid=10 <- B_BLOCK_ALL_SIGS block=1
42881155265 pid=76623 tid=12 <- waitid arg0=fffffffffffffffc
42881156858 pid=76623 tid=12 issig block=1
42881157676 pid=76623 tid=10 <~ brandsys arg0=0
42881158436 pid=76623 tid=14 <- write arg0=50
42881158485 pid=76623 tid=12 handle sig=18 block=1
42881160248 pid=76623 tid=4 -> openat arg0=ffffffffffffff9c arg1=c420494400 arg2=80000
42881160344 pid=76623 tid=10 ~> lwp_sigmask arg0=3 arg1=ffbffeff arg2=ffffffff
42881161973 pid=76623 tid=10 <~ lwp_sigmask arg0=fffffe007fbffeff
42881162078 pid=76623 tid=12 lx_sendsig block=1
42881163625 pid=76623 tid=10 ~> forksys arg0=2 arg1=0 arg2=fffffe00
42881165870 pid=76623 tid=12 ~> lwp_sigmask arg0=3 arg1=7fbffeff arg2=fffffe00
42881166549 pid=76623 tid=4 <- openat arg0=b
42881168750 pid=76623 tid=14 -> fstatat64 arg0=ffffffffffffff9c arg1=c4204101e0 arg2=c42030a788
42881176704 pid=76623 tid=14 <- fstatat64 arg0=0
42881194422 pid=76623 tid=12 <~ lwp_sigmask arg0=ffffffffffbffeff
42881234978 pid=76623 tid=10 vfork child=76691
42881239070 pid=76623 tid=10 <~ forksys arg0=12b93
42881241727 pid=76691 tid=10 <~ forksys arg0=0
42881244065 pid=76691 tid=10 ~> lwp_sigmask arg0=3 arg1=7fbffeff arg2=fffffe00
42881245121 pid=76691 tid=10 <~ lwp_sigmask arg0=ffffffffffbffeff
42881246281 pid=76691 tid=10 ~> brandsys arg0=8f arg1=4111 arg2=0
42881249387 pid=76691 tid=10 <~ brandsys arg0=12b93
42881250763 pid=76691 tid=10 ~> brandsys arg0=85 arg1=7fffecc9df5f arg2=0
42881251663 pid=76691 tid=10 -> B_UNBLOCK_ALL_SIGS block=1
42881253977 pid=76691 tid=10 <- B_UNBLOCK_ALL_SIGS block=0
...
42882359324 pid=76623 tid=12 issig block=1
42882360728 pid=76623 tid=13 -> epoll_ctl arg0=4 arg1=2 arg2=b
42882361156 pid=76623 tid=4 -> epoll_ctl arg0=4 arg1=2 arg2=8
42882362392 pid=76623 tid=13 <- epoll_ctl arg0=fffffffffffffffe
42882363081 pid=76623 tid=4 <- epoll_ctl arg0=fffffffffffffffe
42882368385 pid=76623 tid=4 -> read arg0=8 arg1=c4207f4000 arg2=b0
42882380693 pid=76623 tid=4 <- read arg0=af
42882384289 pid=76623 tid=4 -> read arg0=8 arg1=c4207f40af arg2=1
42882384337 pid=76623 tid=12 ~> lwp_sigmask arg0=3 arg1=ffbffeff arg2=fffffff7
42882386082 pid=76623 tid=4 <- read arg0=0
42882386291 pid=76623 tid=12 <~ lwp_sigmask arg0=ffffffffffbffeff
42882389099 pid=76623 tid=12 ~> sigaction arg0=6 arg1=7fffec0db7b0 arg2=0
42882390719 pid=76623 tid=12 <~ sigaction arg0=0
42882394505 pid=76623 tid=12 ~> lwp_sigmask arg0=3 arg1=ffbffedf arg2=fffffff7
42882396016 pid=76623 tid=12 <~ lwp_sigmask arg0=ffffffffffbffeff
42882396196 pid=76623 tid=4 -> write arg0=3 arg1=c4204be480 arg2=50
42882396863 pid=76623 tid=8 -> read arg0=b arg1=c4207f6000 arg2=8000
42882398693 pid=76623 tid=12 ~> lwp_kill arg0=c arg1=6 arg2=0
...

This shows more or less the race that Ryan described: the SIGCHLD generation is racing with a vfork on another CPU, the SIGCHLD being generated as the vfork is trying to block all signals.

After noodling various approaches, we elected to take a dissatisfying but effective and low-risk approach: we can check for this condition in the signal handler (namely, on recognition that we have a signal with default or ignore dispostion, see if we are in the B_BLOCK_ALL_SIGS condition), spinning until the condition is not true.

Here is a diff for this fix:

diff --git a/usr/src/lib/brand/lx/lx_brand/common/signal.c b/usr/src/lib/brand/lx/lx_brand/common/signal.c
index a8e3601..2b88522 100644
--- a/usr/src/lib/brand/lx/lx_brand/common/signal.c
+++ b/usr/src/lib/brand/lx/lx_brand/common/signal.c
@@ -1450,7 +1450,7 @@ lx_call_user_handler(int sig, siginfo_t *sip, void *p)
 {
 	void (*user_handler)();
 	void (*stk_builder)();
-	struct lx_sigaction *lxsap;
+	volatile struct lx_sigaction *lxsap;
 	ucontext_t *ucp = (ucontext_t *)p;
 	size_t stksize;
 	int lx_sig;
@@ -1480,9 +1480,27 @@ lx_call_user_handler(int sig, siginfo_t *sip, void *p)
 		assert(0);
 	}
 
-	if (lxsap->lxsa_handler == SIG_DFL || lxsap->lxsa_handler == SIG_IGN)
+	while (lxsap->lxsa_handler == SIG_DFL ||
+	    lxsap->lxsa_handler == SIG_IGN) {
+		/*
+		 * This normally shouldn't be possible, but there is a window
+		 * in which a vfork()'d process can have its signal disposition
+		 * corrupted by its child.  While this window has been narrowed
+		 * by blocking all signals in the brand, there remains a
+		 * (smaller) window whereby a signal in flight will be
+		 * delivered before the brand has blocked them.  To catch
+		 * this case, we explicitly check for this condition here.
+		 */
+		if (lx_all_signals_blocked())
+			continue;
+
+		if (lxsap->lxsa_handler != SIG_DFL &&
+		    lxsap->lxsa_handler != SIG_IGN)
+			break;
+
 		lx_err_fatal("lxsa_handler set to %s?  How?!?!?",
 		    (lxsap->lxsa_handler == SIG_DFL) ? "SIG_DFL" : "SIG_IGN");
+	}
 
 #if defined(_LP64)
 	stksize = sizeof (struct lx_sigstack);
@@ -1506,7 +1524,7 @@ lx_call_user_handler(int sig, siginfo_t *sip, void *p)
 		lxsap->lxsa_handler = SIG_DFL;
 
 	lx_sigdeliver(lx_sig, sip, ucp, stksize, stk_builder, user_handler,
-	    lxsap);
+	    (struct lx_sigaction *)lxsap);
 
 	/*
 	 * We need to handle restarting system calls if requested by the
@@ -2390,3 +2408,9 @@ lx_unblock_all_signals()
 {
 	(void) syscall(SYS_brand, B_UNBLOCK_ALL_SIGS);
 }
+
+int
+lx_all_signals_blocked()
+{
+	return (syscall(SYS_brand, B_ALL_SIGS_BLOCKED));
+}
diff --git a/usr/src/lib/brand/lx/lx_brand/sys/lx_misc.h b/usr/src/lib/brand/lx/lx_brand/sys/lx_misc.h
index ce241db..e01166b 100644
--- a/usr/src/lib/brand/lx/lx_brand/sys/lx_misc.h
+++ b/usr/src/lib/brand/lx/lx_brand/sys/lx_misc.h
@@ -149,6 +149,7 @@ extern void lx_stack_postfork(void);
 
 extern void lx_block_all_signals();
 extern void lx_unblock_all_signals();
+extern int lx_all_signals_blocked();
 
 /*
  * NO_UUCOPY disables calls to the uucopy* system calls to help with
diff --git a/usr/src/uts/common/brand/lx/os/lx_brand.c b/usr/src/uts/common/brand/lx/os/lx_brand.c
index 974c860..866b026 100644
--- a/usr/src/uts/common/brand/lx/os/lx_brand.c
+++ b/usr/src/uts/common/brand/lx/os/lx_brand.c
@@ -1892,6 +1892,16 @@ lx_brandsys(int cmd, int64_t *rval, uintptr_t arg1, uintptr_t arg2,
 		mutex_exit(&p->p_lock);
 		return (result);
 	}
+
+	case B_ALL_SIGS_BLOCKED: {
+		uint_t result;
+
+		mutex_enter(&p->p_lock);
+		pd = ptolxproc(p);
+		result = pd->l_block_all_signals;
+		mutex_exit(&p->p_lock);
+		return (result);
+	}
 	}
 
 	return (EINVAL);
diff --git a/usr/src/uts/common/brand/lx/sys/lx_brand.h b/usr/src/uts/common/brand/lx/sys/lx_brand.h
index e305680..9c1579c 100644
--- a/usr/src/uts/common/brand/lx/sys/lx_brand.h
+++ b/usr/src/uts/common/brand/lx/sys/lx_brand.h
@@ -103,7 +103,7 @@ extern "C" {
 #define	B_STORE_ARGS		137
 #define	B_GETPID		138
 #define	B_JUMP_TO_LINUX		139
-/* formerly B_SET_THUNK_PID	140 */
+#define	B_ALL_SIGS_BLOCKED	140
 #define	B_EXIT_AS_SIG		141
 /* formerly B_HELPER_WAITID	142 */
 #define	B_HELPER_CLONE		143

Running with this fix made it impossible for me or Ryan (on his larger machine) to hit the condition -- and we verified with DTrace that we hit the B_ALL_SIGS_BLOCKED case.


Comment by James Hsieh
Created at 2018-09-20T01:59:18.876Z
@bryan - customer installed the new PI but we appear to have run into (guessing) an unrelated issue.  Per the customer, PIs after 20180525 do not work for them and cause issues with their ability to deploy docker instances.  So we cannot get a good test if your PI resolves the race condition issue.  We have other application core dumps but the stacks look quite a bit different (see SWSUP-1273)

SWSUP-1261 is where the other debug effort is occurring and when that issues is resolved, and they are able to deploy docker containers again, we will then be able to figure out next steps with this issue.

Comment by Ryan Zezeski
Created at 2018-09-20T03:58:08.156Z
Updated at 2018-09-20T03:58:30.799Z
I did some initial (read cursory) debugging on one of the new cores for SWSUP-1273, and I can't see how it is related at all. The new issue is in cmon-agent/node-kstat, but I'm not quite sure what's happening. I would get this issue (vfork/signal) merged sooner rather than later as others have reported similar cores in other applications besides golang. We shouldn't hold up this fix because of the new cores.

Someone more familiar with debugging JS should look into the new problem and create a separate ticket.

Comment by Bryan Cantrill
Created at 2018-10-02T21:45:36.652Z
For posterity (and because I'm not sure if it's elsewhere), here is the script I was running to reproduce/test this:

export GOPATH=~/gitaly/_build/
export PATH=$GOPATH/bin:$PATH

for i in {1..1000}; do go install gitlab.com/gitlab-org/gitaly/cmd/gitaly gitlab.com/gitlab-org/gitaly/cmd/gitaly-ssh; if [ "$?" -ne 0 ]; then echo "Run #$i aborted"; exit; fi; done

And here's the golang I was using:

root@51e4dd92-7a01-e75f-993e-c3beb1fe1506:~# which go ; go version
/usr/local/bin/go
go version go1.10.3 linux/amd64


Comment by Jira Bot
Created at 2018-10-02T21:50:24.677Z
illumos-joyent commit fe10e0740dd87c89538183dcb46f8787d84d1c54 (branch master, by Bryan Cantrill)

OS-7121 lx vfork and signal handling still racey
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Ryan Zezeski <rpz@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>