OS-6467

LX go 1.9 aborts on unexpected signal interpositioning

Status:
Resolved
Resolution:
Fixed
Created:
2017-11-21T19:31:16.325-0500
Updated:
2018-05-30T14:19:29.199-0400

Description

I ran into this while looking at why TOOLS-1933 was failing with recent Go.  In looking through the stack trace, it's not clear to me what broke. My concern is that some portion of binaries compiled with Go 1.9.2 are broken on LX atm.

The reproducer for this is, on an LX zone:

$ go version
go version go1.9.2 linux/amd64
$ export GOPATH=`go env GOPATH`
$ go get -v -d -u github.com/docker/docker
$ cd $GOPATH/docker/docker
$ git checkout v1.11.1
$ go get -v -u -d github.com/docker/distribution
$ go get -v -u github.com/Sirupsen/logrus
$ cd integration-cli
$ go test -v -test.timeout 1h -check.f TestAttachDetach

 

THREAD           STATE    SOBJ                COUNT
5                UNPARKED <NONE>                  2
                 runtime.notesleep+0x9b
                 runtime.stopm+0xe5
                 runtime.findrunnable+0x4d2
                 runtime.schedule+0x12c
                 runtime.park_m+0xb6
                 runtime.mcall+0x5b
                 runtime.freedefer+0xb8
                 runtime.deferreturn+0xb6
                 internal/poll.(*FD).Read+0x1c1
                 0xc4204b3590
                 runtime.reentersyscall+0x84
                 0xc4204b36b0
                 0xc4204b3800
                 2
                 3
                 0x313032207468676f
                 6
                 0x313032207468677b
                 0x6e72757465722076

2                UNPARKED <NONE>                  1
                 LMfd`lx_brand.so.1`abort+0x8a
                 LMfd`lx_brand.so.1`lx_err_fatal+0xbf
                 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b
                 LMfd`libc.so.1`__sighndlr+6
                 LMfd`libc.so.1`call_user_handler+0x1db
                 LMfd`libc.so.1`sigacthandler+0x116
                 0xffffffffffffffff
                 runtime.sysmon+0xa8
                 runtime.mstart1+0x11e
                 runtime.mstart+0x64
                 0x4018f3
                 0x7fffee7ffeb0

1                UNPARKED <NONE>                  1
                 LMfd`lx_brand.so.1`lx_emulate+0xbf

4                UNPARKED <NONE>                  1
                 runtime.findrunnable+0x5fb
                 runtime.schedule+0x12c
                 runtime.goexit0+0x236
                 runtime.mcall+0x5b
                 runtime.goexit+6

3                UNPARKED <NONE>                  1
                 runtime.notetsleep_internal+0x86
                 runtime.notetsleepg+0x5a
                 os/signal.signal_recv+0xa6
                 os/signal.loop+0x22
                 runtime.goexit+1

> 2::findstack -v
stack pointer for thread 2: 7fffee99e840
[ 00007fffee99e840 LMfd`libc.so.1`_lwp_kill+0xa() ]
  00007fffee99e890 LMfd`lx_brand.so.1`abort+0x8a()
  00007fffee99e990 LMfd`lx_brand.so.1`lx_err_fatal+0xbf(7fffef4e6dc8)
  00007fffee99ea10 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b(12, 7fffee99ee80, 7fffee99eb20)
  00007fffee99ea20 LMfd`libc.so.1`__sighndlr+6()
  00007fffee99eab0 LMfd`libc.so.1`call_user_handler+0x1db(12, 7fffee99ee80, 7fffee99eb20)
  00007fffee99eb00 LMfd`libc.so.1`sigacthandler+0x116(12, 7fffee99ee80, 7fffee99eb20)
  00007fffee7ffdf0 runtime.usleep+0x41()
  00007fffee7ffe78 runtime.sysmon+0xa8()
  00007fffee7ffea0 runtime.mstart1+0x11e()
  00007fffee7ffeb8 runtime.mstart+0x64()
  000000c420000900 0x4018f3()
  00007fffede002b0 0x7fffee7ffeb0()

> 2::regs
%rax = 0x0000000000000000       %r8  = 0x0000000000000000
%rbx = 0x00007fffee99e850       %r9  = 0x00007fffee9004b0
%rcx = 0x00007fffef3760db       %r10 = 0x00007fffef3760db
%rdx = 0xffffff0402041ba0       %r11 = 0x0000000000000202
%rsi = 0x0000000000000006       %r12 = 0x0000000000000011
%rdi = 0x0000000000000002       %r13 = 0x0000000000000012
                                %r14 = 0x0000000000000000
                                %r15 = 0x00007fffee99eb20

%cs = 0x0053    %fs = 0x0000    %gs = 0x0000
%ds = 0x004b    %es = 0x004b    %ss = 0x004b

%rip = 0x00007fffef3757ba LMfd`libc.so.1`_lwp_kill+0xa
%rbp = 0x00007fffee99e840
%rsp = 0x00007fffee99e828

%rflags = 0x00000286
  id=0 vip=0 vif=0 ac=0 vm=0 rf=0 nt=0 iopl=0x0
  status=<of,df,IF,tf,SF,zf,af,PF,cf>

%gsbase = 0x0000000000000000
%fsbase = 0x00007fffee900240
%trapno = 0xe
   %err = 0x4

> 7fffef4e6dc8/S           
0x7fffef4e6dc8: lxsa_handler set to %s?  How?!?!?

CC @Former user

Comments (9)

Former user commented on 2017-11-21T19:41:54.627-0500:

A bit more context: it looks like Go can't be built under LX, either. Using 1.9.2 as the bootstrap compiler:

$ git clone https://github.com/golang/go go-devel
$ cd go-devel/src
$ bash make.bash
Building Go cmd/dist using /usr/local/go.
Building Go toolchain1 using /usr/local/go.
go tool dist: FAILED: /usr/local/go/bin/go install -gcflags=-l -tags=math_big_pure_go compiler_bootstrap bootstrap/cmd/...: signal: aborted (core dumped)
$ mdb ../pkg/bootstrap/core
THREAD           STATE    SOBJ                COUNT
6                UNPARKED <NONE>                  1

1                UNPARKED <NONE>                  1
                 LMfd`lx_brand.so.1`abort+0x8a
                 LMfd`lx_brand.so.1`lx_err_fatal+0xbf
                 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b
                 LMfd`libc.so.1`__sighndlr+6
                 LMfd`libc.so.1`call_user_handler+0x1db
                 LMfd`libc.so.1`sigacthandler+0x116
                 0xffffffffffffffff
                 runtime.notesleep+0x9b
                 runtime.stopm+0xe5
                 runtime.findrunnable+0x4d2
                 runtime.schedule+0x12c
                 runtime.park_m+0xb6
                 runtime.mcall+0x5b
                 runtime.freedefer+0xb8
                 runtime.deferreturn+0xb6
                 internal/poll.(*FD).Read+0x1c1
                 0xc420023590
                 runtime.reentersyscall+0x84
                 0xc4200236b0
                 0xc420023800

7                UNPARKED <NONE>                  1
                 runtime.findrunnable+0x5fb
                 runtime.schedule+0x12c
                 runtime.goexit0+0x236
                 runtime.mcall+0x5b
                 runtime.goexit+6

3                UNPARKED <NONE>                  1
                 runtime.notetsleep_internal+0x86
                 runtime.notetsleepg+0x5a
                 os/signal.signal_recv+0xa6
                 os/signal.loop+0x22
                 runtime.goexit+1

2                UNPARKED <NONE>                  1
                 runtime.sysmon+0xa8
                 runtime.mstart1+0x11e
                 runtime.mstart+0x64
                 0x4018f3
                 0x7fffee7ffeb0

5                UNPARKED <NONE>                  1
                 syscall.Rmdir+0x47
                 os.Remove+0x6e
                 os.RemoveAll+0x4d
                 cmd/go/internal/work.BuildInstallFunc+0x182
                 cmd/go/internal/work.(*Builder).Do.func1+0x72
                 cmd/go/internal/work.(*Builder).Do.func2+0xbb
                 runtime.goexit+1

4                UNPARKED <NONE>                  1
                 syscall.forkExec+0x39c
                 syscall.StartProcess+0x64
                 os.startProcess+0x1f2
                 os.StartProcess+0x64
                 os/exec.(*Cmd).Start+0x53e
                 os/exec.(*Cmd).Run+0x2b
                 cmd/go/internal/work.(*Builder).runOut+0x43f
                 cmd/go/internal/work.gcToolchain.gc+0x100d
                 cmd/go/internal/work.(*gcToolchain).gc+0xf9
                 cmd/go/internal/work.(*Builder).build+0x1755
                 cmd/go/internal/work.(*Builder).Do.func1+0x72
                 cmd/go/internal/work.(*Builder).Do.func2+0xbb
                 runtime.goexit+1
> 1::findstack -v
stack pointer for thread 1: 7fffef0be840
[ 00007fffef0be840 LMfd`libc.so.1`_lwp_kill+0xa() ]
  00007fffef0be890 LMfd`lx_brand.so.1`abort+0x8a()
  00007fffef0be990 LMfd`lx_brand.so.1`lx_err_fatal+0xbf(7fffef4e6dc8)
  00007fffef0bea10 LMfd`lx_brand.so.1`lx_call_user_handler+0x11b(12, 7fffef0bee80, 7fffef0beb20)
  00007fffef0bea20 LMfd`libc.so.1`__sighndlr+6()
  00007fffef0beab0 LMfd`libc.so.1`call_user_handler+0x1db(12, 7fffef0bee80, 7fffef0beb20)
  00007fffef0beb00 LMfd`libc.so.1`sigacthandler+0x116(12, 7fffef0bee80, 7fffef0beb20)
  00007fffffeff7f0 runtime.futex+0x23()
  00007fffffeff828 runtime.notesleep+0x9b()
  00007fffffeff850 runtime.stopm+0xe5()
  00007fffffeff8e8 runtime.findrunnable+0x4d2()
  00007fffffeff930 runtime.schedule+0x12c()
  00007fffffeff968 runtime.park_m+0xb6()
  000000c4200233f8 runtime.mcall+0x5b()
  000000c420023488 runtime.freedefer+0xb8()
  000000c4200234d8 runtime.deferreturn+0xb6()
  000000c420023528 internal/poll.(*FD).Read+0x1c1()
  000000c420023588 0xc420023590()
  000000c4200235c0 runtime.reentersyscall+0x84()
  000000c420382180 0xc4200236b0()
  000000c420023000 0xc420023800()
  000000c420023800 0()
  000000c420024000 0()
  000000c420024800 0()
  000000c420025000 0()
  000000c420025800 0()
  000000c420028800 0()
> 1::regs
%rax = 0x0000000000000000       %r8  = 0x0000000000000000
%rbx = 0x00007fffef0be850       %r9  = 0x00007fffef0d2cb0
%rcx = 0x00007fffef3760db       %r10 = 0x00007fffef3760db
%rdx = 0xffffff0401ad6840       %r11 = 0x0000000000000001
%rsi = 0x0000000000000006       %r12 = 0x0000000000000011
%rdi = 0x0000000000000001       %r13 = 0x0000000000000012
                                %r14 = 0x0000000000000000
                                %r15 = 0x00007fffef0beb20

%cs = 0x0053    %fs = 0x0000    %gs = 0x0000
%ds = 0x004b    %es = 0x004b    %ss = 0x004b

%rip = 0x00007fffef3757ba LMfd`libc.so.1`_lwp_kill+0xa
%rbp = 0x00007fffef0be840
%rsp = 0x00007fffef0be828

%rflags = 0x00000286
  id=0 vip=0 vif=0 ac=0 vm=0 rf=0 nt=0 iopl=0x0
  status=<of,df,IF,tf,SF,zf,af,PF,cf>

%gsbase = 0x0000000000000000
%fsbase = 0x00007fffef0d2a40
%trapno = 0xef4e6dde
   %err = 0xef381c20

I can dig in more tomorrow and see if I can bisect the problem down to the version of Go used.

Former user commented on 2017-11-27T20:12:21.302-0500:

I have created a packer template that can be used to reproduce this pretty quickly and with 100% reliability:

$ packer build -var="go_bootstrap_version=1.9.2"' os-6467.json
$ packer build -on-error=abort -var="go_bootstrap_version=1.9.2"' os-6467.json

The following is os-4647.json:

{
    "builders": [
        {
            "image_name": "{{ user `image_name` }}",
            "image_version": "{{ user `image_version` }}",
            "source_machine_image": "{{ user `source_machine_image` }}",
            "source_machine_image_filter": {
                "most_recent": true,
                "name": "{{ user `source_machine_name` }}",
                "type": "lx-dataset"
            },
            "source_machine_name": "{{ user `image_name` }}-image-builder",
            "source_machine_package": "{{ user `source_machine_package` }}",
            "ssh_username": "root",
            "triton_account": "{{ user `triton_account` }}",
            "triton_key_id": "{{ user `triton_key_id` }}",
            "triton_url": "{{ user `triton_url` }}",
            "type": "triton"
        }
    ],
    "provisioners": [
        {
            "inline": [
                "add-apt-repository -y ppa:ubuntu-toolchain-r/test",
                "apt-get update",
                "apt-get install -y gcc gcc-4.4 git mosh tmux",
                "update-alternatives --remove-all gcc || true",
                "update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.4 20",
                "curl -OL https://redirector.gvt1.com/edgedl/go/go{{ user `go_bootstrap_version` }}.linux-amd64.tar.gz",
                "tar -C /usr/local -xzf go{{ user `go_bootstrap_version` }}.linux-amd64.tar.gz",
                "echo 'export GOROOT=/usr/local/go' | tee -a /etc/profile",
                "echo 'export PATH=$GOROOT/bin:$PATH' | tee -a /etc/profile",
                "echo 'export GOPATH=`go env GOPATH`' | tee -a /etc/profile",
                "echo 'export PATH=$GOPATH/bin:$PATH' | tee -a /etc/profile"
            ],
            "type": "shell"
        },
        {
            "inline": [
                "git clone https://github.com/golang/go.git go-master",
                "cd go-master/src",
                "git checkout {{ user `go_target_version` }}",
                "env GOROOT_BOOTSTRAP=/usr/local/go bash make.bash"
            ],
            "type": "shell"
        }
    ],
    "variables": {
        "go_bootstrap_version": "1.9.2",
        "go_target_version": "go1.9",
        "image_name": "os-6467",
        "image_version": "2017.11.27",
        "source_machine_name": "ubuntu-16.04",
        "source_machine_package": "g4-highcpu-1G",
        "triton_account": "{{ env `SDC_ACCOUNT` }}",
        "triton_key_id": "{{ env `SDC_KEY_ID` }}",
        "triton_url": "{{ env `SDC_URL` }}"
    }
}

The problem isn't the version of the bootstrap compiler or GCC, it's a problem when building Go after the bootstrap phase using Go > 1.9.0. Using gcc-4.4, gcc-5.4, Go 1.5.1, Go 1.8.5, everything "just works." The problem arises with something in the Go 1.9 release. Tomorrow I will bisect the changes between the 1.9.X branch and Go 1.8.X.

Former user commented on 2017-12-01T17:43:35.063-0500 (edited 2017-12-01T17:51:19.861-0500):

I reproduced the failure locally during a build of Go 1.9.

ub16% cd go-devel/src
ub16% bash make.bash
Building Go cmd/dist using /usr/lib/go-1.6.
Building Go toolchain1 using /usr/lib/go-1.6.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
go tool dist: FAILED: /home/jerry/go-devel/pkg/tool/linux_amd64/go_bootstrap install -gcflags=all= -ldflags=all= -i cmd/asm cmd/cgo cmd/compile cmd/link: signal: aborted (core dumped)

All of the failures are the same problem. We are aborting ourself because the user-level brand library's signal interpositioning is in an unexpected state. Within lx_call_user_handler this is the check we're aborting on.

if (lxsap->lxsa_handler == SIG_DFL || lxsap->lxsa_handler == SIG_IGN)
                lx_err_fatal("lxsa_handler set to %s?  How?!?!?",
                    (lxsap->lxsa_handler == SIG_DFL) ? "SIG_DFL" : "SIG_IGN");

Former user commented on 2017-12-01T18:04:52.289-0500 (edited 2017-12-01T18:11:14.548-0500):

Here is the relevant part of a trace when thread 6 aborts:

go_bootstrap 25343 6 -> waitid(1, 25350, 0xc420070ae0, 0x1000004)
...
go_bootstrap 25343 1 -> clone(0x4111, 0x0 0x0, 0x0 0x0)
compile 25350 -> rexit(0) ppid: 25343
zsched 6763 1 send signal: 18 to: go_bootstrap pid 25343, tid 1
go_bootstrap 25343 6 <- waitid=-4 siginfo: sig 0 errno 0 code 0 pid 0 status 0x0
go_bootstrap 25343 6 handle signal: 18
go_bootstrap 25343 6 send signal: 6 to: go_bootstrap pid 25343, tid 6
go_bootstrap 25343 6 handle signal: 6
...
go_bootstrap 25343 6 core

That is, thread 6 is waiting for pid 25350 to exit. Pid 25350 exits and a SIGCHLD is sent to the parent. This causes thread 6 to return from waitid with EINTR. Thread 6 then handles the SIGCHLD but something unexpected in the way we have it's signal handlers setup is causing it to abort (SIGABRT is signal 6).

Former user commented on 2017-12-04T17:11:40.731-0500 (edited 2017-12-04T17:13:19.348-0500):

Here is lx_sighandlers from one of the core dumps. SIG_DFL is 0 and the Linux SIGCHLD is signal 17. I've annoted some of the members with their index and stopped at 17.

> lx_sighandlers::print lx_sighandlers_t
{
    lx_sa = [
        {                                                           // 0
            lxsa_handler = 0
            lxsa_flags = 0
            lxsa_restorer = 0
            lxsa_mask = {
                __bits = [ 0 ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {                                                           // 5
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0
            lxsa_restorer = 0
            lxsa_mask = {
                __bits = [ 0 ]        
            }
        },
        {                                                            // 10
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {                                                            // 15
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
        {                                                             // 17
            lxsa_handler = 0
            lxsa_flags = 0x1c000004
            lxsa_restorer = 0x456200
            lxsa_mask = {
                __bits = [ 0xffffffffffffffff ]
            }
        },
...

Former user commented on 2017-12-15T11:15:10.867-0500 (edited 2017-12-15T11:17:21.848-0500):

After a bunch of tracing/debugging, I now understand the problem.

1) go_bootstrap has several different threads running
2) one of the threads calls clone to do a vfork
3) during a vfork we save the state of all of the current signal handlers (lx_sighandlers_save) and also turn off signals (_sigoff), but only for that thread
4) while the new child is running, it calls rt_sigaction to set the SIGCHLD signal handler to SIG_DFL
5) after this, before the new child has exec-ed or exited, another child exits and SIGCHLD is sent
6) the signal is delivered to a different thread in the original go_bootstrap parent
7) because of the shared address space that has been modified by the vforking child, the original signal handler for the parent is gone (it will be restored by lx_sighandlers_restore after the vforking child execs or exits), so we blow the assert in the signal hander that is handling the SIGCHLD in the parent

Former user commented on 2017-12-15T18:04:09.680-0500:

This bug is a bad interaction between vfork and signals. We've had this bug for years. Note that the Linux vfork man page states:

Signals sent to the parent arrive after the child releases the parent's memory (i.e., after the child terminates or calls execve(2)).

We make no such promise on illumos. I've worked up an initial fix which allows me to successfully build Go 1.9, whereas previously it would consistently abort.

ub16% bash make.bash
Building Go cmd/dist using /usr/lib/go-1.6.
Building Go toolchain1 using /usr/lib/go-1.6.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
Building Go toolchain3 using go_bootstrap and Go toolchain2.
Building packages and commands for linux/amd64.
---
Installed Go for linux/amd64 in /home/jerry/go-devel
Installed commands in /home/jerry/go-devel/bin

Next, I need to finish cleaning up this fix and do more testing.

Former user commented on 2017-12-18T15:06:25.555-0500:

Testing notes:

Prior to my fix, using Go 1.6 to build Go 1.9, one of the builder processes would abort every time at some point during the build sequence. After my fix, I ran 10 builds in a row without any failure.

I booted Ubuntu 16.04, Ubuntu 14.04 and Alpine (which had some busybox/vfork weirdness in the past). Cassandra is running fine on 16.04

I also ran LTP.

Jira Bot commented on 2017-12-18T16:45:28.778-0500:

illumos-joyent commit b036e0fd770de1dd266db82dad196891746e92e1 (branch master, by Jerry Jelinek)

OS-6467#icft=OS-6467 LX go 1.9 aborts on unexpected signal interpositioning
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Approved by: Patrick Mooney <patrick.mooney@joyent.com>