OS-6580: LX fails to handle futex_requeue correctly

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-01-30T20:33:30.341Z
Updated at:2020-09-30T17:32:02.032Z

People

Created by:Todd Whiteman
Reported by:Todd Whiteman
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2020-09-30T17:32:02.019Z)

Fix Versions

2020-10-08 calmer than you are (Release Date: 2020-10-08)

Related Links

Labels

lxbrand

Description

Originally reported in SWSUP-1041.

This simple docker container will work correctly in a regular docker environment, but hangs under Triton:

$ docker run -ti node:8-alpine sh
echo "if (1) return" > play.js
node play.js
# Process will not exit - it is hung

Using a different node.js image (based on Ubuntu instead of Alpine) works successfully under Triton:

$ docker run -ti node:8 sh
echo "if (1) return" > play.js
node play.js
# Process exits cleanly

Reproducable in both COAL and TPC (us-sw-1) environments.

Truss executable output included in SWSUP-1041.

Comments

Comment by Former user
Created at 2018-03-07T22:57:15.285Z

Preliminary investigation, unfortunately my kernel debugging skills are almost non-existent:

Running:

a198024f7f1c:~# uname -a
Linux a198024f7f1c 3.13.0 BrandZ virtual linux x86_64 Linux
a198024f7f1c:~# node -e ''
(it hangs here)

On the following platform:

root@headnode (coal) ~]# uname -a
SunOS headnode 5.11 joyent_20180209T180849Z i86pc i386 i86pc
[root@headnode (coal) ~]# 

In the container running `node -e`:

a198024f7f1c:~# /native/usr/bin/ptree
92416 /bin/sh
  93149 <defunct>
  93307 <defunct>
92475 /bin/login -h zone:global -f root
  92478 -ash
    12841 node -e 
93188 /bin/login -h zone:global -f root
  93190 -ash
    12903 /native/usr/bin/ptree
12897 /bin/login -h zone:global -f root
  12899 -ash

Attaching to the node process with `gdb`:

a198024f7f1c:~# gdb -p 12841
GNU gdb (GDB) 7.12.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-alpine-linux-musl".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 12841
[New LWP 12843]
[New LWP 12844]
[New LWP 12845]
[New LWP 12847]
0x00007fffed053a50 in __clone () from /lib/ld-musl-x86_64.so.1
(gdb) thread apply all bt

Thread 5 (LWP 12847):
#0  0x00007fffed053a50 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007fffed0511c3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fffed192b30 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 4 (LWP 12845):
#0  0x00007fffed053a50 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007fffed0511c3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fffe9c02b30 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 3 (LWP 12844):
#0  0x00007fffed053a50 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007fffed0511c3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fffeaa02b30 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 2 (LWP 12843):
#0  0x00007fffed053a50 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007fffed0511c3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fffeb802b30 in ?? ()
#3  0x0000000000000000 in ?? ()

Thread 1 (LWP 12841):
#0  0x00007fffed053a50 in __clone () from /lib/ld-musl-x86_64.so.1
#1  0x00007fffed0511c3 in ?? () from /lib/ld-musl-x86_64.so.1
#2  0x00007fffed28bb90 in ?? () from /lib/ld-musl-x86_64.so.1
#3  0x0000000000000000 in ?? ()
(gdb) info registers
rax            0xca     202
rbx            0x7fffed28bb48   140737172257608
rcx            0x0      0
rdx            0x322b   12843
rsi            0x0      0
rdi            0x7fffeb802b20   140737144433440
rbp            0xca     0xca
rsp            0x7fffffeffad8   0x7fffffeffad8
r8             0x0      0
r9             0x0      0
r10            0x0      0
r11            0x246    582
r12            0x0      0
r13            0x0      0
r14            0x7fffeb802b20   140737144433440
r15            0x4      4
rip            0x7fffed053a50   0x7fffed053a50 <__clone+98>
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x4b     75
ds             0x4b     75
es             0x4b     75
fs             0x0      0
gs             0x0      0
(gdb) thread 1
[Switching to thread 1 (LWP 12841)]
#0  0x00007fffed053a50 in __clone () from /lib/ld-musl-x86_64.so.1
(gdb) disassemble
Dump of assembler code for function __clone:
   0x00007fffed0539ee <+0>:     xor    %eax,%eax
   0x00007fffed0539f0 <+2>:     mov    $0x38,%al
   0x00007fffed0539f2 <+4>:     mov    %rdi,%r11
   0x00007fffed0539f5 <+7>:     mov    %rdx,%rdi
   0x00007fffed0539f8 <+10>:    mov    %r8,%rdx
   0x00007fffed0539fb <+13>:    mov    %r9,%r8
   0x00007fffed0539fe <+16>:    mov    0x8(%rsp),%r10
   0x00007fffed053a03 <+21>:    mov    %r11,%r9
   0x00007fffed053a06 <+24>:    and    $0xfffffffffffffff0,%rsi
   0x00007fffed053a0a <+28>:    sub    $0x8,%rsi
   0x00007fffed053a0e <+32>:    mov    %rcx,(%rsi)
   0x00007fffed053a11 <+35>:    syscall 
   0x00007fffed053a13 <+37>:    test   %eax,%eax
   0x00007fffed053a15 <+39>:    jne    0x7fffed053a26 <__clone+56>
   0x00007fffed053a17 <+41>:    xor    %ebp,%ebp
   0x00007fffed053a19 <+43>:    pop    %rdi
   0x00007fffed053a1a <+44>:    callq  *%r9
   0x00007fffed053a1d <+47>:    mov    %eax,%edi
   0x00007fffed053a1f <+49>:    xor    %eax,%eax
   0x00007fffed053a21 <+51>:    mov    $0x3c,%al
   0x00007fffed053a23 <+53>:    syscall 
   0x00007fffed053a25 <+55>:    hlt    
   0x00007fffed053a26 <+56>:    retq   
   0x00007fffed053a27 <+57>:    mov    (%rdi),%eax
   0x00007fffed053a29 <+59>:    test   %eax,%eax
   0x00007fffed053a2b <+61>:    jne    0x7fffed053a51 <__clone+99>
   0x00007fffed053a2d <+63>:    mov    %rdi,%r11
   0x00007fffed053a30 <+66>:    mov    %rsi,%rax
   0x00007fffed053a33 <+69>:    mov    %rdx,%rdi
   0x00007fffed053a36 <+72>:    mov    %rcx,%rsi
   0x00007fffed053a39 <+75>:    mov    %r8,%rdx
   0x00007fffed053a3c <+78>:    mov    %r9,%r10
   0x00007fffed053a3f <+81>:    mov    0x8(%rsp),%r8
   0x00007fffed053a44 <+86>:    mov    0x10(%rsp),%r9
   0x00007fffed053a49 <+91>:    mov    %r11,0x8(%rsp)
   0x00007fffed053a4e <+96>:    syscall 
=> 0x00007fffed053a50 <+98>:    retq   
   0x00007fffed053a51 <+99>:    jmpq   0x7fffed051133
End of assembler dump.
(gdb)

Looking at what all threads are spending their time on:

a198024f7f1c:~# /native/usr/bin/prstat -mLc 1
Please wait...
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 12940 root      18  82 0.0 0.0 0.0 0.0 0.0 0.0   0   0 25K   0 prstat/1
 93307 root     7.0  58 0.0 0.0 0.0 0.0 0.0  24   0   0  9K   0 /0
 12841 root     0.1 0.1 0.0 0.1 0.1 100 0.0 0.0   1   0   9   0 node/1
 93149 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   1   0 /0
 12899 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0  12   0 busybox/1
 12897 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   5   0 busybox/1
 92342 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 zsched/1
 93188 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 busybox/1
 12841 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/6
 12841 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/5
 12841 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/4
 12841 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0   0   0   0   0 node/3
 92416 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 busybox/1
 92475 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 busybox/1
 93190 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   0   0   0   0 busybox/1
Total: 12 processes, 14 lwps, load averages: 0.00, 0.00, 0.00

All node processes/threads are spending 100% of their time on user locks.

[root@headnode (coal) ~]# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba fctl stmf_sbd stmf zfs mm lofs mpt idm fcp sd crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> 0t12841::pid2proc | ::walk thread | ::findstack -v
stack pointer for thread ffffff03e7300c40: ffffff001a53bb60
[ ffffff001a53bb60 _resume_from_idle+0x112() ]
  ffffff001a53bb90 swtch+0x141()
  ffffff001a53bc20 cv_wait_sig_swap_core+0x1b9(ffffff03dfc72be0, ffffffffc0167e60, 0)
  ffffff001a53bc40 cv_wait_sig_swap+0x17(ffffff03dfc72be0, ffffffffc0167e60)
  ffffff001a53bcd0 cv_waituntil_sig+0xbd(ffffff03dfc72be0, ffffffffc0167e60, 0, 6)
  ffffff001a53bd70 futex_wait+0x19d(ffffff001a53bdd0, 7fffeb802b20, 322b, 0, ffffffff)
  ffffff001a53be70 lx_futex+0x2c1(7fffeb802b20, 0, 322b, 0, 0, 0)
  ffffff001a53bef0 lx_syscall_enter+0x16f()
  ffffff001a53bf10 sys_syscall+0x142()
stack pointer for thread ffffff042dad2420: ffffff001a70fb60
[ ffffff001a70fb60 _resume_from_idle+0x112() ]
  ffffff001a70fb90 swtch+0x141()
  ffffff001a70fc20 cv_wait_sig_swap_core+0x1b9(ffffff03dfc72a20, ffffffffc0167be0, 0)
  ffffff001a70fc40 cv_wait_sig_swap+0x17(ffffff03dfc72a20, ffffffffc0167be0)
  ffffff001a70fcd0 cv_waituntil_sig+0xbd(ffffff03dfc72a20, ffffffffc0167be0, 0, 6)
  ffffff001a70fd70 futex_wait+0x19d(ffffff001a70fdd0, 7fffeb8029e4, 2, 0, ffffffff)
  ffffff001a70fe70 lx_futex+0x2c1(7fffeb8029e4, 80, 2, 0, 0, 0)
  ffffff001a70fef0 lx_syscall_enter+0x16f()
  ffffff001a70ff10 sys_syscall+0x142()
stack pointer for thread ffffff0465954c00: ffffff0019fe5b60
[ ffffff0019fe5b60 _resume_from_idle+0x112() ]
  ffffff0019fe5b90 swtch+0x141()
  ffffff0019fe5c20 cv_wait_sig_swap_core+0x1b9(ffffff04112a7a20, ffffffffc0165fe0, 0)
  ffffff0019fe5c40 cv_wait_sig_swap+0x17(ffffff04112a7a20, ffffffffc0165fe0)
  ffffff0019fe5cd0 cv_waituntil_sig+0xbd(ffffff04112a7a20, ffffffffc0165fe0, 0, 6)
  ffffff0019fe5d70 futex_wait+0x19d(ffffff0019fe5dd0, 7fffeaa029e4, 2, 0, ffffffff)
  ffffff0019fe5e70 lx_futex+0x2c1(7fffeaa029e4, 80, 2, 0, 0, 0)
  ffffff0019fe5ef0 lx_syscall_enter+0x16f()
  ffffff0019fe5f10 sys_syscall+0x142()
stack pointer for thread ffffff0408b457e0: ffffff0019323b60
[ ffffff0019323b60 _resume_from_idle+0x112() ]
  ffffff0019323b90 swtch+0x141()
  ffffff0019323c20 cv_wait_sig_swap_core+0x1b9(ffffff04112a7be0, ffffffffc01643e0, 0)
  ffffff0019323c40 cv_wait_sig_swap+0x17(ffffff04112a7be0, ffffffffc01643e0)
  ffffff0019323cd0 cv_waituntil_sig+0xbd(ffffff04112a7be0, ffffffffc01643e0, 0, 6)
  ffffff0019323d70 futex_wait+0x19d(ffffff0019323dd0, 7fffe9c029e4, 2, 0, ffffffff)
  ffffff0019323e70 lx_futex+0x2c1(7fffe9c029e4, 80, 2, 0, 0, 0)
  ffffff0019323ef0 lx_syscall_enter+0x16f()
  ffffff0019323f10 sys_syscall+0x142()
stack pointer for thread ffffff049e2e2b40: ffffff0019047b60
[ ffffff0019047b60 _resume_from_idle+0x112() ]
  ffffff0019047b90 swtch+0x141()
  ffffff0019047c20 cv_wait_sig_swap_core+0x1b9(ffffff04112a76a0, ffffffffc0161840, 0)
  ffffff0019047c40 cv_wait_sig_swap+0x17(ffffff04112a76a0, ffffffffc0161840)
  ffffff0019047cd0 cv_waituntil_sig+0xbd(ffffff04112a76a0, ffffffffc0161840, 0, 6)
  ffffff0019047d70 futex_wait+0x19d(ffffff0019047dd0, 7fffef373ca0, ffffffff, 0, ffffffff)
  ffffff0019047e70 lx_futex+0x2c1(7fffef373ca0, 80, ffffffffffffffff, 0, 0, 0)
  ffffff0019047ef0 lx_syscall_enter+0x16f()
  ffffff0019047f10 sys_syscall+0x142()
> 0t12841::pid2proc | ::walk thread | ::thread -b
            ADDR            WCHAN               TS             PITS    SOBJ OPS
ffffff03e7300c40 ffffff03dfc72be0 ffffff05647b10d0                0 cv_sobj_ops
ffffff042dad2420 ffffff03dfc72a20 ffffff03d7f21118                0 cv_sobj_ops
ffffff0465954c00 ffffff04112a7a20 ffffff0465b0a5a8                0 cv_sobj_ops
ffffff0408b457e0 ffffff04112a7be0 ffffff049d3cd5b0                0 cv_sobj_ops
ffffff049e2e2b40 ffffff04112a76a0 ffffff04062d91d0                0 cv_sobj_ops

So it looks like all threads are waiting on a condition variable to be woken up so that they can exit, and thus none of them
will wake them up?


Comment by Former user
Created at 2018-10-05T20:42:40.536Z

While testing a fix for OS-7288 I hit this with these three commands running in different windows

while true; do strace -f iperf -s -B 10.88.88.43; done

ln /usr/bin/iperf /usr/bin/myperf; while true; do myperf -c 10.88.88.43 -t 1; done

while true; do sleep 0.5; pkill iperf; done

PI was gate bits from 2018-10-05 and the image was Ubuntu 14.04 7b5981c4-1889-11e7-b4c5-3f3bdfc9b88b.


Comment by Former user
Created at 2018-10-08T14:30:59.661Z

This bug is also reproducible with an Ubuntu 14.04 lx zone on a PI built from master bits as of 2018-10-04.

> fffffe4c5d419ff8::walk thread | ::findstack -v
stack pointer for thread fffffe0bee9f3c40: fffffe0010367d20
[ fffffe0010367d20 _resume_from_idle+0x12b() ]
  fffffe0010367d50 swtch+0x18a()
  fffffe0010367dd0 stop+0x3d6(9, 1)
  fffffe0010367e20 lx_ptrace_stop_common+0x71(fffffe4c5d419ff8, fffffe181afca800, 1)
  fffffe0010367e60 lx_ptrace_stop+0x92(1)
  fffffe0010367ee0 lx_syscall_enter+0x8e()
  fffffe0010367f00 sys_syscall+0x242()

Reproduce with Ubuntu 14.04 lx image with:

apt install -y iperf strace ksh
ln /usr/bin/iperf /usr/bin/myperf

Three windows, each with one of:

while true; do strace -f iperf -s 10.88.88.202; done

while true; do myperf -c 10.88.88.202 -t 1; done

./killit

kilit is the following. ksh is used for floating point math.

#! /bin/ksh

while true; do
        pkill strace &
        pkill iperf
        time=$(( $RANDOM % 1100 / 1000.0 ))
        echo $time
        sleep $time
done

Comment by Former user
Created at 2019-11-08T17:39:57.386Z

Hello Mike,

It appears that another customer has run into this issue lately. ZD-229434. I was wondering if I could present them with a workaround? I opened NSUP-3 to address maybe in error? or duplicate?

If you could advise that would be great.

Jeff W


Comment by Former user
Created at 2019-11-11T15:43:13.838Z

@accountid:62562f1ed2839e006bb074c9 The issue here is an lx issue, not a node issue. If a node bug exists when running on a Linux kernel, then it may be considered a node support issue. The NSUP issue should probably be converted to a SWSUP issue.

The only workaround that I know of at this point would be to use bhyve running a real linux kernel.


Comment by Former user
Created at 2020-09-28T22:47:38.935Z
Updated at 2020-09-28T22:55:00.753Z

During a debugging session, we ran node, then called 'exit' and compared the system call behavior of a working linux VM to that of the LX zone where it would hang (both using Alpine).

The hangs appear to be related to the POSIX condvar implementation in musl, which appears to be different than glibc's implementation, though both are based off futexes.

In the LX zone, we noticed this error:

[pid 3333] futex(0x7fffe98029b4, FUTEX_REQUEUE_PRIVATE, 0, 1, 0x12ecc) = -1 EAGAIN (Resource temporarily unavailable)

While in the working VM, we instead saw:

[pid  3333] futex(0x7f621e9809b4, FUTEX_REQUEUE_PRIVATE, 0, 1, 0x555e6cf7aecc) = 1

The futex(2) man page does not list EAGAIN as a valid error for FUTEX_REQUEUE (FUTEX_REQUEUE_PRIVATE is merely FUTEX_REQUEUE|FUTEX_PRIVATE_FLAG), however for FUTEX_CMP_REQUEUE it states that EAGAIN is returned when "The value pointed to by uaddr is not equal to the expected value val3".

Looking at the implementation in lx_futex.c, we see:

1403  	case FUTEX_CMP_REQUEUE:
1404  	case FUTEX_REQUEUE:
1405  		rval = futex_requeue(&memid, &memid2, val,
1406  		    val2, (void *)addr2, &val3);
1407  
1408  		break;

(from lx_futex() where val3 is the last argument to the function (int val3). Note that this means that the last parameter to futex_requeue (int *cmpval) will never be NULL. In futex_requeue(), we see:

 
745  	if (cmpval != NULL) {
746  		if (fuword32(addr, (uint32_t *)&curval)) {
747  			ret = -EFAULT;
748  			goto out;
749  		}
750  		if (curval != *cmpval) {
751  			ret = -EAGAIN;
752  			goto out;
753  		}
754  	}

and this is the only location that returns EAGAIN. From this it's fairly clear the issue is that we should pass NULL for cmpval in the FUTEX_REQUEUE instance, but &val3 in the FUTEX_CMP_REQUEUE instance.


Comment by Michael Zeller
Created at 2020-09-30T17:29:36.918Z

I ran ltp from the instructions we ship in illumos-joyent on a platform with the fix.

-----------------------------------------------
Total Tests: 1184
Total Skipped Tests: 336
Total Failures: 11
Kernel Version: 4.3.0
Machine Architecture: x86_64
Hostname: 1d9eda80-a149-45de-9681-45f9f65440af
root@1d9eda80-a149-45de-9681-45f9f65440af:/opt/ltp# grep FAIL /opt/ltp/results/LTP_RUN_ON-2020_09_30-16h_38m_56s.log
fcntl34                        FAIL       2
fcntl34_64                     FAIL       2
fcntl36                        FAIL       2
fcntl36_64                     FAIL       2
io_setup01                     FAIL       1
msgctl09                       FAIL       1
msgctl10                       FAIL       1
msgctl11                       FAIL       2
recv01                         FAIL       1
recvfrom01                     FAIL       1
recvmsg01                      FAIL       1

The test results look as expected. Another user confirmed that node.js now exits properly on musl based distros.


Comment by Jira Bot
Created at 2020-09-30T17:31:46.572Z

illumos-joyent commit a68418414cd381cf2cb7f9464533e83dce09beb9 (branch master, by Jason King)

OS-6580 LX fails to handle futex_requeue correctly (#333)

Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Mike Zeller <mike.zeller@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>