OS-8141: lx futex called with NULL timeout and FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG set causes panic

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2020-03-19T16:56:09.175Z
Updated at:2020-03-24T02:06:36.512Z

People

Created by:Brian Bennett
Reported by:Brian Bennett
Assigned to:Brian Bennett

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2020-03-24T02:06:36.504Z)

Fix Versions

2020-03-26 Oh Henry (Release Date: 2020-03-26)

Related Links

Description

This was reported by a user on IRC who provided a crash dump. Based on the dump, and a bit of experimentation with a futex demo program, we reliably have a test case that we can use to panic the system.

A non-panic futex program can be found here: http://faculty.washington.edu/wlloyd/courses/tcss422/examples/Chapter28/futex_demo.c

The following patch applied to the above program will induce a kernel panic.

--- futex_demo.orig.c   2020-03-19 15:59:08.207375564 +0000
+++ futex_panic.c       2020-03-19 16:53:57.525412392 +0000
@@ -61,7 +61,7 @@
 
                /* Futex is not available; wait */
 
-               s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0);
+               s = futex(futexp, FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG, 0, NULL, NULL, 0);
                if (s == -1 && errno != EAGAIN)
                    errExit("futex-FUTEX_WAIT");
            }

Here's the panic ::status and ::stack

> ::status
debugging crash dump vmcore.0 (64-bit) from smartos
operating system: 5.11 joyent_20200228T001732Z (i86pc)
git branch: release-20200227
git rev: 5cbb72a848d670186cc8a950e564d70c7208f889
image uuid: (not set)
panic message: BAD TRAP: type=e (#pf Page fault) rp=fffffe0021b75be0 addr=0 occurred in module "genunix" due to a NULL pointer dereference
dump content: kernel pages only
> ::stack
ts2hrt+9(0)
futex_wait+0x193(fffffe0021b75dc0, 5e36f78, 84, 0, c, 1)
lx_futex+0x31e(5e36f78, 89, 84, 0, 0, c)
lx_syscall_enter+0x1aa()
sys_syscall+0x145()

Comments

Comment by Former user
Created at 2020-03-19T17:23:22.643Z

Do we know if the semantics in the example imply an unbounded wait?
If so, I think the fix would be something like:

--- a/usr/src/uts/common/brand/lx/syscall/lx_futex.c
+++ b/usr/src/uts/common/brand/lx/syscall/lx_futex.c
@@ -421,8 +421,14 @@ futex_wait(memid_t *memid, caddr_t addr,
                 * timeout to be CLOCK_MONOTONIC-based but limited by system
                 * clock interval; we treat these semantics as effectively
                 * CLOCK_REALTIME.)
+                *
+                * For an indefinite timeout (NULL), the choice of clock is
+                * irrelevant, and we just wait.
                 */
-               if (hrtime) {
+               if (timeout == NULL) {
+                       ret = cv_wait_sig_swap(&fwp->fw_cv,
+                           &futex_hash[index].fh_lock);
+               } else if (hrtime) {
                        ret = cv_timedwait_sig_hrtime(&fwp->fw_cv,
                            &futex_hash[index].fh_lock, ts2hrt(timeout));
                } else {

Comment by Brian Bennett
Created at 2020-03-23T23:18:10.458Z

Following up on your question, it turns out it's not an unbounded wait. Using CLOCK_REALTIME requires an absolute time value and what's being passed in from the application appears to be a relative value with a value of 0. Linux returns EINVAL in this case. The fix is almost just as straightforward though.


Comment by Brian Bennett
Created at 2020-03-24T01:56:00.362Z
Updated at 2020-03-24T01:56:24.920Z

This has been tested with a sample program that previously triggered a kernel panic, and now gets Invalid Argument (which is consistent with behavior on Linux) and by running the platform image with Plex and Grafana, both of which behave normally.

Test PI is here: https://us-east.manta.joyent.com/bbennett/public/OS-8141/platform-20200321T200110Z.tgz

mail_msg.txt


Comment by Jira Bot
Created at 2020-03-24T02:04:34.141Z

illumos-joyent commit 71d3d1fb721d3893ca04b652d1b175ee2f54ed05 (branch master, by Brian Bennett)

OS-8141 lx futex called with NULL timeout and FUTEX_WAIT_BITSET|FUTEX_PRIVATE_FLAG set causes panic (#272)