OS-5834: procfs t_hold updates can race against target thread

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2016-12-05T23:43:28.000Z
Updated at:2020-04-09T00:43:36.565Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2016-12-08T21:04:52.000Z)

Fix Versions

2016-12-22 FAR EAST STRATEGY (Release Date: 2016-12-22)

Related Links

Description

Richard Reingruber reported in illumos#7637 that certain actions could lead to the signal mask for a thread becoming corrupt:

Symptom: process gets killed upon SIGSEGV even though a handler is registered.

Dtrace analysis: after return from a signal handler restorecontext(ucontext_t *ucp) leaves all maskable signals blocked in curthread->t_hold. So upon the next SIGSEGV the process gets killed.

I tracked down this bug in Solaris 11 which caused my server process to crash in very rare occasions. I used the Illumos/OpenIndiana sources and dtrace to do the analysis. This is a while ago, and Oracle has fixed the bug in Solaris 11 already, but only today I used an OpenIndiana Hipster 2016.10 live system on VirtualBox to run my test and found that there the bug is still present.

The issue can be reproduced very easily with the small test attached.

A detailed description is included in solaris11_restorecontext_bug.c

Important: to trigger the bug, the test must run at least on 2 physical cpus with 2 threads.

I tweaked his examples to build a little more easily with GCC:

main.c:

#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <signal.h>
#include <ucontext.h>
#include <sys/types.h>
#include <procfs.h>
#include <errno.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

const char* lstatus_file = "/proc/self/lstatus";
int fourty2;

/*
 * gensigsegv is implemented in solaris11_restorecontext_bug_gensigsegv_amd64.s
 * and solaris11_restorecontext_bug_gensigsegv_sparc.s
 *
 * The *.s files where compiled from
 *
 *    int gensigsegv(int* ptr) {
 *        return *ptr;
 *    }
 *
 * sigsegv_handler below requires that ptr is allocated in register o0 on SPARC
 * and in register rdi on x86. That's why the simple function is implemented in
 * assembler.
 */
extern int gensigsegv(int* ptr);

/*
 * Called when gensigsegv() generates a SIGSEGV because it was called with ptr == NULL.
 * sigsegv_handler() replaces the NULL with &fourty2. Then gensigsegv() can successfully
 * dereference ptr.
 * IMPORTANT: ptr must be allocated in register o0 on SPARC and in register rdi on x86
 */
void sigsegv_handler(int sig, siginfo_t* info, void* ucVoid) {
  ucontext_t* uc = (ucontext_t*) ucVoid;
#ifdef SPARC
#define BASEREG REG_O0
#else
#define BASEREG REG_RDI
#endif
  uc->uc_mcontext.gregs[BASEREG] = (uintptr_t)&fourty2;
}

void* thread_entry_func(void *arg) {
    int thr_no = (intptr_t) arg;
    printf("Thread %d is up and running!\n", thr_no);
    for(;;) {
        int rv = gensigsegv((int*) 0);
        if (0) printf("T%: gensigsegv((int*) 0)->%d\n", rv);
        if (rv != fourty2) {
            printf("\nERROR: gensigsegv(0) returned %d but expected %d \n\n", rv, fourty2);
            exit(2);
        }
    }
}

void install_sigsegv_handler() {
    int rc;
    struct sigaction sigAct;
    sigfillset(&(sigAct.sa_mask));
    sigdelset(&(sigAct.sa_mask), SIGCONT); /* remove SIGCONT as an eye catcher */
    sigAct.sa_sigaction = sigsegv_handler;
    sigAct.sa_flags = SA_SIGINFO | SA_RESTART;

    rc = sigaction(SIGSEGV, &sigAct, 0);
    if (rc != 0) {
        printf("\nERROR: sigaction() failed with return value %d\n\n", rc);
        exit(3);
    }
}

void create_threads(int no_threads) {
    pthread_t thread1, thread2;  /* thread variables */

    for (int ii=1; ii <= no_threads; ii++) {
        int rc = pthread_create(&thread1, NULL, &thread_entry_func, (void *)(long)ii);
        if (rc != 0) {
            printf("\nERROR: pthread_create() failed with return value %d\n\n", rc);
            exit(2);
        }
    }
}

void read_proc_lstatus() {
    int fd;
    int count;

    fd = open(lstatus_file, O_RDONLY);
    if (fd == -1) {
        printf("\nERROR: opening %s failed.\n");
        exit(4);
    }

    struct stat statb;
    if (fstat(fd, &statb) != 0) {
        printf("\nERROR: fstat on %s failed.\n");
        close(fd);
        exit(5);
    }

    prheader_t *prh = (prheader_t *)malloc(statb.st_size);
    if (prh == NULL) {
        printf("\nERROR: malloc() failed\n");
        close(fd);
        return;
    }

    do {
        count = read(fd,
                       (void *)prh,
                       statb.st_size);
    } while (count < 0 && errno == EINTR);
    if (0) printf("Read %d bytes from %s\n", count, lstatus_file);
    close(fd);
    free(prh);
}

void read_lstatus_in_endless_loop() {
    printf("Polling %s\n", lstatus_file);
    for(;;) {
        read_proc_lstatus();
    }
}

void print_usage_and_exit(char* argv[]) {
    printf("\nusage: %s <number of threads>\n\n", argv[0]);
    exit(1);
}

int main(int argc, char* argv[]) {
    int no_threads;

    fourty2 = 42;

    if (argc != 2) {
        printf("\n\nERROR: wrong number of arguments.\n");
        print_usage_and_exit(argv);
    }

    no_threads = atoi(argv[1]);
    if (no_threads <= 0) {
        printf("\n\nERROR: number of threads should be greater than 0.\n");
        print_usage_and_exit(argv);
    }

    install_sigsegv_handler();
    create_threads(no_threads);

    read_lstatus_in_endless_loop();

    printf("\nStarting test with %d threads", no_threads);

    printf("\n\nReturning from main()\n\n");
}

subr.S:

.file "subr.S"


.code64
.globl gensigsegv
.type gensigsegv, @function
.section .text
.align 16

gensigsegv:
        push    %rbp
        movq    %rsp,%rbp
        movl    (%rdi),%eax
        leave
        ret
        .size gensigsegv, . - gensigsegv

Compiled with:

gcc -m64 -std=c99 -D_LARGEFILE64_SOURCE -D_XOPEN_SOURCE=600 -pthread main.c subr.S -o a.out

Ran in this manner:

./a.out 1

As expected, it would reliably bail on SIGSEGV.

Comments

Comment by Former user
Created at 2016-12-05T23:44:12.000Z

After tracing this out for a while, I believe I've found the crux of the issue. The comment for schedctl_finish_sigblock notes that it should either be called by curthread or, if in an "external" thread, while holding p_lock. While the p_lock requirement does synchronize external callers, it does not address the possibility that signal-related code in curthread will race when updating t_hold. The test program achieves this by hammering schedctl_finish_sigblock via the procfs facilities for querying LWP status.


Comment by Former user
Created at 2016-12-08T18:51:28.000Z

illumos-joyent commit 060157c (branch master, by Patrick Mooney)

OS-5834 procfs t_hold updates can race against target thread
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>


Comment by Former user
Created at 2020-04-04T23:48:26.792Z

Upstream issue.