OS-732: zfs_read()/zfs_delmap() deadlock


Issue Type:Bug
Priority:4 - Normal
Created at:2011-11-09T04:22:50.000Z
Updated at:2020-09-10T05:04:57.655Z


Created by:Linda Derezinski
Reported by:Linda Derezinski
Assigned to:Former user


Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2011-11-15T05:37:59.000Z)

Fix Versions

6.5.1 (Release Date: 2011-11-17)

2011-11-17 Apu (Release Date: 2011-11-17)


was alerted at 1902 Eastern Standard Time that G9FY9R1 was unpingable

was able to ssh in on the admin network with out issues. At this time prstat -Z was working. At some point later (2 hours later ben saw) that ps hung and he was not able to debug any further.

I issued reboot -d and have core file:

[root@G9FY9R1 (us-sw-1) /var/crash/volatile]# ls -lrth
total 27723635
-rw-r--r--   1 root     root        2.6G Nov  9 03:59 vmdump.0
-rw-r--r--   1 root     root           2 Nov  9 03:59 bounds
-rw-r--r--   1 root     root        2.0M Nov  9 04:05 unix.0
-rw-r--r--   1 root     root         11G Nov  9 04:05 vmcore.0
[root@G9FY9R1 (us-sw-1) /var/crash/volatile]# 


Comment by Former user
Created at 2011-11-11T07:38:29.000Z

The ps processes are all waiting on the P_PR_LOCK for pid 19128
(beam.smp). This lock is held by this thread:

stack pointer for thread ffffff19adfd64a0: ffffff00ba11ec30
[ ffffff00ba11ec30 _resume_from_idle+0xf1() ]
  ffffff00ba11ec60 swtch+0x145()
  ffffff00ba11ed10 turnstile_block+0x760()
  ffffff00ba11ed80 rw_enter_sleep+0x205()
  ffffff00ba11edd0 vmu_calculate_proc+0x23a()
  ffffff00ba11ee00 vmu_calculate+0xd2()
  ffffff00ba11ee80 vm_getusage+0x23a()
  ffffff00ba11eec0 rusagesys+0x64()
  ffffff00ba11ef10 _sys_sysenter_post_swapgs+0x149()

It is blocked trying to get the as_lock as reader for beam.smp:

> ffffff647fe95a30::rwlock
            ADDR      OWNER/COUNT FLAGS          WAITERS
ffffff647fe95a30 ffffff59ad269c20  B101 ffffff19adfd64a0 (R)
                                    | | ffffff19adeedae0 (R)
                 WRITE_LOCKED ------+ | ffffff19bda3bb60 (R)
                  HAS_WAITERS --------+

Here is the stack trace of the owner:

> ffffff59ad269c20::findstack
stack pointer for thread ffffff59ad269c20: ffffff00b81139f0
[ ffffff00b81139f0 _resume_from_idle+0xf1() ]
  ffffff00b8113a20 swtch+0x145()
  ffffff00b8113a50 cv_wait+0x61()
  ffffff00b8113ac0 zfs_range_lock_writer+0x142()
  ffffff00b8113b20 zfs_range_lock+0xc0()
  ffffff00b8113be0 zfs_putpage+0xf7()
  ffffff00b8113c50 fop_putpage+0x74()
  ffffff00b8113cb0 zfs_delmap+0x91()
  ffffff00b8113d60 fop_delmap+0x9a()
  ffffff00b8113e20 segvn_unmap+0x17b()
  ffffff00b8113ec0 as_unmap+0x163()
  ffffff00b8113f00 munmap+0x80()
  ffffff00b8113f10 sys_syscall+0x17a()

So the owner is blocked trying to get a range lock on a znode_t. The
owner of the lock isn't terribly hard to find:

> ::stacks -m zfs
ffffff19bda3bb60 SLEEP    RWLOCK                  1

This, then is the deadlock: we are reading from a file into a mapped
but unfaulted page, while at the same time another thread is unmapping
the same file with dirty pages. The first thread is acquiring the
range lock followed the address space lock; the second has the address
space lock and is trying to acquire the range lock. Indeed, knowing
this, the problem is disconcertingly easy to reproduce:

#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>
#include <sys/varargs.h>
#include <sys/mman.h>
#include <thread.h>

char *g_cmd = "readmmap";
size_t g_size = 100 * 1024 * 1024;
int g_fd;

static void
fatal(char *fmt, ...)
        va_list ap;
        int error = errno;

        va_start(ap, fmt);

        (void) fprintf(stderr, "%s: ", g_cmd);
        (void) vfprintf(stderr, fmt, ap);

        if (fmt[strlen(fmt) - 1] != '\n')
                (void) fprintf(stderr, ": %s\n", strerror(error));


        int ps = sysconf(_SC_PAGESIZE);
        size_t i;

        for (;;) {
                char *buf = mmap(NULL, g_size, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANON, -1, 0);

                if (buf == NULL)
                        fatal("couldn't mmap anonymous");

                for (i = 0; i < g_size; i += ps) {
                        if (pread(g_fd, &buf[i], ps, i) != ps)
                                fatal("short read at offset %d", i);

                munmap(buf, g_size);

        int ps = sysconf(_SC_PAGESIZE);
        size_t i;

        for (;;) {
                char *buf = mmap(NULL, g_size, PROT_READ | PROT_WRITE,
                    MAP_SHARED, g_fd, 0);

                for (i = 0; i < g_size; i += ps)
                        (*((int *)&buf[i]))++;

                munmap(buf, g_size);

main(int argc, char **argv)
        char *path = tmpnam(NULL);
        char c = '!';

        if ((path = tmpnam(NULL)) == NULL)
                fatal("couldn't get temporary name");

        if ((g_fd = open(path, O_CREAT | O_EXCL | O_RDWR, 0666)) == -1)
                fatal("couldn't open %s", path);

        fprintf(stderr, "%s: file is %s\n", g_cmd, path);

        if (pwrite(g_fd, &c, 1, g_size) != 1)
                fatal("couldn't write to %s", path);

        if (unlink(path) != 0)
                fatal("couldn't unlink %s", path);

        if (thr_create(NULL, 0,
            (void *(*)(void *))reader, NULL, THR_BOUND, NULL) != 0)
                fatal("couldn't create reader");

        if (thr_create(NULL, 0,
            (void *(*)(void *))writer, NULL, THR_BOUND, NULL) != 0)
                fatal("couldn't create writer");

        while (thr_join(0, NULL, NULL) == 0)

        return (0);

The above reproduces the deadlock within hundreds of millseconds.

In terms of fixing this, it seems that the easier side to examine is the
fop_putpage() from zfs_delmap(); here's the block comment at the beginning
of the function:

 * The reason we push dirty pages as part of zfs_delmap() is so that we get a
 * more accurate mtime for the associated file.  Since we don't have a way of
 * detecting when the data was actually modified, we have to resort to
 * heuristics.  If an explicit msync() is done, then we mark the mtime when the
 * last page is pushed.  The problem occurs when the msync() call is omitted,
 * which by far the most common case:
 *      open()
 *      mmap()
 *      <modify memory>
 *      munmap()
 *      close()
 *      <time lapse>
 *      putpage() via fsflush
 * If we wait until fsflush to come along, we can have a modification time that
 * is some arbitrary point in the future.  In order to prevent this in the
 * common case, we flush pages whenever a (MAP_SHARED, PROT_WRITE) mapping is
 * torn down.

A deadlock seems a steep price to pay for a slightly more accurate mtime;
if the call to fop_putpage() from zfs_delmap() is truly only for the mtime,
perhaps there is a way of achieving this that does not involve attempting
to push pages with the as_lock held as writer?

Comment by Former user
Created at 2011-11-15T05:16:28.000Z

Well, a couple of things. First, it turns out the mtime wasn't even being set properly at all in this case -- see OS-737. Secondly, in discussion with the engineer who originally did this work, the consensus is that these semantics are simply too expensive to implement; the fix here is to restore the semantics prior to their change to accommodate the scenario mentioned in the zfs_delmap() comment, above.

Comment by Former user
Created at 2011-11-15T05:31:43.000Z

illumos-joyent commit 397a6c3 (branch master, by Bryan Cantrill)

OS-732 zfs_read()/zfs_delmap() deadlock
OS-737 zfs_putapage() neglects to call sa_bulk_update()

Comment by Former user
Created at 2011-11-15T05:36:40.000Z

illumos-joyent commit 7391929 (branch release-20110901, by Bryan Cantrill)

OS-732 zfs_read()/zfs_delmap() deadlock
OS-737 zfs_putapage() neglects to call sa_bulk_update()