Issue Type: | Bug |
---|---|
Priority: | 4 - Normal |
Status: | Resolved |
Created at: | 2011-11-09T04:22:50.000Z |
Updated at: | 2020-09-10T05:04:57.655Z |
Created by: | Linda Derezinski |
---|---|
Reported by: | Linda Derezinski |
Assigned to: | Former user |
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2011-11-15T05:37:59.000Z)
6.5.1 (Release Date: 2011-11-17)
2011-11-17 Apu (Release Date: 2011-11-17)
was alerted at 1902 Eastern Standard Time that G9FY9R1 was unpingable
was able to ssh in on the admin network with out issues. At this time prstat -Z was working. At some point later (2 hours later ben saw) that ps hung and he was not able to debug any further.
I issued reboot -d and have core file:
[root@G9FY9R1 (us-sw-1) /var/crash/volatile]# ls -lrth total 27723635 -rw-r--r-- 1 root root 2.6G Nov 9 03:59 vmdump.0 -rw-r--r-- 1 root root 2 Nov 9 03:59 bounds -rw-r--r-- 1 root root 2.0M Nov 9 04:05 unix.0 -rw-r--r-- 1 root root 11G Nov 9 04:05 vmcore.0 [root@G9FY9R1 (us-sw-1) /var/crash/volatile]#
The ps processes are all waiting on the P_PR_LOCK for pid 19128
(beam.smp). This lock is held by this thread:
stack pointer for thread ffffff19adfd64a0: ffffff00ba11ec30 [ ffffff00ba11ec30 _resume_from_idle+0xf1() ] ffffff00ba11ec60 swtch+0x145() ffffff00ba11ed10 turnstile_block+0x760() ffffff00ba11ed80 rw_enter_sleep+0x205() ffffff00ba11edd0 vmu_calculate_proc+0x23a() ffffff00ba11ee00 vmu_calculate+0xd2() ffffff00ba11ee80 vm_getusage+0x23a() ffffff00ba11eec0 rusagesys+0x64() ffffff00ba11ef10 _sys_sysenter_post_swapgs+0x149()
It is blocked trying to get the as_lock as reader for beam.smp:
> ffffff647fe95a30::rwlock ADDR OWNER/COUNT FLAGS WAITERS ffffff647fe95a30 ffffff59ad269c20 B101 ffffff19adfd64a0 (R) | | ffffff19adeedae0 (R) WRITE_LOCKED ------+ | ffffff19bda3bb60 (R) HAS_WAITERS --------+
Here is the stack trace of the owner:
> ffffff59ad269c20::findstack stack pointer for thread ffffff59ad269c20: ffffff00b81139f0 [ ffffff00b81139f0 _resume_from_idle+0xf1() ] ffffff00b8113a20 swtch+0x145() ffffff00b8113a50 cv_wait+0x61() ffffff00b8113ac0 zfs_range_lock_writer+0x142() ffffff00b8113b20 zfs_range_lock+0xc0() ffffff00b8113be0 zfs_putpage+0xf7() ffffff00b8113c50 fop_putpage+0x74() ffffff00b8113cb0 zfs_delmap+0x91() ffffff00b8113d60 fop_delmap+0x9a() ffffff00b8113e20 segvn_unmap+0x17b() ffffff00b8113ec0 as_unmap+0x163() ffffff00b8113f00 munmap+0x80() ffffff00b8113f10 sys_syscall+0x17a()
So the owner is blocked trying to get a range lock on a znode_t. The
owner of the lock isn't terribly hard to find:
> ::stacks -m zfs ... ffffff19bda3bb60 SLEEP RWLOCK 1 swtch+0x145 turnstile_block+0x760 rw_enter_sleep+0x205 as_fault+0x43e pagefault+0x99 trap+0x1642 0xfffffffffb8001d6 0xfffffffffb85a6ea uiomove+0xe9 mappedread+0xaf zfs_read+0x269 fop_read+0xd3 read+0x2b8
This, then is the deadlock: we are reading from a file into a mapped
but unfaulted page, while at the same time another thread is unmapping
the same file with dirty pages. The first thread is acquiring the
range lock followed the address space lock; the second has the address
space lock and is trying to acquire the range lock. Indeed, knowing
this, the problem is disconcertingly easy to reproduce:
#include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <errno.h> #include <stdio.h> #include <sys/varargs.h> #include <sys/mman.h> #include <thread.h> char *g_cmd = "readmmap"; size_t g_size = 100 * 1024 * 1024; int g_fd; static void fatal(char *fmt, ...) { va_list ap; int error = errno; va_start(ap, fmt); (void) fprintf(stderr, "%s: ", g_cmd); (void) vfprintf(stderr, fmt, ap); if (fmt[strlen(fmt) - 1] != '\n') (void) fprintf(stderr, ": %s\n", strerror(error)); exit(EXIT_FAILURE); } void reader() { int ps = sysconf(_SC_PAGESIZE); size_t i; for (;;) { char *buf = mmap(NULL, g_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0); if (buf == NULL) fatal("couldn't mmap anonymous"); for (i = 0; i < g_size; i += ps) { if (pread(g_fd, &buf[i], ps, i) != ps) fatal("short read at offset %d", i); } munmap(buf, g_size); } } void writer() { int ps = sysconf(_SC_PAGESIZE); size_t i; for (;;) { char *buf = mmap(NULL, g_size, PROT_READ | PROT_WRITE, MAP_SHARED, g_fd, 0); for (i = 0; i < g_size; i += ps) (*((int *)&buf[i]))++; munmap(buf, g_size); } } int main(int argc, char **argv) { char *path = tmpnam(NULL); char c = '!'; if ((path = tmpnam(NULL)) == NULL) fatal("couldn't get temporary name"); if ((g_fd = open(path, O_CREAT | O_EXCL | O_RDWR, 0666)) == -1) fatal("couldn't open %s", path); fprintf(stderr, "%s: file is %s\n", g_cmd, path); if (pwrite(g_fd, &c, 1, g_size) != 1) fatal("couldn't write to %s", path); if (unlink(path) != 0) fatal("couldn't unlink %s", path); if (thr_create(NULL, 0, (void *(*)(void *))reader, NULL, THR_BOUND, NULL) != 0) fatal("couldn't create reader"); if (thr_create(NULL, 0, (void *(*)(void *))writer, NULL, THR_BOUND, NULL) != 0) fatal("couldn't create writer"); while (thr_join(0, NULL, NULL) == 0) continue; return (0); }
The above reproduces the deadlock within hundreds of millseconds.
In terms of fixing this, it seems that the easier side to examine is the
fop_putpage() from zfs_delmap(); here's the block comment at the beginning
of the function:
/* * The reason we push dirty pages as part of zfs_delmap() is so that we get a * more accurate mtime for the associated file. Since we don't have a way of * detecting when the data was actually modified, we have to resort to * heuristics. If an explicit msync() is done, then we mark the mtime when the * last page is pushed. The problem occurs when the msync() call is omitted, * which by far the most common case: * * open() * mmap() * <modify memory> * munmap() * close() * <time lapse> * putpage() via fsflush * * If we wait until fsflush to come along, we can have a modification time that * is some arbitrary point in the future. In order to prevent this in the * common case, we flush pages whenever a (MAP_SHARED, PROT_WRITE) mapping is * torn down. */
A deadlock seems a steep price to pay for a slightly more accurate mtime;
if the call to fop_putpage() from zfs_delmap() is truly only for the mtime,
perhaps there is a way of achieving this that does not involve attempting
to push pages with the as_lock held as writer?
Well, a couple of things. First, it turns out the mtime wasn't even being set properly at all in this case -- see OS-737. Secondly, in discussion with the engineer who originally did this work, the consensus is that these semantics are simply too expensive to implement; the fix here is to restore the semantics prior to their change to accommodate the scenario mentioned in the zfs_delmap() comment, above.
illumos-joyent commit 397a6c3 (branch master, by Bryan Cantrill)
OS-732 zfs_read()/zfs_delmap() deadlock
OS-737 zfs_putapage() neglects to call sa_bulk_update()
illumos-joyent commit 7391929 (branch release-20110901, by Bryan Cantrill)
OS-732 zfs_read()/zfs_delmap() deadlock
OS-737 zfs_putapage() neglects to call sa_bulk_update()