In short: the sdev filesystem doesn't enforce EISDIR in non-global
zones. That is, an open(2) call with write access on a directory
doesn't return EISDIR as the man page specifies it should. This, in
turn, can lead to vnode hold leaks when calling shm_open in
lx-brand zones.
You can reproduce this by truss'ing the following code in the GZ and
then in a native zone. There is no significance to choosing
/dev/zvol/.., any dir backed by sdev will do.
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int
main()
{
open("/dev/zvol/..", O_RDWR, 0);
open("/dev/zvol/..", O_RDWR | O_CREAT, 0);
}
Here is the relevant truss output when run in the GZ:
9643: open("/dev/zvol/..", O_RDWR) Err#21 EISDIR
9643: open("/dev/zvol/..", O_RDWR|O_CREAT, 0) Err#21 EISDIR
Here is the relevant truss output when run in the zone:
9640: open("/dev/zvol/..", O_RDWR) Err#21 EISDIR
9640: open("/dev/zvol/..", O_RDWR|O_CREAT, 0) = 3
Those are the most important facts but for completeness I will
retrace how I arrived here and how this ties into lxbrand.
This all started with the Open Posix Test Suite (PTS) shm_open_37-1
test. This tests calls shm_open() with various object names and
verifies that open either created successfully or failed with
EINVAL (because it's an invalid name). One of these names is "..":
the parent directory file.
In native illumos this name is invalid because it doesn't start with a
leading slash. If you add the slash it still won't conflict with an
existing file because the implementation adds a prefix to the name.
In native Linux the ".." name is valid but if you open the object with
write permission the underlying open(2) call will fail with
EISDIR since a directory cannot be opened for writing. Glibc
converts this to EINVAL to report the name as invalid (and because
EISDIR isn't part of the shm_open spec). The comment in glibc is
telling:
int fd = open (shm_name, oflag, mode);
if (fd == -1 && __glibc_unlikely (errno == EISDIR))
/* It might be better to fold this error with EINVAL since
directory names are just another example for unsuitable shared
object names and the standard does not mention EISDIR. */
__set_errno (EINVAL);
However, in Linux, you can get around this by opening the shared
object as O_RDONLY; in this case you'll just read the contents of
the /run directory (Linux creates its shm objects in /dev/shm, a link
to /run, which is a tmpfs mount). This is probably why illumos
prefixes its object names, to prevent conflicts like this.
Things get a bit more complicated in lx brand. First off, at least on
Ubuntu, we deviate from native Ubuntu and use /dev/shm as the shm obj
directory instead of making it a link to /run/shm (we link /run/shm to
/dev/shm, I think whoever wrote the boot script just got this mixed
up). The lx /dev mount is implemented by the lxdevfs (lxd) module.
This module is a mix of the native /dev (sdev) and tmpfs. Since
shm_open is just a userland abstraction, the OS only sees a regular
old open(2) call.
open("/dev/shm/..", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0) = 3
But when the process calls unlink(2) it fails because the
directory is a mount point.
unlink("/dev/shm/..") = -1 EBUSY (Device or resource busy)
This causes vnode holds to leak. You can verify this in MDB by cd'ing
to the lx zone's /dev/ dir, echo'ing the pid, getting the VFS
pointer, and then printing the hold count after each run. In this
example below, MDB shows 19 holds when it should be 2.
root@0cedd981-5e80-c1ac-d3f0-dd1f38498b2a:~# cd /dev
root@0cedd981-5e80-c1ac-d3f0-dd1f38498b2a:/dev# echo $$
4072
[root@testsos ~]# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs mm lofs mpt idm sd crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> 0t4072::pid2proc | ::print proc_t p_user.u_cdir | ::print vnode_t v_vfsp
v_vfsp = 0xffffff019037d840
> ::walk vn_cache | ::grep '*(.+18) == 0xffffff019037d840' | ::printf "(%d) %s\n" vnode_t v_count v_path ! egrep '/root/dev^'
(19) /zones/0cedd981-5e80-c1ac-d3f0-dd1f38498b2a/root/dev
This open should return EISDIR because of the O_RDWR flag,
but it doesn't because the O_CREAT flag causes it to take a
different code path.
[root@testsos ~]# dtrace -Fqn 'fbt::copen:entry /execname == "shm_open_37-1.ru"/ { self->path=copyinstr(arg1); printf("BEGIN %s\n", self->path); self->t=1; } fbt:::entry /self->t/ { printf("\n"); } fbt:::return /self->t/ { printf("%d (0x%p)\n", arg1, arg1); } fbt::copen:return /self->t/ { printf("END %s\n", self->path); self->path=0; self->t=0; }'
...
1 -> lxd_create
...
1 -> fop_create
1 -> crgetmapped
1 <- crgetmapped -1092788741784 (0xffffff0190b71968)
1 -> sdev_create
1 -> prof_lookup
1 <- prof_lookup 0 (0x0)
1 <- sdev_create 0 (0x0)
1 -> vn_updatepath
1 <- vn_updatepath 1086448270555 (0xfcf55cfcdb)
1 <- fop_create 0 (0x0)
1 <- lxd_create
In this trace you can see sdev_create() is called, normally it
would return EISDIR when write access is requested but this
function has special logic for non-global zones and that logic does
not include the EISDIR check.
/* non-global do not allow pure node creation */
if (!SDEV_IS_GLOBAL(parent)) {
rw_exit(&parent->sdev_dotdot->sdev_contents);
return (prof_lookup(dvp, nm, vpp, cred));
I believe the fix is to add the EISDIR logic to this case to bring
it in line with the GZ logic.
Former user commented on 2016-10-25T16:17:32.000-0400:
illumos-joyent commit 9da8cf4 (branch master, by Ryan Zezeski)
OS-5710#icft=OS-5710 sdev_create() doesn't enforce EISDIR in non-GZ
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>