OS-5710: sdev_create() doesn't enforce EISDIR in non-GZ

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2016-10-11T03:56:45.000Z
Updated at:2016-10-25T22:06:13.000Z

People

Created by:Ryan Zezeski [X]
Reported by:Ryan Zezeski [X]
Assigned to:Ryan Zezeski [X]

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2016-10-25T22:06:13.000Z)

Fix Versions

2016-10-27 BAVARIAN DIVERSITY (Release Date: 2016-10-27)

Description

In short: the sdev filesystem doesn't enforce EISDIR in non-global
zones. That is, an open(2) call with write access on a directory
doesn't return EISDIR as the man page specifies it should. This, in
turn, can lead to vnode hold leaks when calling shm_open in
lx-brand zones.

You can reproduce this by truss'ing the following code in the GZ and
then in a native zone. There is no significance to choosing
/dev/zvol/.., any dir backed by sdev will do.

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int
main()
{
       	open("/dev/zvol/..", O_RDWR, 0);
       	open("/dev/zvol/..", O_RDWR | O_CREAT, 0);
}

Here is the relevant truss output when run in the GZ:

9643:  	open("/dev/zvol/..", O_RDWR)   			Err#21 EISDIR
9643:  	open("/dev/zvol/..", O_RDWR|O_CREAT, 0)		Err#21 EISDIR

Here is the relevant truss output when run in the zone:

9640:  	open("/dev/zvol/..", O_RDWR)   			Err#21 EISDIR
9640:  	open("/dev/zvol/..", O_RDWR|O_CREAT, 0)		= 3

Those are the most important facts but for completeness I will
retrace how I arrived here and how this ties into lxbrand.

This all started with the Open Posix Test Suite (PTS) shm_open_37-1
test. This tests calls shm_open() with various object names and
verifies that open either created successfully or failed with
EINVAL (because it's an invalid name). One of these names is "..":
the parent directory file.

In native illumos this name is invalid because it doesn't start with a
leading slash. If you add the slash it still won't conflict with an
existing file because the implementation adds a prefix to the name.

In native Linux the ".." name is valid but if you open the object with
write permission the underlying open(2) call will fail with
EISDIR since a directory cannot be opened for writing. Glibc
converts this to EINVAL to report the name as invalid (and because
EISDIR isn't part of the shm_open spec). The comment in glibc is
telling:

  int fd = open (shm_name, oflag, mode);
  if (fd == -1 && __glibc_unlikely (errno == EISDIR))
    /* It might be better to fold this error with EINVAL since
       directory names are just another example for unsuitable shared
       object names and the standard does not mention EISDIR.  */
    __set_errno (EINVAL);

However, in Linux, you can get around this by opening the shared
object as O_RDONLY; in this case you'll just read the contents of
the /run directory (Linux creates its shm objects in /dev/shm, a link
to /run, which is a tmpfs mount). This is probably why illumos
prefixes its object names, to prevent conflicts like this.

Things get a bit more complicated in lx brand. First off, at least on
Ubuntu, we deviate from native Ubuntu and use /dev/shm as the shm obj
directory instead of making it a link to /run/shm (we link /run/shm to
/dev/shm, I think whoever wrote the boot script just got this mixed
up). The lx /dev mount is implemented by the lxdevfs (lxd) module.
This module is a mix of the native /dev (sdev) and tmpfs. Since
shm_open is just a userland abstraction, the OS only sees a regular
old open(2) call.

open("/dev/shm/..", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0) = 3

But when the process calls unlink(2) it fails because the
directory is a mount point.

unlink("/dev/shm/..")                   = -1 EBUSY (Device or resource busy)

This causes vnode holds to leak. You can verify this in MDB by cd'ing
to the lx zone's /dev/ dir, echo'ing the pid, getting the VFS
pointer, and then printing the hold count after each run. In this
example below, MDB shows 19 holds when it should be 2.

root@0cedd981-5e80-c1ac-d3f0-dd1f38498b2a:~# cd /dev
root@0cedd981-5e80-c1ac-d3f0-dd1f38498b2a:/dev# echo $$
4072


[root@testsos ~]# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs mm lofs mpt idm sd crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ipc ]
> 0t4072::pid2proc | ::print proc_t p_user.u_cdir | ::print vnode_t v_vfsp
v_vfsp = 0xffffff019037d840
> ::walk vn_cache | ::grep '*(.+18) == 0xffffff019037d840' | ::printf "(%d) %s\n" vnode_t v_count v_path ! egrep '/root/dev^'
(19) /zones/0cedd981-5e80-c1ac-d3f0-dd1f38498b2a/root/dev

This open should return EISDIR because of the O_RDWR flag,
but it doesn't because the O_CREAT flag causes it to take a
different code path.

[root@testsos ~]# dtrace -Fqn 'fbt::copen:entry /execname == "shm_open_37-1.ru"/ { self->path=copyinstr(arg1); printf("BEGIN %s\n", self->path); self->t=1; } fbt:::entry /self->t/ { printf("\n"); } fbt:::return /self->t/ { printf("%d (0x%p)\n", arg1, arg1); } fbt::copen:return /self->t/ { printf("END %s\n", self->path); self->path=0; self->t=0; }'
...
  1          -> lxd_create
...
  1            -> fop_create
  1              -> crgetmapped
  1              <- crgetmapped               -1092788741784 (0xffffff0190b71968)
  1              -> sdev_create
  1                -> prof_lookup
  1                <- prof_lookup             0 (0x0)
  1              <- sdev_create               0 (0x0)
  1              -> vn_updatepath
  1              <- vn_updatepath             1086448270555 (0xfcf55cfcdb)
  1            <- fop_create                  0 (0x0)
  1          <- lxd_create

In this trace you can see sdev_create() is called, normally it
would return EISDIR when write access is requested but this
function has special logic for non-global zones and that logic does
not include the EISDIR check.

	/* non-global do not allow pure node creation */
	if (!SDEV_IS_GLOBAL(parent)) {
		rw_exit(&parent->sdev_dotdot->sdev_contents);
		return (prof_lookup(dvp, nm, vpp, cred));

I believe the fix is to add the EISDIR logic to this case to bring
it in line with the GZ logic.

Comments

Comment by Bot Bot [X]
Created at 2016-10-25T20:17:32.000Z

illumos-joyent commit 9da8cf4 (branch master, by Ryan Zezeski)

OS-5710 sdev_create() doesn't enforce EISDIR in non-GZ
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>