OS-8165: dockerinit should be able to start lx_lockd


Issue Type:Bug
Priority:4 - Normal
Created at:2020-04-28T17:09:18.007Z
Updated at:2020-04-30T16:31:22.035Z


Created by:Michael Zeller
Reported by:Michael Zeller
Assigned to:Michael Zeller


Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2020-04-30T16:31:22.019Z)

Fix Versions

2020-05-07 Reverse Peephole (Release Date: 2020-05-07)

Related Links


dockerinit uses the native mount command to mount NFS volumes. Since the native mount is used, we don't end up going down the lx mount path which means that lx_lockd will never be kicked off.  Anything in lx trying to use locking with the NFS volume will fail.

We need to allow the native_brand to make a lx_brandsys call so that lx_lockd can be started from within dockerinit.


Comment by Max Bruning
Created at 2020-04-29T22:42:36.586Z
Updated at 2020-04-29T22:45:54.014Z

To test this fix, I have used elasticsearch on Docker on Triton (the way the problem first showed up).  The elasticsearch process (i.e., java), when used with volumes was failing with:

java.lang.IllegalStateException: failed to obtain node locks, tried [[/usr/share/elasticsearch/data]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])? 

After checking permissions and setting correct uid, the problem persisted. Tracing fcntl calls made by DTrace showed:

31  12600                      fcntl:entry fcntl called for /zones/88d350d5-50ee-6f6e-ee3d-f34ee5c91d31/root/usr/share/elasticsearch/data/nodes/0/node.lock, cmd = 6

 31  12601                     fcntl:return fcntl returns errno = 46

cmd = 6 is F_SETLK, and 46 is ENOLOCK(?). There was no lockd process running.

With the fix, fcntl is no longer returning errors and java is not exiting.

I also ran a simple test program:

cat locktest.c
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int argc, char *argv[])

        struct flock fl;
        int fd;

        fl.l_type   = F_WRLCK;  /* read/write lock */
        fl.l_whence = SEEK_SET; /* beginning of file */
        fl.l_start  = 0;        /* offset from l_whence */
        fl.l_len    = 0;        /* length, 0 = to EOF */
        fl.l_pid    = getpid(); /* PID */

        if (argc != 2) {
                printf("Usage: %s lockfile\n", argv[0]);
        fd = open(argv[1], O_RDWR | O_CREAT, 0666);
        fcntl(fd, F_SETLK, &fl); /* set lock */
        printf("\n release lock \n");
        fl.l_type   = F_UNLCK;
        fcntl(fd, F_SETLK, &fl); /* unset lock */

I ran:

# docker run --rm -it ubuntu /bin/bash

Once in bash, I verified that lx_lockd was running (using ps aux), then I did `apt-get install build-essential`, copied the testlock.c file into the container (from the global zone), then compiled the test program. Without the fix, the fcntl(2) call failed as verified with DTrace and with strace. With the fix, the fcntl succeeded.
There may still be an issue with F_SETLKW, but it is orthogonal to this issue. Still need more testing there. I have succeeded in getting a 3 node elasticsearch cluster to come up using `triton-compose`, along with kibana. I'm still doing some debugging of logstash (it comes up, but doesn't "see" the elasticsearch cluster).

Comment by Jira Bot
Created at 2020-04-30T16:30:09.047Z

illumos-joyent commit 54a7e5761a35624975c2f384a98b3235bf625094 (branch master, by Michael Zeller)

OS-8165 dockerinit should be able to start lx_lockd (#297)

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Max Bruning <max@joyent.com>
Approved by: Jerry Jelinek <jerry.jelinek@joyent.com>

Comment by Jira Bot
Created at 2020-04-30T16:30:42.287Z

smartos-live commit 5cd12761a8fbc2ad5ac813e624c1fafdd8849307 (branch master, by Michael Zeller)

OS-8165 dockerinit should be able to start lx_lockd (#929)

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Max Bruning <max@joyent.com>
Approved by: Max Bruning <max@joyent.com>