Issue Type: | Bug |
---|---|
Priority: | 3 - Elevated |
Status: | Resolved |
Created at: | 2019-12-31T23:25:09.699Z |
Updated at: | 2020-01-02T23:53:21.485Z |
Created by: | Todd Whiteman |
---|---|
Reported by: | Todd Whiteman |
Assigned to: | Former user |
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2020-01-02T23:53:21.476Z)
2020-01-02 Importer Exporter (Release Date: 2020-01-02)
The `docker attach` tests are now failing in the nightly test rig as of Dec 21, 2019.
Note that docker attach is where a terminal/console stdio is attached to an existing running container - and this uses `zlogin -Q -I $ZONE` under the hood, with node sockets handling the stdio between zlogin and the terminal.
I suspect this zlogin bug change - which was merged in around this time (mostly because this seems the most related of the merge changes):
https://www.illumos.org/issues/12057 Github commit
To reproduce, run the following:
$ cat > lx-docker.json << EOF { "alias": "lx-1", "brand": "lx", "docker": true, "kernel_version": "3.13.0", "image_uuid": "5917ca96-c888-11e5-8da0-e785a1ad1185", "ram": "256", "internal_metadata": { "docker:cmd": "[\"bash\",\"-c\",\"for i in {1..20}; do echo \\\"\$i\\\"; sleep 5; done; exit 2\"]", "docker:env": "[\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"]" } } EOF $ imgadm import 5917ca96-c888-11e5-8da0-e785a1ad1185 $ vmadm create -f lx-docker.json Successfully created VM $UUID $ zlogin -Q -I $UUID 3 4 5 ... 20
Now the zlogin process will hang forever after outputting 20 (and is still visible in `ps` output).
Prior to Dec 21, the zlogin process would end/exit after reaching 20.
@accountid:62562f1fcdc24000704b0435 @accountid:70121:6490ccfd-5932-4e7a-936d-554bdd3dc0d3 were reviewers on this change. Maybe they have some insight.
Note that this affects interactive docker sessions (i.e. `docker run ti ...`), as the docker client will no longer exit as it should (it will also hang until further stdin is seen e.g. until pressing a key).
I was able to recreate the issue after building master (as of this morning).
On a hunch, I built a slightly patched version of zlogin:
diff --git a/usr/src/cmd/zlogin/zlogin.c b/usr/src/cmd/zlogin/zlogin.c index 1b49fc221f..8e8c7e626f 100644 --- a/usr/src/cmd/zlogin/zlogin.c +++ b/usr/src/cmd/zlogin/zlogin.c @@ -895,7 +895,7 @@ doio(int stdin_fd, int appin_fd, int stdout_fd, int stderr_fd, int sig_fd, /* read from stdout of zone and write to stdout of global zone */ pollfds[0].fd = stdout_fd; - pollfds[0].events = POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI; + pollfds[0].events = POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI | POLLHUP; /* read from stderr of zone and write to stderr of global zone */ pollfds[1].fd = stderr_fd; @@ -941,6 +941,9 @@ doio(int stdin_fd, int appin_fd, int stdout_fd, int stderr_fd, int sig_fd, /* event from master side stderr */ if (pollfds[1].revents) { + if (pollfds[1].revents & POLLHUP) + fprintf(stderr, "XXX stderr HUP!\n"); + if (pollfds[1].revents & (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) { if (process_output(stderr_fd, STDERR_FILENO) @@ -954,6 +957,9 @@ doio(int stdin_fd, int appin_fd, int stdout_fd, int stderr_fd, int sig_fd, /* event from master side stdout */ if (pollfds[0].revents) { + if (pollfds[0].revents & POLLHUP) + fprintf(stderr, "XXX stdout HUP!\n"); + if (pollfds[0].revents & (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) { if (process_output(stdout_fd, STDOUT_FILENO)
And sure enough, after the dockercmd process finishes, we start seeing HUP events for both of the zone's fds.
That suggests we can probably leverage that somehow in the fix, though I'll need to do a bit more digging to make sure other zlogin functionality isn't broken.
I should also add, prior to trying the modified zlogin binary, I trussed the zlogin binary (using the above test), and saw that approx 5 seconds after '20' appeared, zlogin would get in a loop where it was calling pollsys (aka poll(2)), then read on fds 5 and 6). This is what led me to check if a POLLHUP
event was being delivered.
I tested this by creating a patched zlogin binary and running the test case in the description. zlogin would then exit approx 5 seconds after '20' was displayed.
I did note that once zlogin exited, the invoking shell no longer seemed to be displaying any output, though it was accepting input -- if I exited the shell, then logged back in, I could see the things I typed (after zlogin exited) in the shell history.
I then built a zlogin binary using a snapshot of zlogin.c from the previous release and verified it exhibited the same behavior, so that seems to be an unrelated issue.
@accountid:70121:a36ea101-b8c9-4f3d-825e-334bc077ca5e was also able to test a PI built with the fix and verified that zlogin is now exiting.
illumos-joyent commit fc356053b6fcdfb2eb1f9353e1b7e5332fbfcaf8 (branch master, by Jason King)
OS-8083 zlogin -I now hangs at zone stop (#247)
Reviewed by: Mike Gerdts <mike.gerdts@joyent.com>
Approved by: Mike Gerdts <mike.gerdts@joyent.com>