OS-8083: zlogin -I now hangs at zone stop

Details

Issue Type:Bug
Priority:3 - Elevated
Status:Resolved
Created at:2019-12-31T23:25:09.699Z
Updated at:2020-01-02T23:53:21.485Z

People

Created by:Todd Whiteman
Reported by:Todd Whiteman
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2020-01-02T23:53:21.476Z)

Fix Versions

2020-01-02 Importer Exporter (Release Date: 2020-01-02)

Related Links

Description

The `docker attach` tests are now failing in the nightly test rig as of Dec 21, 2019.

Note that docker attach is where a terminal/console stdio is attached to an existing running container - and this uses `zlogin -Q -I $ZONE` under the hood, with node sockets handling the stdio between zlogin and the terminal.

I suspect this zlogin bug change - which was merged in around this time (mostly because this seems the most related of the merge changes):
https://www.illumos.org/issues/12057 Github commit

To reproduce, run the following:

$ cat > lx-docker.json << EOF
{
  "alias": "lx-1",
  "brand": "lx",
  "docker": true,
  "kernel_version": "3.13.0",
  "image_uuid": "5917ca96-c888-11e5-8da0-e785a1ad1185",
  "ram": "256",
  "internal_metadata": {
    "docker:cmd": "[\"bash\",\"-c\",\"for i in {1..20}; do echo \\\"\$i\\\"; sleep 5; done; exit 2\"]",
    "docker:env": "[\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"]"
  }
}
EOF

$ imgadm import 5917ca96-c888-11e5-8da0-e785a1ad1185

$ vmadm create -f lx-docker.json
Successfully created VM $UUID

$ zlogin -Q -I $UUID
3
4
5
...
20

Now the zlogin process will hang forever after outputting 20 (and is still visible in `ps` output).

Prior to Dec 21, the zlogin process would end/exit after reaching 20.

Comments

Comment by Former user
Created at 2020-01-01T01:03:45.487Z

@accountid:62562f1fcdc24000704b0435 @accountid:70121:6490ccfd-5932-4e7a-936d-554bdd3dc0d3 were reviewers on this change. Maybe they have some insight.


Comment by Todd Whiteman
Created at 2020-01-02T16:50:58.515Z

Note that this affects interactive docker sessions (i.e. `docker run ti ...`), as the docker client will no longer exit as it should (it will also hang until further stdin is seen e.g. until pressing a key).


Comment by Former user
Created at 2020-01-02T17:57:27.738Z

I was able to recreate the issue after building master (as of this morning).

On a hunch, I built a slightly patched version of zlogin:

diff --git a/usr/src/cmd/zlogin/zlogin.c b/usr/src/cmd/zlogin/zlogin.c
index 1b49fc221f..8e8c7e626f 100644
--- a/usr/src/cmd/zlogin/zlogin.c
+++ b/usr/src/cmd/zlogin/zlogin.c
@@ -895,7 +895,7 @@ doio(int stdin_fd, int appin_fd, int stdout_fd, int stderr_fd, int sig_fd,

        /* read from stdout of zone and write to stdout of global zone */
        pollfds[0].fd = stdout_fd;
-       pollfds[0].events = POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI;
+       pollfds[0].events = POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI | POLLHUP;

        /* read from stderr of zone and write to stderr of global zone */
        pollfds[1].fd = stderr_fd;
@@ -941,6 +941,9 @@ doio(int stdin_fd, int appin_fd, int stdout_fd, int stderr_fd, int sig_fd,

                /* event from master side stderr */
                if (pollfds[1].revents) {
+                       if (pollfds[1].revents & POLLHUP)
+                               fprintf(stderr, "XXX stderr HUP!\n");
+
                        if (pollfds[1].revents &
                            (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) {
                                if (process_output(stderr_fd, STDERR_FILENO)
@@ -954,6 +957,9 @@ doio(int stdin_fd, int appin_fd, int stdout_fd, int stderr_fd, int sig_fd,

                /* event from master side stdout */
                if (pollfds[0].revents) {
+                       if (pollfds[0].revents & POLLHUP)
+                               fprintf(stderr, "XXX stdout HUP!\n");
+
                        if (pollfds[0].revents &
                            (POLLIN | POLLRDNORM | POLLRDBAND | POLLPRI)) {
                                if (process_output(stdout_fd, STDOUT_FILENO)

And sure enough, after the dockercmd process finishes, we start seeing HUP events for both of the zone's fds.

That suggests we can probably leverage that somehow in the fix, though I'll need to do a bit more digging to make sure other zlogin functionality isn't broken.


Comment by Former user
Created at 2020-01-02T19:58:57.702Z

I should also add, prior to trying the modified zlogin binary, I trussed the zlogin binary (using the above test), and saw that approx 5 seconds after '20' appeared, zlogin would get in a loop where it was calling pollsys (aka poll(2)), then read on fds 5 and 6). This is what led me to check if a POLLHUP event was being delivered.


Comment by Former user
Created at 2020-01-02T21:47:40.336Z

I tested this by creating a patched zlogin binary and running the test case in the description. zlogin would then exit approx 5 seconds after '20' was displayed.

I did note that once zlogin exited, the invoking shell no longer seemed to be displaying any output, though it was accepting input -- if I exited the shell, then logged back in, I could see the things I typed (after zlogin exited) in the shell history.
I then built a zlogin binary using a snapshot of zlogin.c from the previous release and verified it exhibited the same behavior, so that seems to be an unrelated issue.

@accountid:70121:a36ea101-b8c9-4f3d-825e-334bc077ca5e was also able to test a PI built with the fix and verified that zlogin is now exiting.


Comment by Jira Bot
Created at 2020-01-02T23:52:26.708Z

illumos-joyent commit fc356053b6fcdfb2eb1f9353e1b7e5332fbfcaf8 (branch master, by Jason King)

OS-8083 zlogin -I now hangs at zone stop (#247)

Reviewed by: Mike Gerdts <mike.gerdts@joyent.com>
Approved by: Mike Gerdts <mike.gerdts@joyent.com>