IMAGE-1014: cloud-init doesn't retry metadata lookups and hangs forever if metadata is down when VM is started

Details

Issue Type:Bug
Priority:3 - Elevated
Status:Resolved
Created at:2017-02-23T06:40:01.000Z
Updated at:2018-05-10T21:49:03.855Z

People

Created by:Former user
Reported by:Josh Wilsdon
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2018-05-10T21:49:03.836Z)

Labels

bhyve

Description

See INC-1191 for some background. Basically though:

In order to reproduce this, you can simply:

svcadm disable metadata

and then boot a KVM VM running an Ubuntu image such as:

c864f104-624c-43d2-835e-b49a39709b6b (ubuntu-certified-14.04 20150225.2)

when you do this, the VM's boot process will hang at cloud-init. If you then start the metadata service, cloud-init will not recover.

This is problematic not just because of OS-5950 which caused delays for metadata sockets showing up in KVM VMs, but also in general if there happens to be some reason the metadata agent is restarting for any reason while a KVM VM is booting.

From INC-1191, I was able to cause forward progress to be made by applying this patch:

--- /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py.ori	2017-02-23 01:28:28.405885775 +0000
+++ /usr/lib/python2.7/dist-packages/cloudinit/sources/DataSourceSmartOS.py	2017-02-23 01:35:51.281885775 +0000
@@ -286,7 +286,7 @@
     if not seed_device:
         raise AttributeError("seed_device value is not set")

-    ser = serial.Serial(seed_device, timeout=seed_timeout)
+    ser = serial.Serial(seed_device, timeout=10)
     if not ser.isOpen():
         raise SystemError("Unable to open %s" % seed_device)

which causes the following strace output:

[pid  2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid  2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid  2119] write(5, "GET user-script\n", 16) = 16
[pid  2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid  2119] close(5)                    = 0
[pid  2119] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid  2119] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid  2119] write(5, "GET iptables_disable\n", 21) = 21
[pid  2119] select(6, [5], [], [], {10, 0}) = 0 (Timeout)
[pid  2119] close(5)                    = 0

instead of:

[pid  1977] open("/dev/ttyS1", O_RDWR|O_NOCTTY|O_NONBLOCK) = 5
[pid  1977] ioctl(5, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 -opost -isig -icanon -echo ...}) = 0
[pid  1977] write(5, "GET base64_keys\n", 16) = 16
[pid  1977] select(6, [5], [], [], NULL

which you get without the patch (notice the NULL for the timeout parameter). The code that gets blocked in this version of cloud-init is:

    ser.write("GET %s\n" % noun.rstrip())
    status = str(ser.readline()).rstrip()

in cloudinit/sources/DataSourceSmartOS.py. The ser.readline() documentation says (https://pyserial.readthedocs.io/en/latest/shortintro.html#readline):

Be careful when using readline(). Do specify a timeout when opening the serial port otherwise it could block forever if no newline character is received. Also note that readlines() only works with a timeout. readlines() depends on having a timeout and interprets that as EOF (end of file). It raises an exception if the port is not opened correctly.

which is exactly the situation we've hit here.

It seems the best solution would be to have a timeout but when the timeout is hit, the GET should be retried if there's no answer rather than moving on to the next key. A negative answer (NOTFOUND for example) should not be retried, only when there's no answer (because metadata is unavailable).

Once this is resolved, it should be possible to start a VM with cloud-init and metadata disabled, and then enable metadata some time later and have the boot process complete at that time.

Comments

Comment by Josh Wilsdon
Created at 2017-02-23T06:57:09.000Z
Updated at 2018-05-07T17:50:30.386Z

The version of cloudinit in the image I mentioned above is:

root@442387fb-e102-66a7-bec6-a548532ad13b:~# dpkg -l | grep cloud-init
ii  cloud-init                       0.7.5-0ubuntu1.3                 all          Init scripts for cloud instances
root@442387fb-e102-66a7-bec6-a548532ad13b:~#

Comment by Former user
Created at 2017-02-24T16:15:54.000Z

Here's the public bug report: https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1667735


Comment by Former user
Created at 2017-03-01T15:46:40.000Z

The lead for cloud-init seems to disagree on the proposed solution. Not sure where to go from here and admittedly I don't entirely understand the underlying cloud-init issue.


Comment by Josh Wilsdon
Created at 2017-03-01T21:16:56.000Z
Updated at 2018-05-07T17:50:41.741Z

Before commenting further, I wanted to run a test with latest. So I provisioned using the image:

{
    "id": "f62a1f5b-3a41-4cdf-a186-173c60aca70d",
    "name": "ubuntu-certified-16.10",
    "version": "20170222",
    "os": "linux",
    "requirements": {},
    "type": "zvol",
    "description": "Ubuntu 16.10 (20170222 64-bit). Certified Ubuntu Server Cloud Image from Canonical.",
    "files": [
        {
            "compression": "gzip",
            "sha1": "c85dd821ed246554a4e976ca57b5c7fb2be28d26",
            "size": 310949515
        }
    ],
    "tags": {
        "default_user": "ubuntu",
        "role": "os"
    },
    "homepage": "https://docs.joyent.com/images/linux/ubuntu-certified",
    "published_at": "2017-02-24T09:29:32.000Z",
    "owner": "930896af-bf8c-48d4-885c-6573a94b1853",
    "public": true,
    "state": "active"
}

and found that though cloud-init has changed a lot, the behavior when metadata is off seems similar even though there's a timeout. The code in question now looks like:

        self.fp.write(msg.encode('ascii'))
        self.fp.flush()

        response = bytearray()
        response.extend(self.fp.read(1))
        while response[-1:] != b'\n':
            response.extend(self.fp.read(1))

and there's now a 60 second timeout, but since this is an infinite loop of reads, the result is exactly the same as the infinite timeout that was here before. Under strace this looks like:

[pid  2572] write(5, "V2 29 d75a1f7f 4087b959 GET c2Rj"..., 45) = 45
[pid  2572] select(9, [8], [5], [], NULL) = 1 (out [5])
[pid  2572] ioctl(5, TCSBRK, 1)         = 0
[pid  2572] select(7, [5 6], [], [], {60, 0}) = 0 (Timeout)
[pid  2572] select(7, [5 6], [], [], {60, 0}) = 0 (Timeout)
[pid  2572] select(7, [5 6], [], [], {60, 0}) = 0 (Timeout)
[pid  2572] select(7, [5 6], [], [], {60, 0}) = 0 (Timeout)
[pid  2572] select(7, [5 6], [], [], {60, 0}

and when I turn metadata back on, of course there's no way for this to notice, so startup of this VM will be blocked until it is rebooted again.

fwiw, the easiest way to test this seems to be:

rm /var/lib/cloud/instance/obj.pkl ; /usr/bin/cloud-init -d init --local

which will cause cloud-init to retry init and will make it query metadata. If you stop metadata and run this, it'll hang until killed, even if metadata is turned back on. Which is the same behavior you see if you boot one of these VMs while metadata is disabled.

After some more experimentation, I found that the self.fp.read(1) simply returns an empty string here, apparently due to:

https://github.com/pyserial/pyserial/blob/e0063f13ab53b1a3c9fa9f0b9c65444385e3766f/serial/serialposix.py#L472-L481

which matches what we're seeing here (the select, timeout and then return of b'').

It's clear that this implementation also doesn't follow the recommendations of https://eng.joyent.com/mdata/protocol.html either, as it's not reading and discarding bytes from the serial device. I'll do a bit more experimentation here and then update the launchpad ticket.


Comment by Former user
Created at 2017-05-16T15:36:08.000Z

Assigning back to you so this doesn't get lost.

The issue is being tracked upstream via https://bugs.launchpad.net/cpc-joyent/+bug/1667361 but I haven't seen much activity there.


Comment by Josh Wilsdon
Created at 2018-01-06T00:14:34.412Z

I'm not reasonably going to find time to do work on this anytime soon, so unassigning in case someone wants to pick this up.


Comment by Former user
Created at 2018-02-14T20:37:47.145Z

I have a working fix for this.  It is included in my preliminary guest images for bhyve


Comment by Former user
Created at 2018-02-14T20:41:28.849Z

Not sure how the assignment for this should go. I'm working on a fix for cloud-init upstream, but hope that someone else can pick up the updating of the images.


Comment by Former user
Created at 2018-04-18T19:53:59.111Z

Fixed upstream.

https://git.launchpad.net/cloud-init/commit/?id=4c573d0e0173d2b1e99a383c54a0a6c957aa1cbb


Comment by Former user
Created at 2018-04-24T18:12:49.425Z

This fix is expected to be included in Bionic Beaver (18.04) and should appear in the next SRU for Xenial (16.04) and Artful (17.10) in a bit over a month.


Comment by Former user
Created at 2018-05-10T21:49:03.852Z

Fix is upstream, images that need it can pull it in.