NAPI-327: net-agent should not leak NICs

Details

Issue Type:Bug
Priority:3 - Elevated
Status:Open
Created at:2015-09-28T06:48:22.000Z
Updated at:2019-10-19T00:00:29.624Z

People

Created by:Former user
Reported by:Former user

Related Issues

Description

There have been incidents of duplicate IP address assignment associated with IP addresses that were once assigned to instances that failed provisioning. In examining the network and vm artifacts, it is clear that there are cases when the nics are left behind for such instances:
1) When the provisioning process has already created the nics but failed before the nic were written back to the vm record, subsequent delete instance request would not attempt to remove those nics.
2) When the provisioning process has already created the nics but failed to start up for various reasons, subsequent delete instance request would fail to remove the nics since the RemoveNic workflow requires a restart of the instance. In this situation, the container is stuck in "failed" status and cannot be deleted.

(Note: There is a 3rd situation which I've observed with `docker-compose rm` operation where the IP and nic records were not unset after the container has been "successfully" deleted. The bug is still being tracked down but will likely be a different one.)

The above situation got worse when the operator uses the NAPI update IP action to `free` or `unassign` the IP address (NAPI-326) in order to reuse the IP address. What they didn't notice is that the nic records might not have been cleaned up even after they have deleted the container. To ensure the delete workflow indeed takes care of nics, it should look up ip and nic records using the belong_to_uuid, rather than attempting to do the clean up based on the list of nics present in the vm record.

UPDATE:

Attached is a script that can reproduce the leaky IPs. It creates a VM using vmapi, deletes it using vmadm, and then uses Trent's sdc-wasted-ips script to detect the wasted ips (and the vm with which they are associated).

Also attached is a sample manifest for use with sdc-vmadm. You will need to change the owner_uuid, the uuid stored in the networks member, and anything else that's system specific.

Net-agent is not cleaning up IPs only when it doesn't have to retry sendFullSample() in NetAgent.prototype.start(). The code-path for initializing the VM event watcher is executed from the onRetry() callback, which only gets called on a retry -- which isn't always going to happen.

Comments

Comment by Former user
Created at 2019-10-19T00:00:29.624Z

Found when migrating from Gerrit to PRs. The CR https://cr.joyent.us/#/c/171/ (archive: https://github.com/joyent/gerrit-migration/tree/master/archive/171) was active in 2016, then in 2018 Cody uploaded some newer versions of the patch. It was not discussed/reviewed after that.

At least the latest patchset has commit 616e56e921fd3bcb9992485314b7edf2cc742c30 for a parent. That commit isn't in the GH repo:

$ git show 616e56e921fd3bcb9992485314b7edf2cc742c30
fatal: bad object 616e56e921fd3bcb9992485314b7edf2cc742c30
[16:56:39 trentm@bluesteel:~/joy/sdc-net-agent (master)]
$ git remote -v
cr	trentm@cr.joyent.us:joyent/sdc-net-agent.git (fetch)
cr	trentm@cr.joyent.us:joyent/sdc-net-agent.git (push)
origin	git@github.com:joyent/sdc-net-agent.git (fetch)
origin	git@github.com:joyent/sdc-net-agent.git (push)

I dson't believe it is in the sdc-net-agent.git repo on cr.joyent.us either:

$ git show 616e56e921fd3bcb9992485314b7edf2cc742c30
fatal: bad object 616e56e921fd3bcb9992485314b7edf2cc742c30
[16:59:25 trentm@bluesteel:~/tmp/sdc-net-agent (master)]
$ git remote -v
origin	trentm@cr.joyent.us:joyent/sdc-net-agent.git (fetch)
origin	trentm@cr.joyent.us:joyent/sdc-net-agent.git (push)

So, I don't feel there is use right now in migrating this CR to a GitHub PR. If someone later wants to do that, they can start by looking at https://github.com/joyent/gerrit-migration/blob/master/bin/cr2pr