TRITON-2229: CN server setup needs to poll napi to ensure net-agent has had time to adopt the nics


Issue Type:Bug
Priority:4 - Normal
Created at:2021-06-04T02:47:09.161Z
Updated at:2022-04-13T21:16:12.973Z


Created by:Brian Bennett
Reported by:Brian Bennett
Assigned to:Brian Bennett




We've run into a situation with Linux where after server setup the CN will reboot back into SmartOS with an unreadable zpool.

Here's what's going on:

  1. Server boots the default PI which is SmartOS
  2. Server is assigned a Linux PI and rebooted
  3. Server setup runs
  4. The workflow job finishes before net-agent has pushed its nics to NAPI
  5. When the server reboots, booter looks in napi for nics owned by the server and finds none
  6. Booter then looks for setup servers by hostname. Since the setup field is still false, booter doesn't find a record and serves the default PI
  7. Server boots up to SmartOS and appears unsetup

At this point you can manually PUT the nic with correct ownership in NAPI and reboot and it will get the correct Linux PI. I've also confirmed that net-agent only needs about 10s to do the nic adoption. For now we're working around this by putting in a sleep. But I believe this condition also exists for SmartOS, it's just never been noticed.

So why isn't it a problem for SmartOS? Because prior to the introduction of Linux CNs, servers are almost exclusively set up on the default PI. If net-agent doesn't finish, it'll just boot the default PI that it was supposed to boot anyway, the zpool isn't unreadable so the system boots as expected and net-agent starts up and takes ownership of the nics. The next time the server is rebooted, whenever that is, it'll get assigned PI. Worst case scenario, it boots the default PI, a different PI is assigned, server setup runs and it boots the default PI again and net-agent does its thing. Operators are likely to think "huh, that was weird" and just reboot it again, never looking closely at it. In a SmartOS only world, it's exceedingly rare (almost, but not quite impossible*) that you can end up in a situation with a server that's been through setup, and not capable of booting properly.

On top of this, we do several more things after on SmartOS that we don't do on Linux and I believe that gives SmartOS net-agent the necessary time it needs. However, this problem may very well exist in the reverse if your default PI is Linux and you want the occasional SmartOS CN.

Either way, in order for the server to deterministically boot the intended platform is to have a workflow step to ensure that the nics are properly owned in napi and stop leaving it up to chance.

* It's actually the same trigger condition. Linux CN's zfs supports options that SmartOS zfs doesn't. If you have two versions of SmartOS where the version it was running at the time of setup has features that the default doesn't support, and net-agent doesn't have time to adopt the nics, you'll boot the default and not be able to import the zpool.