TRITON-2308: vmware(coal) images broken since 20220410T022206Z


Issue Type:Bug
Priority:5 - Low
Created at:2022-05-21T04:24:49.963Z
Updated at:2022-05-24T01:28:28.519Z


Created by:Brian Bennett
Reported by:Brian Bennett
Assigned to:Brian Bennett


With the release of 20220505, we’ve discovered that the vm won’t boot anymore due to a hash mismatch. Tracing it back, the earliest build exhibiting this issue is 20220410T022206Z. What’s extremely concerning is that we can’t find anything correlating to this change. There were no changes to smartos-live, illumos-joyent, or sdc-headnode. In fact, the 20220409T022215Z and 20220410T022206Z headnode builds are even both using the same platform image, 20220407T171243Z. And since they’re using the same platform image, the boot files should be identical. It’s also concerning that the headnode usb, and smartos usb images don’t have this issue, and SmartOS.vmwarevm images (after being corrected for OS-8386) does not have this issue.


Comment by Brian Bennett
Created at 2022-05-21T04:33:00.979Z

I booted up a 0310 headnode vmwarevm that I also copied the 0505 broken vmdk into so that I could examine it.

The boot_archive.hash is 4f8467fa264150f93fb3ffd647f57dff01a8a2b0, and the hash of the boot_archive for a working platform image is also 4f8467fa264150f93fb3ffd647f57dff01a8a2b0. The hash of the boot_archive on the broken 0505 vmwarevm image is 24387b1963889193d5e9cebe454fb305a734de2f.

Comment by Brian Bennett
Created at 2022-05-21T04:54:31.765Z
Updated at 2022-05-21T04:56:06.186Z

The MNX jenkins wasn’t created until 2022-04-14, and updates.tdc wasn’t created until 2022-05-03, so neither of those can be an issue either.

Comment by Brian Bennett
Created at 2022-05-24T01:26:31.262Z
Updated at 2022-05-24T01:28:28.443Z

As inexplicably as this started, it seems to have gone away on its own. This is particularly concerning, but now that the issue can’t be replicated, I don’t really know what to do about it.

For posterity, the first failing build is 20220410T022206Z. The last failing build is 20220521T022448Z.

Should this issue ever come up again, this ticket hopefully can provide some guidance.

The way builds for sdc-headnode work, is that we call make all which depends on coal gz-tools ipxe. We don’t actually ever call make usb (but that target should work). In order for coal to be built, it needs a copy of the USB image, and since one doesn’t exist, build-coal-image will call build-usb-image. Once that’s done, we dd out the rootfs from the resultant image so that we can mount it and copy in the devtools, dd the updated rootfs back to the usb image and then copy the image into the vmwarevm directory. The usb image created in this process is the final image that gets uploaded to manta. There’s no separate usb build step.

The only thing going on here is system commands: dd, cp, mv, all controlled by bash. The fact that this had:

makes this one of the most perplexing issues I’ve ever dealt with. And now that the problem has vanished as mysteriously as it started, there’s no way to make progress on it.

If you’re dealing with broken headnode builds and have ended up here looking for answers, I deeply apologize that the only thing I can provide you, dear reader, is more questions. If this had not happened to myself, I would not believe it. I bid you good luck.