Servers that already had a specific UUID were discovered to change their UUID based seemingly on reading a UUID from the output of smbios. That caused a lot of unexpected behavior.
This was narrowed down to be the result of the BMC being in a bad state. By saving the UUID to the zones pool and trusting that over the smbios output going forward we prevent this sort of hardware misbehavior from causing operator confusion.
Leveraging the code from the fix for this bug,@Dan McDonald wrote a one-liner that can check for servers where the server UUID is currently mismatched from the output of smbios (whether from override_uuid or from trusting a value stored on the pool):
sdc-oneachnode -a 'if [[ $(sysinfo | json UUID) != $(smbios | grep UUID: | awk "{print \$2;}" ) ]]; then echo "mismatch on $(sysinfo | json UUID)" ; fi'
[root@headnode (mass-1) ~]# sdc-oneachnode -a 'if [[ $(sysinfo | json UUID) != $(smbios | grep UUID: | awk "{print \$2;}" ) ]]; then echo "mismatch on $(sysinfo | json UUID)" ; fi'
HOSTNAME STATUS
cn-1 mismatch on 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3
cn-2
headnode
See also T8dedf5d44cd8520e-M9f0075c99310d59a5207f285
Need to test:
What happens when smbios uuid changes
What happens when boot MAC changes
What happens when BOTH change?
We should not test MAC address changes. Those could reasonably require Operator intervention to fix nic tag associations (If you change a NIC on a SmartOS machine, you cannot expect it to “Just Work” upon booting it up again.)
Nahum Shalman commented on 2025-11-14T10:16:22.270-0500:
sysinfo#L103 is where the UUID is read from smbios.
Nahum Shalman commented on 2025-11-19T12:33:39.452-0500:
We apparently also have a "override_uuid" bootparam:
# overwrite UUID if config dictates otherwise
tmp_uuid=$(/usr/bin/bootparams | grep "^override_uuid=" | cut -f2 -d'=')
if [[ -n $tmp_uuid ]]; then
UUID=$tmp_uuid
fi
… which presumably one could configure booter to send to a machine, and is probably the correct place in the code for any additional logic on a further source of truth.
I think we should do:
get value from smbios
get value from zones pool, if present. if not present, record the value from smbios
get value from override_uuid if present, also update zones pool if so
Dan McDonald commented on 2025-11-20T10:18:20.429-0500:
First round of testing on all-VMware-guest Triton Cloud mass-1on my workstation.
[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME UUID VERSION SETUP STATUS RAM ADMIN_IP
headnode 564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7 7.0 true running 16383 172.16.64.2
[root@headnode (mass-1) ~]# sdcadm platform list
VERSION CURRENT_PLATFORM BOOT_PLATFORM LATEST DEFAULT OS
20251119T203931Z 0 0 true false smartos
20251113T010957Z 1 1 false true smartos
[root@headnode (mass-1) ~]#
Set 'default' to be 20251119T203931Z and bring up a new CN, inspecting pre-and-post setup.
Currently all mass-1 VMware instances (headnode, CN-1, have sane BIOS settings and sane UUIDs as a result.
Bringing up CN-1 with new PI. Headnode sees it.
[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME UUID VERSION SETUP STATUS RAM ADMIN_IP
headnode 564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7 7.0 true running 16383 172.16.64.2
00-0c-29-7f-b3-d3 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 7.0 false running 4095 172.16.64.32
[root@headnode (mass-1) ~]#
VM setting is sane:
kebe(~)[0]% grep uuid Virtual\ Machines.localized/mass-1-CN-1.vmwarevm/mass-1-CN-1.vmx
uuid.bios = "56 4d f9 cb 7b 48 9e de-b7 a5 c8 ad 3e 7f b3 d3"
uuid.location = "56 4d f9 cb 7b 48 9e de-b7 a5 c8 ad 3e 7f b3 d3"
kebe(~)[0]%
and the system is pristine, and so far no different than any other PI’s boot:
SmartOS (build: 20251119T203931Z)
[root@00-0c-29-7f-b3-d3 ~]# zpool list
no pools available
[root@00-0c-29-7f-b3-d3 ~]# svcs -xv
[root@00-0c-29-7f-b3-d3 ~]# bootparams | grep -i uuid
[root@00-0c-29-7f-b3-d3 ~]# sysinfo | grep -i uuid
"UUID": "564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3",
[root@00-0c-29-7f-b3-d3 ~]#
Let's set it up...
[root@headnode (mass-1) ~]# sdc-server setup 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 hostname=cn-1
[root@headnode (mass-1) ~]#
and...
[root@00-0c-29-7f-b3-d3 ~]#
creating pool: zones
(as potentially bootable)done
adding volume: dump done
adding volume: config done
adding volume: cores done
adding volume: opt done
adding volume: var done
Compute node, installing config files... done
adding volume: swap done
so now it reboots.
CNAPI thinks we're done with setup:
[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME UUID VERSION SETUP STATUS RAM ADMIN_IP
headnode 564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7 7.0 true running 16383 172.16.64.2
cn-1 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 7.0 true unknown 4095 172.16.64.32
[root@headnode (mass-1) ~]#
That's normal until CN-1 is fully up. CNAPI now says good to go:
[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME UUID VERSION SETUP STATUS RAM ADMIN_IP
headnode 564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7 7.0 true running 16383 172.16.64.2
cn-1 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 7.0 true running 4095 172.16.64.32
[root@headnode (mass-1) ~]#
Let's login! Seems sane.
[root@cn-1 (mass-1) ~]# zfs get all zones/var | grep triton
zones/var com.tritondatacenter:uuid 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 local
[root@cn-1 (mass-1) ~]# bootparams | grep -i uuid
[root@cn-1 (mass-1) ~]# smbios | grep -i uuid
UUID: 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3
UUID (Endian-corrected): cbf94d56-487b-de9e-b7a5-c8ad3e7fb3d3
[root@cn-1 (mass-1) ~]# sysinfo | grep -i uuid
"UUID": "564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3",
[root@cn-1 (mass-1) ~]#
Dan McDonald commented on 2025-11-20T10:25:15.518-0500:
Continuing round 1.
1.) Get CN-2 up but with 20251113... aka current release.
2.) Make sure everything's nice and hunky-dory.
3.) Power off both CNs.
4.) Introduce single-bit error on BOTH CNs. VMware’s ethernet MAC address derives from its UUID so introducing the single-bit-error on the vendor portion (564d becomes 564c).
5.) Reboot both and see what happens.
So after rebooting both:
1.) cn-2 doubled-up in CNAPI.
2.) cn-1 (with TRITON-2520) did not.
[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME UUID VERSION SETUP STATUS RAM ADMIN_IP
cn-2 564c0626-c15f-78d0-056e-d4cc6b7aa8ff 7.0 true running 4095 172.16.64.33
cn-2 564d0626-c15f-78d0-056e-d4cc6b7aa8ff 7.0 true unknown 4095 172.16.64.33
headnode 564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7 7.0 true running 16383 172.16.64.2
cn-1 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 7.0 true unknown 4095 172.16.64.32
[root@headnode (mass-1) ~]#
CN-1 self-corrected:
cn-1 ttyb login: root
Password:
Last login: Wed Nov 19 22:33:20 on term/b
2025-11-19T23:18:55+00:00 cn-1 login: [ID 644210 auth.notice] ROOT LOGIN /dev/term/b
SmartOS (build: 20251119T203931Z)
[root@cn-1 (mass-1) ~]# zfs get all zones/var | grep triton
zones/var com.tritondatacenter:uuid 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 local
[root@cn-1 (mass-1) ~]# smbios | grep -i uuid
UUID: 564cf9cb-7b48-9ede-b7a5-c8ad3e7fb3d3
UUID (Endian-corrected): cbf94c56-487b-de9e-b7a5-c8ad3e7fb3d3
[root@cn-1 (mass-1) ~]# sysinfo | grep -i uuid
"UUID": "564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3",
[root@cn-1 (mass-1) ~]#
CN-2 did not:
cn-2 ttyb login: root
Password:
Last login: Wed Nov 19 23:14:18 on term/b
2025-11-19T23:19:45+00:00 cn-2 login: [ID 644210 auth.notice] ROOT LOGIN /dev/term/b
SmartOS (build: 20251113T010957Z)
[root@cn-2 (mass-1) ~]# zfs get all zones/var | grep triton
[root@cn-2 (mass-1) ~]# smbios | grep -i uuid
UUID: 564c0626-c15f-78d0-056e-d4cc6b7aa8ff
UUID (Endian-corrected): 26064c56-5fc1-d078-056e-d4cc6b7aa8ff
[root@cn-2 (mass-1) ~]# sysinfo | grep -i uuid
"UUID": "564c0626-c15f-78d0-056e-d4cc6b7aa8ff",
[root@cn-2 (mass-1) ~]#
NOTE that neither of these CNs had VMs deployed on them.
Attaching “right.json”, “wrong.json”, and “diff.txt” for CN-2’s TWO CNAPI entries.
Nahum Shalman commented on 2025-11-20T12:45:56.275-0500 (edited 2025-11-20T12:52:17.738-0500):
flowchart TD
Start([get_or_store_uuid called]) --> CheckZpool{Is $Zpool empty?}
CheckZpool -->|Yes| Return([Return - do nothing])
CheckZpool -->|No| GetSource[Get ZFS property source for<br/>org.smartos:server_uuid<br/>on $Zpool/var]
GetSource --> CheckSource{Is source == 'local'?}
CheckSource -->|No| NoStoredUUID[No stored UUID on pool]
CheckSource -->|Yes| ReadStored[Read UUID value from pool]
ReadStored --> CheckOverride{Is OVERRIDE_UUID set<br/>AND different from<br/>stored value?}
CheckOverride -->|Yes| ValidateOverride{Is OVERRIDE_UUID<br/>a valid UUID?}
CheckOverride -->|No| ValidateStored{Is stored UUID<br/>valid?}
ValidateOverride -->|Yes| UpdatePool[Update pool with<br/>OVERRIDE_UUID]
ValidateOverride -->|No| ValidateStored
UpdatePool --> ReadBack[Read updated value<br/>back from pool]
ReadBack --> ValidateStored
ValidateStored -->|Yes| UseStored[Set UUID to stored value]
ValidateStored -->|No| CleanupInvalid[Delete invalid UUID<br/>from pool using<br/>zfs inherit]
UseStored --> End([Return])
CleanupInvalid --> NoStoredUUID
NoStoredUUID --> ValidateCurrent{Is current UUID<br/>valid?}
ValidateCurrent -->|Yes| PersistUUID[Store current UUID<br/>on pool]
ValidateCurrent -->|No| End
PersistUUID --> End
Nahum Shalman commented on 2025-11-21T11:34:06.049-0500:
Cleanup of an invalid value stored on the pool has been tested as well:
[root@triton-2520 ~]# zfs get org.smartos:server_uuid zones/var; sysinfo | grep UUID; sysinfo -u; sysinfo | grep UUID; zfs get org.smartos:server_uuid zones/var
NAME PROPERTY VALUE SOURCE
zones/var org.smartos:server_uuid db40345e-669c-4f10-803f-13e212ac4116 local
"UUID": "db40345e-669c-4f10-803f-13e212ac4116",
"UUID": "db40345e-669c-4f10-803f-13e212ac4116",
NAME PROPERTY VALUE SOURCE
zones/var org.smartos:server_uuid db40345e-669c-4f10-803f-13e212ac4116 local
[root@triton-2520 ~]# zfs set org.smartos:server_uuid=bogus zones/var
[root@triton-2520 ~]# zfs get org.smartos:server_uuid zones/var; sysinfo | grep UUID; sysinfo -u; sysinfo | grep UUID; zfs get org.smartos:server_uuid zones/var
NAME PROPERTY VALUE SOURCE
zones/var org.smartos:server_uuid bogus local
"UUID": "db40345e-669c-4f10-803f-13e212ac4116",
"UUID": "db40345e-669c-4f10-803f-13e212ac4116",
NAME PROPERTY VALUE SOURCE
zones/var org.smartos:server_uuid db40345e-669c-4f10-803f-13e212ac4116 local