TRITON-2520

Servers should not change UUID on reboot

Status:
Resolved
Created:
2025-10-31T10:30:15.019-0400
Updated:
2025-11-24T10:32:30.587-0500

Description

Servers that already had a specific UUID were discovered to change their UUID based seemingly on reading a UUID from the output of smbios. That caused a lot of unexpected behavior.

This was narrowed down to be the result of the BMC being in a bad state. By saving the UUID to the zones pool and trusting that over the smbios output going forward we prevent this sort of hardware misbehavior from causing operator confusion.

Leveraging the code from the fix for this bug,@Dan McDonald wrote a one-liner that can check for servers where the server UUID is currently mismatched from the output of smbios (whether from override_uuid or from trusting a value stored on the pool):
sdc-oneachnode -a 'if [[ $(sysinfo | json UUID) != $(smbios | grep UUID: | awk "{print \$2;}" ) ]]; then echo "mismatch on $(sysinfo | json UUID)" ; fi'

[root@headnode (mass-1) ~]# sdc-oneachnode -a 'if [[ $(sysinfo | json UUID) != $(smbios | grep UUID: | awk "{print \$2;}" ) ]]; then echo "mismatch on $(sysinfo | json UUID)" ; fi'
HOSTNAME              STATUS
cn-1                  mismatch on 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3
cn-2                  
headnode              

See also T8dedf5d44cd8520e-M9f0075c99310d59a5207f285

Need to test:

  1. What happens when smbios uuid changes

  2. What happens when boot MAC changes

  3. What happens when BOTH change?

We should not test MAC address changes. Those could reasonably require Operator intervention to fix nic tag associations (If you change a NIC on a SmartOS machine, you cannot expect it to “Just Work” upon booting it up again.)

Comments (6)

Nahum Shalman commented on 2025-11-14T10:16:22.270-0500:

sysinfo#L103 is where the UUID is read from smbios.

Nahum Shalman commented on 2025-11-19T12:33:39.452-0500:

We apparently also have a "override_uuid"  bootparam:

    # overwrite UUID if config dictates otherwise
    tmp_uuid=$(/usr/bin/bootparams | grep "^override_uuid=" | cut -f2 -d'=')
    if [[ -n $tmp_uuid ]]; then
        UUID=$tmp_uuid
    fi

… which presumably one could configure booter to send to a machine, and is probably the correct place in the code for any additional logic on a further source of truth.

I think we should do:

  1. get value from smbios

  2. get value from zones pool, if present. if not present, record the value from smbios

  3. get value from override_uuid if present, also update zones pool if so

Dan McDonald commented on 2025-11-20T10:18:20.429-0500:

First round of testing on all-VMware-guest Triton Cloud mass-1on my workstation.

[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME             UUID                                 VERSION    SETUP    STATUS      RAM  ADMIN_IP       
headnode             564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7     7.0     true   running    16383  172.16.64.2    
[root@headnode (mass-1) ~]# sdcadm platform list
VERSION           CURRENT_PLATFORM  BOOT_PLATFORM  LATEST  DEFAULT  OS
20251119T203931Z  0                 0              true    false    smartos
20251113T010957Z  1                 1              false   true     smartos
[root@headnode (mass-1) ~]# 

Set 'default' to be 20251119T203931Z and bring up a new CN, inspecting pre-and-post setup.

Currently all mass-1 VMware instances (headnode, CN-1, have sane BIOS settings and sane UUIDs as a result.

Bringing up CN-1 with new PI. Headnode sees it.

[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME             UUID                                 VERSION    SETUP    STATUS      RAM  ADMIN_IP       
headnode             564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7     7.0     true   running    16383  172.16.64.2    
00-0c-29-7f-b3-d3    564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3     7.0    false   running     4095  172.16.64.32   
[root@headnode (mass-1) ~]# 

VM setting is sane:

kebe(~)[0]% grep uuid Virtual\ Machines.localized/mass-1-CN-1.vmwarevm/mass-1-CN-1.vmx
uuid.bios = "56 4d f9 cb 7b 48 9e de-b7 a5 c8 ad 3e 7f b3 d3"
uuid.location = "56 4d f9 cb 7b 48 9e de-b7 a5 c8 ad 3e 7f b3 d3"
kebe(~)[0]% 

and the system is pristine, and so far no different than any other PI’s boot:

SmartOS (build: 20251119T203931Z)
[root@00-0c-29-7f-b3-d3 ~]# zpool list
no pools available
[root@00-0c-29-7f-b3-d3 ~]# svcs -xv
[root@00-0c-29-7f-b3-d3 ~]# bootparams | grep -i uuid
[root@00-0c-29-7f-b3-d3 ~]# sysinfo | grep -i uuid
  "UUID": "564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3",
[root@00-0c-29-7f-b3-d3 ~]# 

Let's set it up...

[root@headnode (mass-1) ~]# sdc-server setup 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3 hostname=cn-1
[root@headnode (mass-1) ~]# 

and...

[root@00-0c-29-7f-b3-d3 ~]# 
creating pool: zones                                    
                                                                  (as potentially bootable)done
adding volume: dump                                     done
adding volume: config                                   done
adding volume: cores                                    done
adding volume: opt                                      done
adding volume: var                                      done
Compute node, installing config files...                done
adding volume: swap                                     done

so now it reboots.

CNAPI thinks we're done with setup:

[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME             UUID                                 VERSION    SETUP    STATUS      RAM  ADMIN_IP       
headnode             564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7     7.0     true   running    16383  172.16.64.2    
cn-1                 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3     7.0     true   unknown     4095  172.16.64.32   
[root@headnode (mass-1) ~]# 

That's normal until CN-1 is fully up.  CNAPI now says good to go:

[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME             UUID                                 VERSION    SETUP    STATUS      RAM  ADMIN_IP       
headnode             564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7     7.0     true   running    16383  172.16.64.2    
cn-1                 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3     7.0     true   running     4095  172.16.64.32   
[root@headnode (mass-1) ~]# 

Let's login! Seems sane.

[root@cn-1 (mass-1) ~]# zfs get all zones/var | grep triton
zones/var  com.tritondatacenter:uuid  564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3  local
[root@cn-1 (mass-1) ~]# bootparams | grep -i uuid
[root@cn-1 (mass-1) ~]# smbios | grep -i uuid
  UUID: 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3
  UUID (Endian-corrected): cbf94d56-487b-de9e-b7a5-c8ad3e7fb3d3
[root@cn-1 (mass-1) ~]# sysinfo | grep -i uuid
  "UUID": "564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3",
[root@cn-1 (mass-1) ~]# 

Dan McDonald commented on 2025-11-20T10:25:15.518-0500:

Continuing round 1.

1.) Get CN-2 up but with 20251113... aka current release.
2.) Make sure everything's nice and hunky-dory.
3.) Power off both CNs.
4.) Introduce single-bit error on BOTH CNs. VMware’s ethernet MAC address derives from its UUID so introducing the single-bit-error on the vendor portion (564d becomes 564c).
5.) Reboot both and see what happens.

So after rebooting both:

1.) cn-2 doubled-up in CNAPI.
2.) cn-1 (with TRITON-2520) did not.

[root@headnode (mass-1) ~]# sdc-server list
HOSTNAME             UUID                                 VERSION    SETUP    STATUS      RAM  ADMIN_IP       
cn-2                 564c0626-c15f-78d0-056e-d4cc6b7aa8ff     7.0     true   running     4095  172.16.64.33   
cn-2                 564d0626-c15f-78d0-056e-d4cc6b7aa8ff     7.0     true   unknown     4095  172.16.64.33   
headnode             564d5aa9-490e-fd3a-e08d-cad4f4a5d7b7     7.0     true   running    16383  172.16.64.2    
cn-1                 564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3     7.0     true   unknown     4095  172.16.64.32   
[root@headnode (mass-1) ~]# 

CN-1 self-corrected:

cn-1 ttyb login: root
Password: 
Last login: Wed Nov 19 22:33:20 on term/b
2025-11-19T23:18:55+00:00 cn-1 login: [ID 644210 auth.notice] ROOT LOGIN /dev/term/b
SmartOS (build: 20251119T203931Z)
[root@cn-1 (mass-1) ~]# zfs get all zones/var | grep triton
zones/var  com.tritondatacenter:uuid  564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3  local
[root@cn-1 (mass-1) ~]# smbios | grep -i uuid
  UUID: 564cf9cb-7b48-9ede-b7a5-c8ad3e7fb3d3
  UUID (Endian-corrected): cbf94c56-487b-de9e-b7a5-c8ad3e7fb3d3
[root@cn-1 (mass-1) ~]# sysinfo | grep -i uuid
  "UUID": "564df9cb-7b48-9ede-b7a5-c8ad3e7fb3d3",
[root@cn-1 (mass-1) ~]# 

CN-2 did not:

cn-2 ttyb login: root
Password: 
Last login: Wed Nov 19 23:14:18 on term/b
2025-11-19T23:19:45+00:00 cn-2 login: [ID 644210 auth.notice] ROOT LOGIN /dev/term/b
SmartOS (build: 20251113T010957Z)
[root@cn-2 (mass-1) ~]# zfs get all zones/var | grep triton
[root@cn-2 (mass-1) ~]# smbios | grep -i uuid
  UUID: 564c0626-c15f-78d0-056e-d4cc6b7aa8ff
  UUID (Endian-corrected): 26064c56-5fc1-d078-056e-d4cc6b7aa8ff
[root@cn-2 (mass-1) ~]# sysinfo | grep -i uuid
  "UUID": "564c0626-c15f-78d0-056e-d4cc6b7aa8ff",
[root@cn-2 (mass-1) ~]# 

NOTE that neither of these CNs had VMs deployed on them.

Attaching “right.json”, “wrong.json”, and “diff.txt” for CN-2’s TWO CNAPI entries.

Nahum Shalman commented on 2025-11-20T12:45:56.275-0500 (edited 2025-11-20T12:52:17.738-0500):

flowchart TD
    Start([get_or_store_uuid called]) --> CheckZpool{Is $Zpool empty?}

    CheckZpool -->|Yes| Return([Return - do nothing])
    CheckZpool -->|No| GetSource[Get ZFS property source for<br/>org.smartos:server_uuid<br/>on $Zpool/var]

    GetSource --> CheckSource{Is source == 'local'?}

    CheckSource -->|No| NoStoredUUID[No stored UUID on pool]
    CheckSource -->|Yes| ReadStored[Read UUID value from pool]

    ReadStored --> CheckOverride{Is OVERRIDE_UUID set<br/>AND different from<br/>stored value?}

    CheckOverride -->|Yes| ValidateOverride{Is OVERRIDE_UUID<br/>a valid UUID?}
    CheckOverride -->|No| ValidateStored{Is stored UUID<br/>valid?}

    ValidateOverride -->|Yes| UpdatePool[Update pool with<br/>OVERRIDE_UUID]
    ValidateOverride -->|No| ValidateStored

    UpdatePool --> ReadBack[Read updated value<br/>back from pool]
    ReadBack --> ValidateStored

    ValidateStored -->|Yes| UseStored[Set UUID to stored value]
    ValidateStored -->|No| CleanupInvalid[Delete invalid UUID<br/>from pool using<br/>zfs inherit]

    UseStored --> End([Return])

    CleanupInvalid --> NoStoredUUID
    NoStoredUUID --> ValidateCurrent{Is current UUID<br/>valid?}

    ValidateCurrent -->|Yes| PersistUUID[Store current UUID<br/>on pool]
    ValidateCurrent -->|No| End

    PersistUUID --> End

Nahum Shalman commented on 2025-11-21T11:34:06.049-0500:

Cleanup of an invalid value stored on the pool has been tested as well:

[root@triton-2520 ~]# zfs get org.smartos:server_uuid zones/var; sysinfo | grep UUID; sysinfo -u; sysinfo | grep UUID; zfs get org.smartos:server_uuid zones/var
NAME       PROPERTY                 VALUE                                 SOURCE
zones/var  org.smartos:server_uuid  db40345e-669c-4f10-803f-13e212ac4116  local
  "UUID": "db40345e-669c-4f10-803f-13e212ac4116",
  "UUID": "db40345e-669c-4f10-803f-13e212ac4116",
NAME       PROPERTY                 VALUE                                 SOURCE
zones/var  org.smartos:server_uuid  db40345e-669c-4f10-803f-13e212ac4116  local
[root@triton-2520 ~]# zfs set org.smartos:server_uuid=bogus zones/var
[root@triton-2520 ~]# zfs get org.smartos:server_uuid zones/var; sysinfo | grep UUID; sysinfo -u; sysinfo | grep UUID; zfs get org.smartos:server_uuid zones/var
NAME       PROPERTY                 VALUE                    SOURCE
zones/var  org.smartos:server_uuid  bogus                    local
  "UUID": "db40345e-669c-4f10-803f-13e212ac4116",
  "UUID": "db40345e-669c-4f10-803f-13e212ac4116",
NAME       PROPERTY                 VALUE                                 SOURCE
zones/var  org.smartos:server_uuid  db40345e-669c-4f10-803f-13e212ac4116  local