Issue Type: | Bug |
---|---|
Priority: | 4 - Normal |
Status: | Resolved |
Created at: | 2017-11-07T19:31:51.603Z |
Updated at: | 2017-11-08T07:24:22.083Z |
Created by: | Former user |
---|---|
Reported by: | Former user |
Assigned to: | Former user |
Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-11-08T07:24:22.065Z)
2017-11-09 Edge (Release Date: 2017-11-09)
From @accountid:557058:8e9a30b7-b0e0-4093-927b-6497faf34a80 in ~triton channel this morning, after upgrading east-3b to latest dev images, the vmapi service failed to come up and the following error was found in the logs:
[2017-11-07T15:39:48.662Z] INFO: vmapi/migrations-controller/60252 on 69eadfd0-9233-49df-b419-dfaa5aa90a4d: Found 1000 records Uncaught AssertionError: internal_metadata values must be of type string, number or boolean FROM fail (assert.js:89:3) Function.ok (assert.js:109:15) Object.internalMetadataToSearchArray (/opt/smartdc/vmapi/lib/common/vm-common.js:518:77) migrateRecord (/opt/smartdc/vmapi/lib/data-migrations/migrations/vms/001-internal-metadata-search.js:45:16) Object.migrateRecords [as func] (/opt/smartdc/vmapi/lib/data-migrations/controller.js:272:43) Immediate._onImmediate (/opt/smartdc/vmapi/node_modules/vasync/lib/vasync.js:213:20) processImmediate [as _immediateCallback] (timers.js:383:17)
Shortly after @pedro mentioned:
Culprit is this mdata value, though:
'docker:restartdelay': null
The error is pretty clear: currently data migrations make the VMAPI process abort when an internal_metadata
value is not a string, boolean or number on purpose.
I'll try to find the VM object with that value and understand why it had a docker:restartdelay
value of null
in the first place.
Regardless, we might want to be less radical in how we treat such values because if we ran into that value in east-3b, it is likely we'll run into similar unexpected values in DCs with more VMs.
Maybe warning and moving on would be more robust in this case?
Using the following shell script:
current_offset=0; json_res=$(sdc-vmapi /vms | json -H); while [[ "$json_res" != "[]" ]]; do echo "$json_res" | json -Ha -c "this.internal_metadata && this.internal_metadata['docker:restartdelay']===null"; current_offset=$((current_offset + 1000)); echo "current offset: ${current_offset}" json_res=$(sdc-vmapi "/vms?offset=${current_offset}" | json -H) done
I was able to find that the VM with a value of null
for its docker:restartdelay
metadata is 76d0bbe4-ca1e-4117-8ec5-ca894b768722
.
Looking at that VM's internal_metadata
, we find:
[jgilli@5DCS842 (us-east-3b) ~]$ vmadm get 76d0bbe4-ca1e-4117-8ec5-ca894b768722 | json -H internal_metadata { "docker:id": "76d0bbe4ca1e41178ec5ca894b768722d92e24bd172749b0ba26ee821be93258", "docker:cmd": "[\"sleep 86401\"]", "docker:restartpolicy": "always", "docker:noipmgmtd": true, "docker:env": "[\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"]", "docker:entrypoint": "[]", "docker:linkHosts": "", "docker:linkEnv": "[]", "docker:restartcount": 22301, "docker:restartdelay": null } [jgilli@5DCS842 (us-east-3b) ~]$
So it restarted 22301
times. Now when trying to determine how this number of restarts would lead to a value of null
, and reproducing [the computation that vmadmd performs|https://github.com/joyent/smartos-live/blob/ecb292cc3719695d2b27617ddee9cab1895d3e17/src/vm/sbin/vmadmd.js#L765-L774] to determine the next restart delay, we found that eventually, the restart delay will always be serialized to null:
$ node > 100 * Math.pow(2, 22301) Infinity > JSON.stringify(Infinity) 'null' >
In other words, this occurrence seems to show that aborting when encountering unexpected data in VMs' internal_metadata
may not be what we want to do going forward, and that instead we should log those entries, migrate the record the best we can, and move on.
Testing notes:
In COAL, I manually forced a VM to have a docker:restartdelay
internal_metadata value of null
, then reset the data_version
column of the vmapi_vms
table in manatee to NULL
. I then restarted the VMAPI service and observed it abort the same way it aborted in east-3b.
Then I applied the changes in the CR linked from this ticket, restarted the VMAPI service, observed the WARN
log entry and verified that the data migrations completed. I then verified in manatee that the internal_metadata_search_array
column did not index the invalid internal_metadata value.
sdc-vmapi commit 131eada3251fb7e558e9e84cd89a9b6c63e5f4c5 (branch master, by Julien Gilli)
ZAPI-810 VMAPI data migrations abort on invalid internal_metadata value
Reviewed by: Trent Mick <trent.mick@joyent.com>
Approved by: Trent Mick <trent.mick@joyent.com>