ZAPI-810: VMAPI data migrations abort on invalid internal_metadata value

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2017-11-07T19:31:51.603Z
Updated at:2017-11-08T07:24:22.083Z

People

Created by:Former user
Reported by:Former user
Assigned to:Former user

Resolution

Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2017-11-08T07:24:22.065Z)

Fix Versions

2017-11-09 Edge (Release Date: 2017-11-09)

Description

From @accountid:557058:8e9a30b7-b0e0-4093-927b-6497faf34a80 in ~triton channel this morning, after upgrading east-3b to latest dev images, the vmapi service failed to come up and the following error was found in the logs:

[2017-11-07T15:39:48.662Z]  INFO: vmapi/migrations-controller/60252 on 69eadfd0-9233-49df-b419-dfaa5aa90a4d: Found 1000 records
Uncaught AssertionError: internal_metadata values must be of type string, number or boolean

FROM
fail (assert.js:89:3)
Function.ok (assert.js:109:15)
Object.internalMetadataToSearchArray (/opt/smartdc/vmapi/lib/common/vm-common.js:518:77)
migrateRecord (/opt/smartdc/vmapi/lib/data-migrations/migrations/vms/001-internal-metadata-search.js:45:16)
Object.migrateRecords [as func] (/opt/smartdc/vmapi/lib/data-migrations/controller.js:272:43)
Immediate._onImmediate (/opt/smartdc/vmapi/node_modules/vasync/lib/vasync.js:213:20)
processImmediate [as _immediateCallback] (timers.js:383:17)

Shortly after @pedro mentioned:

Culprit is this mdata value, though: 'docker:restartdelay': null

The error is pretty clear: currently data migrations make the VMAPI process abort when an internal_metadata value is not a string, boolean or number on purpose.

I'll try to find the VM object with that value and understand why it had a docker:restartdelay value of null in the first place.

Regardless, we might want to be less radical in how we treat such values because if we ran into that value in east-3b, it is likely we'll run into similar unexpected values in DCs with more VMs.

Maybe warning and moving on would be more robust in this case?

Comments

Comment by Former user
Created at 2017-11-08T01:12:39.709Z
Updated at 2017-11-08T01:14:31.994Z

Using the following shell script:

current_offset=0;
json_res=$(sdc-vmapi /vms | json -H);
while [[ "$json_res" != "[]" ]]; do
  echo "$json_res" | json -Ha -c "this.internal_metadata && this.internal_metadata['docker:restartdelay']===null";
  current_offset=$((current_offset + 1000));
  echo "current offset: ${current_offset}"
  json_res=$(sdc-vmapi "/vms?offset=${current_offset}" | json -H)
done

I was able to find that the VM with a value of null for its docker:restartdelay metadata is 76d0bbe4-ca1e-4117-8ec5-ca894b768722.

Looking at that VM's internal_metadata, we find:

[jgilli@5DCS842 (us-east-3b) ~]$ vmadm get 76d0bbe4-ca1e-4117-8ec5-ca894b768722 | json -H internal_metadata
{
  "docker:id": "76d0bbe4ca1e41178ec5ca894b768722d92e24bd172749b0ba26ee821be93258",
  "docker:cmd": "[\"sleep 86401\"]",
  "docker:restartpolicy": "always",
  "docker:noipmgmtd": true,
  "docker:env": "[\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"]",
  "docker:entrypoint": "[]",
  "docker:linkHosts": "",
  "docker:linkEnv": "[]",
  "docker:restartcount": 22301,
  "docker:restartdelay": null
}
[jgilli@5DCS842 (us-east-3b) ~]$

So it restarted 22301 times. Now when trying to determine how this number of restarts would lead to a value of null, and reproducing [the computation that vmadmd performs|https://github.com/joyent/smartos-live/blob/ecb292cc3719695d2b27617ddee9cab1895d3e17/src/vm/sbin/vmadmd.js#L765-L774] to determine the next restart delay, we found that eventually, the restart delay will always be serialized to null:

$ node
> 100 * Math.pow(2, 22301)
Infinity
> JSON.stringify(Infinity)
'null'
>

In other words, this occurrence seems to show that aborting when encountering unexpected data in VMs' internal_metadata may not be what we want to do going forward, and that instead we should log those entries, migrate the record the best we can, and move on.


Comment by Former user
Created at 2017-11-08T02:39:33.889Z
Updated at 2017-11-08T02:39:40.811Z

Testing notes:

In COAL, I manually forced a VM to have a docker:restartdelay internal_metadata value of null, then reset the data_version column of the vmapi_vms table in manatee to NULL. I then restarted the VMAPI service and observed it abort the same way it aborted in east-3b.

Then I applied the changes in the CR linked from this ticket, restarted the VMAPI service, observed the WARN log entry and verified that the data migrations completed. I then verified in manatee that the internal_metadata_search_array column did not index the invalid internal_metadata value.


Comment by Former user
Created at 2017-11-08T06:28:03.926Z

Wow. :)


Comment by Jira Bot
Created at 2017-11-08T07:23:10.566Z

sdc-vmapi commit 131eada3251fb7e558e9e84cd89a9b6c63e5f4c5 (branch master, by Julien Gilli)

ZAPI-810 VMAPI data migrations abort on invalid internal_metadata value
Reviewed by: Trent Mick <trent.mick@joyent.com>
Approved by: Trent Mick <trent.mick@joyent.com>