TRITON-766: cainstsvc ctf module relies on CTF type ordering

Details

Issue Type:Bug
Priority:4 - Normal
Status:Resolved
Created at:2018-09-11T18:11:52.203Z
Updated at:2019-03-18T22:51:23.303Z

People

Created by:Former user
Reported by:Former user
Assigned to:Todd Whiteman

Resolution

Won't Fix: The problem described is an issue which will never be fixed.
(Resolution Date: 2019-03-18T22:51:23.288Z)

Related Issues

Description

We found the cainstsvc in maintenance in nightly:

[2018-09-11 06:43:05.964 UTC] CRIT   CA PANIC: panic due to uncaught exception: EXCEPTION: Error: Error: type not found or typdefed: lwpsinfo_t
    at ctCheckReq (/opt/smartdc/agents/lib/node_modules/cabase/node_modules/ctype/ctype.js:357:11)
    at CTypeParser.typedef (/opt/smartdc/agents/lib/node_modules/cabase/node_modules/ctype/ctype.js:489:3)
    at ctfParseStruct (/opt/smartdc/agents/lib/node_modules/cabase/node_modules/ctype/ctf.js:167:8)
    at ctfParseEntry (/opt/smartdc/agents/lib/node_modules/cabase/node_modules/ctype/ctf.js:206:3)
    at Object.ctfParseJson (/opt/smartdc/agents/lib/node_modules/cabase/node_modules/ctype/ctf.js:242:3)
    at Object.parseCTF (/opt/smartdc/agents/lib/node_modules/cabase/node_modules/ctype/ctype.js:906:10)
    at ChildProcess.<anonymous> (/opt/smartdc/agents/lib/node_modules/cabase/lib/ca/ca-proc.js:97:21)
    at ChildProcess.emit (events.js:70:17)
    at maybeExit (child_process.js:360:16)
    at Process.onexit (child_process.js:396:5)
[2018-09-11 06:43:05.964 UTC] CRIT   writing core dump to /opt/smartdc/agents/lib/node_modules/cabase/cacore.7630
[2018-09-11 06:43:05.983 UTC] CRIT   finished writing core dump
[ Sep 11 06:43:05 Stopping because all processes in service exited. ]
[ Sep 11 06:43:05 Executing stop method (:kill). ]
[ Sep 11 06:43:06 Executing start method ("/opt/smartdc/agents/lib/node_modules/cabase/smf/method/canodesvc"). ]

Here we're trying to load up the proc backend for the CA instrumenter. However, it's failed in trying to parse the CTF for the psinfo_t which has an lwpsinfo_t as a member of it. This is a bit surprising as there's no reason that this should normally happen. Let's look at the generated JSON and make sure that we have the corresponding data that we'd expect:

[root@headnode (nightly-1) /var/tmp/rm]# /opt/smartdc/agents/lib/node_modules/cabase/cmd/ctf2json -f /usr/lib/libc.so.1 -t psinfo_t
{ "metadata":
        {
                "ctf2json_version": "1.0",
                "created_at": 1536689217,
                "derived_from": "/usr/lib/libc.so.1",
                "ctf_version": 2,
                "requested_types": [ "psinfo_t" ]
        },
"data":
        [
                { "name": "unsigned int", "integer": { "length": 4, "signed": false } },
                { "name": "size_t", "typedef": "unsigned int" },
                { "name": "uid_t", "typedef": "unsigned int" },
                { "name": "int", "integer": { "length": 4, "signed": true } },
                { "name": "unsigned short", "integer": { "length": 2, "signed": false } },
                { "name": "uintptr_t", "typedef": "unsigned int" },
                { "name": "ushort_t", "typedef": "unsigned short" },
                { "name": "char", "integer": { "length": 1, "signed": true } },
                { "name": "long", "integer": { "length": 4, "signed": true } },
                { "name": "id_t", "typedef": "long" },
                { "name": "short", "integer": { "length": 2, "signed": true } },
                { "name": "zoneid_t", "typedef": "long" },
                { "name": "pid_t", "typedef": "long" },
                { "name": "time_t", "typedef": "long" },
                { "name": "struct timespec", "struct": [
                        { "name": "tv_sec", "type": "time_t" },
                        { "name": "tv_nsec", "type": "long" }
                ] },
                { "name": "timestruc_t", "typedef": "struct timespec" },
                { "name": "processorid_t", "typedef": "int" },
                { "name": "unsigned long", "integer": { "length": 4, "signed": false } },
                { "name": "gid_t", "typedef": "unsigned int" },
                { "name": "dev_t", "typedef": "unsigned long" },
                { "name": "taskid_t", "typedef": "long" },
                { "name": "projid_t", "typedef": "long" },
                { "name": "psetid_t", "typedef": "int" },
                { "name": "poolid_t", "typedef": "long" },
                { "name": "struct psinfo", "struct": [
                        { "name": "pr_flag", "type": "int" },
                        { "name": "pr_nlwp", "type": "int" },
                        { "name": "pr_pid", "type": "pid_t" },
                        { "name": "pr_ppid", "type": "pid_t" },
                        { "name": "pr_pgid", "type": "pid_t" },
                        { "name": "pr_sid", "type": "pid_t" },
                        { "name": "pr_uid", "type": "uid_t" },
                        { "name": "pr_euid", "type": "uid_t" },
                        { "name": "pr_gid", "type": "gid_t" },
                        { "name": "pr_egid", "type": "gid_t" },
                        { "name": "pr_addr", "type": "uintptr_t" },
                        { "name": "pr_size", "type": "size_t" },
                        { "name": "pr_rssize", "type": "size_t" },
                        { "name": "pr_pad1", "type": "size_t" },
                        { "name": "pr_ttydev", "type": "dev_t" },
                        { "name": "pr_pctcpu", "type": "ushort_t" },
                        { "name": "pr_pctmem", "type": "ushort_t" },
                        { "name": "pr_start", "type": "timestruc_t" },
                        { "name": "pr_time", "type": "timestruc_t" },
                        { "name": "pr_ctime", "type": "timestruc_t" },
                        { "name": "pr_fname", "type": "char [16]" },
                        { "name": "pr_psargs", "type": "char [80]" },
                        { "name": "pr_wstat", "type": "int" },
                        { "name": "pr_argc", "type": "int" },
                        { "name": "pr_argv", "type": "uintptr_t" },
                        { "name": "pr_envp", "type": "uintptr_t" },
                        { "name": "pr_dmodel", "type": "char" },
                        { "name": "pr_pad2", "type": "char [3]" },
                        { "name": "pr_taskid", "type": "taskid_t" },
                        { "name": "pr_projid", "type": "projid_t" },
                        { "name": "pr_nzomb", "type": "int" },
                        { "name": "pr_poolid", "type": "poolid_t" },
                        { "name": "pr_zoneid", "type": "zoneid_t" },
                        { "name": "pr_contract", "type": "id_t" },
                        { "name": "pr_filler", "type": "int [1]" },
                        { "name": "pr_lwp", "type": "lwpsinfo_t" }
                ] },
                { "name": "struct lwpsinfo", "struct": [
                        { "name": "pr_flag", "type": "int" },
                        { "name": "pr_lwpid", "type": "id_t" },
                        { "name": "pr_addr", "type": "uintptr_t" },
                        { "name": "pr_wchan", "type": "uintptr_t" },
                        { "name": "pr_stype", "type": "char" },
                        { "name": "pr_state", "type": "char" },
                        { "name": "pr_sname", "type": "char" },
                        { "name": "pr_nice", "type": "char" },
                        { "name": "pr_syscall", "type": "short" },
                        { "name": "pr_oldpri", "type": "char" },
                        { "name": "pr_cpu", "type": "char" },
                        { "name": "pr_pri", "type": "int" },
                        { "name": "pr_pctcpu", "type": "ushort_t" },
                        { "name": "pr_pad", "type": "ushort_t" },
                        { "name": "pr_start", "type": "timestruc_t" },
                        { "name": "pr_time", "type": "timestruc_t" },
                        { "name": "pr_clname", "type": "char [8]" },
                        { "name": "pr_name", "type": "char [16]" },
                        { "name": "pr_onpro", "type": "processorid_t" },
                        { "name": "pr_bindpro", "type": "processorid_t" },
                        { "name": "pr_bindpset", "type": "psetid_t" },
                        { "name": "pr_lgrp", "type": "int" },
                        { "name": "pr_filler", "type": "int [4]" }
                ] },
                { "name": "lwpsinfo_t", "typedef": "struct lwpsinfo" },
                { "name": "psinfo_t", "typedef": "struct psinfo" }
        ]
}

OK. So we do have the struct lwpsinfo. However, note that it occurs after the struct psinfo. There's nothing in CTF that says the types have to come in a specific order. It's totally fine to have a struct appear with a type id that's less than a type it depends on. This is only practical due to the way that forwards and recursively defined structures work.

However, due to the way that the node-ctype parser works, it exepcts this to have been defined already. The node-ctype parser is unfortunately, naive, and rather than recursively walking things to try and find the type, it expects to have found it already. What needs to happen here is that the node-ctype parser needs to be modified to search for the missing types and go from there.

It's also likely the case that the instrumenter should be more resilient to this type of failure and drive on without enabling the back end.

Comments

Comment by Todd Whiteman
Created at 2018-09-11T19:10:13.200Z

Also hit this in a new COAL build today - 2018-09-11, so it looks to be affected by a recent platform change:

$ tail /var/svc/log/smartdc-agent-ca-cainstsvc:default.log
[2018-09-11 18:41:43.898 UTC] INFO   loaded module "cainstr".
[ Sep 11 18:41:43 Method "start" exited with status 0. ]
[2018-09-11 18:41:43.938 UTC] CRIT   CA PANIC: panic due to uncaught exception: EXCEPTION: Error: Error: type not found or typdefed: lwpsinfo_t

Comment by Former user
Created at 2018-09-11T21:13:08.514Z

Options:


Comment by Former user
Created at 2018-09-11T22:44:37.628Z

For the record, untested patch to sdcadm to get it to exclude cainstsvc and cabase from 'sdcadm insts' and 'sdcadm health':

diff --git a/lib/sdcadm.js b/lib/sdcadm.js
index 156d743..4063808 100644
--- a/lib/sdcadm.js
+++ b/lib/sdcadm.js
@@ -669,6 +669,9 @@ SdcAdm.prototype.listInsts = function listInsts(opts, cb) {
                 var agents = server.agents ||
                              server.sysinfo['SDC Agents'] || [];
                 agents.forEach(function (agent) {
+                    if (/^ca/.test(agent.name)) {
+                        return;
+                    }
                     if (!isWantedSvc || isWantedSvc[agent.name]) {
                         var inst = {
                             type: 'agent',

Comment by Former user
Created at 2018-09-12T18:06:49.863Z

nightly-1 did NOT hit TRITON-766 today, so I'm not sure if that's just not always reproducible or what happened. This is nightly-1 which is using the vminfod platform, which lags a little bit.

There were some smartos-live.git#vminfod changes recently:

commit 06db035b5a6066ae320d6afd6b9cad3a7a121d30 (HEAD -> vminfod, origin/vminfod)
Author: Dave Eddy <dave@daveeddy.com>
Date:   2018-09-11T19:08:41-04:00 (19 hours ago)

    remove circular objects from vminfod watchForChanges

 src/vm/node_modules/vminfod/client.js | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

commit 7384a8772160903b99693f70c9c62321f2d6ab2f
Author: Dave Eddy <dave@daveeddy.com>
Date:   2018-09-11T16:45:14-04:00 (21 hours ago)

    better error messages for vminfod client

 src/vm/node_modules/vminfod/client.js | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

commit 59f7e23f4e00fc6d04df193e3088ace04eb54360
Author: Dave Eddy <dave@daveeddy.com>
Date:   2018-09-11T16:45:10-04:00 (21 hours ago)

    clean up log output

 src/vm/node_modules/vminfod/vminfod.js | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

commit 4ea264619ec81db64487d28a9dc6ce2d34c67260
Author: Dave Eddy <dave@daveeddy.com>
Date:   2018-09-11T12:45:10-04:00 (25 hours ago)

    Merge branch 'master' into vminfod

rm
trentm: I suspect that with luck OS-7200 which backed out somehow induced the change that triggered this.


Comment by Todd Whiteman
Created at 2018-09-12T18:12:46.097Z

My coal was using platform-vminfod-20180910T233935Z.tgz at the time of the issue (which included the OS-7200 change - before it was backed out).


Comment by Jira Bot
Created at 2018-10-18T20:07:49.019Z

globe-theatre commit fa7acec2af29d9b0ffb21bcc7dfc0361716159da (branch master, by Trent Mick)

TRITON-884 remove cloud analytics (ca) from Triton

Early workaround cainstsvc being faily (TRITON-766). We remove
cainstsvc everywhere so it doesn't cause spurious healthcheck failures.


Comment by Todd Whiteman
Created at 2018-10-18T22:15:10.700Z

Pushed up a change, which uses CTF deferred type checking:
https://github.com/rmustacc/node-ctype/pull/65

This solves the issue for me (in nightly-1).


Comment by Former user
Created at 2019-02-12T18:48:50.947Z
Updated at 2019-02-13T19:39:37.666Z

Found this issue on:
36 servers in southeast-1a
152 servers in southeast-1c
10 servers in central-1a
16 servers in central-1b
11 servers in central-1c


Comment by Todd Whiteman
Created at 2019-02-12T19:12:21.527Z

@accountid:62561aa44f1d57006a24d40c should I push up this change for CA (https://cr.joyent.us/#/c/5094/)?

Note: I had thought TRITON-884 was the way forward (i.e. removing ca altogether)?


Comment by Former user
Created at 2019-02-12T20:15:17.499Z

Todd: I'm still labouring at removing CA, yah. Too slowly I know.

I don't htink there is a point to the cainstsvc fix because the ca agents have already been removed from the agentsshar.


Comment by Former user
Created at 2019-03-18T22:51:23.301Z

CA has been removed from Triton (TRITON-884).