OS-3042: Unable to fix bad coreadm paths

Details

Issue Type:Bug
Priority:4 - Normal
Status:Open
Created at:2014-05-23T12:58:23.000Z
Updated at:2021-03-26T02:07:00.169Z

People

Created by:Former user
Reported by:Jonathan Perkin

Related Issues

Description

Whilst trying to alter coreadm in build zones to retain cores locally, I ended up in a catch-22 situation. The global path I had set as a test was incorrect, meaning coreadm failed to start with:

[root@pkgsrc-pbulk-sdc7-2 ~]# tail -3 $(svcs -L coreadm)
[ May 23 12:38:53 Executing start method ("/usr/bin/coreadm -U"). ]
core_set_global_path(): Invalid argument
[ May 23 12:38:53 Method "start" exited with status 95. ]

But I couldn't reset it back to the default, because coreadm wasn't running:

[root@pkgsrc-pbulk-sdc7-2 ~]# coreadm -g /%Z/cores/core.%f.%p
coreadm: coreadm service not online

I'm currently figuring out how to fix this directly in the SMF properties, but Robert suggested raising this as a bug anyway so we can avoid leaving users in a hurty place.

Comments

Comment by Former user
Created at 2015-01-14T07:21:53.000Z

I ran into this issue recently as well -- boy, does it ever suck!


Comment by Former user
Created at 2017-05-06T04:13:36.000Z

I just ran into this the other night.

What's happening is that modifying the non-per-process parameters (the first form in coreadm(1m)) makes coreadm modify the smf properties and then perform an smf refresh, but only if the service is online. The refresh causes svc.startd to run coreadm -U which makes the changes active. While rather indirect, it allows non-root users with the proper authorization to be able to modify the parameters without ever directly invoking any binaries with any elevated privileges. It also means that any errors (such as an invalid path) aren't easily relayed back to the user. To make matters worse, as seen here, if the coreadm -U invocation fails, the service is placed into maintenance mode, preventing any further changes via coreadm(1m) and forcing the poor user to resort to manipulating the smf(5) properties directly to fix.

A few modifications in behavior seem reasonable here:

  1. Error when g is invoked with an argument that does not have a leading / - this matches the checks in the kernel. While redundant, due to the disconnect between requesting the change and the actual attempt to change it due to smf(5), this would provide a better experience.
  2. Excepting the above and any existing error checks, any invocations of non-per-process parameters form should allow changes when the service is both online and in maintenance mode. If in maintenance mode, it should attempt to bring the service out of maintenance mode after committing the smf(5) changes, and if unsuccessful, update the smf(5) configuration with the current in-kernel settings. This will hopefully minimize the opportunity for the smf(5) configuration to become invalid, as well as minimize the chance it diverges from the kernel configuration.

Comment by Former user
Created at 2017-05-09T16:18:56.000Z

Tested for expected failure with invalid global pattern. Then set invalid global pattern in SMF, refreshed (placing service into maintenance mode), updated path with coreadm (which succeeded and enabled service). Then commented out path check to set invalid path via coreadm, verified it failed, detected maintenance mode, and reset SMF configuration from current kernel configuration.


Comment by Jonathan Perkin
Created at 2017-05-15T12:11:44.000Z
Updated at 2017-12-14T17:28:34.561Z

illumos#8225


Comment by Former user
Created at 2017-09-15T20:35:22.000Z

Ideally, I had originally wanted to be able to have coreadm try to commit the new settings, and on failure, re-sync the SMF settings with the kernel settings and take the service out of maintenance mode -- the idea being that any invocation that caused the service to transition to maintenance mode would be rolled back and the service left online (keep in mind that coreadm changes the SMF config then tries to force the coreadm service to refresh to invoke the correct sys calls to update the kerne state, so the process the user invokes to make the changes has no relationship to the process that issues the sys call). To accomplish this with any sort of reliability has proved incredibly problematic due to the lack of synchronous SMF service state transition functions -- a busy system could misinterpret a correct setting as a failure and attempt to rollback, or could fail to successfully rollback because it cannot know when SMF has completed all of it's state transitions to attempt to clear the service.

Like anything dealing with SMF, the work necessary to implement such synchronous SMF state manipulation functions is non-trivial. However, as it turns out, almost all of the time, the problem is that the g option requires a full path and the corectl syscall will fail when that is not the case. Merely adding this check to coreadm(1M) should prevent any invocations of coreadm(1M) that change settings to cause the coreadm service to go into maintenance mode. If someone directly manipulates the coreadm SMF configuration, they can still set invalid values and cause the coreadm service to fail. I suspect that it's reasonable to expect someone doing such things to know how to restore values (since it's the same as changing them), and be less concerned about this possibility - at least until the necessary SMF interfaces are available.


Comment by Former user
Created at 2017-09-22T16:44:55.000Z
Updated at 2017-09-22T16:45:47.000Z

Testing:

# ./coreadm -g foo
coreadm: The -g option must specify an absolute path
usage:
    coreadm [ -g pattern ] [ -i pattern ] [ -G content ] [ -I content ]
            [ -e {global | process | global-setid | proc-setid | log} ]
            [ -d {global | process | global-setid | proc-setid | log} ]
    coreadm [ -p pattern ] [ -P content ] [ pid ... ]
# ./coreadm -g /foo
[root@c1748f12-9915-67d4-a625-a1a41dd84898 /home/build/smartos-live/projects/illumos/usr/src/cmd/coreadm]# ./coreadm
     global core file pattern: /foo
     global core file content: default
       init core file pattern: core.%f.%p
       init core file content: default
            global core dumps: enabled
       per-process core dumps: enabled
      global setid core dumps: disabled
 per-process setid core dumps: disabled
     global core dump logging: disabled

As I stated above, with this fix, someone can still go in behind coreadm's back and directly change the stuff in SMF and cause the service to fail. However it seems reasonable that people doing such things should expect that they are operating without a safety net (somewhat analogous to directly editing configuration files that are typically managed by a command), and with just this fix, it's addresses what is likely almost all instances where things get messed up.