OS-3250: lxbrand initctl stop/start svcs can hang


Issue Type:Bug
Priority:4 - Normal
Created at:2014-07-28T19:15:13.000Z
Updated at:2019-08-28T23:38:09.364Z


Created by:Former user
Reported by:Former user
Assigned to:Former user


Fixed: A fix for this issue is checked into the tree and tested.
(Resolution Date: 2015-01-08T19:30:21.000Z)

Fix Versions

2015-01-22 Galloping Gertie (Release Date: 2015-01-22)




Tests that hang LTP on lx brand - terminated with pkill to get through the test list.


For resolution - need to get these unstuck or some indication that they should be skipped. See also float_* tests in OS-3248


Comment by Former user
Created at 2014-08-09T00:10:49.000Z
Updated at 2017-12-14T17:23:54.337Z

proc01 resolved by OS-3329

Comment by Former user
Created at 2014-08-09T01:39:46.000Z
Updated at 2017-12-14T17:23:54.416Z

sbrk02 is passing

Comment by Former user
Created at 2014-12-02T19:39:20.000Z
Updated at 2017-12-14T17:23:57.728Z

cron02 passing

Comment by Former user
Created at 2014-12-11T19:35:08.000Z
Updated at 2017-12-14T17:24:05.501Z

The real issue here is that stopping/starting a svc under upstart's control can hang:

# stop atd
# start atd

The stop will say it worked, but atd is still running and the start then hangs.

Comment by Former user
Created at 2014-12-15T14:15:11.000Z
Updated at 2017-12-14T17:24:07.824Z

Here is more detail about what is going on and how to reproduce. Although starting atd is what is hanging, it could be that the stopping is the underlying issue. Or, it is possible that there are two separate bugs here.

When we issue 'initctl stop atd' it succeeds and says atd was stopped, but in reality the atd process still exists.

When I enable 'debug' logging in upstart and run 'initctl stop atd' I see this logged on real ubuntu

[1432123.124225] init: atd goal changed from start to stop
[1432123.129060] init: atd state changed from running to pre-stop
[1432123.130243] init: atd state changed from pre-stop to stopping
[1432123.131354] init: event_new: Pending stopping event
[1432123.131799] init: Handling stopping event
[1432123.132040] init: event_finished: Finished stopping event
[1432123.132040] init: atd state changed from stopping to killed
[1432123.133408] init: Sending TERM signal to atd main process (29776)
[1432123.135824] init: atd main process (29776) exited normally
[1432123.140370] init: atd state changed from killed to post-stop
[1432123.141511] init: atd state changed from post-stop to waiting
[1432123.142567] init: event_new: Pending stopped event
[1432123.142885] init: job_change_state: Destroyed inactive instance atd

Note that init sends SIGTERM to the atd process. But, on lx, we see mostly the same logging but init never indicates that it sends a SIGTERM to atd. Looking at the upstart code, in job_change_state(), for the JOB_KILLED case it does this:

 if (job->class->process[PROCESS_MAIN] && (job->pid[PROCESS_MAIN] > 0)) {
     job_process_kill (job, PROCESS_MAIN);
} else {
    state = job_next_state (job);

job_process_kill() is what sends the SIGTERM. So, the internal state upstart has indicates that it doesn't have to issue the SIGTERM. I am not sure why the state is wrong here, but my suspicion is that something in the way we are handling the ptrace watching is getting messed up when the atd process forks itself to daemonize. This may be the underlying root cause of why the subsequent 'initctl start atd' hangs, or it could be that this hang is a second bug. After stopping the atd svc and manually killing the existing atd process, the start will also still hang, so it doesn't seem to be existence of the process which is causing the start to hang.

I do have updated kernel code which makes sure that the zone's init process inherits children, but that does not fix the issue described here. I'll probably go ahead and push that code under a separate ticket.

Comment by Former user
Created at 2014-12-15T20:06:20.000Z

Per Jerry, upstart can be placed in debug mode with this command:

initctl log-priority debug

... which should send the output to /dev/kmsg, available via zlogin -C

Comment by Former user
Created at 2015-01-08T19:05:06.000Z

illumos-joyent commit 8ca0fc0 (branch master, by Jerry Jelinek)

OS-3675 lxbrand ptrace setoptions handling is broken
OS-3352 ptrace needs to work for linux, wait on non-children etc.
OS-3306 PTRACE_O_TRACEFORK should stop parent and child as they exit fork
OS-3250 lxbrand initctl stop/start svcs can hang
OS-2916 lxbrand add ptrace PTRACE_GETEVENTMSG support