OS-3250

lxbrand initctl stop/start svcs can hang

Status:
Resolved
Resolution:
Fixed
Created:
2014-07-28T15:15:13.000-0400
Updated:
2019-08-28T19:38:09.364-0400

Description

Tests that hang LTP on lx brand - terminated with pkill to get through the test list.

proc01
sbrk02
cron02
at_deny01
at_allow01

For resolution - need to get these unstuck or some indication that they should be skipped. See also float_* tests in OS-3248

Comments (7)

Former user commented on 2014-08-08T20:10:49.000-0400 (edited 2017-12-14T12:23:54.337-0500):

proc01 resolved by OS-3329#icft=OS-3329

Former user commented on 2014-08-08T21:39:46.000-0400 (edited 2017-12-14T12:23:54.416-0500):

sbrk02 is passing

Former user commented on 2014-12-02T14:39:20.000-0500 (edited 2017-12-14T12:23:57.728-0500):

cron02 passing

Former user commented on 2014-12-11T14:35:08.000-0500 (edited 2017-12-14T12:24:05.501-0500):

The real issue here is that stopping/starting a svc under upstart's control can hang:

# stop atd
# start atd

The stop will say it worked, but atd is still running and the start then hangs.

Former user commented on 2014-12-15T09:15:11.000-0500 (edited 2017-12-14T12:24:07.824-0500):

Here is more detail about what is going on and how to reproduce. Although starting atd is what is hanging, it could be that the stopping is the underlying issue. Or, it is possible that there are two separate bugs here.

When we issue 'initctl stop atd' it succeeds and says atd was stopped, but in reality the atd process still exists.

When I enable 'debug' logging in upstart and run 'initctl stop atd' I see this logged on real ubuntu

[1432123.124225] init: atd goal changed from start to stop
[1432123.129060] init: atd state changed from running to pre-stop
[1432123.130243] init: atd state changed from pre-stop to stopping
[1432123.131354] init: event_new: Pending stopping event
[1432123.131799] init: Handling stopping event
[1432123.132040] init: event_finished: Finished stopping event
[1432123.132040] init: atd state changed from stopping to killed
[1432123.133408] init: Sending TERM signal to atd main process (29776)
[1432123.135824] init: atd main process (29776) exited normally
[1432123.140370] init: atd state changed from killed to post-stop
[1432123.141511] init: atd state changed from post-stop to waiting
[1432123.142567] init: event_new: Pending stopped event
[1432123.142885] init: job_change_state: Destroyed inactive instance atd

Note that init sends SIGTERM to the atd process. But, on lx, we see mostly the same logging but init never indicates that it sends a SIGTERM to atd. Looking at the upstart code, in job_change_state(), for the JOB_KILLED case it does this:

if (job->class->process[PROCESS_MAIN] && (job->pid[PROCESS_MAIN] > 0)) {
     job_process_kill (job, PROCESS_MAIN);
} else {
    state = job_next_state (job);
}

job_process_kill() is what sends the SIGTERM. So, the internal state upstart has indicates that it doesn't have to issue the SIGTERM. I am not sure why the state is wrong here, but my suspicion is that something in the way we are handling the ptrace watching is getting messed up when the atd process forks itself to daemonize. This may be the underlying root cause of why the subsequent 'initctl start atd' hangs, or it could be that this hang is a second bug. After stopping the atd svc and manually killing the existing atd process, the start will also still hang, so it doesn't seem to be existence of the process which is causing the start to hang.

I do have updated kernel code which makes sure that the zone's init process inherits children, but that does not fix the issue described here. I'll probably go ahead and push that code under a separate ticket.

Former user commented on 2014-12-15T15:06:20.000-0500:

Per Jerry, upstart can be placed in debug mode with this command:

initctl log-priority debug

... which should send the output to /dev/kmsg, available via zlogin -C

Former user commented on 2015-01-08T14:05:06.000-0500:

illumos-joyent commit 8ca0fc0 (branch master, by Jerry Jelinek)

OS-3675#icft=OS-3675 lxbrand ptrace setoptions handling is broken
OS-3352#icft=OS-3352 ptrace needs to work for linux, wait on non-children etc.
OS-3306#icft=OS-3306 PTRACE_O_TRACEFORK should stop parent and child as they exit fork
OS-3250#icft=OS-3250 lxbrand initctl stop/start svcs can hang
OS-2916#icft=OS-2916 lxbrand add ptrace PTRACE_GETEVENTMSG support