Clarive 7.0.13 introduces a new feature that allows remote jobs to be killed when a pipeline job is cancelled.

Normally, pipeline job cancelation will only end processes local to the Clarive server and keep remote processes running. This was working as designed, as we did not intend to nuke remote processes inadvertently.

This is an interesting subject that we think could be of use within or outside the scope of Clarive, and may be useful if you’re wondering how to interrupt job pipelines while they’re running, or killing scripts running remote processes.

killing a remote process tree

Why remote processes

Pipeline job remote execution starts remote processes using one of our 3 communication agents/transports: SSH, ClaX (lightweight push agent) and ClaW (lightweight pull-worker). This article is specific about the SSH transport, as it’s more generic, but it applies also to ClaX and ClaW.

When a pipeline kicks off a remote job, Clarive connects to a remote server and starts the command requested. The connection between the Clarive server and the remote machine blocks (unless in parallel mode) and remains blocked for the duration of the remote command.

Here’s a rulebook pipeline example:

do:
   shell:
     host: [email protected]
     cmd: sleep 30

The above example will block wait 30 seconds for the remote sleep command to finish.

During the execution of the command, if we go to the remote machine and do a ps -ef, this is what we’d find:

 user  12042 12012  0 07:47 ?        00:00:00 sshd: [email protected]
 user  12043 12042  0 07:47 ?        00:00:00 sleep 30

Most remote execution engines do not track and kill remote processes. The issue of killing the remote processes and giving the user feedback process (or UI) is present in DevOps tools from Ansible to Gitlab to many others.

 https://gitlab.com/gitlab-org/gitlab-ce/issues/18909
killing a remote process tree

Currently killing a job will not stop remote processes

Killing the remote parent process

Before this release, canceling a job would end the local process and

But you can do the same from the Clarive server with the SSH client command ssh:

 [email protected] $ ssh [email protected] sleep 30
 Killed: 9

Now if we killed the server process – with Clarive’s job cancel command or with a simple Ctrl-C or even a kill -9 [pid] through the SSH client:

 [email protected] $ ssh [email protected] sleep 30
 Killed: 9

That typically does not work, as the children processes will remain alive and become children of the init process process id 1. This would be the result on the remote server after the local process is killed or the Clarive job canceled:

 user  12043     1  0 07:47 ?        00:00:00 sleep 30

The sshd server process that was overseeing the execution of the remote command terminates. That’s because the socket connection has been interrupted. But the remote command is still running.

Pseudo-TTY

A way to interrupt the remote command could be the use of the ssh -t option. The -t tells the SSH client to create a pseudo-TTY, which basically means tells SSH to make the local terminal a mirror of what a remote terminal would be, instead of just running a command.

If have never used it, give it a try:

$ ssh -t [email protected] vim /tmp/

It will open vim locally as if you had a terminal open on the remote machine.

Now if you try to kill a process started with -t using Ctrl-C, the remote sshd process will terminate the children process as well, just like when you hit Ctrl-C with a local process.

$ ssh -t [email protected] sleep 30
^C
Connection to remserver closed.

No remote processes remain alive after the kill, and sleep 30 disappears on remserver.

However, this technique does not solve our problem, due to the fact that pipeline jobs are not interactive, so we cannot tell the ssh channel to send a remote kill just by setting up a pseudo-tty. The kill signal will only impact locally and on the remote sshd and not be interpreted as a user manually hitting the Ctrl-C key.

The solution: tracking and pkill

The way to correctly stop remote processes when pipeline jobs are cancelled is to do it in a controlled fashion:

1) Clarive job process starts remote command and keeps the connection open

2) Clarive job is canceled (by the user normally, through the job monitor)

3) Clarive creates a new connection to all servers where commands are being executed

4) A pkill -[signal] -p $PPID command is sent through the same sshd tunnel

5) The pkill will kill the parent remote sshd process and all it’s children, also called the process tree

That way all the remote processes are stopped with the job cancel.

killing a remote process tree

Successfully killing remote processes will kill the full remote tree

Picking a signal

Additionally, we’ve introduced control over the local and remote signals to send to end the processes. You may be interested in sending a more stern kill -9 or just a nice kill -15 to the remote process.

Clarive will not wait for the remote process to finish since, as we have witnessed many times, certain shutdown procedures may take forever to finish, but it does have a timeout on the local job process that are running and who may be waiting for the remote process to finish.

The following config/[yourconfig].yml file options are available:

# kill signal used to cancel job processes
# - 9 if you want the process to stop immediately
# - 2 or 15 if you want the process to stop normally
kill_signal: 15

# 1|0 - if you want to kill the job children processes as well
kill_job_children: 1
# signal that will be sent to remote children
kill_children_signal: 15

# seconds to wait for killed job child processes to be ripped
kill_job_children_timeout: 30

Why killing remote processes is important

When we get down to business, DevOps is as much about running processes on remote servers, cloud infrastructure and containers as it is about creating a culture that promotes a do-IT-yourself empowered culture.

If you are building DevOps pipelines and general remote process execution and want to stop it midway through for whatever reason, it’s important to have a resilient process tree that is tracked and can be killed when requested by the master process.

Happy scripting!


Get an early start and try Clarive now. Get your custom cloud instance for free.