Move killing logic solely to process #6868

agoscinski · 2025-05-08T21:14:42Z

The killing process is very convoluted due to being partially performed in tasks.py:Waiting (ProcessState) and in process.py:Process. The architecture tried to split the killing process in two parts, one responsible for cancelling the job in the scheduler in (tasks.py:Waiting), one responsible for killing the process transitioning it to the KILLED state. Here a summary of these two parts:

Killing the plumpy process

calcjob/process:Process
Event: KillMessage (through rabbitmq by through verdi)
kill -> self.runner.controller.kill_process # (sending message to kill)

Killing the scheduler job

calcjob/tasks:Waiting (The task running the actual CalcJob)
Event: CalcJobMonitorAction.KILL (through monitoring), KillInterrupt (through verdi)
execute --> _kill_job -> task_kill_job -> do_kill -> execmanager.kill_calculation

In this PR I am moving most of the killing logic to the process to simplify the design. This is required to fix a bug that appears when two killing commands are sent in a row. The first killing command is sending the KillInterruption (within process.py:Process, part of the logic in parent class) to the tasks.py:Waiting that receives it and start the cancelling of the scheduler job. Since this is only triggered through a try-catch block of the KillInterruption it cannot be repeated when a second kill command is invoked by the user. This bug was introduced by PR #6793 (the one introduced force kill), because it also started to fix the timeout issue (verdi process kill is partially ignoring the timeout). Moving all killing logic to the process as done in this PR solves the problem as when a new killing action is received in the process class the scheduler job can be cancelled again, thus the EBM is reinvoked. I further on purpose do not call the parent kill method from plumpy as this just added to the entanglement and made it harder to read what is happening in the kill. The logic between the separation of code in the plumpy.Process and in the aiida-core Process (even before the force kill PR) is not clear to me and seems to me also as something that just grew incrementally to fix bugs becoming just more convoluted to read. Since we do not have any clear usage of plumpy outside of aiida-core, I would be pratical here and continue with this approach.

TODO:

When killing a child during the action (and child is not responding because blocked) this can block a worker. There killing has to happen in a way that it does not block the worker so the old killing action be be cancelled if the killing action is resend. I think the code before did this but I removed it temporary to simplify reading the logic.
Also we have design issue, it is required to kill the children first so the parent does not expect, but we need to first kill the children to continue in case of resend kill action. I think the easy solution would be to allow killing a killed process, so it can continue killing its children.
the same _launch_task is now at two places (process.py and tasks.py) need to unify this

Here are behavoirs I identified to be important for the kill command to work correctly. I use them as guidelines to produce a robust killing action since I need to touch code that I don't fully understand, but which I think is only there due to incrementally changes and the entanglement of the logic between plumpy and aiida-core (no clear modularity between the two).

Kill action in worker may not deadlock

Due to the broker design this scenario has to be avoided
USER sends kill
WORKER receives message and is killing, then gets blocked because it is waiting on something
USER cancels killing
USER sends second kill
WORKER still blocked forever

The user cannot cancel the killing action through rabbitmq. Therefore the logic should avoid any potential deadlock when freeing resources during the kill (e.g. killing child processes, cancellingscheduler jobs). So the worker should kill in a nonblocking manner using asyncio functionalities. The killing of the scheduler is working nonblocking

Cancel old killing action worker side

USER sends kill
WORKER receives message and is killing gets stuck in EBM
USER sends kill with new paramaters (e.g. force kill)
WORKER killing is already in process with old parameters

So the worker must cancel the old killing and need to start a new one with new paramaters.

Timeout actions on the worker side

USER sends kill
WORKER receives message and is killing
USER cancels killing with ctrl+c
USER does something else and assumes that it did not work
WORKER is still killing and might cause unexpected behavior to the user if it suddenly suceeds hours later

Again because the user cannot cancel the kill through rabbitmq. We cannot cancel any sent kill (without resendng another). I think we can at least take an advantage of the timeout argument and cancel the kill once the timeout kicks in which is by default 5 seconds, so at least there will be no kill action in the event loop for hours hanging e.g. in the EBM. That would require a change in plumpy to pass the timeout in the message.

The killing process is very convoluted due to being partially performed in `tasks.py:Waiting` and `process.py:Process`. The architecture tried to split the killing process in two parts, one responsible for cancelling the job in the scheduler in (`tasks.py:Waiting`), one responsible for killing the process transitioning it to the KILLED state. Here a summary of these two steps Killing the plumpy calcjob/process:Process Event: KillMessage (through rabbitmq by through verdi) kill -> self.runner.controller.kill_process # (sending message to kill) Killing the scheduler job calcjob/tasks:Waiting (The task running the actual CalcJob) Event: CalcJobMonitorAction.KILL (through monitoring), KillInterrupt (through verdi) execute --> _kill_job -> task_kill_job -> do_kill -> execmanager.kill_calculation In this PR I am moving most of the killing logic to the process to simplify the design. This is required to fix a bug that appears when two killing commands are sent. The first killing command is sending the KillInterruption (within `process.py:Process`, part of the logic in parent class) to the `tasks.py:Waiting` that receives it and start the cancelling of the scheduler job. Since this is only triggered through a try-catch block of the `KillInterruption` it cannot be repeated when a second kill command is invoked by the user. This bug was introduced by PR TODO (the one introduced force kill), because it also started to fix the timeout issue (verdi process kill is ignoring the timeout). Moving all killing logic to the process as done in this PR solves the problem as we completely moved the cancelation of the job is reinvoked in the process class. This is the function that is invoked when a worker receives a kill message through RMQ. I put very verbose comments for the review that I will remove later. I must say the kill process seems not well tested as I had not to adapt much in the tests. The tests in `test_work_chain.py` need some adaption to also be able to kill a scheduler job in a dummy manner.

for more information, see https://pre-commit.ci

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

Squashed commit at 2025-05-09 21:53 PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

Squashed commit at 2025-05-09 21:53 PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

…f scheduler job (#6870) PR #6793 introduced logic to cancel earlier kill actions when a new one is issued. However, this led to a bug: when two kill actions are sent in succession, the second cancels the first including its triggered cancelation of the scheduler job that is stuck in the EBM. The second kill command does not re-initiate the scheduler job cancelation, leaving it in an inconsistent state. This issue arises because the kill logic is split across two places. PR #6868 addresses this properly with a refactor of the kill action. In contrast, this PR provides a temporary workaround to mitigate the issue. Furthermore, before this PR if the kill command failed through the EBM, the cancelation could not be rescheduled by a subsequent kill action as the process state already transition to the `Expected` state. With the new force-kill option the user can bypasses the EBM as desired, thus the transition to the `Expected` state is not needed anymore and has been removed to allow the user to further gracefully kill a job. --------- Co-authored-by: Ali Khosravi <[email protected]>

agoscinski force-pushed the killing-time branch from 7cf9ccd to 1269ac8 Compare May 8, 2025 21:15

[pre-commit.ci] auto fixes from pre-commit.com hooks

b1690d7

for more information, see https://pre-commit.ci

agoscinski mentioned this pull request May 9, 2025

Regular killing reschedules a cancel of scheduler job #6870

Merged

agoscinski added this to aiida-core v2.8.0 May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move killing logic solely to process #6868

Move killing logic solely to process #6868

Uh oh!

agoscinski commented May 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Move killing logic solely to process #6868

Are you sure you want to change the base?

Move killing logic solely to process #6868

Uh oh!

Conversation

agoscinski commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

agoscinski commented May 8, 2025 •

edited

Loading