Skip to content

Move killing logic solely to process #6868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

agoscinski
Copy link
Contributor

@agoscinski agoscinski commented May 8, 2025

The killing process is very convoluted due to being partially performed in tasks.py:Waiting (ProcessState) and in process.py:Process. The architecture tried to split the killing process in two parts, one responsible for cancelling the job in the scheduler in (tasks.py:Waiting), one responsible for killing the process transitioning it to the KILLED state. Here a summary of these two parts:

Killing the plumpy process

calcjob/process:Process
Event: KillMessage (through rabbitmq by through verdi)
kill -> self.runner.controller.kill_process # (sending message to kill)

Killing the scheduler job

calcjob/tasks:Waiting (The task running the actual CalcJob)
Event: CalcJobMonitorAction.KILL (through monitoring), KillInterrupt (through verdi)
execute --> _kill_job -> task_kill_job -> do_kill -> execmanager.kill_calculation

In this PR I am moving most of the killing logic to the process to simplify the design. This is required to fix a bug that appears when two killing commands are sent in a row. The first killing command is sending the KillInterruption (within process.py:Process, part of the logic in parent class) to the tasks.py:Waiting that receives it and start the cancelling of the scheduler job. Since this is only triggered through a try-catch block of the KillInterruption it cannot be repeated when a second kill command is invoked by the user. This bug was introduced by PR #6793 (the one introduced force kill), because it also started to fix the timeout issue (verdi process kill is partially ignoring the timeout). Moving all killing logic to the process as done in this PR solves the problem as when a new killing action is received in the process class the scheduler job can be cancelled again, thus the EBM is reinvoked. I further on purpose do not call the parent kill method from plumpy as this just added to the entanglement and made it harder to read what is happening in the kill. The logic between the separation of code in the plumpy.Process and in the aiida-core Process (even before the force kill PR) is not clear to me and seems to me also as something that just grew incrementally to fix bugs becoming just more convoluted to read. Since we do not have any clear usage of plumpy outside of aiida-core, I would be pratical here and continue with this approach.

TODO:

  • When killing a child during the action (and child is not responding because blocked) this can block a worker. There killing has to happen in a way that it does not block the worker so the old killing action be be cancelled if the killing action is resend. I think the code before did this but I removed it temporary to simplify reading the logic.
  • Also we have design issue, it is required to kill the children first so the parent does not expect, but we need to first kill the children to continue in case of resend kill action. I think the easy solution would be to allow killing a killed process, so it can continue killing its children.
  • the same _launch_task is now at two places (process.py and tasks.py) need to unify this

Here are behavoirs I identified to be important for the kill command to work correctly. I use them as guidelines to produce a robust killing action since I need to touch code that I don't fully understand, but which I think is only there due to incrementally changes and the entanglement of the logic between plumpy and aiida-core (no clear modularity between the two).

  1. Kill action in worker may not deadlock
Due to the broker design this scenario has to be avoided
USER sends kill
WORKER receives message and is killing, then gets blocked because it is waiting on something
USER cancels killing
USER sends second kill
WORKER still blocked forever

The user cannot cancel the killing action through rabbitmq. Therefore the logic should avoid any potential deadlock when freeing resources during the kill (e.g. killing child processes, cancellingscheduler jobs). So the worker should kill in a nonblocking manner using asyncio functionalities. The killing of the scheduler is working nonblocking

  1. Cancel old killing action worker side
USER sends kill
WORKER receives message and is killing gets stuck in EBM
USER sends kill with new paramaters (e.g. force kill)
WORKER killing is already in process with old parameters

So the worker must cancel the old killing and need to start a new one with new paramaters.

  1. Timeout actions on the worker side
USER sends kill
WORKER receives message and is killing
USER cancels killing with ctrl+c
USER does something else and assumes that it did not work
WORKER is still killing and might cause unexpected behavior to the user if it suddenly suceeds hours later

Again because the user cannot cancel the kill through rabbitmq. We cannot cancel any sent kill (without resendng another). I think we can at least take an advantage of the timeout argument and cancel the kill once the timeout kicks in which is by default 5 seconds, so at least there will be no kill action in the event loop for hours hanging e.g. in the EBM. That would require a change in plumpy to pass the timeout in the message.

The killing process is very convoluted due to being partially performed
in `tasks.py:Waiting` and `process.py:Process`. The architecture tried
to split the killing process in two parts, one responsible for
cancelling the job in the scheduler in (`tasks.py:Waiting`), one
responsible for killing the process transitioning it to the KILLED
state. Here a summary of these two steps

Killing the plumpy
calcjob/process:Process
Event: KillMessage (through rabbitmq by through verdi)
kill -> self.runner.controller.kill_process # (sending message to kill)

Killing the scheduler job
calcjob/tasks:Waiting (The task running the actual CalcJob)
Event: CalcJobMonitorAction.KILL (through monitoring), KillInterrupt (through verdi)
execute --> _kill_job -> task_kill_job -> do_kill -> execmanager.kill_calculation

In this PR I am moving most of the killing logic to the process to
simplify the design. This is required to fix a bug that appears when
two killing commands are sent. The first killing command is sending the
KillInterruption (within `process.py:Process`, part of the logic in
parent class) to the `tasks.py:Waiting` that receives it and start the
cancelling of the scheduler job. Since this is only triggered through a
try-catch block of the `KillInterruption` it cannot be repeated when a
second kill command is invoked by the user. This bug was introduced by
PR TODO (the one introduced force kill), because it also started to fix
the timeout issue (verdi process kill is ignoring the timeout). Moving
all killing logic to the process as done in this PR solves the problem
as we completely moved the cancelation of the job is reinvoked in the
process class. This is the function that is invoked when a worker
receives a kill message through RMQ.

I put very verbose comments for the review that I will remove later. I
must say the kill process seems not well tested as I had not to adapt
much in the tests. The tests in `test_work_chain.py` need some adaption
to also be able to kill a scheduler job in a dummy manner.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request May 9, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request May 9, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request May 9, 2025
Squashed commit at 2025-05-09 21:53

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit that referenced this pull request May 9, 2025
Squashed commit at 2025-05-09 21:53

PR #6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR #6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 2, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 2, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit that referenced this pull request Jun 4, 2025
…f scheduler job (#6870)

PR #6793 introduced logic to cancel earlier kill actions when a new one is
issued. However, this led to a bug: when two kill actions are sent in
succession, the second cancels the first including its triggered cancelation of
the scheduler job that is stuck in the EBM. The second kill command does not
re-initiate the scheduler job cancelation, leaving it in an inconsistent state.
This issue arises because the kill logic is split across two places. PR #6868
addresses this properly with a refactor of the kill action. In contrast, this PR
provides a temporary workaround to mitigate the issue.

Furthermore, before this PR if the kill command failed through the EBM, the
cancelation could not be rescheduled by a subsequent kill action as the process
state already transition to the `Expected` state. With the new force-kill option
the user can bypasses the EBM as desired, thus the transition to the `Expected`
state is not needed anymore and has been removed to allow the user to further
gracefully kill a job.

---------

Co-authored-by: Ali Khosravi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

1 participant