-
Notifications
You must be signed in to change notification settings - Fork 226
Move killing logic solely to process #6868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
agoscinski
wants to merge
2
commits into
aiidateam:main
Choose a base branch
from
agoscinski:killing-time
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The killing process is very convoluted due to being partially performed in `tasks.py:Waiting` and `process.py:Process`. The architecture tried to split the killing process in two parts, one responsible for cancelling the job in the scheduler in (`tasks.py:Waiting`), one responsible for killing the process transitioning it to the KILLED state. Here a summary of these two steps Killing the plumpy calcjob/process:Process Event: KillMessage (through rabbitmq by through verdi) kill -> self.runner.controller.kill_process # (sending message to kill) Killing the scheduler job calcjob/tasks:Waiting (The task running the actual CalcJob) Event: CalcJobMonitorAction.KILL (through monitoring), KillInterrupt (through verdi) execute --> _kill_job -> task_kill_job -> do_kill -> execmanager.kill_calculation In this PR I am moving most of the killing logic to the process to simplify the design. This is required to fix a bug that appears when two killing commands are sent. The first killing command is sending the KillInterruption (within `process.py:Process`, part of the logic in parent class) to the `tasks.py:Waiting` that receives it and start the cancelling of the scheduler job. Since this is only triggered through a try-catch block of the `KillInterruption` it cannot be repeated when a second kill command is invoked by the user. This bug was introduced by PR TODO (the one introduced force kill), because it also started to fix the timeout issue (verdi process kill is ignoring the timeout). Moving all killing logic to the process as done in this PR solves the problem as we completely moved the cancelation of the job is reinvoked in the process class. This is the function that is invoked when a worker receives a kill message through RMQ. I put very verbose comments for the review that I will remove later. I must say the kill process seems not well tested as I had not to adapt much in the tests. The tests in `test_work_chain.py` need some adaption to also be able to kill a scheduler job in a dummy manner.
for more information, see https://pre-commit.ci
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
May 9, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
May 9, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
May 9, 2025
Squashed commit at 2025-05-09 21:53 PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
that referenced
this pull request
May 9, 2025
Squashed commit at 2025-05-09 21:53 PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
Jun 2, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
Jun 2, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
to agoscinski/aiida-core
that referenced
this pull request
Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
agoscinski
added a commit
that referenced
this pull request
Jun 4, 2025
…f scheduler job (#6870) PR #6793 introduced logic to cancel earlier kill actions when a new one is issued. However, this led to a bug: when two kill actions are sent in succession, the second cancels the first including its triggered cancelation of the scheduler job that is stuck in the EBM. The second kill command does not re-initiate the scheduler job cancelation, leaving it in an inconsistent state. This issue arises because the kill logic is split across two places. PR #6868 addresses this properly with a refactor of the kill action. In contrast, this PR provides a temporary workaround to mitigate the issue. Furthermore, before this PR if the kill command failed through the EBM, the cancelation could not be rescheduled by a subsequent kill action as the process state already transition to the `Expected` state. With the new force-kill option the user can bypasses the EBM as desired, thus the transition to the `Expected` state is not needed anymore and has been removed to allow the user to further gracefully kill a job. --------- Co-authored-by: Ali Khosravi <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The killing process is very convoluted due to being partially performed in
tasks.py:Waiting
(ProcessState) and inprocess.py:Process
. The architecture tried to split the killing process in two parts, one responsible for cancelling the job in the scheduler in (tasks.py:Waiting
), one responsible for killing the process transitioning it to theKILLED
state. Here a summary of these two parts:Killing the plumpy process
Killing the scheduler job
In this PR I am moving most of the killing logic to the process to simplify the design. This is required to fix a bug that appears when two killing commands are sent in a row. The first killing command is sending the
KillInterruption
(withinprocess.py:Process
, part of the logic in parent class) to thetasks.py:Waiting
that receives it and start the cancelling of the scheduler job. Since this is only triggered through a try-catch block of theKillInterruption
it cannot be repeated when a second kill command is invoked by the user. This bug was introduced by PR #6793 (the one introduced force kill), because it also started to fix the timeout issue (verdi process kill
is partially ignoring the timeout). Moving all killing logic to the process as done in this PR solves the problem as when a new killing action is received in the process class the scheduler job can be cancelled again, thus the EBM is reinvoked. I further on purpose do not call the parent kill method from plumpy as this just added to the entanglement and made it harder to read what is happening in the kill. The logic between the separation of code in theplumpy.Process
and in the aiida-core Process (even before the force kill PR) is not clear to me and seems to me also as something that just grew incrementally to fix bugs becoming just more convoluted to read. Since we do not have any clear usage of plumpy outside of aiida-core, I would be pratical here and continue with this approach.TODO:
_launch_task
is now at two places (process.py and tasks.py) need to unify thisHere are behavoirs I identified to be important for the kill command to work correctly. I use them as guidelines to produce a robust killing action since I need to touch code that I don't fully understand, but which I think is only there due to incrementally changes and the entanglement of the logic between plumpy and aiida-core (no clear modularity between the two).
The user cannot cancel the killing action through rabbitmq. Therefore the logic should avoid any potential deadlock when freeing resources during the kill (e.g. killing child processes, cancellingscheduler jobs). So the worker should kill in a nonblocking manner using asyncio functionalities. The killing of the scheduler is working nonblocking
So the worker must cancel the old killing and need to start a new one with new paramaters.
Again because the user cannot cancel the kill through rabbitmq. We cannot cancel any sent kill (without resendng another). I think we can at least take an advantage of the timeout argument and cancel the kill once the timeout kicks in which is by default 5 seconds, so at least there will be no kill action in the event loop for hours hanging e.g. in the EBM. That would require a change in plumpy to pass the timeout in the message.