WIP: Implement force-kill option in verdi process kill #6575

agoscinski · 2024-09-27T18:23:03Z

This allows plumpy ignore the callback so it does not get stuck in it but directly transitions the process to killed. The disadvantage is that the job behind the process might not be killed as we do not wait for. But I think that can be clearly communicated to the user in the help message and it has a similar logic as kill -9 <PID>.

This PR goes together with the PR in plumpy aiidateam/plumpy#288

TODO:
As you feared @khsrali with this solution the worker process will not be killed, so the upload async function still continues, so maybe we still need to find a solution that actually sends the message back to aiida-core. It is really weird, I feel like _stepping part in plumpy is buggy, because it only the transition function actually returns back to the sender of the message (where verdi process kill <PID> was sent), while only call the callback do calcjobs.tasks.task_kill_job in the _stepping part kills the process occupying the worker (when removing the exponential backof from the kill), but it should do both. I have this a bug on my system also when just killing local jobs that sleep. Maybe we can check on your machine.

Co-author @khsrali

@kh

This allows plumpy ignore the callback so it does not get stuck in it but directly transitions the process to killed. The disadvantage is that the job behind the process might not be killed as we do not wait for. But I think that can be clearly communicated to the user in the help message and it has a similar logic as `kill -9 <PID>`. TODO: As you mentioned @kh with this solution the worker process will not be killed. Some parts of plumpy in stepping are needed to kill the process, but I am not sure which ones.

for more information, see https://pre-commit.ci

khsrali · 2024-09-30T12:26:26Z

Thanks @agoscinski
I write things here, only not to forget

Agreed, the process doesn't seem to be completely taken off from the concurrent Q, it's still there until its exponential back-off thing finishes, even though it disappears from verdi process list
One other problem:
Right now, if you do first without force kill verdi process kill <pk> and then press Ctrl +C because it has stuck and do again verdi process kill <pk> -fk , it cannot force kill. Because a killing signal has already been queued (and stuck) in the process:

09/30/2024 12:30:13 PM <4187692> aiida.engine.runners: [WARNING] runner received interrupt, process 72 already being killed
WARNING: aiida.engine.runners: runner received interrupt, process 72 already being killed

and says:

raise TransportTaskException(f'kill_calculation failed {max_attempts} times consecutively') from exception
aiida.common.exceptions.TransportTaskException: kill_calculation failed 5 times consecutively

The other thing, is verdi process kill <pk> sometimes gets stuck, even though the process is already killed. (As agreed this should be listed as a different bug- not for this PR)

sphuber · 2024-09-30T18:07:02Z

raise TransportTaskException(f'kill_calculation failed {max_attempts} times consecutively') from exception
aiida.common.exceptions.TransportTaskException: kill_calculation failed 5 times consecutively

This part is key. If you want to kill a CalcJob that was stuck in the EBM because of a problem with the transport, the verdi process kill command will first call the kill command through the scheduler of the CalcJob because it wants to actually kill the associated job on the computer (before it kills the CalcJob in AiiDA itself). But if the transport is not working, that kill command over the scheduler will fail because it requires the transport.

The question is what the code should do in this case. Maybe the EBM should be disabled when calling the scheduler's kill command. But then again, what if the CalcJob was running just fine without transport issues but the user wants to kill it for another reason, but then it just so happens that the transport has an issue when performing the kill, ignoring it in this case would be a bad idea. So we would really have to check whether the job was already in EBM before the kill call, and only then ignore any exceptions and not go in the EBM again.

khsrali · 2024-10-01T07:19:06Z

So we would really have to check whether the job was already in EBM before the kill call, and only then ignore any exceptions and not go in the EBM again.

Yes, exactly! I was thinking about this as well.
I think this idea makes a lot of sense; I'm going to change this PR in this way and push again.

Once done, this PR should address all possibilities:

CalcJob got stuck in EBM
- verdi process kill also gets stuck (transport not available)
  - 🔧 ➡️ PROPOSED SOLUTION:
  - verdi process kill TIMEOUT after 5 Seconds, and echo if you want to kill with EBM do the next line:
  - verdi process kill --timeout 0 (0 for infinite, theoretically can take up to 5 minutes!)
  - verdi process kill -F do not even try to open a transport, just kill
    In this case, user accepts consequences: might still be running an HPC, but not monitored, retrieved, etc.
- Right about executing verdi process kill transport becomes available, but it gets stuck because CalcJob is sleeping.
  - Should be covered by the PROPOSED SOLUTION (--timeout default 5 seconds)
CalcJob has no problem (e.g. was submitted when connection was available)
- verdi process kill gets stuck (transport not available)
  - 🔧 ➡️ The same as PROPOSED SOLUTION
- verdi process kill successful (transport is still available.)
  - works ✔️

khsrali · 2024-10-08T08:36:08Z

Don't worry about tests failing.
They require aiidateam/plumpy#288 to function.. which I couldn't figure out how to properly hook it as dependency.

sphuber

Thanks @agoscinski some initial comments/suggestions

src/aiida/engine/processes/calcjobs/tasks.py

sphuber · 2024-10-08T20:33:16Z

src/aiida/engine/processes/calcjobs/tasks.py

        raise
    except Exception as exception:
-        logger.warning(f'killing CalcJob<{node.pk}> failed')
+        logger.warning(f'killing CalcJob<{node.pk}> excepted, the job might be orphaned.')


Don't think this warning is correct. When a TransportTaskException is raised, the Waiting.execute method catches it and pauses the process.

If I'm not wrong, what you are referring to is already in my test scenario
(see here test_process.py::test_process_kill # 10)
And is being passed, meaning the job is being EXCEPTED, not paused..
(maybe there is a bug somewhere else?)

I don't understand. The exception being caught here is thrown by do_kill so it is the kill operation (i.e. getting the transport or using it to call the kill command of the scheduler) that excepted, not the process itself. The exception is then reraised as a TransportTaskException which is caught in Waiting.execute and rethrown as PauseInterruption on line 571, which means the process is going into pause. So how does this mean the process is already excepted?

I see.
Just added one more test, and I'm now explicitly patching and returning TransportTaskException from task_kill_job.
We still receive EXCEPTED regardless of whether the process was running normally or was stuck in EBM.
Haven't figured out why yet.

src/aiida/engine/processes/calcjobs/tasks.py

unkcpz · 2024-12-02T00:16:55Z

tests/cmdline/commands/test_process.py

+        raise Exception('Mock open exception')
+
+    # patch a faulty transport open, to make EBM go crazy
+    inject_patch.patch('aiida.transports.plugins.local.LocalTransport.open', mock_open)


I think the way to implement an inject_patch fixture is quite hacky. Why not defined a MockTransport with open method implemented as mock_open?

Well, the main problem I had here was, the daemon being instantiated from another run of the interpreter so it couldn't see any mocking function you defined here.. So that I made this hard patch fixture, which might not be an ideal way, as we discussed before.

Anyways, I remember @agoscinski had a suggestion that maybe he could instantiate the daemon on the same running interpreter so what could just mock those functions normally.

@agoscinski can you try your idea, to see if that works?

We pass the monkey patched code to the daemon worker by creating the daemon worker with multiprocessing. I think in this case multiprocess is forked copying the monkeypatchey version of the class into the forked process memory. Here code snippet how it could work.

import multiprocessing as mp # more imports .. # monekypatch def prepare_for_submission(self, folder: Folder) -> CalcInfo: """Prepare the calculation for submission. ... aiida.calculations.arithmetic.add.ArithmeticAddCalculation.prepare_for_submission = prepare_for_submission from aiida.engine.daemon.worker import start_daemon_worker process = mp.Process(target=start_daemon_worker, args=(True,)) process.start() # will run forever process.join()

We could insert this probably somehow in submit_and_await. @unkcpz mentioned it should not affect running the tests in parallel because the daemon is started pre profile. One needs to ensure that a new profile is created for each test that is relying on a specific worker. I am not sure at the moment if each test creates a new profile, need to check it.

agoscinski · 2024-12-04T15:16:40Z

@khsrali @unkcpz we agreed in a discussion that we implement the solution mentioned in #6575 (comment) to avoid the inject. Maybe a separate PR would be good since this changes the infrastructure a bit. Probably this will not happen before the new year because @khsrali and me are busy with some other tasks.

agoscinski · 2025-04-30T08:14:29Z

Replaced by #6793

agoscinski and others added 2 commits September 27, 2024 20:12

[pre-commit.ci] auto fixes from pre-commit.com hooks

3cb8539

for more information, see https://pre-commit.ci

agoscinski mentioned this pull request Sep 27, 2024

WIP: Implement option to force kill aiidateam/plumpy#288

Closed

khsrali force-pushed the fix/6524/force-kill branch from 2af67e7 to 8e080b0 Compare October 2, 2024 12:56

working implementation

75bb430

khsrali force-pushed the fix/6524/force-kill branch from 8e080b0 to 75bb430 Compare October 2, 2024 13:26

khsrali linked an issue Oct 3, 2024 that may be closed by this pull request

verdi process kill takes long time for erroneous processes #6524

Closed

tests implemented

cad6738

khsrali marked this pull request as ready for review October 8, 2024 08:33

khsrali requested a review from sphuber October 8, 2024 08:36

sphuber requested changes Oct 8, 2024

View reviewed changes

khsrali added 3 commits October 9, 2024 11:11

review applied

dbd0a06

more test added

144680d

Merge branch 'main' into fix/6524/force-kill

bb31b69

unkcpz reviewed Dec 2, 2024

View reviewed changes

agoscinski mentioned this pull request Dec 4, 2024

Tests: Coverage for verdi process commands when no connection can be established #6650

Open

agoscinski assigned agoscinski and khsrali Feb 19, 2025

khsrali mentioned this pull request Mar 10, 2025

Implement force-kill option in verdi process kill #6793

Merged

agoscinski closed this Apr 30, 2025

github-project-automation bot moved this to Done in aiida-core v2.7.0 Apr 30, 2025

agoscinski deleted the fix/6524/force-kill branch May 2, 2025 10:02

WIP: Implement force-kill option in verdi process kill #6575

WIP: Implement force-kill option in verdi process kill #6575

Uh oh!

Conversation

agoscinski commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khsrali commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sphuber commented Sep 30, 2024

Uh oh!

khsrali commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khsrali commented Oct 8, 2024

Uh oh!

sphuber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sphuber Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

khsrali Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sphuber Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

khsrali Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

unkcpz Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

khsrali Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

agoscinski Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

agoscinski Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

agoscinski commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agoscinski commented Apr 30, 2025

Uh oh!

Uh oh!

agoscinski commented Sep 27, 2024 •

edited

Loading

khsrali commented Sep 30, 2024 •

edited

Loading

khsrali commented Oct 1, 2024 •

edited

Loading

khsrali Oct 9, 2024 •

edited

Loading

agoscinski commented Dec 4, 2024 •

edited

Loading