Implement force-kill option in verdi process kill #6793

agoscinski · 2025-03-07T17:02:32Z

Don't want to remove the contributions and conversations by force pushing to the older branch (in PR #6575) therefore I created a new PR on a new branch.

This PR goes with the plumpy PR aiidateam/plumpy#320 which has not much changed from the original one.

I changed the killing calcjob to a the mock RunningProcess that stays in running state. The tests before were not really testing if it was killable during the running state as the ArithmeticAdd CalcJob goes immediately to the Waiting state (while it is sleeping it is in waiting state) and is then killable. Running can be blocking if the run_fn is blocking, so no kill command can go through. I am not sure if it makes sense to test these cases, since in the end we want to test if a CalcJob does not get stuck during killing, we do not care so much about custom Processes. I would go forward by only using ArithmeticAdd calcjob in the tests 1-6.

From the tests for the EBM 7-11 I only kept 7. I think it is not important to test that kill is getting stuck. I see some value in the test 9, but kept it short for now. Let me know your opinion about which tests to keep, I think before it were too many.

codecov · 2025-03-08T07:47:02Z

Codecov Report

❌ Patch coverage is 91.42857% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.57%. Comparing base (6eb17a7) to head (83e4d84).
⚠️ Report is 55 commits behind head on main.

Files with missing lines	Patch %	Lines
src/aiida/engine/processes/calcjobs/tasks.py	88.89%	1 Missing ⚠️
src/aiida/engine/processes/control.py	87.50%	1 Missing ⚠️
src/aiida/manage/tests/pytest_fixtures.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6793      +/-   ##
==========================================
+ Coverage   78.44%   78.57%   +0.13%     
==========================================
  Files         568      568              
  Lines       43093    43107      +14     
==========================================
+ Hits        33802    33867      +65     
+ Misses       9291     9240      -51

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

src/aiida/manage/tests/pytest_fixtures.py

tests/cmdline/commands/test_process.py

khsrali · 2025-03-11T06:47:44Z

Thanks @agoscinski for following up on #6575

I changed the killing calcjob to a the mock RunningProcess that stays in running state. The tests before were not really testing if it was killable during the running state as the ArithmeticAdd CalcJob goes immediately to the Waiting state (while it is sleeping it is in waiting state) and is then killable. Running can be blocking if the run_fn is blocking, so no kill command can go through. I am not sure if it makes sense to test these cases, since in the end we want to test if a CalcJob does not get stuck during killing, we do not care so much about custom Processes. I would go forward by only using ArithmeticAdd calcjob in the tests 1-6.

ArithmeticAdd is slow in this sense. The process state was set as RUNNING in the #6575. We already have a fixture for that which I used submit_and_await
node = submit_and_await(make_a_builder(sleep_seconds=100), ProcessState.RUNNING)

From the tests for the EBM 7-11 I only kept 7. I think it is not important to test that kill is getting stuck. I see some value in the test 9, but kept it short for now. Let me know your opinion about which tests to keep, I think before it were too many.

I remember I covered all possible scenarios that I manually discovered verdi process kill would fail. So a lot of time and effort has spent to design those tests.
I think removing those test instead of migrating them is not a good idea 🙃

For, e.g. I specifically remember the very important one: when the kill is getting stuck! and I had to design the Force kill in a way that would also be able to kill them.

P.S.
The other very important one, when there was a history of a failed kill, which could result in -F not working.. which inspired 7-9,
10-11 was to document the current behavior, which no body knew how it works. I had to spent half a day to understand that is the behavior. It's important to keep that, so if there was change in code base that changes that behavior wouldn't go unnoticed...

agoscinski · 2025-03-11T07:28:49Z

ArithmeticAdd is slow in this sense. The process state was set as RUNNING in the #6575. We already have a fixture for that which I used submit_and_await

The problem is not that but it is that ArithmeticAdd will be in the WAITING state during the sleep, so you effectively killing another process in the WAITING state. What will happen is state is in RUNNING after the submit_and_await then it goes to WAITING during the sleep then it gets killed. I also had that the submit_and_await failed because the process went so fast to the WAITING state that the submit_and_await did not register the RUNNING phase (I think this is related to the sleep command I added). Furthermore the RUNNING state can be blocking so I don't think it makes so much sense to test the killing during the RUNNING state since you can have a Process that cannot be killed during the RUNNING state. In the end we want to test if one can kill a CalcJob and not some arbitrary Process. Therefore I suggest to only do the tests for the ArithmeticAdd (using the builder that was used before).

I remember I covered all possible scenarios that I manually discovered verdi process kill would fail. So a lot of time and effort has spent to design those tests.
I think removing those test instead of migrating them is not a good idea 🙃

Yes for sure that is why I would like have your okay on the tests. I think we agree more or less it is more about some details.

The other very important one, when there was a history of a failed kill, which could result in -F not working.. which inspired 7-9

Okay I misunderstood the documentation of 8. I will add 8 back. 9 I was anyway considering to put back.

10-11 was to document the current behavior, which no body knew how it works. I had to spent half a day to understand that is the behavior. It's important to keep that, so if there was change in code base that changes that behavior wouldn't go unnoticed...

I don't think the tests are the right place. I think should be developer documentation in aiida-core. Generally a better documentation for the state machine would be very useful. Is it okay if we do an issue for this? Because this would also take quite a long time.

khsrali · 2025-03-11T08:12:29Z

I don't think the tests are the right place. I think should be developer documentation in aiida-core. Generally a better documentation for the state machine would be very useful. Is it okay if we do an issue for this? Because this would also take quite a long time.

fair enough, we can take them away. Then we should put them as documentation in someplace visible and easily accessible to developers.. I don't know where..

agoscinski · 2025-03-11T08:55:11Z

Talked with @khsrali we agreed on these TODOs
TODO:

Replace the tests 1-6 with only ArithmeticAdd and check with @unkcpz if he agrees
Add 8-9 back
Make an issue for tests 10-11 to be documented

agoscinski · 2025-03-11T09:07:27Z

tests/cmdline/commands/test_process.py

+    return result
+
+
+# TODO this test fails if I run something daemon related before


If I move the test below some other test I get this error

03/10/2025 02:13:37 PM <22178> aiida.engine.daemon.worker: [ERROR] daemon worker failed to start Traceback (most recent call last): File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/pytray/aiothreads.py", line 164, in await_ return self.await_submit(awaitable).result(timeout=self.task_timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/concurrent/futures/_base.py", line 458, in result raise TimeoutError() TimeoutError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/alexgo/code/aiida-core/src/aiida/engine/daemon/worker.py", line 55, in start_daemon_worker runner = manager.create_daemon_runner() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/alexgo/code/aiida-core/src/aiida/manage/manager.py", line 462, in create_daemon_runner runner.communicator.add_task_subscriber(task_receiver) File "/Users/alexgo/code/plumpy/src/plumpy/communications.py", line 153, in add_task_subscriber return self._communicator.add_task_subscriber(converted, identifier) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/kiwipy/rmq/threadcomms.py", line 199, in add_task_subscriber return self._loop_scheduler.await_( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/pytray/aiothreads.py", line 168, in await_ raise concurrent.futures.TimeoutError( TimeoutError: add_task_subscriber after 10 seconds src/aiida/tools/pytest_fixtures/daemon.py:153: RuntimeError self = <kiwipy.rmq.communicator.RmqCommunicator object at 0x12f19b590>, task = {'args': {'nowait': False, 'pid': 11, 'tag': None}, 'task': 'continue'}, no_reply = True async def task_send(self, task, no_reply=False): try: task_queue = await self.get_default_task_queue() result = await task_queue.task_send(task, no_reply) return result except aio_pika.exceptions.DeliveryError as exception: > raise kiwipy.UnroutableError(str(exception)) E kiwipy.exceptions.UnroutableError: ('NO_ROUTE', 'aiida-280823cefe014e1bab90f6c0ff368ce7.process.queue') .pixi/envs/default/lib/python3.12/site-packages/kiwipy/rmq/communicator.py:529: UnroutableError

unkcpz · 2025-03-11T13:46:50Z

src/aiida/engine/processes/control.py


    controller = get_manager().get_process_controller()
-    action = functools.partial(controller.kill_process, msg_text=msg_text)
+    action = functools.partial(controller.kill_process, msg_text=msg_text, force_kill=force_kill)


In my opinion, this is one of the changes need to be tested. The CLI tests are calling this and should not test against using daemon, which cross too many boundaries.

In essence, if I understand correctly to make verdi process kill --force work, it need:

--force can propagate to the kill_processes to make the force_kill set to True

kill_processes function should send the message through RMQ with message resolved with force as an attriute of the function.

Then it is plumpy part (override by aiida-core/engine/processes/process.py::Process::kill, this I said the evil of inheritance) need to make sure it calls kill properly, so it is out the scope of the tests covered in aiida-core.

This should be now tested.

unkcpz · 2025-03-11T13:47:47Z

src/aiida/engine/processes/process.py

        had_been_terminated = self.has_terminated()

-        result = super().kill(msg_text)
+        result = super().kill(msg_text, force_kill)


This is the other thing I think require a test, but from the change you made, it should be the plumpy's responsibility to make sure it kill properly.

Ah, you already did the change aiidateam/plumpy#320, I think there is the place require more test to make sure the behavior of kill works as expected.

We have now several tests. Do you think something is still missing?

agoscinski · 2025-04-12T08:58:16Z

src/aiida/engine/processes/calcjobs/tasks.py

        logger.info(f'scheduled request to upload CalcJob<{node.pk}>')
        ignore_exceptions = (plumpy.futures.CancelledError, PreSubmitException, plumpy.process_states.Interruption)
-        skip_submit = await exponential_backoff_retry(
+        skip_submit = await utils.exponential_backoff_retry(


these changes are important to be able to monkeypatch, otherwise one imports a local copy of the function that cannot be monkeypatched

agoscinski · 2025-04-12T09:02:47Z

src/aiida/engine/processes/control.py


    controller = get_manager().get_process_controller()
-    action = functools.partial(controller.kill_process, msg_text=msg_text)
+    action = functools.partial(controller.kill_process, msg_text=msg_text, force_kill=force_kill)


This should be now tested.

src/aiida/engine/utils.py

agoscinski · 2025-04-12T09:03:25Z

src/aiida/manage/tests/pytest_fixtures.py

                    f'Daemon <{started_daemon_client.profile.name}|{daemon_status}> log file content: \n'
                    f'{daemon_log_file}'
                )
+            time.sleep(0.1)


to not keep cpu spinning

agoscinski · 2025-04-12T09:05:59Z

src/aiida/engine/processes/control.py

                # unwrap is need here since LoopCommunicator will also wrap a future
                unwrapped = unwrap_kiwi_future(future)
-                result = unwrapped.result()
+                result = unwrapped.result(timeout)


This needs to be discussed. When the task is successfully sent but blocking (because in EBM) this will block the kill command. I feel like it makes more sense to have the timeout here also.

perhaps timeout should also be passed to plumpy/futures.py:line 100 ?

unwrap_kiwi_futureis not blocking (or to be here more precise the local unwrap function inside unwrap_kiwi_future is not blocking as it only gets the result when the future is done). So passing here the timeout is enough. The question I had is more, we are waiting here for the result of the kill command. If we ~~cancel~~ timeout the ~~kill command~~ waiting for the kill command to finish, it is still in the event loop. If the user sends another kill command it is will be cancelled, this is ok, but if the user does not (send another kill command), the process might be later killed (hidden to the user). I try to add a cancel if timed out

In async programming it should always have timeout for outer resource access.
But I think maybe it is clear to use another parameter for this inner unwrap timeout, to distinguish timeout passed to as_completed above.

To give the user two timeout options seems also bad, since it is complexifies CLI

EDIT: The text below does not apply to unwrap_kiwi_future as what I wrote is only true for asyncio.Future and concurrent.futures.Future (which kiwipy.Future is)

The whole usage unwrap_kiwi_future seems to me actually wrong. The unwrapping.result() can raise exception if future is not done yet (the input of unwrap_kiwi_future) because not result has been yet set in unwrapping which would be done by calling unwrap. Look at this example where I on purpose do not make outer future as done.

import asyncio def unwrap_future(future: asyncio.Future) -> asyncio.Future: """ Unwraps nested futures: if a future resolves to another future, keep unwrapping until we get a concrete value. """ loop = asyncio.get_event_loop() unwrapping = loop.create_future() def unwrap(fut: asyncio.Future): try: result = fut.result() if isinstance(result, asyncio.Future): result.add_done_callback(unwrap) else: unwrapping.set_result(result) except Exception as e: unwrapping.set_exception(e) future.add_done_callback(unwrap) return unwrapping # Simulate this outside of async def to show the error synchronously loop = asyncio.get_event_loop() # Step 1: Create nested futures inner = loop.create_future() outer = loop.create_future() outer.set_result(inner) # Step 2: Unwrap the outer future unwrapped = unwrap_future(outer) # Step 3: Try to access result too early try: print("Attempting to access result before unwrapped is done...") print(unwrapped.result()) # This raises InvalidStateError except asyncio.InvalidStateError as e: print("Caught expected error:", e) # Step 4: Now resolve the inner future inner.set_result("final value") # Step 5: Give the event loop a moment to run callbacks loop.run_until_complete(asyncio.sleep(0)) # flush callbacks # Step 6: Access result again, this time it works print("Now the result is ready:", unwrapped.result())

In async programming it should always have timeout for outer resource access.

That seems wrong in this case because the outer one has just another Future as result so it will immediately return. The total recursive process should consider the timeout till the actual result is retrieved which is the moste inner one. I think @khsrali has a point that it needs to be passed to the last instance that is really blocking.

Okay I think I understood the whole sequence. The functionality is a bit complex. Only the last future will set the result of unwrapping. So we want to only have a timeout on the final result. So putting the timeout only on unwrapping is the correct usage. Any other timeout on other futures could just increase the timeout and is therefore not correct.

I need more time to look at this. There is such timeout here we have control, but need to aware that in the pytray dependency, the synchronous function is scheduled and has a timeout as well.

So we want to only have a timeout on the final result.

This is a valid point, in the context of a chain of futures.

If the unwrap_kiwipy_future is the inner most call, then it is the correct change as you mentioned off line. I don't have any more thoughts or evidence to say this is correct or not. Seems it is a correct and great fix. Tests are all pass, so nothing more to block the it. cheers.

tests/cmdline/commands/test_process.py

khsrali

Thanks @agoscinski for the hard work!
overall looks good to me,
I'll have another look on the tests, later

environment.yml

khsrali · 2025-04-14T10:55:53Z

src/aiida/engine/daemon/client.py

            subprocess.check_output(command, env=env, stderr=subprocess.STDOUT)
        except subprocess.CalledProcessError as exception:
-            raise DaemonException('The daemon failed to start.') from exception
+            # CalledProcessError is not passing the subprocess stderr in its message so we add it in DaemonException


khsrali · 2025-04-14T11:09:42Z

src/aiida/engine/processes/control.py

                # unwrap is need here since LoopCommunicator will also wrap a future
                unwrapped = unwrap_kiwi_future(future)
-                result = unwrapped.result()
+                result = unwrapped.result(timeout)


perhaps timeout should also be passed to plumpy/futures.py:line 100 ?

src/aiida/engine/processes/control.py

unkcpz

Thanks! @agoscinski @khsrali

I just has some nitpick requests, looks all good for the implementation include the neat tests approach using multiprocessing to start a customized worker. cheers!

pyproject.toml

src/aiida/cmdline/params/options/main.py

src/aiida/engine/daemon/client.py

src/aiida/engine/daemon/worker.py

unkcpz · 2025-04-29T11:28:29Z

src/aiida/engine/processes/control.py

                # unwrap is need here since LoopCommunicator will also wrap a future
                unwrapped = unwrap_kiwi_future(future)
-                result = unwrapped.result()
+                result = unwrapped.result(timeout)


In async programming it should always have timeout for outer resource access.
But I think maybe it is clear to use another parameter for this inner unwrap timeout, to distinguish timeout passed to as_completed above.

src/aiida/engine/processes/control.py

tests/cmdline/commands/test_process.py

agoscinski · 2025-04-29T14:51:30Z

docs/source/reference/command_line.rst

      --broker-port INTEGER           Port for the message broker.  [default: 5672]
      --broker-virtual-host TEXT      Name of the virtual host for the message broker without
-                                      leading forward slash.
+                                      leading forward slash.  [default: ""]


Not sure what is here automatically added, need to check this

.github/workflows/ci-code.yml

for more information, see https://pre-commit.ci

As this option is only used there it does not need to be in a common file.

unkcpz

Just want to see once the pre-commit is bring back, if all changes here pass mypy check. I don't have other comments on the changes. Cheers!

agoscinski · 2025-05-02T04:31:09Z

It already ran through https://github.com/aiidateam/aiida-core/actions/runs/14772755215/job/41475532064

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

Squashed commit at 2025-05-09 21:53 PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

Squashed commit at 2025-05-09 21:53 PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.

…f scheduler job (#6870) PR #6793 introduced logic to cancel earlier kill actions when a new one is issued. However, this led to a bug: when two kill actions are sent in succession, the second cancels the first including its triggered cancelation of the scheduler job that is stuck in the EBM. The second kill command does not re-initiate the scheduler job cancelation, leaving it in an inconsistent state. This issue arises because the kill logic is split across two places. PR #6868 addresses this properly with a refactor of the kill action. In contrast, this PR provides a temporary workaround to mitigate the issue. Furthermore, before this PR if the kill command failed through the EBM, the cancelation could not be rescheduled by a subsequent kill action as the process state already transition to the `Expected` state. With the new force-kill option the user can bypasses the EBM as desired, thus the transition to the `Expected` state is not needed anymore and has been removed to allow the user to further gracefully kill a job. --------- Co-authored-by: Ali Khosravi <[email protected]>

The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR #6793 introducing a bug fixed in PR #6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

…|play|pause}` (#6902) The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR #6793 introducing a bug fixed in PR $6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. My interpretation is that the original logic (seen in commit 8388018) is that it unwraps the future once, if it is a another future then it is a scheduled action otherwise an immediate one. Then 1b6ecb8 in plumpy introduced a double wrapping for `play`, `pause` and `kill` of the return value which I think is correct as these are scheduled actions. The problem appears in cd0d15c where on the aiida-core side in the first round unwrapping is always done till the actual (nonfuture) result by using `unwrap_kiwi_future`. This makes the usage of `wait` completely obsolete as we always get the final nonfuture result in the first step. Also the `timeout` was not passed to the `result` which made it a blocking action for scheduled tasked (the `timeout` is only applied on the first layer of future in `futures.as_completed` which is meaningless since the first layer is always a future and returns immediately). This is fixed by simplifying the two step procedure, unwrap once and if future unwrap again, to one step: Unwrap until nonfuture result gained with timeout. One can specify a `timeout == 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

…|play|pause}` (aiidateam#6902) The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR $6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. My interpretation is that the original logic (seen in commit 8388018) is that it unwraps the future once, if it is a another future then it is a scheduled action otherwise an immediate one. Then 1b6ecb8 in plumpy introduced a double wrapping for `play`, `pause` and `kill` of the return value which I think is correct as these are scheduled actions. The problem appears in cd0d15c where on the aiida-core side in the first round unwrapping is always done till the actual (nonfuture) result by using `unwrap_kiwi_future`. This makes the usage of `wait` completely obsolete as we always get the final nonfuture result in the first step. Also the `timeout` was not passed to the `result` which made it a blocking action for scheduled tasked (the `timeout` is only applied on the first layer of future in `futures.as_completed` which is meaningless since the first layer is always a future and returns immediately). This is fixed by simplifying the two step procedure, unwrap once and if future unwrap again, to one step: Unwrap until nonfuture result gained with timeout. One can specify a `timeout == 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.

agoscinski force-pushed the fix/6524/force-kill-v2 branch from d9941bd to 4d09fa8 Compare March 8, 2025 07:29

agoscinski force-pushed the fix/6524/force-kill-v2 branch from 2932c58 to 605e9c9 Compare March 10, 2025 13:39

agoscinski commented Mar 10, 2025

View reviewed changes

src/aiida/manage/tests/pytest_fixtures.py Outdated Show resolved Hide resolved

agoscinski commented Mar 10, 2025

View reviewed changes

tests/cmdline/commands/test_process.py Outdated Show resolved Hide resolved

agoscinski commented Mar 11, 2025

View reviewed changes

unkcpz reviewed Mar 11, 2025

View reviewed changes

agoscinski force-pushed the fix/6524/force-kill-v2 branch from 6bd708c to 2ce09b7 Compare April 10, 2025 05:27

khsrali self-requested a review April 10, 2025 08:02

agoscinski force-pushed the fix/6524/force-kill-v2 branch from 35e1ffd to a62d60e Compare April 10, 2025 08:07

agoscinski marked this pull request as ready for review April 12, 2025 08:57

agoscinski commented Apr 12, 2025

View reviewed changes

agoscinski requested a review from unkcpz April 12, 2025 09:06

agoscinski commented Apr 14, 2025

View reviewed changes

tests/cmdline/commands/test_process.py Outdated Show resolved Hide resolved

agoscinski commented Apr 14, 2025

View reviewed changes

tests/cmdline/commands/test_process.py Outdated Show resolved Hide resolved

khsrali reviewed Apr 14, 2025

View reviewed changes

agoscinski commented Apr 28, 2025

View reviewed changes

src/aiida/engine/processes/control.py Outdated Show resolved Hide resolved

agoscinski commented Apr 28, 2025

View reviewed changes

src/aiida/engine/processes/control.py Show resolved Hide resolved

unkcpz requested changes Apr 29, 2025

View reviewed changes

agoscinski commented Apr 29, 2025

View reviewed changes

agoscinski force-pushed the fix/6524/force-kill-v2 branch from d053fd4 to c6432c3 Compare April 30, 2025 07:32

agoscinski mentioned this pull request Apr 30, 2025

WIP: Implement force-kill option in verdi process kill #6575

Closed

agoscinski requested a review from unkcpz April 30, 2025 09:20

agoscinski commented Apr 30, 2025

View reviewed changes

.github/workflows/ci-code.yml Outdated Show resolved Hide resolved

agoscinski commented Apr 30, 2025

View reviewed changes

.github/workflows/ci-code.yml Outdated Show resolved Hide resolved

pre-commit-ci bot and others added 2 commits May 1, 2025 10:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5c4be5

for more information, see https://pre-commit.ci

Move the FORCE_KILL option to the process_kill command

83e4d84

As this option is only used there it does not need to be in a common file.

agoscinski force-pushed the fix/6524/force-kill-v2 branch from ce220dd to 83e4d84 Compare May 1, 2025 08:53

agoscinski mentioned this pull request May 1, 2025

Release v2.7.0pre0 #6849

Closed

4 tasks

unkcpz requested changes May 1, 2025

View reviewed changes

agoscinski requested a review from unkcpz May 2, 2025 04:31

agoscinski merged commit b6d0fe5 into aiidateam:main May 2, 2025
26 checks passed

agoscinski deleted the fix/6524/force-kill-v2 branch May 2, 2025 04:41

This was referenced May 2, 2025

verdi process kill takes long time for erroneous processes #6524

Closed

Move killing logic solely to process #6868

Draft

agoscinski mentioned this pull request May 9, 2025

Regular killing reschedules a cancel of scheduler job #6870

Merged

agoscinski mentioned this pull request May 9, 2025

Clarify wait and timeout in CLI play, pause, kill commands #6874

Closed

agoscinski mentioned this pull request Jun 2, 2025

Merge wait and timeout to one CLI option for verdi process {kill|play|pause} #6902

Merged

		return result


		# TODO this test fails if I run something daemon related before

Implement force-kill option in verdi process kill #6793

Implement force-kill option in verdi process kill #6793

Uh oh!

Conversation

agoscinski commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

khsrali commented Mar 11, 2025

Uh oh!

agoscinski commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khsrali commented Mar 11, 2025

Uh oh!

agoscinski commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agoscinski Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agoscinski Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agoscinski Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

khsrali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

agoscinski commented Mar 7, 2025 •

edited

Loading

codecov bot commented Mar 8, 2025 •

edited

Loading

agoscinski commented Mar 11, 2025 •

edited

Loading

agoscinski commented Mar 11, 2025 •

edited

Loading

agoscinski Apr 28, 2025 •

edited

Loading

agoscinski Apr 29, 2025 •

edited

Loading

agoscinski Apr 30, 2025 •

edited

Loading