-
Notifications
You must be signed in to change notification settings - Fork 227
Implement force-kill option in verdi process kill #6793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d9941bd
to
4d09fa8
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6793 +/- ##
==========================================
+ Coverage 78.44% 78.57% +0.13%
==========================================
Files 568 568
Lines 43093 43107 +14
==========================================
+ Hits 33802 33867 +65
+ Misses 9291 9240 -51 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
2932c58
to
605e9c9
Compare
Thanks @agoscinski for following up on #6575
I remember I covered all possible scenarios that I manually discovered For, e.g. I specifically remember the very important one: when the kill is getting stuck! and I had to design the Force kill in a way that would also be able to kill them. P.S. |
The problem is not that but it is that ArithmeticAdd will be in the
Yes for sure that is why I would like have your okay on the tests. I think we agree more or less it is more about some details.
Okay I misunderstood the documentation of 8. I will add 8 back. 9 I was anyway considering to put back.
I don't think the tests are the right place. I think should be developer documentation in aiida-core. Generally a better documentation for the state machine would be very useful. Is it okay if we do an issue for this? Because this would also take quite a long time. |
fair enough, we can take them away. Then we should put them as documentation in someplace visible and easily accessible to developers.. I don't know where.. |
return result | ||
|
||
|
||
# TODO this test fails if I run something daemon related before |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I move the test below some other test I get this error
03/10/2025 02:13:37 PM <22178> aiida.engine.daemon.worker: [ERROR] daemon worker failed to start
Traceback (most recent call last):
File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/pytray/aiothreads.py", line 164, in await_
return self.await_submit(awaitable).result(timeout=self.task_timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/concurrent/futures/_base.py", line 458, in result
raise TimeoutError()
TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/alexgo/code/aiida-core/src/aiida/engine/daemon/worker.py", line 55, in start_daemon_worker
runner = manager.create_daemon_runner()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alexgo/code/aiida-core/src/aiida/manage/manager.py", line 462, in create_daemon_runner
runner.communicator.add_task_subscriber(task_receiver)
File "/Users/alexgo/code/plumpy/src/plumpy/communications.py", line 153, in add_task_subscriber
return self._communicator.add_task_subscriber(converted, identifier)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/kiwipy/rmq/threadcomms.py", line 199, in add_task_subscriber
return self._loop_scheduler.await_(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/pytray/aiothreads.py", line 168, in await_
raise concurrent.futures.TimeoutError(
TimeoutError: add_task_subscriber after 10 seconds
src/aiida/tools/pytest_fixtures/daemon.py:153: RuntimeError
self = <kiwipy.rmq.communicator.RmqCommunicator object at 0x12f19b590>, task = {'args': {'nowait': False, 'pid': 11, 'tag': None}, 'task': 'continue'}, no_reply = True
async def task_send(self, task, no_reply=False):
try:
task_queue = await self.get_default_task_queue()
result = await task_queue.task_send(task, no_reply)
return result
except aio_pika.exceptions.DeliveryError as exception:
> raise kiwipy.UnroutableError(str(exception))
E kiwipy.exceptions.UnroutableError: ('NO_ROUTE', 'aiida-280823cefe014e1bab90f6c0ff368ce7.process.queue')
.pixi/envs/default/lib/python3.12/site-packages/kiwipy/rmq/communicator.py:529: UnroutableError
@@ -201,7 +202,7 @@ def kill_processes( | |||
return | |||
|
|||
controller = get_manager().get_process_controller() | |||
action = functools.partial(controller.kill_process, msg_text=msg_text) | |||
action = functools.partial(controller.kill_process, msg_text=msg_text, force_kill=force_kill) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, this is one of the changes need to be tested. The CLI tests are calling this and should not test against using daemon, which cross too many boundaries.
In essence, if I understand correctly to make verdi process kill --force
work, it need:
--force
can propagate to thekill_processes
to make theforce_kill
set toTrue
- kill_processes function should send the message through RMQ with message resolved with
force
as an attriute of the function. - Then it is plumpy part (override by
aiida-core/engine/processes/process.py::Process::kill
, this I said the evil of inheritance) need to make sure it calls kill properly, so it is out the scope of the tests covered in aiida-core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be now tested.
@@ -338,7 +338,7 @@ def kill(self, msg_text: str | None = None) -> Union[bool, plumpy.futures.Future | |||
|
|||
had_been_terminated = self.has_terminated() | |||
|
|||
result = super().kill(msg_text) | |||
result = super().kill(msg_text, force_kill) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other thing I think require a test, but from the change you made, it should be the plumpy's responsibility to make sure it kill properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you already did the change aiidateam/plumpy#320, I think there is the place require more test to make sure the behavior of kill works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have now several tests. Do you think something is still missing?
6bd708c
to
2ce09b7
Compare
35e1ffd
to
a62d60e
Compare
@@ -102,7 +103,7 @@ async def do_upload(): | |||
try: | |||
logger.info(f'scheduled request to upload CalcJob<{node.pk}>') | |||
ignore_exceptions = (plumpy.futures.CancelledError, PreSubmitException, plumpy.process_states.Interruption) | |||
skip_submit = await exponential_backoff_retry( | |||
skip_submit = await utils.exponential_backoff_retry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes are important to be able to monkeypatch, otherwise one imports a local copy of the function that cannot be monkeypatched
@@ -201,7 +202,7 @@ def kill_processes( | |||
return | |||
|
|||
controller = get_manager().get_process_controller() | |||
action = functools.partial(controller.kill_process, msg_text=msg_text) | |||
action = functools.partial(controller.kill_process, msg_text=msg_text, force_kill=force_kill) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be now tested.
@@ -760,6 +760,7 @@ def _factory( | |||
f'Daemon <{started_daemon_client.profile.name}|{daemon_status}> log file content: \n' | |||
f'{daemon_log_file}' | |||
) | |||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to not keep cpu spinning
@@ -282,7 +283,7 @@ def handle_result(result): | |||
try: | |||
# unwrap is need here since LoopCommunicator will also wrap a future | |||
unwrapped = unwrap_kiwi_future(future) | |||
result = unwrapped.result() | |||
result = unwrapped.result(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be discussed. When the task is successfully sent but blocking (because in EBM) this will block the kill command. I feel like it makes more sense to have the timeout here also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps timeout
should also be passed to plumpy/futures.py
:line 100 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwrap_kiwi_future
is not blocking (or to be here more precise the local unwrap
function inside unwrap_kiwi_future
is not blocking as it only gets the result when the future is done). So passing here the timeout is enough. The question I had is more, we are waiting here for the result of the kill command. If we cancel timeout the kill command waiting for the kill command to finish, it is still in the event loop. If the user sends another kill command it is will be cancelled, this is ok, but if the user does not (send another kill command), the process might be later killed (hidden to the user). I try to add a cancel if timed out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In async programming it should always have timeout for outer resource access.
But I think maybe it is clear to use another parameter for this inner unwrap timeout, to distinguish timeout passed to as_completed
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give the user two timeout options seems also bad, since it is complexifies CLI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: The text below does not apply to unwrap_kiwi_future
as what I wrote is only true for asyncio.Future
and concurrent.futures.Future
(which kiwipy.Future
is)
The whole usage unwrap_kiwi_future
seems to me actually wrong. The unwrapping.result()
can raise exception if future
is not done yet (the input of unwrap_kiwi_future
) because not result has been yet set in unwrapping
which would be done by calling unwrap
. Look at this example where I on purpose do not make outer
future as done.
import asyncio
def unwrap_future(future: asyncio.Future) -> asyncio.Future:
"""
Unwraps nested futures: if a future resolves to another future, keep unwrapping
until we get a concrete value.
"""
loop = asyncio.get_event_loop()
unwrapping = loop.create_future()
def unwrap(fut: asyncio.Future):
try:
result = fut.result()
if isinstance(result, asyncio.Future):
result.add_done_callback(unwrap)
else:
unwrapping.set_result(result)
except Exception as e:
unwrapping.set_exception(e)
future.add_done_callback(unwrap)
return unwrapping
# Simulate this outside of async def to show the error synchronously
loop = asyncio.get_event_loop()
# Step 1: Create nested futures
inner = loop.create_future()
outer = loop.create_future()
outer.set_result(inner)
# Step 2: Unwrap the outer future
unwrapped = unwrap_future(outer)
# Step 3: Try to access result too early
try:
print("Attempting to access result before unwrapped is done...")
print(unwrapped.result()) # This raises InvalidStateError
except asyncio.InvalidStateError as e:
print("Caught expected error:", e)
# Step 4: Now resolve the inner future
inner.set_result("final value")
# Step 5: Give the event loop a moment to run callbacks
loop.run_until_complete(asyncio.sleep(0)) # flush callbacks
# Step 6: Access result again, this time it works
print("Now the result is ready:", unwrapped.result())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In async programming it should always have timeout for outer resource access.
That seems wrong in this case because the outer one has just another Future as result so it will immediately return. The total recursive process should consider the timeout till the actual result is retrieved which is the moste inner one. I think @khsrali has a point that it needs to be passed to the last instance that is really blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I think I understood the whole sequence. The functionality is a bit complex. Only the last future will set the result of unwrapping
. So we want to only have a timeout on the final result. So putting the timeout only on unwrapping is the correct usage. Any other timeout on other futures could just increase the timeout and is therefore not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need more time to look at this. There is such timeout here we have control, but need to aware that in the pytray
dependency, the synchronous function is scheduled and has a timeout as well.
So we want to only have a timeout on the final result.
This is a valid point, in the context of a chain of futures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the unwrap_kiwipy_future is the inner most call, then it is the correct change as you mentioned off line. I don't have any more thoughts or evidence to say this is correct or not. Seems it is a correct and great fix. Tests are all pass, so nothing more to block the it. cheers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @agoscinski for the hard work!
overall looks good to me,
I'll have another look on the tests, later
@@ -531,7 +535,8 @@ def start_daemon( | |||
try: | |||
subprocess.check_output(command, env=env, stderr=subprocess.STDOUT) | |||
except subprocess.CalledProcessError as exception: | |||
raise DaemonException('The daemon failed to start.') from exception | |||
# CalledProcessError is not passing the subprocess stderr in its message so we add it in DaemonException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice 👍
@@ -282,7 +283,7 @@ def handle_result(result): | |||
try: | |||
# unwrap is need here since LoopCommunicator will also wrap a future | |||
unwrapped = unwrap_kiwi_future(future) | |||
result = unwrapped.result() | |||
result = unwrapped.result(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps timeout
should also be passed to plumpy/futures.py
:line 100 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! @agoscinski @khsrali
I just has some nitpick requests, looks all good for the implementation include the neat tests approach using multiprocessing to start a customized worker. cheers!
@@ -282,7 +283,7 @@ def handle_result(result): | |||
try: | |||
# unwrap is need here since LoopCommunicator will also wrap a future | |||
unwrapped = unwrap_kiwi_future(future) | |||
result = unwrapped.result() | |||
result = unwrapped.result(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In async programming it should always have timeout for outer resource access.
But I think maybe it is clear to use another parameter for this inner unwrap timeout, to distinguish timeout passed to as_completed
above.
@@ -451,7 +451,7 @@ Below is a list with all available subcommands. | |||
--broker-host HOSTNAME Hostname for the message broker. [default: 127.0.0.1] | |||
--broker-port INTEGER Port for the message broker. [default: 5672] | |||
--broker-virtual-host TEXT Name of the virtual host for the message broker without | |||
leading forward slash. | |||
leading forward slash. [default: ""] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is here automatically added, need to check this
d053fd4
to
c6432c3
Compare
for more information, see https://pre-commit.ci
As this option is only used there it does not need to be in a common file.
ce220dd
to
83e4d84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to see once the pre-commit is bring back, if all changes here pass mypy check. I don't have other comments on the changes. Cheers!
It already ran through https://github.com/aiidateam/aiida-core/actions/runs/14772755215/job/41475532064 |
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
Squashed commit at 2025-05-09 21:53 PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
Squashed commit at 2025-05-09 21:53 PR #6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR #6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had the problem if two kill commands are set in a sequence, the second kill action will cancel the first one which triggered the cancelation of the scheduler job within an EBM. The second kill command however did not retrigger the cancelation of the scheduler job. This bug appeared because we have two places where the killing logic is placed. More information about this can be found in PR aiidateam#6868 that fixes this properly refactoring the kill action. This PR only serves as a fast temporary fix with workarounds. Before this PR, when the killing command failed through the EBM, the scheduler job could not be cancelled through a kill anymore. Since we have now force-kill option to bypass the EBM, we can reschedule the cancelation of the scheduler job to gracefully kill a process.
…f scheduler job (#6870) PR #6793 introduced logic to cancel earlier kill actions when a new one is issued. However, this led to a bug: when two kill actions are sent in succession, the second cancels the first including its triggered cancelation of the scheduler job that is stuck in the EBM. The second kill command does not re-initiate the scheduler job cancelation, leaving it in an inconsistent state. This issue arises because the kill logic is split across two places. PR #6868 addresses this properly with a refactor of the kill action. In contrast, this PR provides a temporary workaround to mitigate the issue. Furthermore, before this PR if the kill command failed through the EBM, the cancelation could not be rescheduled by a subsequent kill action as the process state already transition to the `Expected` state. With the new force-kill option the user can bypasses the EBM as desired, thus the transition to the `Expected` state is not needed anymore and has been removed to allow the user to further gracefully kill a job. --------- Co-authored-by: Ali Khosravi <[email protected]>
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR #6793 introducing a bug fixed in PR #6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8 One can specifiy a `timeout <= 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
…|play|pause}` (#6902) The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR #6793 introducing a bug fixed in PR $6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. My interpretation is that the original logic (seen in commit 8388018) is that it unwraps the future once, if it is a another future then it is a scheduled action otherwise an immediate one. Then 1b6ecb8 in plumpy introduced a double wrapping for `play`, `pause` and `kill` of the return value which I think is correct as these are scheduled actions. The problem appears in cd0d15c where on the aiida-core side in the first round unwrapping is always done till the actual (nonfuture) result by using `unwrap_kiwi_future`. This makes the usage of `wait` completely obsolete as we always get the final nonfuture result in the first step. Also the `timeout` was not passed to the `result` which made it a blocking action for scheduled tasked (the `timeout` is only applied on the first layer of future in `futures.as_completed` which is meaningless since the first layer is always a future and returns immediately). This is fixed by simplifying the two step procedure, unwrap once and if future unwrap again, to one step: Unwrap until nonfuture result gained with timeout. One can specify a `timeout == 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
…|play|pause}` (aiidateam#6902) The original design of `wait` and `timeout` was to distinguish between actions that immediately return and actions that scheduled. This mechanism was however never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793 introducing a bug fixed in PR $6870. The mechanism of `wait` and `timeout` was also never correctly implemented. In this PR we rectify the logic and simplify it by handling immediate and scheduled actions the same way. My interpretation is that the original logic (seen in commit 8388018) is that it unwraps the future once, if it is a another future then it is a scheduled action otherwise an immediate one. Then 1b6ecb8 in plumpy introduced a double wrapping for `play`, `pause` and `kill` of the return value which I think is correct as these are scheduled actions. The problem appears in cd0d15c where on the aiida-core side in the first round unwrapping is always done till the actual (nonfuture) result by using `unwrap_kiwi_future`. This makes the usage of `wait` completely obsolete as we always get the final nonfuture result in the first step. Also the `timeout` was not passed to the `result` which made it a blocking action for scheduled tasked (the `timeout` is only applied on the first layer of future in `futures.as_completed` which is meaningless since the first layer is always a future and returns immediately). This is fixed by simplifying the two step procedure, unwrap once and if future unwrap again, to one step: Unwrap until nonfuture result gained with timeout. One can specify a `timeout == 0` to express that the action should not wait for a response while one can specify `timeout==float('inf')` (default value) to wait until a response has been received without a timeout.
Don't want to remove the contributions and conversations by force pushing to the older branch (in PR #6575) therefore I created a new PR on a new branch.
This PR goes with the plumpy PR aiidateam/plumpy#320 which has not much changed from the original one.
I changed the killing calcjob to a the mock
RunningProcess
that stays in running state. The tests before were not really testing if it was killable during the running state as the ArithmeticAdd CalcJob goes immediately to the Waiting state (while it is sleeping it is in waiting state) and is then killable. Running can be blocking if therun_fn
is blocking, so no kill command can go through. I am not sure if it makes sense to test these cases, since in the end we want to test if a CalcJob does not get stuck during killing, we do not care so much about custom Processes. I would go forward by only using ArithmeticAdd calcjob in the tests 1-6.From the tests for the EBM 7-11 I only kept 7. I think it is not important to test that kill is getting stuck. I see some value in the test 9, but kept it short for now. Let me know your opinion about which tests to keep, I think before it were too many.