Skip to content

Implement force-kill option in verdi process kill #6793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 2, 2025

Conversation

agoscinski
Copy link
Contributor

@agoscinski agoscinski commented Mar 7, 2025

Don't want to remove the contributions and conversations by force pushing to the older branch (in PR #6575) therefore I created a new PR on a new branch.


This PR goes with the plumpy PR aiidateam/plumpy#320 which has not much changed from the original one.


I changed the killing calcjob to a the mock RunningProcess that stays in running state. The tests before were not really testing if it was killable during the running state as the ArithmeticAdd CalcJob goes immediately to the Waiting state (while it is sleeping it is in waiting state) and is then killable. Running can be blocking if the run_fn is blocking, so no kill command can go through. I am not sure if it makes sense to test these cases, since in the end we want to test if a CalcJob does not get stuck during killing, we do not care so much about custom Processes. I would go forward by only using ArithmeticAdd calcjob in the tests 1-6.

From the tests for the EBM 7-11 I only kept 7. I think it is not important to test that kill is getting stuck. I see some value in the test 9, but kept it short for now. Let me know your opinion about which tests to keep, I think before it were too many.

@agoscinski agoscinski force-pushed the fix/6524/force-kill-v2 branch from d9941bd to 4d09fa8 Compare March 8, 2025 07:29
Copy link

codecov bot commented Mar 8, 2025

Codecov Report

Attention: Patch coverage is 91.42857% with 3 lines in your changes missing coverage. Please review.

Project coverage is 78.57%. Comparing base (6eb17a7) to head (83e4d84).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/aiida/engine/processes/calcjobs/tasks.py 88.89% 1 Missing ⚠️
src/aiida/engine/processes/control.py 87.50% 1 Missing ⚠️
src/aiida/manage/tests/pytest_fixtures.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6793      +/-   ##
==========================================
+ Coverage   78.44%   78.57%   +0.13%     
==========================================
  Files         568      568              
  Lines       43093    43107      +14     
==========================================
+ Hits        33802    33867      +65     
+ Misses       9291     9240      -51     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@agoscinski agoscinski force-pushed the fix/6524/force-kill-v2 branch from 2932c58 to 605e9c9 Compare March 10, 2025 13:39
@khsrali
Copy link
Contributor

khsrali commented Mar 11, 2025

Thanks @agoscinski for following up on #6575

I changed the killing calcjob to a the mock RunningProcess that stays in running state. The tests before were not really testing if it was killable during the running state as the ArithmeticAdd CalcJob goes immediately to the Waiting state (while it is sleeping it is in waiting state) and is then killable. Running can be blocking if the run_fn is blocking, so no kill command can go through. I am not sure if it makes sense to test these cases, since in the end we want to test if a CalcJob does not get stuck during killing, we do not care so much about custom Processes. I would go forward by only using ArithmeticAdd calcjob in the tests 1-6.

ArithmeticAdd is slow in this sense. The process state was set as RUNNING in the #6575. We already have a fixture for that which I used submit_and_await
node = submit_and_await(make_a_builder(sleep_seconds=100), ProcessState.RUNNING)

From the tests for the EBM 7-11 I only kept 7. I think it is not important to test that kill is getting stuck. I see some value in the test 9, but kept it short for now. Let me know your opinion about which tests to keep, I think before it were too many.

I remember I covered all possible scenarios that I manually discovered verdi process kill would fail. So a lot of time and effort has spent to design those tests.
I think removing those test instead of migrating them is not a good idea 🙃

For, e.g. I specifically remember the very important one: when the kill is getting stuck! and I had to design the Force kill in a way that would also be able to kill them.

P.S.
The other very important one, when there was a history of a failed kill, which could result in -F not working.. which inspired 7-9,
10-11 was to document the current behavior, which no body knew how it works. I had to spent half a day to understand that is the behavior. It's important to keep that, so if there was change in code base that changes that behavior wouldn't go unnoticed...

@agoscinski
Copy link
Contributor Author

agoscinski commented Mar 11, 2025

ArithmeticAdd is slow in this sense. The process state was set as RUNNING in the #6575. We already have a fixture for that which I used submit_and_await

The problem is not that but it is that ArithmeticAdd will be in the WAITING state during the sleep, so you effectively killing another process in the WAITING state. What will happen is state is in RUNNING after the submit_and_await then it goes to WAITING during the sleep then it gets killed. I also had that the submit_and_await failed because the process went so fast to the WAITING state that the submit_and_await did not register the RUNNING phase (I think this is related to the sleep command I added). Furthermore the RUNNING state can be blocking so I don't think it makes so much sense to test the killing during the RUNNING state since you can have a Process that cannot be killed during the RUNNING state. In the end we want to test if one can kill a CalcJob and not some arbitrary Process. Therefore I suggest to only do the tests for the ArithmeticAdd (using the builder that was used before).

I remember I covered all possible scenarios that I manually discovered verdi process kill would fail. So a lot of time and effort has spent to design those tests.
I think removing those test instead of migrating them is not a good idea 🙃

Yes for sure that is why I would like have your okay on the tests. I think we agree more or less it is more about some details.

The other very important one, when there was a history of a failed kill, which could result in -F not working.. which inspired 7-9

Okay I misunderstood the documentation of 8. I will add 8 back. 9 I was anyway considering to put back.

10-11 was to document the current behavior, which no body knew how it works. I had to spent half a day to understand that is the behavior. It's important to keep that, so if there was change in code base that changes that behavior wouldn't go unnoticed...

I don't think the tests are the right place. I think should be developer documentation in aiida-core. Generally a better documentation for the state machine would be very useful. Is it okay if we do an issue for this? Because this would also take quite a long time.

@khsrali
Copy link
Contributor

khsrali commented Mar 11, 2025

I don't think the tests are the right place. I think should be developer documentation in aiida-core. Generally a better documentation for the state machine would be very useful. Is it okay if we do an issue for this? Because this would also take quite a long time.

fair enough, we can take them away. Then we should put them as documentation in someplace visible and easily accessible to developers.. I don't know where..

@agoscinski
Copy link
Contributor Author

agoscinski commented Mar 11, 2025

Talked with @khsrali we agreed on these TODOs
TODO:

  • Replace the tests 1-6 with only ArithmeticAdd and check with @unkcpz if he agrees
  • Add 8-9 back
  • Make an issue for tests 10-11 to be documented

return result


# TODO this test fails if I run something daemon related before
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I move the test below some other test I get this error

  03/10/2025 02:13:37 PM <22178> aiida.engine.daemon.worker: [ERROR] daemon worker failed to start
  Traceback (most recent call last):
    File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/pytray/aiothreads.py", line 164, in await_
      return self.await_submit(awaitable).result(timeout=self.task_timeout)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/concurrent/futures/_base.py", line 458, in result
      raise TimeoutError()
  TimeoutError
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "/Users/alexgo/code/aiida-core/src/aiida/engine/daemon/worker.py", line 55, in start_daemon_worker
      runner = manager.create_daemon_runner()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/alexgo/code/aiida-core/src/aiida/manage/manager.py", line 462, in create_daemon_runner
      runner.communicator.add_task_subscriber(task_receiver)
    File "/Users/alexgo/code/plumpy/src/plumpy/communications.py", line 153, in add_task_subscriber
      return self._communicator.add_task_subscriber(converted, identifier)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/kiwipy/rmq/threadcomms.py", line 199, in add_task_subscriber
      return self._loop_scheduler.await_(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/Users/alexgo/code/aiida-core/.pixi/envs/default/lib/python3.12/site-packages/pytray/aiothreads.py", line 168, in await_
      raise concurrent.futures.TimeoutError(
  TimeoutError: add_task_subscriber after 10 seconds

src/aiida/tools/pytest_fixtures/daemon.py:153: RuntimeError


self = <kiwipy.rmq.communicator.RmqCommunicator object at 0x12f19b590>, task = {'args': {'nowait': False, 'pid': 11, 'tag': None}, 'task': 'continue'}, no_reply = True

    async def task_send(self, task, no_reply=False):
        try:
            task_queue = await self.get_default_task_queue()
            result = await task_queue.task_send(task, no_reply)
            return result
        except aio_pika.exceptions.DeliveryError as exception:
>           raise kiwipy.UnroutableError(str(exception))
E           kiwipy.exceptions.UnroutableError: ('NO_ROUTE', 'aiida-280823cefe014e1bab90f6c0ff368ce7.process.queue')

.pixi/envs/default/lib/python3.12/site-packages/kiwipy/rmq/communicator.py:529: UnroutableError

@@ -201,7 +202,7 @@ def kill_processes(
return

controller = get_manager().get_process_controller()
action = functools.partial(controller.kill_process, msg_text=msg_text)
action = functools.partial(controller.kill_process, msg_text=msg_text, force_kill=force_kill)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this is one of the changes need to be tested. The CLI tests are calling this and should not test against using daemon, which cross too many boundaries.

In essence, if I understand correctly to make verdi process kill --force work, it need:

  • --force can propagate to the kill_processes to make the force_kill set to True
  • kill_processes function should send the message through RMQ with message resolved with force as an attriute of the function.
  • Then it is plumpy part (override by aiida-core/engine/processes/process.py::Process::kill, this I said the evil of inheritance) need to make sure it calls kill properly, so it is out the scope of the tests covered in aiida-core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be now tested.

@@ -338,7 +338,7 @@ def kill(self, msg_text: str | None = None) -> Union[bool, plumpy.futures.Future

had_been_terminated = self.has_terminated()

result = super().kill(msg_text)
result = super().kill(msg_text, force_kill)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other thing I think require a test, but from the change you made, it should be the plumpy's responsibility to make sure it kill properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you already did the change aiidateam/plumpy#320, I think there is the place require more test to make sure the behavior of kill works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have now several tests. Do you think something is still missing?

@agoscinski agoscinski force-pushed the fix/6524/force-kill-v2 branch from 6bd708c to 2ce09b7 Compare April 10, 2025 05:27
@khsrali khsrali self-requested a review April 10, 2025 08:02
@agoscinski agoscinski force-pushed the fix/6524/force-kill-v2 branch from 35e1ffd to a62d60e Compare April 10, 2025 08:07
@agoscinski agoscinski marked this pull request as ready for review April 12, 2025 08:57
@@ -102,7 +103,7 @@ async def do_upload():
try:
logger.info(f'scheduled request to upload CalcJob<{node.pk}>')
ignore_exceptions = (plumpy.futures.CancelledError, PreSubmitException, plumpy.process_states.Interruption)
skip_submit = await exponential_backoff_retry(
skip_submit = await utils.exponential_backoff_retry(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are important to be able to monkeypatch, otherwise one imports a local copy of the function that cannot be monkeypatched

@@ -201,7 +202,7 @@ def kill_processes(
return

controller = get_manager().get_process_controller()
action = functools.partial(controller.kill_process, msg_text=msg_text)
action = functools.partial(controller.kill_process, msg_text=msg_text, force_kill=force_kill)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be now tested.

@@ -760,6 +760,7 @@ def _factory(
f'Daemon <{started_daemon_client.profile.name}|{daemon_status}> log file content: \n'
f'{daemon_log_file}'
)
time.sleep(0.1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to not keep cpu spinning

@@ -282,7 +283,7 @@ def handle_result(result):
try:
# unwrap is need here since LoopCommunicator will also wrap a future
unwrapped = unwrap_kiwi_future(future)
result = unwrapped.result()
result = unwrapped.result(timeout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed. When the task is successfully sent but blocking (because in EBM) this will block the kill command. I feel like it makes more sense to have the timeout here also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps timeout should also be passed to plumpy/futures.py:line 100 ?

Copy link
Contributor Author

@agoscinski agoscinski Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap_kiwi_futureis not blocking (or to be here more precise the local unwrap function inside unwrap_kiwi_future is not blocking as it only gets the result when the future is done). So passing here the timeout is enough. The question I had is more, we are waiting here for the result of the kill command. If we cancel timeout the kill command waiting for the kill command to finish, it is still in the event loop. If the user sends another kill command it is will be cancelled, this is ok, but if the user does not (send another kill command), the process might be later killed (hidden to the user). I try to add a cancel if timed out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In async programming it should always have timeout for outer resource access.
But I think maybe it is clear to use another parameter for this inner unwrap timeout, to distinguish timeout passed to as_completed above.

Copy link
Contributor Author

@agoscinski agoscinski Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give the user two timeout options seems also bad, since it is complexifies CLI

Copy link
Contributor Author

@agoscinski agoscinski Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: The text below does not apply to unwrap_kiwi_future as what I wrote is only true for asyncio.Future and concurrent.futures.Future (which kiwipy.Future is)


The whole usage unwrap_kiwi_future seems to me actually wrong. The unwrapping.result() can raise exception if future is not done yet (the input of unwrap_kiwi_future) because not result has been yet set in unwrapping which would be done by calling unwrap. Look at this example where I on purpose do not make outer future as done.

import asyncio


def unwrap_future(future: asyncio.Future) -> asyncio.Future:
    """
    Unwraps nested futures: if a future resolves to another future, keep unwrapping
    until we get a concrete value.
    """
    loop = asyncio.get_event_loop()
    unwrapping = loop.create_future()

    def unwrap(fut: asyncio.Future):
        try:
            result = fut.result()
            if isinstance(result, asyncio.Future):
                result.add_done_callback(unwrap)
            else:
                unwrapping.set_result(result)
        except Exception as e:
            unwrapping.set_exception(e)

    future.add_done_callback(unwrap)
    return unwrapping


# Simulate this outside of async def to show the error synchronously
loop = asyncio.get_event_loop()

# Step 1: Create nested futures
inner = loop.create_future()
outer = loop.create_future()
outer.set_result(inner)

# Step 2: Unwrap the outer future
unwrapped = unwrap_future(outer)

# Step 3: Try to access result too early
try:
    print("Attempting to access result before unwrapped is done...")
    print(unwrapped.result())  # This raises InvalidStateError
except asyncio.InvalidStateError as e:
    print("Caught expected error:", e)

# Step 4: Now resolve the inner future
inner.set_result("final value")

# Step 5: Give the event loop a moment to run callbacks
loop.run_until_complete(asyncio.sleep(0))  # flush callbacks

# Step 6: Access result again, this time it works
print("Now the result is ready:", unwrapped.result())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In async programming it should always have timeout for outer resource access.

That seems wrong in this case because the outer one has just another Future as result so it will immediately return. The total recursive process should consider the timeout till the actual result is retrieved which is the moste inner one. I think @khsrali has a point that it needs to be passed to the last instance that is really blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think I understood the whole sequence. The functionality is a bit complex. Only the last future will set the result of unwrapping. So we want to only have a timeout on the final result. So putting the timeout only on unwrapping is the correct usage. Any other timeout on other futures could just increase the timeout and is therefore not correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more time to look at this. There is such timeout here we have control, but need to aware that in the pytray dependency, the synchronous function is scheduled and has a timeout as well.

So we want to only have a timeout on the final result.

This is a valid point, in the context of a chain of futures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the unwrap_kiwipy_future is the inner most call, then it is the correct change as you mentioned off line. I don't have any more thoughts or evidence to say this is correct or not. Seems it is a correct and great fix. Tests are all pass, so nothing more to block the it. cheers.

@agoscinski agoscinski requested a review from unkcpz April 12, 2025 09:06
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agoscinski for the hard work!
overall looks good to me,
I'll have another look on the tests, later

@@ -531,7 +535,8 @@ def start_daemon(
try:
subprocess.check_output(command, env=env, stderr=subprocess.STDOUT)
except subprocess.CalledProcessError as exception:
raise DaemonException('The daemon failed to start.') from exception
# CalledProcessError is not passing the subprocess stderr in its message so we add it in DaemonException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍

@@ -282,7 +283,7 @@ def handle_result(result):
try:
# unwrap is need here since LoopCommunicator will also wrap a future
unwrapped = unwrap_kiwi_future(future)
result = unwrapped.result()
result = unwrapped.result(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps timeout should also be passed to plumpy/futures.py:line 100 ?

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @agoscinski @khsrali

I just has some nitpick requests, looks all good for the implementation include the neat tests approach using multiprocessing to start a customized worker. cheers!

@@ -282,7 +283,7 @@ def handle_result(result):
try:
# unwrap is need here since LoopCommunicator will also wrap a future
unwrapped = unwrap_kiwi_future(future)
result = unwrapped.result()
result = unwrapped.result(timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In async programming it should always have timeout for outer resource access.
But I think maybe it is clear to use another parameter for this inner unwrap timeout, to distinguish timeout passed to as_completed above.

@@ -451,7 +451,7 @@ Below is a list with all available subcommands.
--broker-host HOSTNAME Hostname for the message broker. [default: 127.0.0.1]
--broker-port INTEGER Port for the message broker. [default: 5672]
--broker-virtual-host TEXT Name of the virtual host for the message broker without
leading forward slash.
leading forward slash. [default: ""]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is here automatically added, need to check this

pre-commit-ci bot and others added 2 commits May 1, 2025 10:53
As this option is only used there it does not need to be in a common
file.
@agoscinski agoscinski force-pushed the fix/6524/force-kill-v2 branch from ce220dd to 83e4d84 Compare May 1, 2025 08:53
@agoscinski agoscinski mentioned this pull request May 1, 2025
4 tasks
Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to see once the pre-commit is bring back, if all changes here pass mypy check. I don't have other comments on the changes. Cheers!

@agoscinski
Copy link
Contributor Author

@agoscinski agoscinski requested a review from unkcpz May 2, 2025 04:31
@agoscinski agoscinski merged commit b6d0fe5 into aiidateam:main May 2, 2025
26 checks passed
@agoscinski agoscinski deleted the fix/6524/force-kill-v2 branch May 2, 2025 04:41
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request May 9, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request May 9, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request May 9, 2025
Squashed commit at 2025-05-09 21:53

PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit that referenced this pull request May 9, 2025
Squashed commit at 2025-05-09 21:53

PR #6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR #6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 2, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 2, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
PR aiidateam#6793 introduced the cancelation of earlier kill actions. This had
the problem if two kill commands are set in a sequence, the second kill
action will cancel the first one which triggered the cancelation of the
scheduler job within an EBM. The second kill command however did not
retrigger the cancelation of the scheduler job. This bug appeared
because we have two places where the killing logic is placed. More
information about this can be found in PR aiidateam#6868 that fixes this properly
refactoring the kill action. This PR only serves as a fast temporary fix
with workarounds.

Before this PR, when the killing command failed through the EBM, the
scheduler job could not be cancelled through a kill anymore. Since we
have now force-kill option to bypass the EBM, we can reschedule the
cancelation of the scheduler job to gracefully kill a process.
agoscinski added a commit that referenced this pull request Jun 4, 2025
…f scheduler job (#6870)

PR #6793 introduced logic to cancel earlier kill actions when a new one is
issued. However, this led to a bug: when two kill actions are sent in
succession, the second cancels the first including its triggered cancelation of
the scheduler job that is stuck in the EBM. The second kill command does not
re-initiate the scheduler job cancelation, leaving it in an inconsistent state.
This issue arises because the kill logic is split across two places. PR #6868
addresses this properly with a refactor of the kill action. In contrast, this PR
provides a temporary workaround to mitigate the issue.

Furthermore, before this PR if the kill command failed through the EBM, the
cancelation could not be rescheduled by a subsequent kill action as the process
state already transition to the `Expected` state. With the new force-kill option
the user can bypasses the EBM as desired, thus the transition to the `Expected`
state is not needed anymore and has been removed to allow the user to further
gracefully kill a job.

---------

Co-authored-by: Ali Khosravi <[email protected]>
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 4, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit that referenced this pull request Jun 11, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR #6793 introducing a bug fixed in PR #6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 11, 2025
The original design of `wait` and `timeout` was to distinguish between
actions that immediately return and actions that scheduled. This
mechanism was however never used and resulted in an misinterpretation
in the force-kill PR aiidateam#6793 introducing a bug fixed in PR aiidateam#6870.
The mechanism of `wait` and `timeout` was also never correctly implemented.
In this PR we rectify the logic and simplify it by handling immediate
and scheduled actions the same way.

Related commits in aiida-core 8388018, cd0d15c and plumpy 1b6ecb8

One can specifiy a `timeout <= 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit that referenced this pull request Jun 11, 2025
…|play|pause}` (#6902)

The original design of `wait` and `timeout` was to distinguish between actions
that immediately return and actions that scheduled. This mechanism was however
never used and resulted in an misinterpretation in the force-kill PR #6793
introducing a bug fixed in PR $6870. The mechanism of `wait` and `timeout` was
also never correctly implemented. In this PR we rectify the logic and simplify
it by handling immediate and scheduled actions the same way.

My interpretation is that the original logic (seen in commit 8388018) is that it
unwraps the future once, if it is a another future then it is a scheduled action
otherwise an immediate one. Then 1b6ecb8 in plumpy introduced a double wrapping
for `play`, `pause` and `kill` of the return value which I think is correct as
these are scheduled actions. The problem appears in cd0d15c where on the
aiida-core side in the first round unwrapping is always done till the actual
(nonfuture) result by using `unwrap_kiwi_future`. This makes the usage of
`wait` completely obsolete as we always get the final nonfuture result in the
first step. Also the `timeout` was not passed to the `result` which made it a
blocking action for scheduled tasked (the `timeout` is only applied on the
first layer of future in `futures.as_completed` which is meaningless since the
first layer is always a future and returns immediately).

This is fixed by simplifying the two step procedure, unwrap once and if future
unwrap again, to one step: Unwrap until nonfuture result gained with timeout.
One can specify a `timeout == 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
agoscinski added a commit to agoscinski/aiida-core that referenced this pull request Jun 17, 2025
…|play|pause}` (aiidateam#6902)

The original design of `wait` and `timeout` was to distinguish between actions
that immediately return and actions that scheduled. This mechanism was however
never used and resulted in an misinterpretation in the force-kill PR aiidateam#6793
introducing a bug fixed in PR $6870. The mechanism of `wait` and `timeout` was
also never correctly implemented. In this PR we rectify the logic and simplify
it by handling immediate and scheduled actions the same way.

My interpretation is that the original logic (seen in commit 8388018) is that it
unwraps the future once, if it is a another future then it is a scheduled action
otherwise an immediate one. Then 1b6ecb8 in plumpy introduced a double wrapping
for `play`, `pause` and `kill` of the return value which I think is correct as
these are scheduled actions. The problem appears in cd0d15c where on the
aiida-core side in the first round unwrapping is always done till the actual
(nonfuture) result by using `unwrap_kiwi_future`. This makes the usage of
`wait` completely obsolete as we always get the final nonfuture result in the
first step. Also the `timeout` was not passed to the `result` which made it a
blocking action for scheduled tasked (the `timeout` is only applied on the
first layer of future in `futures.as_completed` which is meaningless since the
first layer is always a future and returns immediately).

This is fixed by simplifying the two step procedure, unwrap once and if future
unwrap again, to one step: Unwrap until nonfuture result gained with timeout.
One can specify a `timeout == 0` to express that the action should not wait for
a response while one can specify `timeout==float('inf')` (default value) to wait
until a response has been received without a timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants