Skip to content

verdi process kill takes long time for erroneous processes #6524

Closed
@GeigerJ2

Description

@GeigerJ2

As noted by @khsrali and me when trying out the new FirecREST implementation. So maybe not the most general example, but this is where it occurred for me: Say, one installs aiida-firecrest and submits a job using that, but forgets to run verdi daemon restart before, leading to the requested transport plugin not being available. verdi daemon logshow shows the

expected exception (though, doesn't actually matter)
Error: iteration 1 of do_upload excepted, retrying after 20 seconds
Traceback (most recent call last):
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/orm/authinfos.py", line 175, in get_transport
    transport_class = TransportFactory(transport_type)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/factories.py", line 432, in TransportFactory
    entry_point = BaseFactory(entry_point_group, entry_point_name, load=load)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/factories.py", line 75, in BaseFactory
    return load_entry_point(group, name)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/entry_point.py", line 276, in load_entry_point
    entry_point = get_entry_point(group, name)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/entry_point.py", line 324, in get_entry_point
    raise MissingEntryPointError(f"Entry point '{name}' not found in group '{group}'")
aiida.common.exceptions.MissingEntryPointError: Entry point 'firecrest' not found in group 'aiida.transports'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py", line 203, in exponential_backoff_retry
    result = await coro()
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/calcjobs/tasks.py", line 85, in do_upload
    with transport_queue.request_transport(authinfo) as request:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/transports.py", line 78, in request_transport
    transport = authinfo.get_transport()
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/orm/authinfos.py", line 177, in get_transport
    raise exceptions.ConfigurationError(f'transport type `{transport_type}` could not be loaded: {exception}')
aiida.common.exceptions.ConfigurationError: transport type `firecrest` could not be loaded: Entry point 'firecrest' not found in group 'aiida.transports'

and the job is stuck in the state

⏵ Waiting Waiting for transport task: upload

as expected. (though, actually, here it could also show that the job excepted, rather than being stuck in the waiting state?)

Now, running verdi process kill leads to the command being stuck for minutes on end, while

`verdi daemon logshow` gives the following output
plumpy.process_states.KillInterruption: Killed through `verdi process kill`

Task exception was never retrieved
future: <Task finished name='Task-49' coro=<interruptable_task.<locals>.execute_coroutine() done, defined at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py:132> exception=KillInterruption('Killed through `verdi process kill`')>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
    result = coro.send(None)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py", line 142, in execute_coroutine
    future.result(),
plumpy.process_states.KillInterruption: Killed through `verdi process kill`

showing that plumpy reports that it has killed the process, but it's not being picked up by the daemon. Opening this issue for future reference. Please correct me if I got something wrong, @khsrali, or if you have another, simpler and more general example where this occurs.

EDIT: To add here, running verdi daemon stop then

leads to an update in `verdi daemon logshow`
Exception in callback Process.kill.<locals>.done(<_GatheringFu...celledError()>) at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py:369
handle: <Handle Process.kill.<locals>.done(<_GatheringFu...celledError()>) at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py:369>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py", line 370, in done
    is_all_killed = all(done_future.result())
asyncio.exceptions.CancelledError

and starting the daemon again then leads to the process actually entering the QUEUED state (rather than being killed), so the verdi process kill command is somewhat lost.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions