Skip to content

Galaxy unable to delete jobs submitted through TES Pulsar library #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
micoleaoo opened this issue Apr 11, 2025 · 0 comments
Open

Galaxy unable to delete jobs submitted through TES Pulsar library #153

micoleaoo opened this issue Apr 11, 2025 · 0 comments
Assignees

Comments

@micoleaoo
Copy link

You can delete datasets throught UI but they still linger in database, resulting in nonstop cycle of these logs -->

Galaxy log:

Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: galaxy.jobs.handler DEBUG 2025-04-11 07:57:24,426 [pN:handler_0,p:875355,tN:JobHandlerStopQueue.monitor_thread] Stopping job 260 in pulsar_tes runner
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: galaxy.jobs.runners.pulsar DEBUG 2025-04-11 07:57:24,438 [pN:handler_0,p:875355,tN:JobHandlerStopQueue.monitor_thread] Attempt remote Pulsar kill of job with url pulsar_tes and id 67f52f266dd4e6cf81e81e87
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: galaxy.jobs.handler ERROR 2025-04-11 07:57:24,467 [pN:handler_0,p:875355,tN:JobHandlerStopQueue.monitor_thread] Exception in monitor_step
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: Traceback (most recent call last):
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/venv/lib/python3.11/site-packages/requests/models.py", line 974, in json
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     return complexjson.loads(self.text, **kwargs)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/usr/lib/python3.11/json/__init__.py", line 346, in loads
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     return _default_decoder.decode(s)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/usr/lib/python3.11/json/decoder.py", line 337, in decode
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     raise JSONDecodeError("Expecting value", s, err.value) from None
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: During handling of the above exception, another exception occurred:
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: Traceback (most recent call last):
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/server/lib/galaxy/jobs/handler.py", line 1086, in __monitor
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     self.__monitor_step()
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/server/lib/galaxy/jobs/handler.py", line 1118, in __monitor_step
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     self._check_jobs(session, jobs_to_check)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/server/lib/galaxy/jobs/handler.py", line 1188, in _check_jobs
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     self.dispatcher.stop(job, job_wrapper)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/server/lib/galaxy/jobs/handler.py", line 1267, in stop
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     self.job_runners[runner_name].stop_job(job_wrapper)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/server/lib/galaxy/jobs/runners/pulsar.py", line 773, in stop_job
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     client.kill()
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/venv/lib/python3.11/site-packages/pulsar/client/client.py", line 745, in kill
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     self._tes_client.cancel_task(self.job_id)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/venv/lib/python3.11/site-packages/pydantictes/api.py", line 45, in cancel_task
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     return TesCancelTaskResponse(**response.json())
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:                                    ^^^^^^^^^^^^^^^
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:   File "/srv/galaxy/venv/lib/python3.11/site-packages/requests/models.py", line 978, in json
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]:     raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875355]: requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875356]: galaxy.jobs.handler DEBUG 2025-04-11 07:57:24,689 [pN:handler_1,p:875356,tN:JobHandlerStopQueue.monitor_thread] Stopping job 258 in pulsar_tes runner
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875356]: galaxy.jobs.runners.pulsar DEBUG 2025-04-11 07:57:24,693 [pN:handler_1,p:875356,tN:JobHandlerStopQueue.monitor_thread] Attempt remote Pulsar kill of job with url pulsar_tes and id 67f52cf35172e48cefc64499
Apr 11 07:57:24 galaxy-qa-nd-2 galaxyctl[875356]: galaxy.jobs.handler ERROR 2025-04-11 07:57:24,715 [pN:handler_1,p:875356,tN:JobHandlerStopQueue.monitor_thread] Exception in monitor_step

TESP log:

tesp-api       | 2025-04-11 08:03:10.812 | INFO     | uvicorn.protocols.http.h11_impl:send:431 - 147.251.245.115:37333 - "POST /v1/tasks/67f52cf35172e48cefc64499%3Acancel HTTP/1.1" 200
tesp-api       | 2025-04-11 08:03:10.820 | DEBUG    | asyncio.selector_events:_read_ready__on_eof:885 - <_SelectorSocketTransport fd=12 read=polling write=<idle, bufsize=0>> received EOF
tesp-api       | 2025-04-11 08:03:11.426 | DEBUG    | asyncio.selector_events:_accept_connection:161 - <Server sockets=(<asyncio.TransportSocket fd=3, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('0.0.0.0', 8080)>,)> got a new connection from ('147.251.245.115', 21065): <socket.socket fd=12, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.16.4', 8080), raddr=('147.251.245.115', 21065)>

Temporary fix:
Changing job status from deleting to deleted in Galaxy SQL database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants