-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Hi!
On usegalaxy.fr we sometimes get sporadic errors on some jobs, due to TPV trying to submit jobs with weird mem values.
Here's an example log:
tpv.core.entities DEBUG 2025-12-12 08:19:32,230 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [runner=slurm, dest_name=slurm, min_accepted_cores=None, min_accepted_mem=None, min_accepted_gpus=None, max_accepted_cores=None, max_accepted_mem=None, max_accepted_gpus=None, tpv_dest_tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=docker, type=TagType.ACCEPT>, <Tag: name=scheduling, value=singularity, type=TagType.ACCEPT>, <Tag: name=scheduling, value=pulsar, type=TagType.ACCEPT>], handler_tags=None<class 'tpv.core.entities.Destination'> id=slurm, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[{'name': 'TMP', 'value': '$_GALAXY_JOB_TMP_DIR'}, {'name': 'TEMP', 'value': '$_GALAXY_JOB_TMP_DIR'}, {'name': 'TMPDIR', 'value': '$_GALAXY_JOB_TMP_DIR'}], params={'tpv_cores': '{cores}', 'tpv_gpus': '{gpus}', 'tpv_mem': '{mem}', 'nativeSpecification': '--cpus-per-task={round(cores)} --mem={round(mem*1024)} --partition={partition} {additional_spec} {reservation}'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=_slurm_destination, context={}, rules={'slurm_destination_singularity_rule': <class 'tpv.core.entities.Rule'> id=slurm_destination_singularity_rule, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[{'name': 'LC_ALL', 'value': 'C'}, {'name': 'APPTAINER_DISABLE_CACHE', 'value': 'True'}, {'name': 'APPTAINER_CACHEDIR', 'value': '/tmp/singularity'}, {'name': 'APPTAINER_TMPDIR', 'value': '/home/galaxy/.singularity'}], params={'require_container': True, 'singularity_volumes': '$galaxy_root:ro,$tool_directory:ro,$job_directory:rw,$working_directory:rw,$default_file_path:rw,/cvmfs/data.galaxyproject.org:rw,/tmp:rw,/foobar/galaxy/mutable-data/tmp:rw,/foobar/galaxy/datasets2:rw,/foobar/galaxy/mutable-data/tool-data:ro', 'singularity_default_container_id': '/cvmfs/singularity.galaxyproject.org/all/ubuntu:20.04', 'tmp_dir': 'True'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context={}, if=entity.par, execute=, fail=, 'slurm_destination_docker_rule': <class 'tpv.core.entities.Rule'> id=slurm_destination_docker_rule, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'nativeSpecification': '--cpus-per-task={round(cores)} --mem={round(mem*1024)} --partition=docker {additional_spec} {reservation}', 'require_container': True, 'docker_volumes': '$defaults,/foobar/galaxy/datasets2:ro', 'docker_memory': '{mem}G', 'docker_sudo': False, 'docker_auto_rm': True, 'docker_default_container_id': 'busybox:ubuntu-14.04', 'docker_set_user': '', 'tmp_dir': 'True'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context={}, if=entity.par, execute=, fail=}] for entity: <class 'tpv.core.entities.Tool'> id=__DATA_FETCH__, Rule: __DATA_FETCH__, abstract=False, cores=1, mem=12582912.0, gpus=0, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'metadata_strategy': 'extended', 'SCALING_FACTOR': "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"}, resubmit={'with_more_mem_on_failure': {'condition': 'memory_limit_reached and attempt <= 3', 'destination': 'tpv_dispatcher', 'delay': 'attempt * 30'}}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=, inherits=None, context={'partition': 'fast', 'additional_spec': '', 'reservation': ''}, rules={} using default ranker
tpv.core.entities DEBUG 2025-12-12 08:19:32,231 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Destination: <class 'tpv.core.entities.Tool'> id=__DATA_FETCH__, Rule: __DATA_FETCH__, abstract=False, cores=1, mem=12582912.0, gpus=0, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'metadata_strategy': 'extended', 'SCALING_FACTOR': "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"}, resubmit={'with_more_mem_on_failure': {'condition': 'memory_limit_reached and attempt <= 3', 'destination': 'tpv_dispatcher', 'delay': 'attempt * 30'}}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=, inherits=None, context={'partition': 'fast', 'additional_spec': '', 'reservation': ''}, rules={} scored: 0
galaxy.jobs.mapper DEBUG 2025-12-12 08:19:32,233 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Mapped job to destination id: slurm
galaxy.jobs.handler DEBUG 2025-12-12 08:19:32,244 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Dispatching to slurm runner
galaxy.objectstore DEBUG 2025-12-12 08:19:32,301 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Selected backend 'newdata' for creation of Dataset 12665025
galaxy.objectstore DEBUG 2025-12-12 08:19:32,304 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Using preferred backend 'newdata' for creation of Job 6915953
galaxy.jobs DEBUG 2025-12-12 08:19:32,309 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Working directory for job is: /foobar/galaxy/datasets2/jobs/006/915/6915953
galaxy.jobs.runners DEBUG 2025-12-12 08:19:32,329 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Job [6915953] queued (84.953 ms)
galaxy.jobs.handler INFO 2025-12-12 08:19:32,334 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Job dispatched
galaxy.jobs DEBUG 2025-12-12 08:19:32,459 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] Job wrapper for Job [6915953] prepared (113.267 ms)
galaxy.jobs.command_factory INFO 2025-12-12 08:19:32,473 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] Built script [/foobar/galaxy/datasets2/jobs/006/915/6915953/tool_script.sh] for tool command [python '/foobar/galaxy/server/lib/galaxy/tools/data_fetch.py' --galaxy-root '/foobar/galaxy/server' --datatypes-registry '/foobar/galaxy/datasets2/jobs/006/915/6915953/registry.xml' --request-version '1' --request '/foobar/galaxy/datasets2/jobs/006/915/6915953/configs/tmpnq3ddc4y']
galaxy.jobs.runners DEBUG 2025-12-12 08:19:32,580 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) command is: cd working; /bin/bash /foobar/galaxy/datasets2/jobs/006/915/6915953/tool_script.sh > '../outputs/tool_stdout' 2> '../outputs/tool_stderr'; return_code=$?; echo $return_code > /foobar/galaxy/datasets2/jobs/006/915/6915953/galaxy_6915953.ec; cd '/foobar/galaxy/datasets2/jobs/006/915/6915953';
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True; python metadata/set.py; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2025-12-12 08:19:32,598 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) submitting file /foobar/galaxy/datasets2/jobs/006/915/6915953/galaxy_6915953.sh
galaxy.jobs.runners.drmaa DEBUG 2025-12-12 08:19:32,598 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) native specification is: --cpus-per-task=1 --mem=12884901888 --partition=fast
galaxy.jobs.runners.drmaa ERROR 2025-12-12 08:19:32,599 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) drmaa.Session.runJob() failed unconditionally
Traceback (most recent call last):
File "/foobar/galaxy/server/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job
external_job_id = self.ds.run_job(**jt)
^^^^^^^^^^^^^^^^^^^^^
File "/foobar/galaxy/.venv/lib/python3.11/site-packages/pulsar/managers/util/drmaa/__init__.py", line 73, in run_job
return DrmaaSession.session.runJob(template)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/session.py", line 314, in runJob
c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/helpers.py", line 302, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/errors.py", line 151, in error_check
raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: value out of range: 12884901888
galaxy.jobs.runners.drmaa ERROR 2025-12-12 08:19:32,627 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) All attempts to submit job failed
This is for a DATA_FETCH job, which is supposed to reserver only 6Gb of memory.
I don't understand why suddenly we get a mem=12582912, which is then multiplied again by 1024 to generate the --mem=12884901888 slurm specification. The only clue is that 12582912 = 6x2x1024x1024
We're using TPV v2.5.0 apparently (don't remember if it needs to updated manually, or if it gets updated with new galaxy release)
Our config files are on our gitlab repo, and they're loaded in this order.
Any help for debugging this would be very very useful :)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels