Skip to content

Sporadic weird mem value #181

@abretaud

Description

@abretaud

Hi!
On usegalaxy.fr we sometimes get sporadic errors on some jobs, due to TPV trying to submit jobs with weird mem values.

Here's an example log:

tpv.core.entities DEBUG 2025-12-12 08:19:32,230 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [runner=slurm, dest_name=slurm, min_accepted_cores=None, min_accepted_mem=None, min_accepted_gpus=None, max_accepted_cores=None, max_accepted_mem=None, max_accepted_gpus=None, tpv_dest_tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=docker, type=TagType.ACCEPT>, <Tag: name=scheduling, value=singularity, type=TagType.ACCEPT>, <Tag: name=scheduling, value=pulsar, type=TagType.ACCEPT>], handler_tags=None<class 'tpv.core.entities.Destination'> id=slurm, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[{'name': 'TMP', 'value': '$_GALAXY_JOB_TMP_DIR'}, {'name': 'TEMP', 'value': '$_GALAXY_JOB_TMP_DIR'}, {'name': 'TMPDIR', 'value': '$_GALAXY_JOB_TMP_DIR'}], params={'tpv_cores': '{cores}', 'tpv_gpus': '{gpus}', 'tpv_mem': '{mem}', 'nativeSpecification': '--cpus-per-task={round(cores)} --mem={round(mem*1024)} --partition={partition} {additional_spec} {reservation}'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=_slurm_destination, context={}, rules={'slurm_destination_singularity_rule': <class 'tpv.core.entities.Rule'> id=slurm_destination_singularity_rule, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[{'name': 'LC_ALL', 'value': 'C'}, {'name': 'APPTAINER_DISABLE_CACHE', 'value': 'True'}, {'name': 'APPTAINER_CACHEDIR', 'value': '/tmp/singularity'}, {'name': 'APPTAINER_TMPDIR', 'value': '/home/galaxy/.singularity'}], params={'require_container': True, 'singularity_volumes': '$galaxy_root:ro,$tool_directory:ro,$job_directory:rw,$working_directory:rw,$default_file_path:rw,/cvmfs/data.galaxyproject.org:rw,/tmp:rw,/foobar/galaxy/mutable-data/tmp:rw,/foobar/galaxy/datasets2:rw,/foobar/galaxy/mutable-data/tool-data:ro', 'singularity_default_container_id': '/cvmfs/singularity.galaxyproject.org/all/ubuntu:20.04', 'tmp_dir': 'True'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context={}, if=entity.par, execute=, fail=, 'slurm_destination_docker_rule': <class 'tpv.core.entities.Rule'> id=slurm_destination_docker_rule, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'nativeSpecification': '--cpus-per-task={round(cores)} --mem={round(mem*1024)} --partition=docker {additional_spec} {reservation}', 'require_container': True, 'docker_volumes': '$defaults,/foobar/galaxy/datasets2:ro', 'docker_memory': '{mem}G', 'docker_sudo': False, 'docker_auto_rm': True, 'docker_default_container_id': 'busybox:ubuntu-14.04', 'docker_set_user': '', 'tmp_dir': 'True'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context={}, if=entity.par, execute=, fail=}] for entity: <class 'tpv.core.entities.Tool'> id=__DATA_FETCH__, Rule: __DATA_FETCH__, abstract=False, cores=1, mem=12582912.0, gpus=0, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'metadata_strategy': 'extended', 'SCALING_FACTOR': "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"}, resubmit={'with_more_mem_on_failure': {'condition': 'memory_limit_reached and attempt <= 3', 'destination': 'tpv_dispatcher', 'delay': 'attempt * 30'}}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=, inherits=None, context={'partition': 'fast', 'additional_spec': '', 'reservation': ''}, rules={} using default ranker
tpv.core.entities DEBUG 2025-12-12 08:19:32,231 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Destination: <class 'tpv.core.entities.Tool'> id=__DATA_FETCH__, Rule: __DATA_FETCH__, abstract=False, cores=1, mem=12582912.0, gpus=0, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'metadata_strategy': 'extended', 'SCALING_FACTOR': "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"}, resubmit={'with_more_mem_on_failure': {'condition': 'memory_limit_reached and attempt <= 3', 'destination': 'tpv_dispatcher', 'delay': 'attempt * 30'}}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=, inherits=None, context={'partition': 'fast', 'additional_spec': '', 'reservation': ''}, rules={} scored: 0
galaxy.jobs.mapper DEBUG 2025-12-12 08:19:32,233 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Mapped job to destination id: slurm
galaxy.jobs.handler DEBUG 2025-12-12 08:19:32,244 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Dispatching to slurm runner
galaxy.objectstore DEBUG 2025-12-12 08:19:32,301 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Selected backend 'newdata' for creation of Dataset 12665025
galaxy.objectstore DEBUG 2025-12-12 08:19:32,304 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Using preferred backend 'newdata' for creation of Job 6915953
galaxy.jobs DEBUG 2025-12-12 08:19:32,309 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Working directory for job is: /foobar/galaxy/datasets2/jobs/006/915/6915953
galaxy.jobs.runners DEBUG 2025-12-12 08:19:32,329 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Job [6915953] queued (84.953 ms)
galaxy.jobs.handler INFO 2025-12-12 08:19:32,334 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Job dispatched
galaxy.jobs DEBUG 2025-12-12 08:19:32,459 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] Job wrapper for Job [6915953] prepared (113.267 ms)
galaxy.jobs.command_factory INFO 2025-12-12 08:19:32,473 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] Built script [/foobar/galaxy/datasets2/jobs/006/915/6915953/tool_script.sh] for tool command [python '/foobar/galaxy/server/lib/galaxy/tools/data_fetch.py' --galaxy-root '/foobar/galaxy/server' --datatypes-registry '/foobar/galaxy/datasets2/jobs/006/915/6915953/registry.xml' --request-version '1' --request '/foobar/galaxy/datasets2/jobs/006/915/6915953/configs/tmpnq3ddc4y']
galaxy.jobs.runners DEBUG 2025-12-12 08:19:32,580 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) command is: cd working; /bin/bash /foobar/galaxy/datasets2/jobs/006/915/6915953/tool_script.sh > '../outputs/tool_stdout' 2> '../outputs/tool_stderr'; return_code=$?; echo $return_code > /foobar/galaxy/datasets2/jobs/006/915/6915953/galaxy_6915953.ec; cd '/foobar/galaxy/datasets2/jobs/006/915/6915953'; 
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True; python metadata/set.py; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2025-12-12 08:19:32,598 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) submitting file /foobar/galaxy/datasets2/jobs/006/915/6915953/galaxy_6915953.sh
galaxy.jobs.runners.drmaa DEBUG 2025-12-12 08:19:32,598 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) native specification is: --cpus-per-task=1 --mem=12884901888 --partition=fast  
galaxy.jobs.runners.drmaa ERROR 2025-12-12 08:19:32,599 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) drmaa.Session.runJob() failed unconditionally
Traceback (most recent call last):
  File "/foobar/galaxy/server/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job
    external_job_id = self.ds.run_job(**jt)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/pulsar/managers/util/drmaa/__init__.py", line 73, in run_job
    return DrmaaSession.session.runJob(template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/session.py", line 314, in runJob
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/helpers.py", line 302, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: value out of range: 12884901888
galaxy.jobs.runners.drmaa ERROR 2025-12-12 08:19:32,627 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) All attempts to submit job failed

This is for a DATA_FETCH job, which is supposed to reserver only 6Gb of memory.

I don't understand why suddenly we get a mem=12582912, which is then multiplied again by 1024 to generate the --mem=12884901888 slurm specification. The only clue is that 12582912 = 6x2x1024x1024

We're using TPV v2.5.0 apparently (don't remember if it needs to updated manually, or if it gets updated with new galaxy release)

Our config files are on our gitlab repo, and they're loaded in this order.

Any help for debugging this would be very very useful :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions