Sporadic weird mem value

Hi!
On usegalaxy.fr we sometimes get sporadic errors on some jobs, due to TPV trying to submit jobs with weird mem values.

Here's an example log:

```
tpv.core.entities DEBUG 2025-12-12 08:19:32,230 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Ranking destinations: [runner=slurm, dest_name=slurm, min_accepted_cores=None, min_accepted_mem=None, min_accepted_gpus=None, max_accepted_cores=None, max_accepted_mem=None, max_accepted_gpus=None, tpv_dest_tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=docker, type=TagType.ACCEPT>, <Tag: name=scheduling, value=singularity, type=TagType.ACCEPT>, <Tag: name=scheduling, value=pulsar, type=TagType.ACCEPT>], handler_tags=None<class 'tpv.core.entities.Destination'> id=slurm, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[{'name': 'TMP', 'value': '$_GALAXY_JOB_TMP_DIR'}, {'name': 'TEMP', 'value': '$_GALAXY_JOB_TMP_DIR'}, {'name': 'TMPDIR', 'value': '$_GALAXY_JOB_TMP_DIR'}], params={'tpv_cores': '{cores}', 'tpv_gpus': '{gpus}', 'tpv_mem': '{mem}', 'nativeSpecification': '--cpus-per-task={round(cores)} --mem={round(mem*1024)} --partition={partition} {additional_spec} {reservation}'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=_slurm_destination, context={}, rules={'slurm_destination_singularity_rule': <class 'tpv.core.entities.Rule'> id=slurm_destination_singularity_rule, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[{'name': 'LC_ALL', 'value': 'C'}, {'name': 'APPTAINER_DISABLE_CACHE', 'value': 'True'}, {'name': 'APPTAINER_CACHEDIR', 'value': '/tmp/singularity'}, {'name': 'APPTAINER_TMPDIR', 'value': '/home/galaxy/.singularity'}], params={'require_container': True, 'singularity_volumes': '$galaxy_root:ro,$tool_directory:ro,$job_directory:rw,$working_directory:rw,$default_file_path:rw,/cvmfs/data.galaxyproject.org:rw,/tmp:rw,/foobar/galaxy/mutable-data/tmp:rw,/foobar/galaxy/datasets2:rw,/foobar/galaxy/mutable-data/tool-data:ro', 'singularity_default_container_id': '/cvmfs/singularity.galaxyproject.org/all/ubuntu:20.04', 'tmp_dir': 'True'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context={}, if=entity.par, execute=, fail=, 'slurm_destination_docker_rule': <class 'tpv.core.entities.Rule'> id=slurm_destination_docker_rule, abstract=False, cores=None, mem=None, gpus=None, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'nativeSpecification': '--cpus-per-task={round(cores)} --mem={round(mem*1024)} --partition=docker {additional_spec} {reservation}', 'require_container': True, 'docker_volumes': '$defaults,/foobar/galaxy/datasets2:ro', 'docker_memory': '{mem}G', 'docker_sudo': False, 'docker_auto_rm': True, 'docker_default_container_id': 'busybox:ubuntu-14.04', 'docker_set_user': '', 'tmp_dir': 'True'}, resubmit={}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[], rank=, inherits=None, context={}, if=entity.par, execute=, fail=}] for entity: <class 'tpv.core.entities.Tool'> id=__DATA_FETCH__, Rule: __DATA_FETCH__, abstract=False, cores=1, mem=12582912.0, gpus=0, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'metadata_strategy': 'extended', 'SCALING_FACTOR': "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"}, resubmit={'with_more_mem_on_failure': {'condition': 'memory_limit_reached and attempt <= 3', 'destination': 'tpv_dispatcher', 'delay': 'attempt * 30'}}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=, inherits=None, context={'partition': 'fast', 'additional_spec': '', 'reservation': ''}, rules={} using default ranker
tpv.core.entities DEBUG 2025-12-12 08:19:32,231 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Destination: <class 'tpv.core.entities.Tool'> id=__DATA_FETCH__, Rule: __DATA_FETCH__, abstract=False, cores=1, mem=12582912.0, gpus=0, min_cores = None, min_mem = None, min_gpus = None, max_cores = None, max_mem = None, max_gpus = None, env=[], params={'metadata_strategy': 'extended', 'SCALING_FACTOR': "{2 * int(job.destination_params.get('SCALING_FACTOR', 2)) if job.destination_params else 2}"}, resubmit={'with_more_mem_on_failure': {'condition': 'memory_limit_reached and attempt <= 3', 'destination': 'tpv_dispatcher', 'delay': 'attempt * 30'}}, tags=<class 'tpv.core.entities.TagSetManager'> tags=[<Tag: name=scheduling, value=offline, type=TagType.REJECT>], rank=, inherits=None, context={'partition': 'fast', 'additional_spec': '', 'reservation': ''}, rules={} scored: 0
galaxy.jobs.mapper DEBUG 2025-12-12 08:19:32,233 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Mapped job to destination id: slurm
galaxy.jobs.handler DEBUG 2025-12-12 08:19:32,244 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Dispatching to slurm runner
galaxy.objectstore DEBUG 2025-12-12 08:19:32,301 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Selected backend 'newdata' for creation of Dataset 12665025
galaxy.objectstore DEBUG 2025-12-12 08:19:32,304 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Using preferred backend 'newdata' for creation of Job 6915953
galaxy.jobs DEBUG 2025-12-12 08:19:32,309 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Working directory for job is: /foobar/galaxy/datasets2/jobs/006/915/6915953
galaxy.jobs.runners DEBUG 2025-12-12 08:19:32,329 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] Job [6915953] queued (84.953 ms)
galaxy.jobs.handler INFO 2025-12-12 08:19:32,334 [pN:handler_job_2,p:3824998,tN:JobHandlerQueue.monitor_thread] (6915953) Job dispatched
galaxy.jobs DEBUG 2025-12-12 08:19:32,459 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] Job wrapper for Job [6915953] prepared (113.267 ms)
galaxy.jobs.command_factory INFO 2025-12-12 08:19:32,473 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] Built script [/foobar/galaxy/datasets2/jobs/006/915/6915953/tool_script.sh] for tool command [python '/foobar/galaxy/server/lib/galaxy/tools/data_fetch.py' --galaxy-root '/foobar/galaxy/server' --datatypes-registry '/foobar/galaxy/datasets2/jobs/006/915/6915953/registry.xml' --request-version '1' --request '/foobar/galaxy/datasets2/jobs/006/915/6915953/configs/tmpnq3ddc4y']
galaxy.jobs.runners DEBUG 2025-12-12 08:19:32,580 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) command is: cd working; /bin/bash /foobar/galaxy/datasets2/jobs/006/915/6915953/tool_script.sh > '../outputs/tool_stdout' 2> '../outputs/tool_stderr'; return_code=$?; echo $return_code > /foobar/galaxy/datasets2/jobs/006/915/6915953/galaxy_6915953.ec; cd '/foobar/galaxy/datasets2/jobs/006/915/6915953'; 
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True; python metadata/set.py; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2025-12-12 08:19:32,598 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) submitting file /foobar/galaxy/datasets2/jobs/006/915/6915953/galaxy_6915953.sh
galaxy.jobs.runners.drmaa DEBUG 2025-12-12 08:19:32,598 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) native specification is: --cpus-per-task=1 --mem=12884901888 --partition=fast  
galaxy.jobs.runners.drmaa ERROR 2025-12-12 08:19:32,599 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) drmaa.Session.runJob() failed unconditionally
Traceback (most recent call last):
  File "/foobar/galaxy/server/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job
    external_job_id = self.ds.run_job(**jt)
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/pulsar/managers/util/drmaa/__init__.py", line 73, in run_job
    return DrmaaSession.session.runJob(template)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/session.py", line 314, in runJob
    c(drmaa_run_job, jid, sizeof(jid), jobTemplate)
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/helpers.py", line 302, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/foobar/galaxy/.venv/lib/python3.11/site-packages/drmaa/errors.py", line 151, in error_check
    raise _ERRORS[code - 1](error_string)
drmaa.errors.InvalidArgumentException: code 4: value out of range: 12884901888
galaxy.jobs.runners.drmaa ERROR 2025-12-12 08:19:32,627 [pN:handler_job_2,p:3824998,tN:SlurmRunner.work_thread-1] (6915953) All attempts to submit job failed
```

This is for a __DATA_FETCH__ job, which is supposed to reserver [only 6Gb of memory](https://gitlab.com/ifb-elixirfr/usegalaxy-fr/infrastructure/-/blob/master/production/files/total_perspective_vortex/tools.yml#L9).

I don't understand why suddenly we get a `mem=12582912`, which is then multiplied again by 1024 to generate the `--mem=12884901888` slurm specification. The only clue is that `12582912 = 6x2x1024x1024`

We're using TPV v2.5.0 apparently (don't remember if it needs to updated manually, or if it gets updated with new galaxy release)

Our config files are on [our gitlab repo](https://gitlab.com/ifb-elixirfr/usegalaxy-fr/infrastructure/-/tree/master/production/files/total_perspective_vortex?ref_type=heads), and they're loaded [in this order](https://gitlab.com/ifb-elixirfr/usegalaxy-fr/infrastructure/-/blob/master/production/group_vars/galaxy/galaxy_jobconf.yml?ref_type=heads#L25).

Any help for debugging this would be very very useful :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic weird mem value #181

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sporadic weird mem value #181

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions