PBS job manager #151

johnyaku · 2025-04-28T03:37:12Z

This PR closes #150 and

adds pbs_queue.py
updates config.json
updates [pbspro.template.example(https://github.com/johnyaku/martian/blob/pbs-job-manager/jobmanagers/pbspro.template.example)

Here pbs_queue.py borrows heavily from slurm_queue.py and sge_queue.py, with adaptions for PBS Pro.

These changes have been successfully tested on the Gadi HPC system on the National Compute Infrastructure in Australia by running cellranger 9.0.1 on 125 captures.

Checks

pylint: ✅
pyformat: ✅

adam-azarchs

Thank you! This looks pretty good other than a few minor nit-picks.

adam-azarchs · 2025-05-12T18:52:55Z

jobmanagers/pbs_queue.py

+import json
+
+# PBS Pro "job states" to be regarded as "alive"
+ALIVE = {'Q', 'H', 'W', 'S', 'R', 'E'}


This could be simply

Suggested change

ALIVE = {'Q', 'H', 'W', 'S', 'R', 'E'}

ALIVE = "QHWSRE"

in will still work on it the same way and, for a set that small, will probably be faster than hash lookup.

jobmanagers/pbs_queue.py

adam-azarchs · 2025-05-12T18:56:28Z

jobmanagers/pbs_queue.py

+    """Gets the command line for qstat."""
+    if not ids:
+        sys.exit(0)
+    return ['qstat', '-x', '-F', 'json', '-f'] + ids


Please format this script with black for consistency with the other scripts.

macklin-10x · 2025-05-12T19:54:13Z

@johnyaku thank you for the patch!

johnyaku · 2025-06-11T02:56:16Z

Thanks for the review @macklin-10x
I've made and succesfully tested the suggested changes on PBS Pro, except for the build file which I guess CI will check for us.

macklin-10x · 2025-06-11T23:54:15Z

@adam-azarchs can you please take a look over the changes and see if this is ready to merge?

adam-azarchs

My remaining comments are minor nitpicks. If you have a chance to address them soon, great, otherwise I'll merge it either way before we cut another release.

adam-azarchs · 2025-06-12T06:59:36Z

jobmanagers/pbspro.template.example

+#    is sufficient, but this can often be reduced to 4 hours or less if 
+#    '--maxjobs' is at least 10. 


While it's probably fine to set 4 hours regardless, this comment is misleading. The wall time being set in this template is for each individual job, not for the overall pipeline run. --maxjobs controls the number of jobs that can be queued simultaneously; increasing it may improve the overall pipeline wall time but would not be expected to improve the runtime for each individual job; more likely in fact would be higher I/O contention which would increase the runtime for these jobs.

I haven't seen any jobs run for more than an hour in cluster mode, but happy to be educated here. My naive assumption was that --maxjobs affected the number (and size) of the chunks when input data is split. If I'm wrong then you can revert the second half of this change.
My reason for reducing this is that on our HPC, and perhaps PBS Pro systems more generally, account quota is "reserved" based on resources X walltime. We often run dozens of cellranger or spaceranger jobs in parallel and so requesting too much walltime can result in all of our quota getting reserved, even though we only get billed based on actual usage. Hogging quota like this can impact the rest of the team.
Perhaps

24 hours (24:00:00) is sufficient but this can be adjusted by comparing against actual usage

In our system, PBS Pro dumps actual usage information to the _stdout files.

--maxjobs controls a semaphore limiting the number of jobs which are queued to the cluster at a time. It does not affect the size or number of jobs in total. Together with --jobinterval, the intent is to prevent a single pipeline job from slamming a cluster hard enough to make sysadmins angry at our users.

I have not previously encountered a cluster that charged based on reserved wall time as opposed to actually-used wall time; if you have that then more careful accounting is certainly in order! You can get the actual wall time (as measured by the job itself, so might be a slight underestimate as far as the cluster is concerned) from the _perf json file at the end of the run, though care must be taken when interpreting what that means in anything but the leaf nodes of the graph.

In our internal infrastructure, we usually run on spot instances in AWS, which are prone to involuntary preemption, so we try to aim to keep individual jobs under 1 hour (for most datasets anyway) to avoid loosing too much work when one of them fails this way and needs to be restarted. I think in most cases there will be very few jobs more than 15 minutes, even, depending on the performance of your hardware (the slow stages tend to be mostly I/O limited).

Thanks for the clarification. We only get billed for actual usage, but potential usage gets "reserved" until the jobs finish. 1000 jobs (from multiple parallel runs) all claiming 24 hours walltime can result in all our quota getting reserved, which prevents new jobs from queuing, breaking the pipeline orchestrating the parallel runs.

johnyaku added 4 commits April 28, 2025 10:36

Update pbspro.template.example

2b5efb9

Add pbs_queue.py and update config.json

a31fcc7

Comply with R1732 and C0209

b4cf0cc

Apply pyformat

835e732

adam-azarchs reviewed May 12, 2025

View reviewed changes

johnyaku added 3 commits June 1, 2025 20:00

Simplify ALIVE constant

3b30b1b

Format pbs_queue.py with black

f641a1d

Add pbs_queue.py to BUILD.bazel for jobmanagers

c59feab

adam-azarchs approved these changes Jun 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PBS job manager #151

PBS job manager #151

Uh oh!

johnyaku commented Apr 28, 2025

Uh oh!

adam-azarchs left a comment

Uh oh!

adam-azarchs May 12, 2025

Uh oh!

Uh oh!

adam-azarchs May 12, 2025

Uh oh!

macklin-10x commented May 12, 2025

Uh oh!

johnyaku commented Jun 11, 2025 •

edited

Loading

Uh oh!

macklin-10x commented Jun 11, 2025

Uh oh!

adam-azarchs left a comment

Uh oh!

adam-azarchs Jun 12, 2025

Uh oh!

johnyaku Jun 12, 2025

Uh oh!

adam-azarchs Jun 12, 2025

Uh oh!

johnyaku Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# is sufficient, but this can often be reduced to 4 hours or less if
		# '--maxjobs' is at least 10.

PBS job manager #151

Are you sure you want to change the base?

PBS job manager #151

Uh oh!

Conversation

johnyaku commented Apr 28, 2025

Uh oh!

adam-azarchs left a comment

Choose a reason for hiding this comment

Uh oh!

adam-azarchs May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adam-azarchs May 12, 2025

Choose a reason for hiding this comment

Uh oh!

macklin-10x commented May 12, 2025

Uh oh!

johnyaku commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

macklin-10x commented Jun 11, 2025

Uh oh!

adam-azarchs left a comment

Choose a reason for hiding this comment

Uh oh!

adam-azarchs Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

johnyaku Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

adam-azarchs Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

johnyaku Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

johnyaku commented Jun 11, 2025 •

edited

Loading