Skip to content

Conversation

@johnyaku
Copy link

This PR closes #150 and

  • adds pbs_queue.py
  • updates config.json
  • updates [pbspro.template.example(https://github.com/johnyaku/martian/blob/pbs-job-manager/jobmanagers/pbspro.template.example)

Here pbs_queue.py borrows heavily from slurm_queue.py and sge_queue.py, with adaptions for PBS Pro.

These changes have been successfully tested on the Gadi HPC system on the National Compute Infrastructure in Australia by running cellranger 9.0.1 on 125 captures.

Checks

  • pylint: ✅
  • pyformat: ✅

Copy link
Member

@adam-azarchs adam-azarchs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This looks pretty good other than a few minor nit-picks.

import json

# PBS Pro "job states" to be regarded as "alive"
ALIVE = {'Q', 'H', 'W', 'S', 'R', 'E'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simply

Suggested change
ALIVE = {'Q', 'H', 'W', 'S', 'R', 'E'}
ALIVE = "QHWSRE"

in will still work on it the same way and, for a set that small, will probably be faster than hash lookup.

"""Gets the command line for qstat."""
if not ids:
sys.exit(0)
return ['qstat', '-x', '-F', 'json', '-f'] + ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format this script with black for consistency with the other scripts.

@macklin-10x
Copy link
Contributor

@johnyaku thank you for the patch!

@johnyaku
Copy link
Author

johnyaku commented Jun 11, 2025

Thanks for the review @macklin-10x
I've made and succesfully tested the suggested changes on PBS Pro, except for the build file which I guess CI will check for us.

@macklin-10x
Copy link
Contributor

@adam-azarchs can you please take a look over the changes and see if this is ready to merge?

Copy link
Member

@adam-azarchs adam-azarchs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My remaining comments are minor nitpicks. If you have a chance to address them soon, great, otherwise I'll merge it either way before we cut another release.

Comment on lines +12 to +13
# is sufficient, but this can often be reduced to 4 hours or less if
# '--maxjobs' is at least 10.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's probably fine to set 4 hours regardless, this comment is misleading. The wall time being set in this template is for each individual job, not for the overall pipeline run. --maxjobs controls the number of jobs that can be queued simultaneously; increasing it may improve the overall pipeline wall time but would not be expected to improve the runtime for each individual job; more likely in fact would be higher I/O contention which would increase the runtime for these jobs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any jobs run for more than an hour in cluster mode, but happy to be educated here. My naive assumption was that --maxjobs affected the number (and size) of the chunks when input data is split. If I'm wrong then you can revert the second half of this change.
My reason for reducing this is that on our HPC, and perhaps PBS Pro systems more generally, account quota is "reserved" based on resources X walltime. We often run dozens of cellranger or spaceranger jobs in parallel and so requesting too much walltime can result in all of our quota getting reserved, even though we only get billed based on actual usage. Hogging quota like this can impact the rest of the team.
Perhaps

24 hours (24:00:00) is sufficient but this can be adjusted by comparing against actual usage

In our system, PBS Pro dumps actual usage information to the _stdout files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--maxjobs controls a semaphore limiting the number of jobs which are queued to the cluster at a time. It does not affect the size or number of jobs in total. Together with --jobinterval, the intent is to prevent a single pipeline job from slamming a cluster hard enough to make sysadmins angry at our users.

I have not previously encountered a cluster that charged based on reserved wall time as opposed to actually-used wall time; if you have that then more careful accounting is certainly in order! You can get the actual wall time (as measured by the job itself, so might be a slight underestimate as far as the cluster is concerned) from the _perf json file at the end of the run, though care must be taken when interpreting what that means in anything but the leaf nodes of the graph.

In our internal infrastructure, we usually run on spot instances in AWS, which are prone to involuntary preemption, so we try to aim to keep individual jobs under 1 hour (for most datasets anyway) to avoid loosing too much work when one of them fails this way and needs to be restarted. I think in most cases there will be very few jobs more than 15 minutes, even, depending on the performance of your hardware (the slow stages tend to be mostly I/O limited).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. We only get billed for actual usage, but potential usage gets "reserved" until the jobs finish. 1000 jobs (from multiple parallel runs) all claiming 24 hours walltime can result in all our quota getting reserved, which prevents new jobs from queuing, breaking the pipeline orchestrating the parallel runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend PBS Pro Job Manager

3 participants