-
Notifications
You must be signed in to change notification settings - Fork 138
Description
Axel kindly requested (on my behalf) to increase the maximum number of running jobs on the “interactive” partition for the RTW (following the instructions at DKRZ documentation: Limits).
Carsten shared the following:
I created a QOS called "rtw" to overcome the limit of 5 jobs per user in the partition "interactive". Currently it allows up to 50 jobs in parallel or a maximum of 256 CPU's and we could adjust that for more resources (cpu/maxjobs) if we see that it fits with our other needs in "interactive" (see below).
It is already added to user < my username! > for project/account < our project and account! > and you need to add it to your SLURM scripts => "#SBATCH --qos=rtw"
Could you please run the jobs in the evening or over night, because the partition "interactive" is for interactive work (Jupyterhub/VScode/salloc) and normally NOT for batch job processing. This QOS setup is currently an exception and we have most free resources in "interactive" from evening till morning. On daytime the nodes are most times fully used with interactive sessions and we have a setup the sessions/jobs in "interactive" have highest scheduling priority in the cluster. Too much batch jobs on the day could block users with interactive sessions to get resources.
I only used the "interactive" partition because that was the default partition used in the generate.py script. I asked Carsten about using other partitions instead and he shared:
Since your jobs are pretty small (8 cpus) you would 'burn' a lot of your computing time when using the 'compute' partition, but yes you could do it, if needed or urgent.
In 'interactive' and 'shared' partition several jobs could run on one node and on 'compute' you would get for each job a complete node exclusively and I think you would waste a lot of your computing time for that. Maybe you could adopt your scripts to run several tasks on a 'compute' node in parallel and replace the need for 'interactive/shared'.
If it's not time critical you could use the 'shared' partition for your small jobs. You might also check if there are idle resources in 'interactive' and might run your development there if there're not too much jobs on daytime. We might restrict this if we have workshops/hackathons ongoing on 'interactive' (in that case, we also often increase the number of nodes in 'interactive' temporarily).
Following this, Carsten enabled the QOS on the "shared" partition as well. I'm wondering whether the "shared" partition would be better for the nightly runs, and whether to use the "interactive" partition for development purposes.
Tasks:
- Add the
--qos=rtw
directive to the RTW to enable the use of the QOS - Update the default partition to "shared"
- Add a
dev
mode that uses the "interactive" partition