Job Defense Shield is a software tool for identifying and reducing instances of underutilization by the users of high-performance computing systems. The software can (1) send automated email alerts to users, (2) create reports for system administrators, and (3) automatically cancel GPU jobs at 0% utilization. Job Defense Shield is a component of the Jobstats job monitoring platform.
Below is an example report for 0% GPU utilization:
GPU-Hours at 0% Utilization
---------------------------------------------------------------------
User GPU-Hours-At-0% Jobs JobID Emails
---------------------------------------------------------------------
1 u12998 308 39 62285369,62303767,62317153+ 1 (7)
2 u9l487 84 14 62301737,62301738,62301742+ 0
3 u39635 25 2 62184669,62187323 2 (4)
4 u24074 24 13 62303182,62303183,62303184+ 0
---------------------------------------------------------------------
Cluster: della
Partitions: gpu, llm
Start: Wed Feb 12, 2025 at 09:50 AM
End: Wed Feb 19, 2025 at 09:50 AM
Below is an example email to a user that is requesting too much CPU memory:
Hi Alan (u12345),
Below are your jobs that ran on the Stellar cluster in the past 7 days:
JobID Memory-Used Memory-Allocated Percent-Used Cores Hours
5761066 2 GB 100 GB 2% 1 48
5761091 4 GB 100 GB 4% 1 48
5761092 3 GB 100 GB 3% 1 48
It appears that you are requesting too much CPU memory for your jobs since
you are only using on average 3% of the allocated memory. For help on
allocating CPU memory with Slurm, please see:
https://your-institution.edu/knowledge-base/memory
Replying to this automated email will open a support ticket with Research
Computing.
See the documentation for installing and running the software.