Skip to content

PrincetonUniversity/job_defense_shield

Repository files navigation

Tests codecov PyPI License: GPL v2 DOI

Job Defense Shield

Job Defense Shield is a software tool for identifying and reducing instances of underutilization by the users of high-performance computing systems. The software can (1) send automated email alerts to users, (2) create reports for system administrators, and (3) automatically cancel GPU jobs at 0% utilization. Job Defense Shield is a component of the Jobstats job monitoring platform.

Below is an example report for 0% GPU utilization:

                         GPU-Hours at 0% Utilization
---------------------------------------------------------------------
    User   GPU-Hours-At-0%  Jobs             JobID             Emails
---------------------------------------------------------------------
1  u12998        308         39   62285369,62303767,62317153+   1 (7)
2  u9l487         84         14   62301737,62301738,62301742+   0         
3  u39635         25          2            62184669,62187323    2 (4)         
4  u24074         24         13   62303182,62303183,62303184+   0         
---------------------------------------------------------------------
   Cluster: della
Partitions: gpu, llm
     Start: Wed Feb 12, 2025 at 09:50 AM
       End: Wed Feb 19, 2025 at 09:50 AM

Below is an example email to a user that is requesting too much CPU memory:

Hi Alan (u12345),

Below are your jobs that ran on the Stellar cluster in the past 7 days:

     JobID   Memory-Used  Memory-Allocated  Percent-Used  Cores  Hours
    5761066      2 GB          100 GB            2%         1     48
    5761091      4 GB          100 GB            4%         1     48
    5761092      3 GB          100 GB            3%         1     48

It appears that you are requesting too much CPU memory for your jobs since
you are only using on average 3% of the allocated memory. For help on
allocating CPU memory with Slurm, please see:

    https://your-institution.edu/knowledge-base/memory

Replying to this automated email will open a support ticket with Research
Computing.

Getting Started

See the documentation for installing and running the software.