-
Notifications
You must be signed in to change notification settings - Fork 0
Add files via upload #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,179 @@ | ||
| initcommit | ||
| # CHTC Summer Research Facilitation Project | ||
|
|
||
| **Fellow**: Kashika Mahajan | ||
| **Mentors**: Andrew Owen, Ian Ross | ||
| **Fellowship Dates**: May 19 – August 8, 2025 | ||
|
|
||
| ________ | ||
|
|
||
|
|
||
| ## 📚 Background | ||
|
|
||
| Researchers using HTCondor often struggle to quickly understand how their computational workloads (clusters of jobs) are performing. Current interfaces expose too much raw data, making it difficult, especially for less experienced users—to diagnose issues like jobs on hold, poor resource utilization, or unexpected failures. | ||
|
|
||
| This project aimed to build tools that simplify job monitoring, flag issues, and offer meaningful insights into workload behavior using accessible metrics and clear visual feedback. | ||
|
|
||
| ________ | ||
|
|
||
|
|
||
| ## 📁 Repository Structure | ||
|
|
||
| - hold_classifer.py # Diagnoses held jobs and groups by hold reasons | ||
| - runtime_histogram.py # Plots job runtime distribution using ASCII histograms | ||
| - resource_usage_summary.py # Summarizes requested vs actual CPU, memory, disk | ||
| - cluster_status_dashboard.py # Prints status distribution of jobs in a cluster | ||
| - README.md # This file | ||
|
|
||
| ________ | ||
|
|
||
|
|
||
| ## ⚙️ Setup and Installation | ||
| 1. Clone this repository: | ||
| 2. Install all the packages in the requirements.txt | ||
| 3. You must have access to: | ||
| - HTCondor Python bindings | ||
| - Elasticsearch (if querying historical job data) | ||
| ⚠️ Note: Some tools require authentication to the CHTC Elasticsearch instance, which is currently not available to general users. | ||
|
|
||
| ________ | ||
|
|
||
| ## 🚀 Usage Instructions | ||
|
|
||
| Each tool is meant to be run as a standalone Python script with a cluster ID as input. | ||
|
|
||
| Example command: `python dashboard.py <ClusterId>` | ||
|
|
||
| ________ | ||
|
|
||
|
|
||
| ## Features and Deliverables | ||
| 1. Cluster Status Dashboard | ||
| Purpose: Quickly visualize job statuses (Idle, Running, Held, Completed) | ||
| Features: | ||
| - Combines data from both queue and history | ||
| - Highlights abnormal patterns using ASCII charts | ||
|
|
||
| ### Example: Cluster Status Dashboard Output | ||
| ``` | ||
| Cluster 12345 Status Dashboard | ||
| Status | Bar | Count | % | ||
| ----------------------------------------------------------------------------------------- | ||
| Idle | █████████████ | 8686 | 27.0% | ||
| Running | | 433 | 1.3% | ||
| Removing | | 0 | 0.0% | ||
| Completed | | 0 | 0.0% | ||
| Held | ███████████████████████████████████ | 23067 | 71.7% | ||
| Transferring Output | | 0 | 0.0% | ||
| Suspended | | 0 | 0.0% | ||
| ``` | ||
|
|
||
|
|
||
| 3. Cluster Runtime Histogram | ||
| Purpose: Understand runtime variance across jobs | ||
| Features: | ||
| - Binned by percentiles | ||
| - Flags jobs with runtime < 10 min | ||
| - Can print list of affected job IDs | ||
|
|
||
|
|
||
| ### Example: Cluster Runtime Histogram Output | ||
|
|
||
| <img width="500" height="300" alt="image" src="https://github.com/user-attachments/assets/d6102c28-8a1b-4d7e-b87b-2b0d6be26019" /> | ||
|
|
||
|
|
||
|
|
||
| 4. Hold Classifier | ||
| Purpose: Explain why jobs were held | ||
| Features: | ||
| - Clusters jobs by HoldReasonCode + HoldReasonSubCode | ||
| - Displays percentage and example reasons | ||
| - Includes human-readable legend of hold codes | ||
|
|
||
| ### Example: Hold Classifier Output | ||
|
|
||
| ``` | ||
| Cluster ID: 12345 | ||
| Held Jobs in Cluster: 109 | ||
| +---------------------+-----------+--------------------------+---------------------------------------------------------+ | ||
| | Hold Reason Label | SubCode | % of Held Jobs (Count) | Example Reason | | ||
| +=====================+===========+==========================+=========================================================+ | ||
| | StartdHeldJob | 0 | 95.4% (104) | Job failed to complete in 72 hrs | | ||
| +---------------------+-----------+--------------------------+---------------------------------------------------------+ | ||
| | JobExecuteExceeded | 0 | 4.6% (5) | The job exceeded allowed execute duration of 3+00:00:00 | | ||
| +---------------------+-----------+--------------------------+---------------------------------------------------------+ | ||
| Legend: | ||
| ╒════════╤════════════════════╤═══════════════════════════════════════════════════════════════════════════╕ | ||
| │ Code │ Label │ Reason │ | ||
| ╞════════╪════════════════════╪═══════════════════════════════════════════════════════════════════════════╡ | ||
| │ 21 │ StartdHeldJob │ The job was put on hold because WANT_HOLD in the machine policy was true. │ | ||
| ├────────┼────────────────────┼───────────────────────────────────────────────────────────────────────────┤ | ||
| │ 47 │ JobExecuteExceeded │ The job's allowed execution time was exceeded. │ | ||
| ╘════════╧════════════════════╧═══════════════════════════════════════════════════════════════════════════╛ | ||
| ``` | ||
|
|
||
|
|
||
| 5. Resource Utilization Report | ||
| Purpose: Compare requested vs actual usage | ||
| Features: | ||
| - Summarizes CPU, memory, and disk usage | ||
| - Adds flags for under (<15%) or over (>80%) utilization | ||
| - Includes bar chart and percentiles | ||
|
|
||
|
|
||
| ### Example: Resource Utilization ReportOutput | ||
| ``` | ||
| ================================================================================ | ||
| HTCondor Cluster Resource Summary | ||
| ================================================================================ | ||
| Cluster ID: 12345 | ||
| Job Count: 748 | ||
| Avg Runtime: 0:56:52 | ||
| Requested Resources | ||
| ================================================================================ | ||
| Memory (GiB) : | ||
| 0.49 GiB 1 job(s) | ||
| 12.0 GiB 1 job(s) | ||
| 50.0 GiB 746 job(s) | ||
| Disk (GiB) : | ||
| 0.1 GiB 1 job(s) | ||
| 10.0 GiB 1 job(s) | ||
| 30.0 GiB 746 job(s) | ||
| CPUs : | ||
| 1 2 job(s) | ||
| 8 746 job(s) | ||
| GPUs : No data | ||
| Number Summary Table | ||
| ================================================================================ | ||
| Resource (units) : Min Q1 Median Q3 Max StdDev | ||
| -------------------------------------------------------------------------------- | ||
| Memory Used (GiB) : 0.1 1.2 6.1 14.2 47.4 10.4 | ||
| Disk Used (GiB) : 0.0 0.8 0.8 0.8 1.1 0.1 | ||
| CPU Usage (%) : 0.0%% 32.1%% 35.8%% 44.8%% 85.5%% 11.0%% | ||
| Overall Utilization | ||
| ================================================================================ | ||
| Memory usage [██████ ] 12.2% | ||
| Disk usage [█ ] 2.6% | ||
| CPU usage [█████████████████ ] 35.8% | ||
| Efficiency Notes | ||
| ================================================================================ | ||
| ⚠️ Memory usage is 12.2% | ||
| ⚠️ Disk usage is 2.6% | ||
| ✅ CPU usage is 35.8% | ||
| End of Summary | ||
| ================================================================================ | ||
| ``` | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,213 @@ | ||
| import os | ||
| import sys | ||
| import csv | ||
| import statistics | ||
| from collections import Counter | ||
| from datetime import timedelta | ||
| from utils import safe_float | ||
|
|
||
| """ | ||
| This program provides a report on the resource request and usage for a cluster | ||
| """ | ||
|
|
||
|
|
||
| # to print the bar visualizations | ||
| def bar(pct, width=50): | ||
| filled = int(pct / 100 * width) | ||
| return "[" + "█" * filled + " " * (width - filled) + f"] {pct:.1f}%" | ||
|
|
||
| # to calculate efficiency | ||
| def efficiency(used, expected): | ||
| if not expected: | ||
| return 0.0 | ||
| return (used / expected) * 100 | ||
|
|
||
| # to print the usage report | ||
| def compute_usage_summary(data, label, percentage=False, unit=None): | ||
| if not data or len(data) < 2: | ||
| return f"{label:<25}: Not enough data" | ||
|
|
||
| data_sorted = sorted(data) | ||
| min_val = data_sorted[0] | ||
| q1 = statistics.quantiles(data_sorted, n=4)[0] | ||
| median = statistics.median(data_sorted) | ||
| q3 = statistics.quantiles(data_sorted, n=4)[2] | ||
| max_val = data_sorted[-1] | ||
| std_dev = statistics.stdev(data_sorted) | ||
|
|
||
| fmt = "{:.1f}%%" if percentage else "{:.1f}" | ||
| return ( | ||
| f"{label:<25}: " | ||
| f"{fmt.format(min_val):>6} {fmt.format(q1):>6} {fmt.format(median):>7} " | ||
| f"{fmt.format(q3):>6} {fmt.format(max_val):>6} {fmt.format(std_dev):>6}" | ||
| ) | ||
|
|
||
| # prints the resource request table | ||
| def print_resource_table(name, values, unit=""): | ||
| if not values: | ||
| print(f"{name:<15}: No data") | ||
| return | ||
|
|
||
| counts = Counter(values) | ||
| print(f"{name:<15}:") | ||
| for val, count in sorted(counts.items()): | ||
| print(f"{'':<15} {val:<10} {unit:<5} {count} job(s)") | ||
| print() | ||
|
|
||
| # prints the total report | ||
| def summarize(cluster_id): | ||
| script_dir = os.path.dirname(os.path.abspath(__file__)) | ||
| data_dir = os.path.join(script_dir, "cluster_data") | ||
| filepath = os.path.join(data_dir, f"cluster_{cluster_id}_jobs.csv") | ||
|
|
||
| if not os.path.exists(filepath): | ||
| print(f"File not found: {filepath}") | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This won't be an informative error message for a user. |
||
| sys.exit(1) | ||
|
|
||
| with open(filepath, newline='', encoding='utf-8') as f: | ||
| jobs = list(csv.DictReader(f)) | ||
|
|
||
| mem_requested, mem_used = [], [] | ||
| disk_requested, disk_used = [], [] | ||
| run_time, cpu_used_time = [], [] | ||
| runtimes = [] | ||
| cpu_requests = [] | ||
| gpu_requests = [] | ||
|
|
||
| for job in jobs: | ||
| mem_req = safe_float(job.get("RequestMemory")) | ||
| mem_use = safe_float(job.get("ResidentSetSize_RAW")) | ||
| if mem_req: | ||
| mem_requested.append(round(mem_req / 1024, 2)) # Convert MiB to GiB | ||
| if mem_use: | ||
| mem_used.append(mem_use / 1024 / 1024) # Convert KiB to GiB | ||
|
|
||
| disk_req = safe_float(job.get("RequestDisk")) | ||
| disk_use = safe_float(job.get("DiskUsage_RAW")) | ||
| if disk_req: | ||
| disk_requested.append(round(disk_req / (1024 * 1024), 2)) # Convert KiB to GiB | ||
| if disk_use: | ||
| disk_used.append(disk_use / (1024 * 1024)) # Convert KiB to GiB | ||
|
|
||
| cpus = safe_float(job.get("RequestCpus")) | ||
| if cpus: | ||
| cpu_requests.append(int(cpus)) | ||
|
|
||
| gpus = safe_float(job.get("RequestGpus")) | ||
| if gpus: | ||
| gpu_requests.append(int(gpus)) | ||
|
|
||
| user_cpu = safe_float(job.get("RemoteUserCpu")) or 0 | ||
| sys_cpu = safe_float(job.get("RemoteSysCpu")) or 0 | ||
| wall_time = safe_float(job.get("RemoteWallClockTime")) | ||
|
|
||
| if wall_time and cpus and (user_cpu or sys_cpu): | ||
| total_cpu_used = sys_cpu / cpus | ||
| cpu_used_time.append(total_cpu_used) | ||
| run_time.append(wall_time) | ||
|
|
||
| if wall_time: | ||
| runtimes.append(wall_time) | ||
|
|
||
| from statistics import median | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You've already imported |
||
|
|
||
| # Compute per-job efficiency lists | ||
| per_job_cpu_eff = [ | ||
| efficiency(cpu_used_time[i], run_time[i]) | ||
| for i in range(len(cpu_used_time)) | ||
| if run_time[i] | ||
| ] | ||
|
|
||
| per_job_mem_eff = [ | ||
| efficiency(mem_used[i], mem_requested[i]) | ||
| for i in range(min(len(mem_used), len(mem_requested))) | ||
| if mem_requested[i] | ||
| ] | ||
|
|
||
| per_job_disk_eff = [ | ||
| efficiency(disk_used[i], disk_requested[i]) | ||
| for i in range(min(len(disk_used), len(disk_requested))) | ||
| if disk_requested[i] | ||
| ] | ||
|
|
||
| # Take medians | ||
| avg_cpu_eff = median(per_job_cpu_eff) if per_job_cpu_eff else 0 | ||
| avg_mem_eff = median(per_job_mem_eff) if per_job_mem_eff else 0 | ||
| avg_disk_eff = median(per_job_disk_eff) if per_job_disk_eff else 0 | ||
|
|
||
|
|
||
| total_jobs = len(jobs) | ||
| avg_runtime = statistics.mean(runtimes) if runtimes else 0 | ||
| avg_runtime_str = str(timedelta(seconds=int(avg_runtime))) if avg_runtime else "N/A" | ||
|
|
||
| print("=" * 80) | ||
| print(f"{'HTCondor Cluster Resource Summary':^80}") | ||
| print("=" * 80) | ||
| print(f"{'Cluster ID':>20}: {cluster_id}") | ||
| print(f"{'Job Count':>20}: {total_jobs}") | ||
| print(f"{'Avg Runtime':>20}: {avg_runtime_str}") | ||
| print() | ||
|
|
||
| print(f"{'Requested Resources':^80}") | ||
| print("=" * 80) | ||
| print_resource_table("Memory (GiB)", mem_requested, "GiB") | ||
| print_resource_table("Disk (GiB)", disk_requested, "GiB") | ||
| print_resource_table("CPUs", cpu_requests, "") | ||
| print_resource_table("GPUs", gpu_requests, "") | ||
|
|
||
| print(f"{'Number Summary Table':^80}") | ||
| print("=" * 80) | ||
| print(f"{'Resource (units)':<25}: {'Min':>6} {'Q1':>6} {'Median':>7} {'Q3':>6} {'Max':>6} {'StdDev':>6}") | ||
| print("-" * 80) | ||
|
|
||
| cpu_usages, mem_values, disk_values = [], [], [] | ||
|
|
||
| for i in range(len(jobs)): | ||
| if i < len(cpu_used_time) and i < len(run_time) and run_time[i]: | ||
| cpu_usages.append(efficiency(cpu_used_time[i], run_time[i])) | ||
| if i < len(mem_used): | ||
| mem_values.append(mem_used[i]) | ||
| if i < len(disk_used): | ||
| disk_values.append(disk_used[i]) | ||
|
|
||
|
|
||
| print(compute_usage_summary(mem_values, "Memory Used (GiB)")) | ||
| print(compute_usage_summary(disk_values, "Disk Used (GiB)")) | ||
| print(compute_usage_summary(cpu_usages, "CPU Usage (%)", percentage=True)) | ||
|
|
||
|
|
||
| print() | ||
|
|
||
| print(f"{'Overall Utilization':^80}") | ||
| print("=" * 80) | ||
| print(f" Memory usage {bar(avg_mem_eff)}") | ||
| print(f" Disk usage {bar(avg_disk_eff)}") | ||
| print(f" CPU usage {bar(avg_cpu_eff)}") | ||
| print() | ||
|
|
||
|
|
||
| # Gives human readable notes on the efficiency and also warnings | ||
| print(f"{'Efficiency Notes':^80}") | ||
| print("=" * 80) | ||
|
|
||
| def warn(resource, efficiency): | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the upper and lower bounds should be arguments. I can definitely see value in being able to set different thresholds for each resource type. |
||
| if efficiency < 15 or efficiency > 80: | ||
| print(f" ⚠️ {resource} usage is {efficiency:.1f}%") | ||
| else: | ||
| print(f" ✅ {resource} usage is {efficiency:.1f}%") | ||
|
|
||
| warn("Memory", avg_mem_eff) | ||
| warn("Disk", avg_disk_eff) | ||
| warn("CPU", avg_cpu_eff) | ||
|
|
||
|
|
||
| print() | ||
| print(f"{'End of Summary':^80}") | ||
| print("=" * 80) | ||
|
|
||
| if __name__ == "__main__": | ||
| if len(sys.argv) != 2: | ||
| print("Usage: python htcondor_cluster_summary.py <ClusterId>") | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Name of the script is wrong |
||
| sys.exit(1) | ||
| summarize(sys.argv[1]) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the files have been renamed since this readme was written.