A Prometheus exporter for PBS (Portable Batch System) cluster monitoring.
Real-time PBS cluster monitoring dashboard showing job status, node availability, and resource utilization
- Job Metrics: Track running jobs by user, queue, and status
- Node Metrics: Monitor node states, CPU/GPU usage, and memory utilization
- Queue Metrics: Track job distribution across different queues
- Real-time Updates: Metrics are updated every 60 seconds
- Dashboard Integration: Compatible with Grafana and other monitoring dashboards
The application is structured into several packages for better maintainability:
Contains all Prometheus metrics definitions and registry management:
- Job-related metrics (running jobs by user/queue, total jobs by status)
- Node-related metrics (state, CPU/GPU/memory usage)
- Node count metrics (free, busy, offline, down nodes)
Handles PBS command execution and data parsing:
Client: Executes PBS commands (qstat,pbsnodes)JobData: Structured representation of job informationNodeData: Structured representation of node information- Parsing utilities for PBS output formats
Coordinates the HTTP server and metrics updates:
Server: Manages the overall application state- Metrics update coordination
- Data flow between PBS client and metrics registry
Entry point that orchestrates all components:
- Initializes all packages
- Starts background metrics collection
- Runs the HTTP server
qstat_running_jobs_by_user: Number of running jobs per userqstat_running_jobs_by_queue: Number of running jobs per queueqstat_jobs_in_queue: Total number of jobs in each queueqstat_total_running_jobs: Total number of running jobsqstat_total_r_jobs: Total Running (R) jobsqstat_total_h_jobs: Total Hold (H) jobsqstat_total_f_jobs: Total Finished (F) jobsqstat_total_q_jobs: Total Queuing (Q) jobsqstat_total_e_jobs: Total Error (E) jobsqstat_total_b_jobs: Total Array Job Running (B) jobsqstat_total_all_jobs: Total number of all jobsqstat_jobs_by_status: Number of jobs by status
pbs_node_state: Node state (1=free, 2=busy, 3=offline, 4=down)pbs_node_jobs: Number of jobs on nodepbs_node_cpus_available: Available CPUs on nodepbs_node_cpus_used: Used CPUs on nodepbs_node_cpus_total: Total CPUs on nodepbs_node_gpus_available: Available GPUs on nodepbs_node_gpus_used: Used GPUs on nodepbs_node_gpus_total: Total GPUs on nodepbs_node_memory_available_gb: Available memory on node in GBpbs_node_memory_used_gb: Used memory on node in GBpbs_node_memory_total_gb: Total memory on node in GB
pbs_node_count_free: Number of nodes in free statepbs_node_count_busy: Number of nodes in busy statepbs_node_count_offline: Number of nodes in offline statepbs_node_count_down: Number of nodes in down state
-
Build the application:
go build -o pbs-exporter
-
Run the exporter:
./pbs-exporter
-
Access metrics at
http://localhost:8888/metrics
The application runs on port 8888 by default and updates metrics every 60 seconds. These values can be modified in the main.go file.
- Go 1.21+
- Prometheus client library
- PBS commands (
qstat,pbsnodes) must be available in PATH
