The sams-collector is run on the compute-node and collects information about the running jobs.
The collector contains of three parts. The pidfinder, the sampler and the outputs.
This plugin finds process ids (PID) of a job.
This plugins gets the PIDs from pidfinder and collects metrics about the processes.
This plugins output the result of the samplers info different kinds of ways.
Usage information
Jobid to collect information about.
Note: In slurm this must be the ''JobIDRaw'' and not the jobid with job array extension (NNNNNN_A)
Path to configuration file.
Default: /etc/sams/sams-collector.yaml
Name of the current node.
Default: ''hostname'' of the machine.
Send collector into background.
Create pid file at .
Test output module with data from <json_file>.
Core options of SAMS collector.
The number of seconds to wait before trying to find new pids.
Name of the plugin that finds PIDs.
A list of plugins that sample metrics about the PIDs.
A list of plugins that stores the metrics from the samplers.
---
sams-collector:
pid_finder_update_interval: 30
pid_finder: sams.pidfinder.Slurm
samplers:
- sams.sampler.Core
- sams.sampler.Software
- sams.sampler.SlurmInfo
outputs:
- sams.output.File
umask: '077' # only used in daemon mode.
logfile: /var/log/sams-collector.%(jobid)s.%(node)s.log
loglevel: ERROR
sams.pidfinder.Slurm:
grace_period: 600
sams.sampler.SlurmInfo:
sampler_interval: 30
sams.sampler.Software:
sampler_interval: 30
sams.output.File:
base_path: /var/spool/softwareaccounting/data
file_pattern: "%(jobid)s.%(node)s.json"
jobid_hash_size: 1000
In Slurm prolog start
sams-collector.py --config=/path/config.yaml --jobid=$SLURM_JOB_ID --daemon --pidfile=/var/run/sams-collector.$SLURM_JOB_ID
The sams-collector needs to run as root.
In Slurm epilog kill -HUP.
If HUP i missing the collector will exit after 10 minutes without active processes.
See below for example usage with systemd.
Starting and stopping the software accounting with systemd is easy
create the file: /etc/systemd/system/[email protected] with the following content:
''' [Unit] Description=SAMS Software Accounting (%i)
[Service] Environment=PYTHONPATH=/lap/softwareaccounting/lib/python3.5/site-packages PIDFile=/var/run/software-accounting.%i.pid ExecStart=/lap/softwareaccounting/bin/sams-collector.py --jobid=%i --config=/etc/slurm/softwareaccounting.yaml KillSignal=SIGHUP KillMode=process '''
To start the accounting process just run: systemctl start softwareaccounting@${SLURM_JOB_ID}.service in the slurm prolog and put: systemctl stop softwareaccounting@${SLURM_JOB_ID}.service in the slurm epilog.