slurmShip is a Docker-compose based demonstration of a small Slurm cluster (Slurm 25.11.0). It provides container images and orchestration for a minimal cluster consisting of:
database(slurmdbd backend)controller(slurmctld)worker01,worker02(slurmd nodes)login(user login node for job submission)
The repository includes local RPMs for an offline, reproducible build and a set of entrypoint scripts to wire up munge, SSH keys, and Slurm configuration between containers.
This README explains how to build, run, and verify the cluster, plus some troubleshooting tips.
User
|
| SSH :22
v
+--------+ Volumes: munge_keys, slurm_state, db_data
| Login |<------------------------------------+
| Node | |
+--------+ |
| |
| Slurm commands (srun, sbatch, sinfo) |
| to slurmctld :6817 |
v |
+----------------+ |
| Controller | |
| (slurmctld) |<--------+ |
| :6817 | | |
+----------------+ | |
| | | |
| Job dispatch to | |
| slurmd :6818 | |
v v | |
+--------+ +--------+ | |
|Worker01| |Worker02| | |
|(slurmd)|(slurmd) | |
| :6818 | :6818 | | |
+--------+ +--------+ | |
| |
+--------------------+-------------------+
|
| Accounting (slurmdbd :6819)
|
+-----------------+
| Database |
| (slurmdbd) |
| :6819 |
| MariaDB :3306 |
+-----------------+
- System Architecture: x86_64
- Operating System: Rocky (Container)
- Tested on GitHub Codespace
- Docker (or Docker Desktop) with Buildx support
- docker-compose (or
docker composeplugin) - Sufficient disk space to hold images and volumes
- On systems with SELinux you may need to adjust volume mount options
The login image uses the local RPMs in packages/rocky/rpms so you should build images from the repository root.
From the repository root:
# Build controller, database, workers and login
make Or build a single service (example: login):
make -C login buildBring up the cluster:
docker compose up
or
docker compose up -d The compose file defines named volumes for munge_keys, slurm_state, and db_data that persist state between runs.
List containers and health status:
docker compose ps
or
docker psCheck logs for errors (examples):
docker compose logs controller
docker compose logs worker01
docker compose logs worker02The login container is intended as the user-facing node. Use it to run sinfo, scontrol, squeue, sbatch, and srun.
Run these from the host to execute inside the login container:
# Check if nodes are visible
docker exec -it login sinfo
# See detailed node information
docker exec -it login scontrol show nodes
# Access an interactive shell as the worker user
docker exec -it login bash
# Or su to the worker user then submit a jobNotes:
-
sinfoand other Slurm commands will only show nodes onceslurmctldandslurmdare communicating and munge authentication is working. -
The
loginimage copiesmunge.keyandslurm.conffrom the shared/.secretvolume so its configuration matches the controller.
From inside the login container (or via docker exec):
su - worker
cat > ~/hello.slurm <<'EOF'
#!/bin/bash
#SBATCH --job-name=hello
#SBATCH --output=hello.out
echo "Hello from $(hostname)"
sleep 5
EOF
sbatch ~/hello.slurm
squeue -u $USER- If
sinfoshows nodes asUNKNOWNor they don't appear:- Check
docker compose logs controlleranddocker compose logs worker01/worker02for error messages. - Verify
mungedis running in each relevant container and that/etc/munge/munge.keymatches across containers.
- Check
- If munge authentication fails:
- Ensure
munge.keyis present in/.secret(controller copies it there). Theloginand worker containers read it from that volume. - Run
munge -n | unmungeinside a container to test.
- Ensure
- If Slurm client tools are missing in
login:- The
loginimage installs the localslurm-25.11.0RPMs during build frompackages/rocky/rpms. - Rebuild the
loginimage if RPMs have changed:make -C login build.
- The
- The current setup mounts
./homeinto/home/workerand uses./secretas the shared secret volume. Thecontrollercreates/.secret/worker-secret.tar.gz,munge.key, andslurm.conf/cgroup.confthere for other services to pick up. - For a production-like setup you may prefer to:
- Run
sshdon thelogincontainer and expose SSH port(s) so users can connect using standard SSH clients. - Use
gosu/tinifor better signal handling in entrypoints. - Harden SSH and munge key handling (avoid disabling StrictHostKeyChecking in production).
- Run
Stop and remove containers and volumes:
docker compose down --volumes --remove-orphans
or
make cleanIf you'd like I can:
- Add an SSH server to the
loginimage and expose port 22 - Make the login container install RPMs at runtime if they're not present
- Add CI steps or a Makefile target to build all images in sequence
Open an issue or ask here with what you want next.