FireHPC: Instantly fire up container-based emulated HPC cluster

Description

FireHPC is a tool designed to quickly start and setup a tiny emulated HPC cluster based on Linux containers, ready to run non-intensive MPI jobs with Slurm.

Obviously, FireHPC does not aim at performances as you get better performances out of your computer without containers overhead.

The purposes are the following:

Setup development, CI and tests environment
Learn tools and software in breakable environment
Testing and discovering new technologies in isolated environment

FireHPC aims to emulate HPC clusters with multiple distributions. It supports running multiple emulated HPC clusters in parallel on the same host, each cluster running in its dedicated virtual network.

The following services are automatically deployed in the emulated cluster:

Slurm workload manager
SlurmDBD accounting with MariaDB backend
Slurmrestd REST API
LDAP directory with TLS
SSH with public keys
OpenMPI

Additional components can be also be deployed, such as:

Slurm-web
Metrics stack with Prometheus, Grafana and Grafana Alloy

Architecture

FireHPC requires Python >= 3.9.

FireHPC relies on:

systemd-nspawn containers controlled with systemd-{networkd,machined}
Ansible automation tool

FireHPC also requires the following community Ansible collections:

They may not be installed within Ansible on your distribution. In this case, you can install on your system using ansible-galaxy.

Quickstart

Download and install Rackslab packages repository signing keyring:

$ curl -sS https://pkgs.rackslab.io/keyring.asc | gpg --dearmor | sudo tee /usr/share/keyrings/rackslab.gpg > /dev/null

Create /etc/apt/sources.list.d/rackslab.sources:

For Debian 12 Bookworm:

Types: deb
URIs: https://pkgs.rackslab.io/deb
Suites: bookworm
Components: main
Architectures: amd64
Signed-By: /usr/share/keyrings/rackslab.gpg

For Debian 13 Trixie:

Types: deb
URIs: https://pkgs.rackslab.io/deb
Suites: trixie
Components: main
Architectures: amd64
Signed-By: /usr/share/keyrings/rackslab.gpg

For Debian sid:

Types: deb
URIs: https://pkgs.rackslab.io/deb
Suites: sid
Components: main
Architectures: amd64
Signed-By: /usr/share/keyrings/rackslab.gpg

Update packages database:

$ sudo apt update

Install firehpc:

$ sudo apt install firehpc

Add your user in firehpc system group:

$ sudo usermod -a -G firehpc ${USERNAME}

$ curl -s https://hpck.it/keyring.asc | \
  sudo gpg --no-default-keyring --keyring=/etc/systemd/import-pubring.gpg --import

gpg: key F2EB7900E8151A0D: public key "HPCk.it team <[email protected]>" imported
gpg: Total number processed: 1
gpg:               imported: 1

Start and enable systemd-networkd service:

$ sudo systemctl start systemd-networkd.service
$ sudo systemctl enable systemd-networkd.service

Install systemd-resolved:

$ sudo apt install systemd-resolved

Unfortunarly, there is on ongoing bug #1031236 in Debian on ifupdown/systemd-resolved automatic integration. One workaround is to define DNS servers IP addresses and domains in systemd-resolved configuration file /etc/systemd/resolved.conf. Then restart the service so changes can take effect:

$ sudo systemctl restart systemd-resolved.service

Fix the order between mymachine and resolve services in /etc/nsswitch.conf so the mymachines service can resolve IP addresses of container names:

--- a/etc/nsswitch.conf
+++ b/etc/nsswitch.conf
@@ -9,7 +9,7 @@
 shadow:         files systemd sss
 gshadow:        files systemd

-hosts:          files resolve [!UNAVAIL=return] dns mymachines myhostname
+hosts:          files mymachines resolve [!UNAVAIL=return] dns myhostname
 networks:       files

 protocols:      db files

It is also recommended to increase maximum inotify instances from default 128 to 1024 for instance to avoid weird issues when starting a large number of containers:

# sysctl fs.inotify.max_user_instances=1024

Without this modification, the mymachines service is basically ignored by the return action on resolve service. For reference, see nss-mymachines(8).

This can be made persistent with:

# echo fs.inotify.max_user_instances=1024 > /etc/sysctl.d/99-firehpc.conf

Usage

Bootstrap

First, bootstrap deployment environments:

$ firehpc bootstrap

Note

This command creates multiple Python virtual environments with all versions of Ansible required for all supported target OS.

Deploy

For a quick start, copy the simple example RacksDB database:

$ cp /usr/share/doc/firehpc/examples/db/racksdb.yml racksdb.yml

This file can optionally be modified to add nodes or change hostnames.

With your regular user, run FireHPC with a cluster name and an OS in arguments. For example:

$ firehpc deploy --db racksdb.yml --cluster hpc --os debian12

The available OS are reported by this command:

$ firehpc images

Status

When it is deployed, check the status of the emulated cluster:

$ firehpc status --cluster hpc

This reports the started containers and the randomly generated user accounts.

MPI

You can connect to your containers (eg. admin) with this command:

$ firehpc ssh admin.hpc

Connect with a generated user account on the login node:

$ firehpc ssh <user>@login.hpc

Then run MPI job in Slurm job:

[<user>@login ~]$ curl --silent https://raw.githubusercontent.com/mpitutorial/mpitutorial/gh-pages/tutorials/mpi-hello-world/code/mpi_hello_world.c -o helloworld.c
[<user>@login ~]$ export PATH=$PATH:/usr/lib64/openmpi/bin  # required on rocky8, not on debian11
[<user>@login ~]$ mpicc -o helloworld helloworld.c
[<user>@login ~]$ salloc -N 2
salloc: Granted job allocation 2
[<user>@login ~]$ mpirun helloworld
Hello world from processor cn1.hpc, rank 0 out of 4 processors
Hello world from processor cn1.hpc, rank 1 out of 4 processors
Hello world from processor cn2.hpc, rank 2 out of 4 processors
Hello world from processor cn2.hpc, rank 3 out of 4 processors

You can also try Slurm REST API:

[<user>@login ~]$ export $(scontrol token)
[<user>@login ~]$ curl -H "X-SLURM-USER-NAME: ${USER}" -H "X-SLURM-USER-TOKEN: ${SLURM_JWT}" http://admin:6820/slurm/v0.0.39/nodes

Slurm-web

When Slurm-web is enabled, it is available at: http://admin.hpc/

Metrics

When the metrics stack is enabled, Grafana is available at: http://admin.hpc:3000/

Grafana is setup with a Slurm dashboard showing diagrams of nodes states and job queue by default.

Clean

When you are done, you can clean up everything for a cluster with this command:

$ firehpc clean --cluster hpc

Authors

FireHPC is developed by Rackslab.

License

FireHPC is distributed under the terms of the GNU General Public License v3.0 or later (GPLv3+).

Name		Name	Last commit message	Last commit date
Latest commit History 382 Commits
.github/workflows		.github/workflows
conf		conf
db		db
docs		docs
etc		etc
firehpc		firehpc
lib		lib
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FireHPC: Instantly fire up container-based emulated HPC cluster

Description

Architecture

Quickstart

Usage

Bootstrap

Deploy

Status

MPI

Slurm-web

Metrics

Clean

Authors

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

rackslab/FireHPC

Folders and files

Latest commit

History

Repository files navigation

FireHPC: Instantly fire up container-based emulated HPC cluster

Description

Architecture

Quickstart

Usage

Bootstrap

Deploy

Status

MPI

Slurm-web

Metrics

Clean

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages