  1. Getting started
  2. Submit a job
  3. Additional infos
  4. Conda and git
  5. Singularity
  6. SLURM

Getting started

For registration and account association follow:

Update (08/09/2023): If you have already an account on CINECA, notice that it has recently change the authentication procedure for log-in into the cluster:

  • follow this guide for activating the 2FA (send an email to [email protected] to get the activation link)
  • follow this guide from point n.3, you will install smallstep for creating a new certificate valid for 12 hours on your pc
	eval $(ssh-agent) # activate the ssh-agent
	step ssh login '<user-email>' --provisioner cineca-hpc #  obtain the certificate
  • The user has now to put his/her cluster credentials (username and password) and push the button "Sign in". Then, keycloak will ask for the OTP code generated by the Authenticator
  • Once authenticated, you will see a Success message on your browser meaning that the certificate has been generated and it is available on your PC.

IMPORTANT: the temporary certificate is valid for 12 hours. If you reboot your PC the certificate is lost and you need to download a new one launching again the "step ssh login ..." command.

Command and scripts inside the cluster (CINECA) to submit a job

./ <num_cpu> <max_walltime> (e.g. ./ 12 24:00:00)

	# >>> Pulling repos
	sbatch --job-name=job_example --cpus-per-task=${1} --time=${2} --output=./slurm_output/job_example.out --error=./slurm_output/job_example.out train.sbatch


	#SBATCH --partition=g100_usr_prod
	#SBATCH --mem=20000M
	#SBATCH --ntasks=1
	#SBATCH --mail-type=ALL
	#SBATCH --mail-user=<your email>

	# >>> IF YOU NEED TO USE A CONTAINER FOLLOW THE CODE BELOW (otherwise use your code here)<<<
	# Load the module
	module load singularity
	# Run the container
	singularity exec --hostname ${SLURM_SUBMIT_HOST}${SLURM_JOB_ID} ./container.sif bash ./

	# >>> Activate the conda enviroment
	# >>> Execute code

CINECA: Additional infos

Cineca allows the usage of TMUX as terminal multiplexer:

Cineca works only offline inside the running node. Therefore:

  • pull the repos before submitting the job (e.g. in
  • to use a logger (e.g. wandb):
    • use wandb_mode as offline
    • to sync with the server, inside the wandb folder: wandb sync --include-offline ./offline-*
    • Script for syncronize wandb offline runs (supponing to have a group directory containing more than one run)
            # argument 1: group directory
            conda activate <env_name>
            RAND_ID=$(python3 -c "import wandb; print(wandb.util.generate_id());")
            echo "Syncing runs $1 to run new id $RAND_ID"
            # first, sync last series of logs to new id
            first_dir=$(ls -t $1| head -1)
            wandb sync $1/$first_dir/wandb/$(ls -t $1/$first_dir/wandb/ | grep offline | head -1) --id $RAND_ID
            # then all the others + last again to sync hyper-params.
            for dir in $(ls $1)
        		run=$(ls $1/$dir/wandb/ | grep offline)
       			echo $run
       			wandb sync $1/$dir/wandb/$run --id $RAND_ID;

New setup for tracking live of experiments with wandb

  • login node has to create a reverse proxy towards to the compute node, while the running job has to wait this proxy is up before using wandb

  • add this at the begin of your script (use any port you prefer):

     echo Waiting the reverse proxy...
     while ! netstat -an | grep 34567 &> /dev/null; do sleep 1; done
     export HTTP_PROXY=socks5://
     export HTTPS_PROXY=socks5://
     export SOCK_PROXY=socks5://
     echo Reverse proxy is up and running!
  • this other script must be in execution for all the duration of the job, controlling periodically which job are in run and opening a new proxy for each of them

     while true; do
     	# Get the list of running jobs for the user
     	nodes=$(squeue -u $USER -h -t R -o "%N" | uniq)
     	for node in $nodes; do
             # Check if a reverse proxy is already set up for this job
     	    n=$(ps -f -u $USER | grep -e "ssh.*$node" | wc -l)
             if [ $n -eq 1 ]; then
     	echo Creating proxy for $node...
             	ssh -oStrictHostKeyChecking=no -N -R 34567 -f $node
     	sleep $INTERVAL
  • ssh connection has kept in background and killled by cineca when the job ended

  • this solution works for wandb, huggingfacehub, and any library/application which use requests - NOT for dataset download by torchvision

Installations of conda and git on the cluster

  • Install conda:
        mkdir -p ~/miniconda3
        wget -O ~/miniconda3/
        bash ~/miniconda3/ -b -u -p ~/miniconda3
        rm -rf ~/miniconda3/
        ~/miniconda3/bin/conda init bash
        ~/miniconda3/bin/conda init zsh
  • Create a conda environment:
        conda create -n <env_name> python=3.8
        conda activate <env_name>
  • If singularity is not installed:
        conda install -c conda-forge singularity
  • Clone git repositories:
        conda install gh --channel conda-forge
        gh auth login
        <gh token>
        git clone <repo>

Singularity: Additional infos

Usually a cluster (e.g. CINECA, HPC) do not allow the use of Docker for security reasons, however it is possible to use Singularity as alternative. Singularity, differently from Docker, creates a container as a directory inside the original host filesystem. Therefore, if you have originally created the Docker container in the path /home/a/b/c, Singularity would virtually create a path /home/a/b/c inside the actual host filesystem. When you use Singularity for the first time you should be take note of these steps:

  • add in ~/.bashrc file:
  • pull docker image <docker_path> and convert into a singularity image <container>.sif (in your login node)
     module load singularity
     singularity pull <new_sing_img>.sif docker://<docker_path>
  • NOTE: if you are not able to pull it from the cluster, you can copy a pre-existent .sif into the cluster
  • test singularity container using
       singularity shell <container>.sif

Useful links:

SLURM cheatsheet

  • submit a job sbatch <file_name>.sbatch
  • show all jobs squeue
  • show your jobs squeue -u <username>
  • show job infos scontrol show job <job_id>
  • partions status sinfo
  • delete a job scancel <job_id>
  • running an interactive session on a node srun --nodes=1 --tasks-per-node=1 --pty /bin/bash


