For registration and account association follow:
Update (08/09/2023): If you have already an account on CINECA, notice that it has recently change the authentication procedure for log-in into the cluster:
- follow this guide for activating the 2FA (send an email to [email protected] to get the activation link)
- follow this guide from point n.3, you will install smallstep for creating a new certificate valid for 12 hours on your pc
eval $(ssh-agent) # activate the ssh-agent
step ssh login '<user-email>' --provisioner cineca-hpc # obtain the certificate
- The user has now to put his/her cluster credentials (username and password) and push the button "Sign in". Then, keycloak will ask for the OTP code generated by the Authenticator
- Once authenticated, you will see a Success message on your browser meaning that the certificate has been generated and it is available on your PC.
IMPORTANT: the temporary certificate is valid for 12 hours. If you reboot your PC the certificate is lost and you need to download a new one launching again the "step ssh login ..." command.
./ <num_cpu> <max_walltime>
(e.g. ./ 12 24:00:00
# >>> Pulling repos
sbatch --job-name=job_example --cpus-per-task=${1} --time=${2} --output=./slurm_output/job_example.out --error=./slurm_output/job_example.out train.sbatch
#SBATCH --partition=g100_usr_prod
#SBATCH --mem=20000M
#SBATCH --ntasks=1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email>
# >>> IF YOU NEED TO USE A CONTAINER FOLLOW THE CODE BELOW (otherwise use your code here)<<<
# Load the module
module load singularity
# Run the container
singularity exec --hostname ${SLURM_SUBMIT_HOST}${SLURM_JOB_ID} ./container.sif bash ./
# >>> Activate the conda enviroment
# >>> Execute code
Cineca allows the usage of TMUX as terminal multiplexer:
Cineca works only offline inside the running node. Therefore:
- pull the repos before submitting the job (e.g. in
) - to use a logger (e.g. wandb):
- use
- to sync with the server, inside the wandb folder:
wandb sync --include-offline ./offline-*
- Script for syncronize wandb offline runs (supponing to have a group directory containing more than one run)
#!/bin/bash # argument 1: group directory conda activate <env_name> RAND_ID=$(python3 -c "import wandb; print(wandb.util.generate_id());") echo "Syncing runs $1 to run new id $RAND_ID" # first, sync last series of logs to new id first_dir=$(ls -t $1| head -1) wandb sync $1/$first_dir/wandb/$(ls -t $1/$first_dir/wandb/ | grep offline | head -1) --id $RAND_ID # then all the others + last again to sync hyper-params. for dir in $(ls $1) do run=$(ls $1/$dir/wandb/ | grep offline) echo $run wandb sync $1/$dir/wandb/$run --id $RAND_ID; done
- use
login node has to create a reverse proxy towards to the compute node, while the running job has to wait this proxy is up before using wandb
add this at the begin of your script (use any port you prefer):
echo Waiting the reverse proxy... while ! netstat -an | grep 34567 &> /dev/null; do sleep 1; done export HTTP_PROXY=socks5:// export HTTPS_PROXY=socks5:// export SOCK_PROXY=socks5:// echo Reverse proxy is up and running!
this other script must be in execution for all the duration of the job, controlling periodically which job are in run and opening a new proxy for each of them
#!/bin/bash INTERVAL=10 while true; do # Get the list of running jobs for the user nodes=$(squeue -u $USER -h -t R -o "%N" | uniq) for node in $nodes; do # Check if a reverse proxy is already set up for this job n=$(ps -f -u $USER | grep -e "ssh.*$node" | wc -l) if [ $n -eq 1 ]; then echo Creating proxy for $node... ssh -oStrictHostKeyChecking=no -N -R 34567 -f $node fi done sleep $INTERVAL done
ssh connection has kept in background and killled by cineca when the job ended
this solution works for wandb, huggingfacehub, and any library/application which use requests - NOT for dataset download by torchvision
- Install conda:
mkdir -p ~/miniconda3 wget -O ~/miniconda3/ bash ~/miniconda3/ -b -u -p ~/miniconda3 rm -rf ~/miniconda3/ ~/miniconda3/bin/conda init bash ~/miniconda3/bin/conda init zsh
- Create a conda environment:
conda create -n <env_name> python=3.8 conda activate <env_name>
- If singularity is not installed:
conda install -c conda-forge singularity
- Clone git repositories:
conda install gh --channel conda-forge gh auth login <gh token> git clone <repo>
Usually a cluster (e.g. CINECA, HPC) do not allow the use of Docker for security reasons, however it is possible to use Singularity as alternative.
Singularity, differently from Docker, creates a container as a directory inside the original host filesystem.
Therefore, if you have originally created the Docker container in the path /home/a/b/c
, Singularity would virtually create a path /home/a/b/c
inside the actual host filesystem.
When you use Singularity for the first time you should be take note of these steps:
- add in
- pull docker image
and convert into a singularity image<container>.sif
(in your login node)
module load singularity
singularity pull <new_sing_img>.sif docker://<docker_path>
- NOTE: if you are not able to pull it from the cluster, you can copy a pre-existent .sif into the cluster
- test singularity container using
singularity shell <container>.sif
Useful links:
- submit a job
sbatch <file_name>.sbatch
- show all jobs
- show your jobs
squeue -u <username>
- show job infos
scontrol show job <job_id>
- partions status
- delete a job
scancel <job_id>
- running an interactive session on a node
srun --nodes=1 --tasks-per-node=1 --pty /bin/bash