For registration and account association follow:
https://wiki.u-gov.it/confluence/display/SCAIUS/UG2.1+Getting+started#expand-3Connectingtothecluster
Update (08/09/2023): If you have already an account on CINECA, notice that it has recently change the authentication procedure for log-in into the cluster:
- follow this guide for activating the 2FA (send an email to [email protected] to get the activation link)
- follow this guide from point n.3, you will install smallstep for creating a new certificate valid for 12 hours on your pc
eval $(ssh-agent) # activate the ssh-agent
step ssh login '<user-email>' --provisioner cineca-hpc # obtain the certificate
- The user has now to put his/her cluster credentials (username and password) and push the button "Sign in". Then, keycloak will ask for the OTP code generated by the Authenticator
- Once authenticated, you will see a Success message on your browser meaning that the certificate has been generated and it is available on your PC.
IMPORTANT: the temporary certificate is valid for 12 hours. If you reboot your PC the certificate is lost and you need to download a new one launching again the "step ssh login ..." command.
./train.sh <num_cpu> <max_walltime>
(e.g. ./train.sh 12 24:00:00
)
train.sh:
#!/bin/bash
# >>> Pulling repos
...
sbatch --job-name=job_example --cpus-per-task=${1} --time=${2} --output=./slurm_output/job_example.out --error=./slurm_output/job_example.out train.sbatch
train.sbatch:
#!/bin/bash
#SBATCH --partition=g100_usr_prod
#SBATCH --mem=20000M
#SBATCH --ntasks=1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email>
# >>> IF YOU NEED TO USE A CONTAINER FOLLOW THE CODE BELOW (otherwise use your code here)<<<
# Load the module
module load singularity
# Run the container
singularity exec --hostname ${SLURM_SUBMIT_HOST}${SLURM_JOB_ID} ./container.sif bash ./container_train.sh
container_train.sh:
#!/bin/bash
# >>> Activate the conda enviroment
...
# >>> Execute code
...
https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.3%3A+GALILEO100+UserGuide
Cineca allows the usage of TMUX as terminal multiplexer: https://tmuxcheatsheet.com/
Cineca works only offline inside the running node. Therefore:
- pull the repos before submitting the job (e.g. in
train.sh
) - to use a logger (e.g. wandb):
- use
wandb_mode
asoffline
- to sync with the server, inside the wandb folder:
wandb sync --include-offline ./offline-*
- Script for syncronize wandb offline runs (supponing to have a group directory containing more than one run)
#!/bin/bash # argument 1: group directory conda activate <env_name> RAND_ID=$(python3 -c "import wandb; print(wandb.util.generate_id());") echo "Syncing runs $1 to run new id $RAND_ID" # first, sync last series of logs to new id first_dir=$(ls -t $1| head -1) wandb sync $1/$first_dir/wandb/$(ls -t $1/$first_dir/wandb/ | grep offline | head -1) --id $RAND_ID # then all the others + last again to sync hyper-params. for dir in $(ls $1) do run=$(ls $1/$dir/wandb/ | grep offline) echo $run wandb sync $1/$dir/wandb/$run --id $RAND_ID; done
- use
-
login node has to create a reverse proxy towards to the compute node, while the running job has to wait this proxy is up before using wandb
-
add this at the begin of your script (use any port you prefer):
echo Waiting the reverse proxy... while ! netstat -an | grep 34567 &> /dev/null; do sleep 1; done export HTTP_PROXY=socks5://127.0.0.1:34567 export HTTPS_PROXY=socks5://127.0.0.1:34567 export SOCK_PROXY=socks5://127.0.0.1:34567 echo Reverse proxy is up and running!
-
this other script must be in execution for all the duration of the job, controlling periodically which job are in run and opening a new proxy for each of them
#!/bin/bash INTERVAL=10 while true; do # Get the list of running jobs for the user nodes=$(squeue -u $USER -h -t R -o "%N" | uniq) for node in $nodes; do # Check if a reverse proxy is already set up for this job n=$(ps -f -u $USER | grep -e "ssh.*$node" | wc -l) if [ $n -eq 1 ]; then echo Creating proxy for $node... ssh -oStrictHostKeyChecking=no -N -R 34567 -f $node fi done sleep $INTERVAL done
-
ssh connection has kept in background and killled by cineca when the job ended
-
this solution works for wandb, huggingfacehub, and any library/application which use requests - NOT for dataset download by torchvision
- Install conda:
mkdir -p ~/miniconda3 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3 rm -rf ~/miniconda3/miniconda.sh ~/miniconda3/bin/conda init bash ~/miniconda3/bin/conda init zsh
- Create a conda environment:
conda create -n <env_name> python=3.8 conda activate <env_name>
- If singularity is not installed:
conda install -c conda-forge singularity
- Clone git repositories:
conda install gh --channel conda-forge gh auth login <gh token> git clone <repo>
Usually a cluster (e.g. CINECA, HPC) do not allow the use of Docker for security reasons, however it is possible to use Singularity as alternative.
Singularity, differently from Docker, creates a container as a directory inside the original host filesystem.
Therefore, if you have originally created the Docker container in the path /home/a/b/c
, Singularity would virtually create a path /home/a/b/c
inside the actual host filesystem.
When you use Singularity for the first time you should be take note of these steps:
- add in
~/.bashrc
file:
export SINGULARITY_CACHEDIR=/scratch/gpfs/$USER/SINGULARITY_CACHE
export SINGULARITY_TMPDIR=/tmp
- pull docker image
<docker_path>
and convert into a singularity image<container>.sif
(in your login node)
module load singularity
singularity pull <new_sing_img>.sif docker://<docker_path>
- NOTE: if you are not able to pull it from the cluster, you can copy a pre-existent .sif into the cluster
- test singularity container using
singularity shell <container>.sif
Useful links:
- https://www.hpc.iastate.edu/guides/containers
- https://researchcomputing.princeton.edu/support/knowledge-base/singularity
- https://docs.sylabs.io/guides/3.0/user-guide/build_a_container.html
- https://www.hpc.polito.it/docs/guide-slurm-it.pdf
- https://help.itc.rwth-aachen.de/en/service/rhr4fjjutttf/article/1f18ef48d8444f15bd908c592e0c44fb/
- https://docs.sylabs.io/guides/3.1/user-guide/cli/singularity_shell.html
- submit a job
sbatch <file_name>.sbatch
- show all jobs
squeue
- show your jobs
squeue -u <username>
- show job infos
scontrol show job <job_id>
- partions status
sinfo
- delete a job
scancel <job_id>
- running an interactive session on a node
srun --nodes=1 --tasks-per-node=1 --pty /bin/bash