To install the conda environment, type
conda env create -f environment.yml
then activate the environment with
conda activate drl_artificial_unintelligence
and on first install, retrieve the additional dependencies from pip by running the post_install.sh script in a bash terminal. This should look like this:
bash post_install.sh
To monitor the training there is a tensorboard logger which can easily be accessed by opening a bash terminal and typing the following:
tensorboard --logdir runs --port 6006
One can then find the plots etc at http://localhost:6006/ on their local browser.
This guide walks you through submitting jobs to the HPC cluster using SLURM workload manager.
Download and install Bitvise SSH Client from the official website. This will be your primary tool for connecting to the cluster and transferring files.
In Bitvise SSH Client, set up your connection:
- Host:
cool.hpc.lrz.de - Username:
drlearn001 - Authentication:
sm-7xwnZP+8s
Configure the authentication:
- Password: Enter your account password
- MFA (Multi-Factor Authentication): Use
drlearn001as the MFA token
Once connected, you'll see a sidebar on the left with two important buttons:
- New Terminal Console: Opens a command-line interface for running commands
- New SFTP Window: Opens a file transfer interface for uploading/downloading files
Use the SFTP window to upload the following essential files to your cluster home directory:
master.sh- The main submission scriptrun_bash.cmd- The SLURM job scriptlearning.py- Your Python script (or other computational files)
The run_bash.cmd file is a SLURM batch script that defines how your job should be executed on the cluster. Here's what each section does:
#SBATCH -J test: Sets the job name to "test"#SBATCH -o ./%x.%j.%N.out: Defines output file naming pattern#SBATCH -D ./: Sets working directory to current directory#SBATCH --clusters=serial: Specifies the cluster partition#SBATCH --partition=serial_std: Uses standard serial partition#SBATCH --mem=5000mb: Allocates 5GB of memory#SBATCH --cpus-per-task=1: Requests 1 CPU core#SBATCH --time=10:00:00: Sets maximum runtime to 10 hours#SBATCH [email protected]: Email for job notifications
module load python/3.8.11-base: Loads Python 3.8.11module load slurm_setup: Loads SLURM configurationsource ../venv/bin/activate: Activates Python virtual environment
python learning.py --dummy_variable=${VARIABLE} --hyperparameter_value=${HYPERPARAMETER}Runs your Python script with environment variables
The master.sh script automates the submission of multiple jobs with different parameters:
HYPERPARAMETER_A="hyp_A"
HYPERPARAMETER_B="hyp_B"
VARIABLE=4
for HYPERPARAMETER in HYPERPARAMETER_A HYPERPARAMETER_B
do
sbatch --job-name="test" --export=VARIABLE=$VARIABLE,HYPERPARAMETER=$HYPERPARAMETER run_bash.cmd
done- Defines two hyperparameter values (
hyp_Aandhyp_B) - Sets a variable value (
4) - Loops through each hyperparameter
- Submits a separate SLURM job for each hyperparameter using
sbatch - Passes environment variables to each job
Your Python scripts need to be structured to work with the SLURM environment. Here are the key requirements:
Your script should use argparse to handle command-line arguments that will be passed from the SLURM job:
import argparse
def main():
parser = argparse.ArgumentParser(description="Your script description")
parser.add_argument('--dummy_variable', type=str, required=False, help='Variable description')
parser.add_argument('--hyperparameter_value', type=str, required=False, help='Hyperparameter description')
args = parser.parse_args()Ensure your script produces output that can be captured:
- Print important information to stdout
- Write results to files for persistence
- Handle missing or None arguments gracefully
- Write output files to the current working directory
- Use relative paths when possible
- Ensure proper file permissions
import argparse
def main():
# Parse command line arguments
parser = argparse.ArgumentParser(description="Your ML/analysis script")
parser.add_argument('--param1', type=str, required=False, help='Parameter 1')
parser.add_argument('--param2', type=str, required=False, help='Parameter 2')
args = parser.parse_args()
# Your computation logic here
print(f"Running with param1: {args.param1}")
print(f"Running with param2: {args.param2}")
# Write results to file
with open("results.txt", "w") as f:
f.write(f"Results for {args.param1}, {args.param2}\n")
if __name__ == "__main__":
main()-
Upload all files via SFTP
-
Open terminal console
-
Submit jobs using either method:
Option A - Using bash (no chmod needed):
bash master.sh
Option B - Making executable first:
chmod +x master.sh run_bash.cmd ./master.sh
-
Monitor jobs:
squeue -u drlearn001
-
Check output files when jobs complete
The system will create output files with names like test.jobid.nodename.out containing your script's output and any error messages.