Skip to content

Predixx/drl_artificial_unitelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

drl_artificial_unitelligence

To install the conda environment, type

conda env create -f environment.yml

then activate the environment with

conda activate drl_artificial_unintelligence

and on first install, retrieve the additional dependencies from pip by running the post_install.sh script in a bash terminal. This should look like this:

bash post_install.sh

To monitor the training there is a tensorboard logger which can easily be accessed by opening a bash terminal and typing the following:

tensorboard --logdir runs  --port 6006

One can then find the plots etc at http://localhost:6006/ on their local browser.

HPC Cluster Submission Guide

This guide walks you through submitting jobs to the HPC cluster using SLURM workload manager.

1. Install Bitvise SSH Client

Download and install Bitvise SSH Client from the official website. This will be your primary tool for connecting to the cluster and transferring files.

2. Configure Connection Settings

In Bitvise SSH Client, set up your connection:

  • Host: cool.hpc.lrz.de
  • Username: drlearn001
  • Authentication: sm-7xwnZP+8s

3. Authentication Setup

Configure the authentication:

  • Password: Enter your account password
  • MFA (Multi-Factor Authentication): Use drlearn001 as the MFA token

4. Connect and Access Tools

Once connected, you'll see a sidebar on the left with two important buttons:

  • New Terminal Console: Opens a command-line interface for running commands
  • New SFTP Window: Opens a file transfer interface for uploading/downloading files

5. Upload Required Files

Use the SFTP window to upload the following essential files to your cluster home directory:

  • master.sh - The main submission script
  • run_bash.cmd - The SLURM job script
  • learning.py - Your Python script (or other computational files)

6. Understanding run_bash.cmd (SLURM Job Script)

The run_bash.cmd file is a SLURM batch script that defines how your job should be executed on the cluster. Here's what each section does:

SLURM Directives (Lines starting with #SBATCH):

  • #SBATCH -J test: Sets the job name to "test"
  • #SBATCH -o ./%x.%j.%N.out: Defines output file naming pattern
  • #SBATCH -D ./: Sets working directory to current directory
  • #SBATCH --clusters=serial: Specifies the cluster partition
  • #SBATCH --partition=serial_std: Uses standard serial partition
  • #SBATCH --mem=5000mb: Allocates 5GB of memory
  • #SBATCH --cpus-per-task=1: Requests 1 CPU core
  • #SBATCH --time=10:00:00: Sets maximum runtime to 10 hours
  • #SBATCH [email protected]: Email for job notifications

Environment Setup:

  • module load python/3.8.11-base: Loads Python 3.8.11
  • module load slurm_setup: Loads SLURM configuration
  • source ../venv/bin/activate: Activates Python virtual environment

Job Execution:

python learning.py --dummy_variable=${VARIABLE} --hyperparameter_value=${HYPERPARAMETER}

Runs your Python script with environment variables

7. Understanding master.sh (Job Submission Controller)

The master.sh script automates the submission of multiple jobs with different parameters:

HYPERPARAMETER_A="hyp_A"
HYPERPARAMETER_B="hyp_B"
VARIABLE=4

for HYPERPARAMETER in HYPERPARAMETER_A HYPERPARAMETER_B
do
    sbatch --job-name="test" --export=VARIABLE=$VARIABLE,HYPERPARAMETER=$HYPERPARAMETER run_bash.cmd
done

What it does:

  • Defines two hyperparameter values (hyp_A and hyp_B)
  • Sets a variable value (4)
  • Loops through each hyperparameter
  • Submits a separate SLURM job for each hyperparameter using sbatch
  • Passes environment variables to each job

8. Python Script Requirements (learning.py structure)

Your Python scripts need to be structured to work with the SLURM environment. Here are the key requirements:

Command Line Arguments:

Your script should use argparse to handle command-line arguments that will be passed from the SLURM job:

import argparse

def main():
    parser = argparse.ArgumentParser(description="Your script description")
    parser.add_argument('--dummy_variable', type=str, required=False, help='Variable description')
    parser.add_argument('--hyperparameter_value', type=str, required=False, help='Hyperparameter description')
    args = parser.parse_args()

Output Handling:

Ensure your script produces output that can be captured:

  • Print important information to stdout
  • Write results to files for persistence
  • Handle missing or None arguments gracefully

File I/O:

  • Write output files to the current working directory
  • Use relative paths when possible
  • Ensure proper file permissions

Example Structure:

import argparse

def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser(description="Your ML/analysis script")
    parser.add_argument('--param1', type=str, required=False, help='Parameter 1')
    parser.add_argument('--param2', type=str, required=False, help='Parameter 2')
    args = parser.parse_args()
    
    # Your computation logic here
    print(f"Running with param1: {args.param1}")
    print(f"Running with param2: {args.param2}")
    
    # Write results to file
    with open("results.txt", "w") as f:
        f.write(f"Results for {args.param1}, {args.param2}\n")

if __name__ == "__main__":
    main()

Execution Steps:

  1. Upload all files via SFTP

  2. Open terminal console

  3. Submit jobs using either method:

    Option A - Using bash (no chmod needed):

    bash master.sh

    Option B - Making executable first:

    chmod +x master.sh run_bash.cmd
    ./master.sh
  4. Monitor jobs:

    squeue -u drlearn001
  5. Check output files when jobs complete

The system will create output files with names like test.jobid.nodename.out containing your script's output and any error messages.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •