deprecated: please see https://github.com/laszewsk/osmi-bench for a new version
- Nate Kimball
- Gregor von Laszewski, [email protected], orcid: 0000-0001-9558-179X
- mlcommons-osmi
- Other Repos
- Authors
- Table of contents
- 1. Running OSMI Bench on Ubuntu natively
- 2. Running on UVA Rivanna
- 2.1 Logging into Rivanna
- 2.2 Running OSMI Bench on rivanna
- 2.3 Set up a project directory and get the code
- 2.4 Set up Python Environment
- 2.5 Build Tensorflow Serving, Haproxy, and OSMI Images
- 2.6 Compile OSMI Models in Batch Jobs
- Run benchmark with cloudmesh experiment executor
- Graphing Results
- Compile OSMI Models in Interactive Jobs (avpid using)
- 1. Running OSMI Bench on a local Windows WSL
- References
Note:
tensorflow, 3.10 is the latest supported version
smartredis, python 3.10 is the latest supported version
Hence, we will use python3.10
First create a venv with
ubuntu>
python3.10 -m venv ~/OSMI
source ~/OSMI/bin/activate
pip install pip -U
We assume that you go to the directory where you want to install osmi
. We assume you do not have a directory called osmi in it. Use simply ls osmi
to check. Next we set up the osmi directory and clone it from github.
To get the code we clone this github repository
https://github.com/laszewsk/osmi.git
Please execute:
ubuntu>
mkdir ./osmi
export OSMI_HOME=$(realpath "./osmi")
export OSMI=$(OSMI_HOME)/
git clone https://github.com/laszewsk/osmi.git
cd osmi
pip install -r target/ubuntu/requirements-ubuntu.txt
cd models
time python train.py small_lstm # ~ 6.6s on an 5950X with RTX3090
time python train.py medium_cnn # ~ 35.6s on an 5950X with RTX3090
time python train.py large_tcnn # ~ 16m58s on an 5950X with RTX3090
This documentation is unclear and not tested:
Unclear. the documentation do this with singularity, I do have singularity on desktop, but can we use it natively and compare with singularity performance?
echo "deb [arch=amd64] http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add - sudo apt-get update && sudo apt-get install tensorflow-model-server which tensorflow_model_server make image
The easiest way to log into rivanna is to use ssh. However as we are creating singularity images, we need to currently use either bihead1 or bihead2
Please follow the documentation at
http://sciencefmhub.org/docs/tutorials/rivanna/singularity/
to set this up
Best is to also install cloudmesh-rivanna and cloudmesh-vpn on your local machine, so that login and management of the machine is simplified
local>
python -m venv ~/ENV3
pip install cloudmesh-rivanna
pip install cloudmesh-vpn
In case you have set up the vpn client correctly you can now activate it from the terminal including gitbash on windows. If this does not work, you can alternatively just use the cisco vpn gu client and ssh to one of biheads.
In case you followed our documentation you will be able to say
local>
cms vpn activate
ssh b1
Furthermore we assume that you have the code also checked out on your laptop as we use this to sync later on the results created with the super computer.
local>
mkdir ~/github
cd ~/github
git clone git clone https://github.com/laszewsk/osmi.git
cd osmi
To have the same environment variables to access the code on rivanna we introduce
local>
export USER_SCRATCH=/scratch/$USER
export USER_LOCALSCRATCH=/localscratch/$USER
export BASE=$USER_SCRATCH
export CLOUDMESH_CONFIG_DIR=$BASE/.cloudmesh
export PROJECT=$BASE/osmi
export EXEC_DIR=$PROJECT/target/rivanna
This will come in handy when we rsync the results. Now you are logged in on frontend node to rivanna.
To run the OSMI benchmark, you will first need to generate the project
directory with the code. We assume you are in the group
bii_dsc_community
, and
SOME OTHERS MISSING COPY FROM OUR DOCUMENTATION
so you can create singularity images on rivanna.
As well as the slurm partitions gpu
and bii_gpu
We will set up OSMI in the /scratch/$USER directory.
First you need to create the directory. The following steps simplify it and make the instalation uniform.
b1>
export USER_SCRATCH=/scratch/$USER
export USER_LOCALSCRATCH=/localscratch/$USER
export BASE=$USER_SCRATCH
export CLOUDMESH_CONFIG_DIR=$BASE/.cloudmesh
export PROJECT=$BASE/osmi
export EXEC_DIR=$PROJECT/target/rivanna
mkdir -p $BASE
cd $BASE
git clone https://github.com/laszewsk/osmi.git
cd osmi
You now have the code in $PROJECT
Note: This is no longer working
OSMI will run in batch mode this is also valid for setting up the environment for which we created sbatch script. This has the advantage that it installed via the worker nodes, which is typically faster, but also gurantees that the worker node itself is ued to install it to avoid software incompatibilities.
b1> cd $EXEC_DIR sbatch environment.slurm # (this may take a while) source $BASE/ENV3/bin/activate
See: environment.slurm
Note: currently we recommend this way:
An alternate way is to run the following commands directly:
b1>
cd $EXEC_DIR
module load gcc/11.4.0 openmpi/4.1.4 python/3.11.4
which python
python --version
python -m venv $BASE/ENV3 # takes about 5.2s
source $BASE/ENV3/bin/activate
pip install pip -U
time pip install -r $EXEC_DIR/requirements.txt # takes about 1m21s
cms help
We created convenient singularity images for tensorflow serving, haproxy, and the code to be executed. This is done with
b1>
cd $EXEC_DIR
make images
To run some of the test jobs to run a model and see if things work you can use the commands
b1>
cd $EXEC_DIR
sbatch train-small.slurm # 26.8s on a100_80GB, bi_fox_dgx
sbatch train-medium.slurm # 33.5s on a100_80GB, bi_fox_dgx
sbatch train-large.slurm # 1m 8.3s on a100_80GB, bi_fox_dgx
Set parameters in config.in.slurm
experiment:
# different gpus require different directives
directive: "a100,v100"
# batch size
batch: "1,2,4,8,16,32,64,128"
# number of gpus
ngpus: "1,2,3,4"
# number of concurrent clients
concurrency: "1,2,4,8,16"
# models
model: "small_lstm,medium_cnn,large_tcnn"
# number of repetitions of each experiment
repeat: "1,2,3,4"
To run many different jobs that are created based on config.in.slurm You can use the following
b1>
cd $EXEC_DIR
make project-gpu
sh jobs-project-gpu.sh
The results will be stored in a projects directory.
To analyse the program it is best to copy the results into your local computer and use a jupyter notebook.
local>
cd ~/github/osmi/target/rivanna
du -h rivanna:$EXEC_DIR/project-gpu
// figure out if you have enough space for this project on the local machine
rsync rivanna:$EXEC_DIR/project-gpu ./project-gpu
Now we can analyse the data with
local>
open ./analysis/analysis-simple.ipynb
graphs are also saved in ./analysis/out
The program takes the results from clodmesh experiment executir and produces several graphs.
Interactive Jobs: allow you to reserve a node on rivanna so it looks like a login node. This interactive mode is usefull only during the debug phase and can serve as a convenient way to debug and to interactively experiment running the program.
Once you know hwo to create jobs with a propper batch script you will likely no longer need to use interactive jobs. We keep this documentation for beginners that like to experiement in interactive mode to develop batch scripts.
First, obtain an interactive job with
rivanna>
ijob -c 1 -A bii_dsc_community -p standard --time=01:00:00
To specify a particular GPU please use.
rivanna>
export GPUS=1
v100 rivanna> ijob -c 1 -A bii_dsc_community --partition=bii-gpu --gres=gpu:v100:$GPUS --time=01:00:00
# (or)
a100 rivanna> ijob -c 1 -A bii_dsc_community --partition=bii-gpu --gres=gpu:a100:$GPUS --time=01:00:00
node>
cd $PROJECT/models
python train.py small_lstm
python train.py medium_tcnn
python train.py large_cnn
For this application there is no separate data
TODO: Nate
- create isolated new wsl environment
- Use what we do in the ubuntu thing, but do separate documentation er as the ubuntu native install may have other steps or issuse
wsl> python3.10 -m venv /home/$USER/OSMI
source /home/$USER/OSMI/bin/activate
python -V
pip install pip -U
To get the code we clone a gitlab instance that is hosted at Oakridge National Laboratory, please execute:
wsl>
export PROJECT=/home/$USER/project/
mkdir -p $PROJECT
cd $PROJECT
git clone https://github.com/laszewsk/osmi #[email protected]:laszewsk/osmi.git
cd osmi/
pip install -r $PROJECT/mlcommons-osmi/wsl/requirements.txt
wsl>
cd $PROJECT/mlcommons-osmi/wsl
make image
cd models
time python train.py small_lstm (14.01s user 1.71s system 135% cpu 11.605 total)
time python train.py medium_cnn (109.20s user 6.84s system 407% cpu 28.481 total)
time python train.py large_tcnn
cd ..
-
Production Deployment of Machine-Learned Rotorcraft Surrogate Models on HPC, Wesley Brewer, Daniel Martinez, Mathew Boyer, Dylan Jude, Andy Wissink, Ben Parsons, Junqi Yin, Valentine Anantharaj 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 978-1-6654-1124-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/MLHPC54614.2021.00008, https://ieeexplore.ieee.org/document/9652868 TODO: please ask wess what the free pdf link is all gov organizations have one. for example as ornl is coauther it must be on their site somewhere.
-
Using Rivanna for GPU ussage, Gregor von Laszewski, JP. Fleischer https://github.com/cybertraining-dsc/reu2022/blob/main/project/hpc/rivanna-introduction.md
-
Setting up a Windows computer for research, Gregor von Laszewski, J.P Fleischer https://github.com/cybertraining-dsc/reu2022/blob/main/project/windows-configuration.md
-
Initial notes to be deleted, Nate: https://docs.google.com/document/d/1luDAAatx6ZD_9-gM5HZZLcvglLuk_OqswzAS2n_5rNA
-
Gregor von Laszewski, J.P. Fleischer, Cloudmesh VPN, https://github.com/cloudmesh/cloudmesh-vpn
-
Gregor von Laszewski, Cloudmesh Rivanna, https://github.com/cloudmesh/cloudmesh-rivanna
-
Gregor von Laszewski, Cloudmesh Common, https://github.com/cloudmesh/cloudmesh-common
-
Gregor von Laszewski, Cloudmesh Experiment Executor, https://github.com/cloudmesh/cloudmesh-ee
-
Gregor von Laszewski, J.P. Fleischer, Geoffrey C. Fox, Juri Papay, Sam Jackson, Jeyan Thiyagalingam (2023). Templated Hybrid Reusable Computational Analytics Workflow Management with Cloudmesh, Applied to the Deep Learning MLCommons Cloudmask Application. eScience'23. https://github.com/cyberaide/paper-cloudmesh-cc-ieee-5-pages/raw/main/vonLaszewski-cloudmesh-cc.pdf, 2023.
-
Gregor von Laszewski, J.P. Fleischer, R. Knuuti, G.C. Fox, J. Kolessar, T.S. Butler, J. Fox (2023). Opportunities for enhancing MLCommons efforts while leveraging insights from educational MLCommons earthquake benchmarks efforts. Frontiers in High Performance Computing. https://doi.org/10.3389/fhpcp.2023.1233877