CHTC
diff --git a/‎docker/README.md‎
Lines changed: 72 additions & 0 deletions b/‎docker/README.md‎
Lines changed: 72 additions & 0 deletions
diff --git a/‎docker/hello_gpu/README.md‎
Lines changed: 63 additions & 0 deletions b/‎docker/hello_gpu/README.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎docker/hello_gpu/expected_output/docker_stderror‎ b/‎docker/hello_gpu/expected_output/docker_stderror‎
diff --git a/‎docker/hello_gpu/expected_output/hello_gpu.err.txt‎ b/‎docker/hello_gpu/expected_output/hello_gpu.err.txt‎
diff --git a/‎docker/hello_gpu/expected_output/hello_gpu.log.txt‎
Lines changed: 32 additions & 0 deletions b/‎docker/hello_gpu/expected_output/hello_gpu.log.txt‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎docker/hello_gpu/expected_output/hello_gpu.out.txt‎
Lines changed: 23 additions & 0 deletions b/‎docker/hello_gpu/expected_output/hello_gpu.out.txt‎
Lines changed: 23 additions & 0 deletions
diff --git a/‎docker/hello_gpu/hello_gpu.sh‎
Lines changed: 5 additions & 0 deletions b/‎docker/hello_gpu/hello_gpu.sh‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docker/hello_gpu/hello_gpu.sub‎
Lines changed: 32 additions & 0 deletions b/‎docker/hello_gpu/hello_gpu.sub‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎docker/pytorch_python/MNIST_data.tar.gz‎
33.2 MB b/‎docker/pytorch_python/MNIST_data.tar.gz‎
33.2 MB
diff --git a/‎docker/pytorch_python/README.md‎
Lines changed: 48 additions & 0 deletions b/‎docker/pytorch_python/README.md‎
Lines changed: 48 additions & 0 deletions
@@ -0,0 +1,72 @@
+### Using GPUs on CHTC via Docker
+
+Docker is software that helps bundle software programs, libraries and
+dependencies in a package called a **container**. Once built, these containers
+can be run on different machines that have the Docker Engine. Programs with
+complex dependencies are often packaged with Docker and made available for
+download on [DockerHub](https://hub.docker.com).
+
+The Docker Engine needs special configuration to give the software inside a
+container access to a GPU. CHTC does this behind the scenes with
+`nvidia-docker`. Any Docker container that wants to use `nvidia-docker` must
+contain the Nvidia CUDA toolkit inside it. Here we have working examples and
+also some pointers on how to find containers or build your own containers that
+can access the GPU. 
+
+
+### Examples 
+
+1. **Hello\_GPU**  
+ This is a simple example to see if we can access the GPU from inside a Docker
+container on CHTC. It uses the
+[nvidia/cuda](https://hub.docker.com/r/nvidia/cuda) Docker image which is a
+tiny container that only contains the Nvidia CUDA toolkit.  
+ [Click here to access this example](./hello_gpu/). 
+
+2. **Matrix Multiplication with TensorFlow (Python)**  
+ This example uses a [TensorFlow](https://www.tensorflow.org) [Docker
+container](https://hub.docker.com/r/tensorflow/tensorflow/) to benchmark matrix
+multiplication on a GPU vs the same matrix multiplication on a CPU.  
+ [Click here to access this example](./tensorflow_python/). 
+
+3. **Convolutional Neural Network with PyTorch (Python)**  
+ This example shows how to send training and test data to the compute node
+along with the script.  After processing the trained network is returned to the
+submit node.  
+ [Click here to access this example](./pytorch_python/). 
+ 
+### Finding containers
+1. Pick a container that is built on a more modern version of CUDA Toolkit. Although the toolkits are backwards compatible, the more modern the toolkit, the less likely you are to run into problems. 
+2. [Nvidia Catalog](https://ngc.nvidia.com/catalog/landing) has a good
+   selection of containers that use the GPU for machine learning, inference,
+visualization etc. They need to be uploaded to your own account on Dockerhub
+before being used. This can be done with the Docker application or with the
+Docker Automated Builder (see below).  
+3. [Rocker](https://hub.docker.com/u/rocker) is a great place to find GPU
+   enabled machine learning software for the [R Project for Statistical
+Computing](https://www.r-project.org)
+
+
+### Building containers
+Building your own containers to access a GPU requires a bit of work and will
+not be described fully here. It is best to start with a basic container that
+can access the GPU and then build upon that container. The PyTorch Docker
+container is built on top of Nvidia Cuda and is a [good example to follow](https://github.com/pytorch/pytorch/blob/master/docker/pytorch/Dockerfile).
+
+```Dockerfile
+FROM nvidia/cuda:10.1-base-ubuntu18.04
+#....
+```
+or
+```Dockerfile
+# Pull from Nvidia's catalog
+FROM nvcr.io/nvidia/pytorch:19.07-py3
+
+# conda is already installed so just install packages
+RUN conda install package_1 package_2 package_etc
+```
+
+Once you have a working `Dockerfile`, you need to build a Docker container with
+the Docker app and then upload it to Dockerhub so that CHTC can access your
+container. Alternatively, you can have the DockerHub Cloud service directly
+build it for you on DockerHub. 
@@ -0,0 +1,63 @@
+
+### Running Hello GPU via Docker
+
+This is a simple example that uses the [Nvidia CUDA container](https://hub.docker.com/r/nvidia/cuda/) from Dockerhub. 
+
+
+### Submit file 
+We set the `universe` and the `docker_image` tags to make sure CHTC knows to
+pull in the right images.
+
+```
+universe = docker
+docker_image = nvidia/cuda:10.1-base-ubuntu18.04
+```
+
+We require a machine with a modern version of the CUDA driver. CUDA drivers are
+usually backwards compatible. So a machine with CUDA Driver version 10.1 should
+be able to run containers built with older versions of CUDA. 
+```
+Requirements = (Target.CUDADriverVersion >= 10.1)
+```
+
+We should also request a `cpu` as well as a `gpu`. 
+```
+request_cpus = 1
+request_gpus = 1
+```
+[The complete submit file is available here](./hello_gpu.sub). 
+
+```shell
+condor_submit hello_gpu.sub
+```
+
+### Execute file
+We run `nvidia-smi` which gives us diagnostic information about the GPU. 
+
+
+### Output
+The output should be similar to below.
+```
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  Tesla P100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
+| N/A   28C    P0    27W / 250W |     10MiB / 16280MiB |      0%      Default |
++-------------------------------+----------------------+----------------------+
+|   1  Tesla P100-PCIE...  Off  | 00000000:D8:00.0 Off |                    0 |
+| N/A   26C    P0    25W / 250W |     10MiB / 16280MiB |      0%      Default |
++-------------------------------+----------------------+----------------------+
+                                                                               
++-----------------------------------------------------------------------------+
+| Processes:                                                       GPU Memory |
+|  GPU       PID   Type   Process name                             Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
++-----------------------------------------------------------------------------+
+```
+You can see a complete list of files expected in the output in the [expected output directory](./expected_output/).
+
+
@@ -0,0 +1,32 @@
+000 (8949703.000.000) 2019-08-26 21:25:48 Job submitted from host: <128.105.244.191:9618?addrs=128.105.244.191-9618&alias=submit-1.chtc.wisc.edu&noUDP&sock=schedd_2058300_2d2a_12>
+...
+040 (8949703.000.000) 2019-08-26 21:27:36 Started transferring input files
+	Transferring to host: <128.105.245.10:9618?addrs=128.105.245.10-9618&alias=gpu2000.chtc.wisc.edu&noUDP&sock=starter_32162_c9d5_18958>
+...
+040 (8949703.000.000) 2019-08-26 21:27:36 Finished transferring input files
+...
+001 (8949703.000.000) 2019-08-26 21:27:37 Job executing on host: <128.105.245.10:9618?addrs=128.105.245.10-9618&alias=gpu2000.chtc.wisc.edu&noUDP&sock=startd_32073_fd1f_3>
+...
+040 (8949703.000.000) 2019-08-26 21:27:38 Started transferring output files
+...
+040 (8949703.000.000) 2019-08-26 21:27:38 Finished transferring output files
+...
+005 (8949703.000.000) 2019-08-26 21:27:39 Job terminated.
+	(1) Normal termination (return value 0)
+		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
+		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
+		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
+		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
+	1678  -  Run Bytes Sent By Job
+	142  -  Run Bytes Received By Job
+	1678  -  Total Bytes Sent By Job
+	142  -  Total Bytes Received By Job
+	Partitionable Resources :    Usage  Request Allocated Assigned
+	   Cpus                 :     0.01        1        20 
+	   Disk (KB)            :    25      512000 725632508 
+	   Gpus                 :                 1         1 "CUDA1"
+	   Ioheavy              :                           0 
+	   Memory (MB)          :              1024    256194 
+
+	Job terminated of its own accord at 2019-08-27T02:27:38Z.
+...
@@ -0,0 +1,23 @@
+Hello CHTC from Job 0 running on dcosta2-8949703.0-gpu2000.chtc.wisc.edu
+
+Trying to see if nvidia/cuda can access the GPU....
+Tue Aug 27 02:27:38 2019       
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  Tesla P100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
+| N/A   28C    P0    27W / 250W |     10MiB / 16280MiB |      0%      Default |
++-------------------------------+----------------------+----------------------+
+|   1  Tesla P100-PCIE...  Off  | 00000000:D8:00.0 Off |                    0 |
+| N/A   26C    P0    25W / 250W |     10MiB / 16280MiB |      0%      Default |
++-------------------------------+----------------------+----------------------+
+                                                                               
++-----------------------------------------------------------------------------+
+| Processes:                                                       GPU Memory |
+|  GPU       PID   Type   Process name                             Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
++-----------------------------------------------------------------------------+
@@ -0,0 +1,5 @@
+#!/bin/bash
+echo "Hello CHTC from Job $1 running on `hostname`"
+echo ""
+echo "Trying to see if nvidia/cuda can access the GPU...."
+nvidia-smi
@@ -0,0 +1,32 @@
+# hello_gpu.sub
+# Submit file to access the GPU via docker
+
+# Must set the universe to Docker
+universe = docker
+docker_image = nvidia/cuda:10.1-base-ubuntu18.04
+
+# set the log, error and output files 
+log = hello_gpu.log.txt
+error = hello_gpu.err.txt
+output = hello_gpu.out.txt
+
+# set the executable to run
+executable = hello_gpu.sh
+arguments = $(Process)
+
+should_transfer_files = YES
+when_to_transfer_output = ON_EXIT
+
+# We require a machine with a modern version of the CUDA driver
+Requirements = (Target.CUDADriverVersion >= 10.1)
+
+# We must request 1 CPU in addition to 1 GPU
+request_cpus = 1
+request_gpus = 1
+
+# select some memory and disk space
+request_memory = 1GB
+request_disk = 500MB
+
+# Tell HTCondor to run 1 instances of our job:
+queue 1
@@ -0,0 +1,48 @@
+
+### Convolutional Neural Network with PyTorch (Python)
+This example shows how to send training and test data to the compute node along
+with the script. After processing the trained network is returned to the
+submit node.  
+
+### Submit file
+
+Here the submit file stays the same as that in the [Hello\_GPU](../hello_gpu/) example with a few minor tweaks. 
+
+We set the Docker image to version of Pytorch that is build with CUDA. 
+```
+docker_image = pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime
+```
+
+Also, we need the python script as well as the data transferred to the compute node. 
+```
+transfer_input_files = main.py, MNIST_data.tar.gz
+```
+
+The rest of the submit file remains the same.  We run the submit file with 
+```shell
+condor_submit pytorch_cnn.sub
+```
+
+### Execute script
+The [Execute Shell script](./pytorch_cnn.sh) extracts the data and then calls a
+Python script [main.py](./main.py) that figures out the network weights and saves it to disk. Then the Execute script deletes the data directory so that it isn't returned to the submit node. 
+
+```shell
+tar zxf MNIST_data.tar.gz
+python main.py --save-model --epochs 20
+rm -r data
+```
+ 
+### Output 
+We have the CNN Network that was trained returned to us as a file
+[mnist\_cnn.pt](./expected_output/mnist_cnn.pt). The are also some output stats
+on the training and test error in the [output
+files](./expected_output/pytorch_cnn.out.txt).  
+```
+Test set: Average loss: 0.0278, Accuracy: 9909/10000 (99%)
+```
+
+You can see a complete list of files expected in the output in the [expected
+output directory](./expected_output/).
+
+