Skip to content

Commit 21c9139

Browse files
authored
Merge pull request #1 from sameerd/master
Examples for accessing the GPU via Docker
2 parents 800651b + 8dd9dae commit 21c9139

26 files changed

+2811
-0
lines changed

docker/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
### Using GPUs on CHTC via Docker
2+
3+
Docker is software that helps bundle software programs, libraries and
4+
dependencies in a package called a **container**. Once built, these containers
5+
can be run on different machines that have the Docker Engine. Programs with
6+
complex dependencies are often packaged with Docker and made available for
7+
download on [DockerHub](https://hub.docker.com).
8+
9+
The Docker Engine needs special configuration to give the software inside a
10+
container access to a GPU. CHTC does this behind the scenes with
11+
`nvidia-docker`. Any Docker container that wants to use `nvidia-docker` must
12+
contain the Nvidia CUDA toolkit inside it. Here we have working examples and
13+
also some pointers on how to find containers or build your own containers that
14+
can access the GPU.
15+
16+
17+
### Examples
18+
19+
1. **Hello\_GPU**
20+
This is a simple example to see if we can access the GPU from inside a Docker
21+
container on CHTC. It uses the
22+
[nvidia/cuda](https://hub.docker.com/r/nvidia/cuda) Docker image which is a
23+
tiny container that only contains the Nvidia CUDA toolkit.
24+
[Click here to access this example](./hello_gpu/).
25+
26+
2. **Matrix Multiplication with TensorFlow (Python)**
27+
This example uses a [TensorFlow](https://www.tensorflow.org) [Docker
28+
container](https://hub.docker.com/r/tensorflow/tensorflow/) to benchmark matrix
29+
multiplication on a GPU vs the same matrix multiplication on a CPU.
30+
[Click here to access this example](./tensorflow_python/).
31+
32+
3. **Convolutional Neural Network with PyTorch (Python)**
33+
This example shows how to send training and test data to the compute node
34+
along with the script. After processing the trained network is returned to the
35+
submit node.
36+
[Click here to access this example](./pytorch_python/).
37+
38+
### Finding containers
39+
1. Pick a container that is built on a more modern version of CUDA Toolkit. Although the toolkits are backwards compatible, the more modern the toolkit, the less likely you are to run into problems.
40+
2. [Nvidia Catalog](https://ngc.nvidia.com/catalog/landing) has a good
41+
selection of containers that use the GPU for machine learning, inference,
42+
visualization etc. They need to be uploaded to your own account on Dockerhub
43+
before being used. This can be done with the Docker application or with the
44+
Docker Automated Builder (see below).
45+
3. [Rocker](https://hub.docker.com/u/rocker) is a great place to find GPU
46+
enabled machine learning software for the [R Project for Statistical
47+
Computing](https://www.r-project.org)
48+
49+
50+
### Building containers
51+
Building your own containers to access a GPU requires a bit of work and will
52+
not be described fully here. It is best to start with a basic container that
53+
can access the GPU and then build upon that container. The PyTorch Docker
54+
container is built on top of Nvidia Cuda and is a [good example to follow](https://github.com/pytorch/pytorch/blob/master/docker/pytorch/Dockerfile).
55+
56+
```Dockerfile
57+
FROM nvidia/cuda:10.1-base-ubuntu18.04
58+
#....
59+
```
60+
or
61+
```Dockerfile
62+
# Pull from Nvidia's catalog
63+
FROM nvcr.io/nvidia/pytorch:19.07-py3
64+
65+
# conda is already installed so just install packages
66+
RUN conda install package_1 package_2 package_etc
67+
```
68+
69+
Once you have a working `Dockerfile`, you need to build a Docker container with
70+
the Docker app and then upload it to Dockerhub so that CHTC can access your
71+
container. Alternatively, you can have the DockerHub Cloud service directly
72+
build it for you on DockerHub.

docker/hello_gpu/README.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
2+
### Running Hello GPU via Docker
3+
4+
This is a simple example that uses the [Nvidia CUDA container](https://hub.docker.com/r/nvidia/cuda/) from Dockerhub.
5+
6+
7+
### Submit file
8+
We set the `universe` and the `docker_image` tags to make sure CHTC knows to
9+
pull in the right images.
10+
11+
```
12+
universe = docker
13+
docker_image = nvidia/cuda:10.1-base-ubuntu18.04
14+
```
15+
16+
We require a machine with a modern version of the CUDA driver. CUDA drivers are
17+
usually backwards compatible. So a machine with CUDA Driver version 10.1 should
18+
be able to run containers built with older versions of CUDA.
19+
```
20+
Requirements = (Target.CUDADriverVersion >= 10.1)
21+
```
22+
23+
We should also request a `cpu` as well as a `gpu`.
24+
```
25+
request_cpus = 1
26+
request_gpus = 1
27+
```
28+
[The complete submit file is available here](./hello_gpu.sub).
29+
30+
```shell
31+
condor_submit hello_gpu.sub
32+
```
33+
34+
### Execute file
35+
We run `nvidia-smi` which gives us diagnostic information about the GPU.
36+
37+
38+
### Output
39+
The output should be similar to below.
40+
```
41+
+-----------------------------------------------------------------------------+
42+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
43+
|-------------------------------+----------------------+----------------------+
44+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
45+
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
46+
|===============================+======================+======================|
47+
| 0 Tesla P100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
48+
| N/A 28C P0 27W / 250W | 10MiB / 16280MiB | 0% Default |
49+
+-------------------------------+----------------------+----------------------+
50+
| 1 Tesla P100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
51+
| N/A 26C P0 25W / 250W | 10MiB / 16280MiB | 0% Default |
52+
+-------------------------------+----------------------+----------------------+
53+
54+
+-----------------------------------------------------------------------------+
55+
| Processes: GPU Memory |
56+
| GPU PID Type Process name Usage |
57+
|=============================================================================|
58+
| No running processes found |
59+
+-----------------------------------------------------------------------------+
60+
```
61+
You can see a complete list of files expected in the output in the [expected output directory](./expected_output/).
62+
63+

docker/hello_gpu/expected_output/docker_stderror

Whitespace-only changes.

docker/hello_gpu/expected_output/hello_gpu.err.txt

Whitespace-only changes.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
000 (8949703.000.000) 2019-08-26 21:25:48 Job submitted from host: <128.105.244.191:9618?addrs=128.105.244.191-9618&alias=submit-1.chtc.wisc.edu&noUDP&sock=schedd_2058300_2d2a_12>
2+
...
3+
040 (8949703.000.000) 2019-08-26 21:27:36 Started transferring input files
4+
Transferring to host: <128.105.245.10:9618?addrs=128.105.245.10-9618&alias=gpu2000.chtc.wisc.edu&noUDP&sock=starter_32162_c9d5_18958>
5+
...
6+
040 (8949703.000.000) 2019-08-26 21:27:36 Finished transferring input files
7+
...
8+
001 (8949703.000.000) 2019-08-26 21:27:37 Job executing on host: <128.105.245.10:9618?addrs=128.105.245.10-9618&alias=gpu2000.chtc.wisc.edu&noUDP&sock=startd_32073_fd1f_3>
9+
...
10+
040 (8949703.000.000) 2019-08-26 21:27:38 Started transferring output files
11+
...
12+
040 (8949703.000.000) 2019-08-26 21:27:38 Finished transferring output files
13+
...
14+
005 (8949703.000.000) 2019-08-26 21:27:39 Job terminated.
15+
(1) Normal termination (return value 0)
16+
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
17+
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
18+
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
19+
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
20+
1678 - Run Bytes Sent By Job
21+
142 - Run Bytes Received By Job
22+
1678 - Total Bytes Sent By Job
23+
142 - Total Bytes Received By Job
24+
Partitionable Resources : Usage Request Allocated Assigned
25+
Cpus : 0.01 1 20
26+
Disk (KB) : 25 512000 725632508
27+
Gpus : 1 1 "CUDA1"
28+
Ioheavy : 0
29+
Memory (MB) : 1024 256194
30+
31+
Job terminated of its own accord at 2019-08-27T02:27:38Z.
32+
...
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
Hello CHTC from Job 0 running on dcosta2-8949703.0-gpu2000.chtc.wisc.edu
2+
3+
Trying to see if nvidia/cuda can access the GPU....
4+
Tue Aug 27 02:27:38 2019
5+
+-----------------------------------------------------------------------------+
6+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
7+
|-------------------------------+----------------------+----------------------+
8+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
9+
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
10+
|===============================+======================+======================|
11+
| 0 Tesla P100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
12+
| N/A 28C P0 27W / 250W | 10MiB / 16280MiB | 0% Default |
13+
+-------------------------------+----------------------+----------------------+
14+
| 1 Tesla P100-PCIE... Off | 00000000:D8:00.0 Off | 0 |
15+
| N/A 26C P0 25W / 250W | 10MiB / 16280MiB | 0% Default |
16+
+-------------------------------+----------------------+----------------------+
17+
18+
+-----------------------------------------------------------------------------+
19+
| Processes: GPU Memory |
20+
| GPU PID Type Process name Usage |
21+
|=============================================================================|
22+
| No running processes found |
23+
+-----------------------------------------------------------------------------+

docker/hello_gpu/hello_gpu.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
#!/bin/bash
2+
echo "Hello CHTC from Job $1 running on `hostname`"
3+
echo ""
4+
echo "Trying to see if nvidia/cuda can access the GPU...."
5+
nvidia-smi

docker/hello_gpu/hello_gpu.sub

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# hello_gpu.sub
2+
# Submit file to access the GPU via docker
3+
4+
# Must set the universe to Docker
5+
universe = docker
6+
docker_image = nvidia/cuda:10.1-base-ubuntu18.04
7+
8+
# set the log, error and output files
9+
log = hello_gpu.log.txt
10+
error = hello_gpu.err.txt
11+
output = hello_gpu.out.txt
12+
13+
# set the executable to run
14+
executable = hello_gpu.sh
15+
arguments = $(Process)
16+
17+
should_transfer_files = YES
18+
when_to_transfer_output = ON_EXIT
19+
20+
# We require a machine with a modern version of the CUDA driver
21+
Requirements = (Target.CUDADriverVersion >= 10.1)
22+
23+
# We must request 1 CPU in addition to 1 GPU
24+
request_cpus = 1
25+
request_gpus = 1
26+
27+
# select some memory and disk space
28+
request_memory = 1GB
29+
request_disk = 500MB
30+
31+
# Tell HTCondor to run 1 instances of our job:
32+
queue 1
33.2 MB
Binary file not shown.

docker/pytorch_python/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
2+
### Convolutional Neural Network with PyTorch (Python)
3+
This example shows how to send training and test data to the compute node along
4+
with the script. After processing the trained network is returned to the
5+
submit node.
6+
7+
### Submit file
8+
9+
Here the submit file stays the same as that in the [Hello\_GPU](../hello_gpu/) example with a few minor tweaks.
10+
11+
We set the Docker image to version of Pytorch that is build with CUDA.
12+
```
13+
docker_image = pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-runtime
14+
```
15+
16+
Also, we need the python script as well as the data transferred to the compute node.
17+
```
18+
transfer_input_files = main.py, MNIST_data.tar.gz
19+
```
20+
21+
The rest of the submit file remains the same. We run the submit file with
22+
```shell
23+
condor_submit pytorch_cnn.sub
24+
```
25+
26+
### Execute script
27+
The [Execute Shell script](./pytorch_cnn.sh) extracts the data and then calls a
28+
Python script [main.py](./main.py) that figures out the network weights and saves it to disk. Then the Execute script deletes the data directory so that it isn't returned to the submit node.
29+
30+
```shell
31+
tar zxf MNIST_data.tar.gz
32+
python main.py --save-model --epochs 20
33+
rm -r data
34+
```
35+
36+
### Output
37+
We have the CNN Network that was trained returned to us as a file
38+
[mnist\_cnn.pt](./expected_output/mnist_cnn.pt). The are also some output stats
39+
on the training and test error in the [output
40+
files](./expected_output/pytorch_cnn.out.txt).
41+
```
42+
Test set: Average loss: 0.0278, Accuracy: 9909/10000 (99%)
43+
```
44+
45+
You can see a complete list of files expected in the output in the [expected
46+
output directory](./expected_output/).
47+
48+

0 commit comments

Comments
 (0)