You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to make contributing to this project as easy and transparent as
3
4
possible.
4
5
5
6
## Pull Requests
7
+
6
8
We actively welcome your pull requests.
7
9
8
10
If you're new we encourage you to take a look at issues tagged with [good first issue](https://github.com/pytorch/examples/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
9
11
10
12
### For new examples
11
-
0. Create a github issue proposing a new example and make sure it's substantially different from an existing one
13
+
14
+
0. Create a GitHub issue proposing a new example and make sure it's substantially different from an existing one.
12
15
1. Fork the repo and create your branch from `main`.
13
-
2. If you've added code that should be tested, add tests to `run_python_examples.sh`
16
+
2. If you've added code that should be tested, add tests to `run_python_examples.sh`.
14
17
3. Create a `README.md`.
15
18
4. Add a card with a brief description of your example and link to the repo to
16
-
the `docs/source/index.rst` file and build the docs by running:
19
+
the `docs/source/index.rst` file and build the docs by running:
17
20
18
21
```
19
22
cd docs
@@ -22,34 +25,39 @@ If you're new we encourage you to take a look at issues tagged with [good first
22
25
pip install -r requirements.txt
23
26
make html
24
27
```
28
+
25
29
When done working with `virtualenv`, run `deactivate`.
26
30
27
-
5. Verify that there are no issues in your doc build. You can check preview locally
31
+
5. Verify that there are no issues in your doc build. You can check the preview locally
28
32
by installing [sphinx-serve](https://pypi.org/project/sphinx-serve/) and
29
33
then running `sphinx-serve -b build`.
30
-
31
-
5. Ensure your test passes locally.
32
-
6. If you haven't already, complete the Contributor License Agreement ("CLA").
33
-
7. Address any feedback in code review promptly.
34
+
6. Ensure your test passes locally.
35
+
7. If you haven't already, complete the Contributor License Agreement ("CLA").
36
+
8. Address any feedback in code review promptly.
34
37
35
38
## For bug fixes
39
+
36
40
1. Fork the repo and create your branch from `main`.
37
-
2. Make sure you have a GPU-enabled machine, either locally or in the cloud. `g4dn.4xlarge` is a good starting point on AWS.
38
-
3. Make your code change.
41
+
2. Make sure you have a GPU-enabled machine, either locally or in the cloud. `g4dn.4xlarge` is a good starting point on AWS.
42
+
3. Make your code change.
39
43
4. First, install all dependencies with `./run_python_examples.sh "install_deps"`.
40
-
5. Then make sure that `./run_python_examples.sh` passes locally by running script end to end.
44
+
5. Then make sure that `./run_python_examples.sh` passes locally by running the script end to end.
41
45
6. If you haven't already, complete the Contributor License Agreement ("CLA").
42
46
7. Address any feedback in code review promptly.
43
47
44
-
45
48
## Contributor License Agreement ("CLA")
46
-
In order to accept your pull request, we need you to submit a CLA. You only need
49
+
50
+
To accept your pull request, we need you to submit a CLA. You only need
47
51
to do this once to work on any of Facebook's open source projects.
48
52
49
53
Complete your CLA here: <https://code.facebook.com/cla>
54
+
50
55
## Issues
56
+
51
57
We use GitHub issues to track public bugs. Please ensure your description is
52
58
clear and has sufficient instructions to be able to reproduce the issue.
59
+
53
60
## License
61
+
54
62
By contributing to examples, you agree that your contributions will be licensed
55
63
under the LICENSE file in the root directory of this source tree.
Copy file name to clipboardexpand all lines: README.md
+6-6
Original file line number
Diff line number
Diff line change
@@ -4,13 +4,13 @@
4
4
5
5
https://pytorch.org/examples/
6
6
7
-
`pytorch/examples` is a repository showcasing examples of using [PyTorch](https://github.com/pytorch/pytorch). The goal is to have curated, short, few/no dependencies *high quality* examples that are substantially different from each other that can be emulated in your existing work.
7
+
`pytorch/examples` is a repository showcasing examples of using [PyTorch](https://github.com/pytorch/pytorch). The goal is to have curated, short, few/no dependencies _high quality_ examples that are substantially different from each other that can be emulated in your existing work.
8
8
9
-
* For tutorials: https://github.com/pytorch/tutorials
10
-
* For changes to pytorch.org: https://github.com/pytorch/pytorch.github.io
11
-
* For a general model hub: https://pytorch.org/hub/ or https://huggingface.co/models
12
-
* For recipes on how to run PyTorch in production: https://github.com/facebookresearch/recipes
13
-
* For general Q&A and support: https://discuss.pytorch.org/
9
+
- For tutorials: https://github.com/pytorch/tutorials
10
+
- For changes to pytorch.org: https://github.com/pytorch/pytorch.github.io
11
+
- For a general model hub: https://pytorch.org/hub/ or https://huggingface.co/models
12
+
- For recipes on how to run PyTorch in production: https://github.com/facebookresearch/recipes
13
+
- For general Q&A and support: https://discuss.pytorch.org/
Copy file name to clipboardexpand all lines: cpp/custom-dataset/README.md
+4-2
Original file line number
Diff line number
Diff line change
@@ -20,9 +20,10 @@ $ make
20
20
21
21
where /path/to/libtorch should be the path to the unzipped LibTorch distribution, which you can get from the [PyTorch homepage](https://pytorch.org/get-started/locally/).
22
22
23
-
if you see an error like ```undefined reference to cv::imread(std::string const&, int)``` when running the ```make``` command, you should build LibTorch from source using the instructions [here](https://github.com/pytorch/pytorch#from-source), and then set ```CMAKE_PREFIX_PATH``` to that PyTorch source directory.
23
+
if you see an error like `undefined reference to cv::imread(std::string const&, int)` when running the `make` command, you should build LibTorch from source using the instructions [here](https://github.com/pytorch/pytorch#from-source), and then set `CMAKE_PREFIX_PATH` to that PyTorch source directory.
24
24
25
25
The build directory should look like this:
26
+
26
27
```
27
28
.
28
29
├── custom-dataset
@@ -38,9 +39,10 @@ The build directory should look like this:
38
39
└── ...
39
40
```
40
41
41
-
```info.txt``` file gets copied from source directory during build.
42
+
`info.txt` file gets copied from source directory during build.
We assume you are familiar with [PyTorch](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html), the primitives it provides for [writing distributed applications](https://pytorch.org/tutorials/intermediate/dist_tuto.html) as well as training [distributed models](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
9
+
10
+
We assume you are familiar with [PyTorch](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html), the primitives it provides for [writing distributed applications](https://pytorch.org/tutorials/intermediate/dist_tuto.html) as well as training [distributed models](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
10
11
11
12
The example program in this tutorial uses the
12
13
[`torch.nn.parallel.DistributedDataParallel`](https://pytorch.org/docs/stable/nn.html#distributeddataparallel) class for training models
@@ -20,6 +21,7 @@ application but each one operates on different portions of the
20
21
training dataset.
21
22
22
23
# Application process topologies
24
+
23
25
A Distributed Data Parallel (DDP) application can be executed on
24
26
multiple nodes where each node can consist of multiple GPU
25
27
devices. Each node in turn can run multiple copies of the DDP
@@ -49,6 +51,7 @@ computational costs. In the rest of this tutorial, we assume that the
49
51
application follows this heuristic.
50
52
51
53
# Preparing and launching a DDP application
54
+
52
55
Independent of how a DDP application is launched, each process needs a
53
56
mechanism to know its global and local ranks. Once this is known, all
54
57
processes create a `ProcessGroup` that enables them to participate in
When the DDP application is started via `launch.py`, it passes the world size, global rank, master address and master port via environment variables and the local rank as a command-line parameter to each instance.
74
78
To use the launcher, an application needs to adhere to the following convention:
79
+
75
80
1. It must provide an entry-point function for a _single worker_. For example, it should not launch subprocesses using `torch.multiprocessing.spawn`
76
81
2. It must use environment variables for initializing the process group.
77
82
78
83
For simplicity, the application can assume each process maps to a single GPU but in the next section we also show how a more general process-to-GPU mapping can be performed.
79
84
80
85
# Sample application
86
+
81
87
The sample DDP application in this repo is based on the "Hello, World" [DDP tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
82
88
83
89
## Argument passing convention
90
+
84
91
The DDP application takes two command-line arguments:
92
+
85
93
1.`--local_rank`: This is passed in via `launch.py`
86
94
2.`--local_world_size`: This is passed in explicitly and is typically either $1$ or the number of GPUs per node.
87
95
88
96
The application parses these and calls the `spmd_main` entrypoint:
97
+
89
98
```py
90
99
if__name__=="__main__":
91
100
parser = argparse.ArgumentParser()
@@ -94,7 +103,9 @@ if __name__ == "__main__":
94
103
args = parser.parse_args()
95
104
spmd_main(args.local_world_size, args.local_rank)
96
105
```
106
+
97
107
In `spmd_main`, the process group is initialized with just the backend (NCCL or Gloo). The rest of the information needed for rendezvous comes from environment variables set by `launch.py`:
108
+
98
109
```py
99
110
defspmd_main(local_world_size, local_rank):
100
111
# These are the parameters used to initialize the process group
Given the local rank and world size, the training function, `demo_basic` initializes the `DistributedDataParallel` model across a set of GPUs local to the node via `device_ids`:
and produces an output similar to the one shown below:
165
+
151
166
```sh
152
167
*****************************************
153
168
Setting OMP_NUM_THREADS environment variable foreach process to be 1in default, to avoid your system being overloaded, please further tune the variable foroptimal performancein your application as needed.
@@ -177,16 +192,21 @@ Setting OMP_NUM_THREADS environment variable for each process to be 1 in default
As the author of a distributed data parallel application, your code needs to be aware of two types of resources: compute nodes and the GPUs within each node. The process of setting up bookkeeping to track how the set of GPUs is mapped to the processes of your application can be tedious and error-prone. We hope that by structuring your application as shown in this example and using the launcher, the mechanics of setting up distributed training can be significantly simplified.
211
+
212
+
As the author of a distributed data parallel application, your code needs to be aware of two types of resources: compute nodes and the GPUs within each node. The process of setting up bookkeeping to track how the set of GPUs is mapped to the processes of your application can be tedious and error-prone. We hope that by structuring your application as shown in this example and using the launcher, the mechanics of setting up distributed training can be significantly simplified.
Copy file name to clipboardexpand all lines: distributed/rpc/batch/README.md
+11-11
Original file line number
Diff line number
Diff line change
@@ -3,15 +3,15 @@
3
3
This folder contains two examples for [`@rpc.functions.async_execution`](https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.functions.async_execution):
Copy file name to clipboardexpand all lines: distributed/rpc/parameter_server/README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
### RPC-based distributed training
2
2
3
-
This is a basic example of RPC-based training that uses several trainers remotely train a model hosted on a server.
3
+
This is a basic example of RPC-based training that uses several trainers remotely train a model hosted on a server.
4
4
5
5
To run the example locally, run the following command worker for the server and each worker you wish to spawn, in separate terminal windows:
6
6
`python rpc_parameter_server.py --world_size=WORLD_SIZE --rank=RANK`. For example, for a master node with world size of 2, the command would be `python rpc_parameter_server.py --world_size=2 --rank=0`. The trainer can then be launched with the command `python rpc_parameter_server.py --world_size=2 --rank=1` in a separate window, and this will begin training with one server and a single trainer.
0 commit comments