Skip to content

Commit 6bd444a

Browse files
authored
Fix HelloDeepSpeed for multi-GPU runs (deepspeedai#170)
* fixed citation * cleaned up distributed logging * fixed potential race condition with writing to file * added note about running on multiple GPU * added formatted non-ds example
1 parent 4f064fb commit 6bd444a

File tree

3 files changed

+346
-296
lines changed

3 files changed

+346
-296
lines changed

HelloDeepSpeed/README.md

+13-8
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ def test_masking_stats(tol: float = 1e-3):
7575

7676
The main idea behind the MLM task is to get the model to fill in the blanks based on contextual clues present **both before and after** the blank. Consider, for example, the following sentence:
7777

78-
> In the beautiful season of ____ the ____ shed their leaves.
78+
> In the beautiful season of ____ the ____ shed their leaves.
7979
8080
Given the left context `season` and the right context `shed their leaves`, one can guess that the blanks are `Autumn` and `trees` respectively. This is exactly what we want the model to do: utilize both the left and right context to fill in the blanks.
8181

@@ -99,7 +99,7 @@ A Transformer model repeatedly applies a (Multi-Headed) Self-Attention block and
9999
3. The number of Self Attention Heads
100100
4. The size of the intermediate representation between the FeedForward block
101101

102-
Check out the `create_model` function in [train_bert.py](./train_bert.py) to see how this is done in code. In this example, we create a Roberta model[3](#3)
102+
Check out the `create_model` function in [train_bert.py](./train_bert.py) to see how this is done in code. In this example, we create a Roberta model [[3](#3)]
103103

104104
---
105105
📌 **Note:** You can check out [[1](#1), [2](#2)] as a starting point for better understanding Transformers. Additionally, there are a number of blogs that do a nice deep dive into the workings of these models (eg: [this](https://nlp.seas.harvard.edu/2018/04/03/attention.html), [this](https://jalammar.github.io/illustrated-bert/) and [this](https://jalammar.github.io/illustrated-transformer/)).
@@ -108,7 +108,7 @@ Check out the `create_model` function in [train_bert.py](./train_bert.py) to see
108108

109109
### 1.3 Training the Model
110110

111-
In order to train the model, you can run the following command
111+
In order to train the model, you can run the following command
112112

113113
```bash
114114
python train_bert.py --checkpoint_dir ./experiments
@@ -171,8 +171,8 @@ ds_config = {
171171
}
172172
},
173173
}
174-
model, _, _, _ = deepspeed.initialize(model=model,
175-
model_parameters=model.parameters(),
174+
model, _, _, _ = deepspeed.initialize(model=model,
175+
model_parameters=model.parameters(),
176176
config=ds_config)
177177
```
178178

@@ -208,6 +208,11 @@ _, client_state = model.load_checkpoint(load_dir=load_checkpoint_dir)
208208
checkpoint_step = client_state['checkpoint_step']
209209
```
210210

211+
---
212+
📌 **Note:** You may also want/need to make additional changes to your code if you run on multiple GPUs as DeepSpeed will launch multiple processes. You will want to avoid potential race conditions with creating directories or writing to file and restrict logging to a single process. Take a look at `train_bert_ds.py` for an example of how to do this.
213+
214+
---
215+
211216
## 2.1 Launching Training
212217

213218
We are now ready to launch our training! As a convenience, DeepSpeed provides its own launcher that is seamlessly compatible with clusters that provide a `/job/hostfile` containing all available machines in your job. You can now try running your model on your available GPU(s) with the command below. By default this will attempt to run distributed data-parallel (DDP) training across all available GPUs on the current machine + any external machines listed in your `/job/hostfile`. Please read [more details about the DeepSpeed launcher](https://www.deepspeed.ai/getting-started/#launching-deepspeed-training) and its assumptions on our website.
@@ -280,8 +285,8 @@ deepspeed train_bert.py --checkpoint_dir . --num_layers 24 --h_dim 4096
280285
---
281286

282287
## References
283-
> <a id="1">[1]</a>
284-
[Vaswani et al. Attention is all you need.
288+
> <a id="1">[1]</a>
289+
[Vaswani et al. Attention is all you need.
285290
In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17)](https://arxiv.org/pdf/1706.03762.pdf)
286291
>
287292
> <a id="2">[2]</a>
@@ -296,5 +301,5 @@ In Proceedings of the 31st International Conference on Neural Information Proces
296301
> <a id="5">[5]</a>
297302
[J. Ren, S. Rajbhandari, R. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, Y. He. ZeRO-Offload: Democratizing Billion-Scale Model Training. (ATC'21)](https://www.usenix.org/system/files/atc21-ren-jie.pdf)
298303
>
299-
> <a id="1">[6]</a>
304+
> <a id="1">[6]</a>
300305
[S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, Y. He. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning (SC'21)](https://arxiv.org/abs/2104.07857)

0 commit comments

Comments
 (0)