You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fixed citation
* cleaned up distributed logging
* fixed potential race condition with writing to file
* added note about running on multiple GPU
* added formatted non-ds example
The main idea behind the MLM task is to get the model to fill in the blanks based on contextual clues present **both before and after** the blank. Consider, for example, the following sentence:
77
77
78
-
> In the beautiful season of ____ the ____ shed their leaves.
78
+
> In the beautiful season of ____ the ____ shed their leaves.
79
79
80
80
Given the left context `season` and the right context `shed their leaves`, one can guess that the blanks are `Autumn` and `trees` respectively. This is exactly what we want the model to do: utilize both the left and right context to fill in the blanks.
81
81
@@ -99,7 +99,7 @@ A Transformer model repeatedly applies a (Multi-Headed) Self-Attention block and
99
99
3. The number of Self Attention Heads
100
100
4. The size of the intermediate representation between the FeedForward block
101
101
102
-
Check out the `create_model` function in [train_bert.py](./train_bert.py) to see how this is done in code. In this example, we create a Roberta model[3](#3)
102
+
Check out the `create_model` function in [train_bert.py](./train_bert.py) to see how this is done in code. In this example, we create a Roberta model[[3](#3)]
103
103
104
104
---
105
105
📌 **Note:** You can check out [[1](#1), [2](#2)] as a starting point for better understanding Transformers. Additionally, there are a number of blogs that do a nice deep dive into the workings of these models (eg: [this](https://nlp.seas.harvard.edu/2018/04/03/attention.html), [this](https://jalammar.github.io/illustrated-bert/) and [this](https://jalammar.github.io/illustrated-transformer/)).
@@ -108,7 +108,7 @@ Check out the `create_model` function in [train_bert.py](./train_bert.py) to see
108
108
109
109
### 1.3 Training the Model
110
110
111
-
In order to train the model, you can run the following command
111
+
In order to train the model, you can run the following command
📌 **Note:** You may also want/need to make additional changes to your code if you run on multiple GPUs as DeepSpeed will launch multiple processes. You will want to avoid potential race conditions with creating directories or writing to file and restrict logging to a single process. Take a look at `train_bert_ds.py` for an example of how to do this.
213
+
214
+
---
215
+
211
216
## 2.1 Launching Training
212
217
213
218
We are now ready to launch our training! As a convenience, DeepSpeed provides its own launcher that is seamlessly compatible with clusters that provide a `/job/hostfile` containing all available machines in your job. You can now try running your model on your available GPU(s) with the command below. By default this will attempt to run distributed data-parallel (DDP) training across all available GPUs on the current machine + any external machines listed in your `/job/hostfile`. Please read [more details about the DeepSpeed launcher](https://www.deepspeed.ai/getting-started/#launching-deepspeed-training) and its assumptions on our website.
In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17)](https://arxiv.org/pdf/1706.03762.pdf)
286
291
>
287
292
> <aid="2">[2]</a>
@@ -296,5 +301,5 @@ In Proceedings of the 31st International Conference on Neural Information Proces
296
301
> <aid="5">[5]</a>
297
302
[J. Ren, S. Rajbhandari, R. Aminabadi, O. Ruwase, S. Yang, M. Zhang, D. Li, Y. He. ZeRO-Offload: Democratizing Billion-Scale Model Training. (ATC'21)](https://www.usenix.org/system/files/atc21-ren-jie.pdf)
298
303
>
299
-
> <aid="1">[6]</a>
304
+
> <aid="1">[6]</a>
300
305
[S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, Y. He. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning (SC'21)](https://arxiv.org/abs/2104.07857)
0 commit comments