@@ -152,6 +152,14 @@ assert torch.testing.assert_close(y, model(x))
152
152
153
153
### Speed up LLM training
154
154
155
+ Install LitGPT (without updating other dependencies)
156
+
157
+ ```
158
+ pip install --no-deps 'litgpt[all]'
159
+ ```
160
+
161
+ and run
162
+
155
163
``` python
156
164
import thunder
157
165
import torch
@@ -170,6 +178,14 @@ out.sum().backward()
170
178
171
179
### Speed up HuggingFace BERT inference
172
180
181
+ Install Hugging Face Transformers (recommended version is ` 4.50.2 ` and above)
182
+
183
+ ```
184
+ pip install -U transformers
185
+ ```
186
+
187
+ and run
188
+
173
189
``` python
174
190
import thunder
175
191
import torch
@@ -188,14 +204,22 @@ with torch.device("cuda"):
188
204
189
205
inp = tokenizer([" Hello world!" ], return_tensors = " pt" )
190
206
191
- thunder_model = thunder.compile(model, plugins = " reduce-overhead " )
207
+ thunder_model = thunder.compile(model)
192
208
193
209
out = thunder_model(** inp)
194
210
print (out)
195
211
```
196
212
197
213
### Speed up HuggingFace DeepSeek R1 distill inference
198
214
215
+ Install Hugging Face Transformers (recommended version is ` 4.50.2 ` and above)
216
+
217
+ ```
218
+ pip install -U transformers
219
+ ```
220
+
221
+ and run
222
+
199
223
``` python
200
224
import torch
201
225
import transformers
@@ -214,9 +238,7 @@ with torch.device("cuda"):
214
238
215
239
inp = tokenizer([" Hello world! Here's a long story" ], return_tensors = " pt" )
216
240
217
- thunder_model = thunder.compile(
218
- model, recipe = " hf-transformers" , plugins = " reduce-overhead"
219
- )
241
+ thunder_model = thunder.compile(model)
220
242
221
243
out = thunder_model.generate(
222
244
** inp, do_sample = False , cache_implementation = " static" , max_new_tokens = 100
@@ -240,7 +262,7 @@ with torch.device("cuda"):
240
262
241
263
out = model(inp)
242
264
243
- thunder_model = thunder.compile(model, plugins = " reduce-overhead " )
265
+ thunder_model = thunder.compile(model)
244
266
245
267
out = thunder_model(inp)
246
268
```
@@ -257,6 +279,16 @@ Thunder comes with a few plugins included of the box, but it's easy to write new
257
279
- reduce latency with CUDAGraphs
258
280
- debugging and profiling
259
281
282
+ For example, in order to reduce CPU overheads via CUDAGraphs you can add "reduce-overhead"
283
+ to the ` plugins= ` argument of ` thunder.compile ` :
284
+
285
+ ``` python
286
+ thunder_model = thunder.compile(model, plugins = " reduce-overhead" )
287
+ ```
288
+
289
+ This may or may not make a big difference. The point of Thunder is that you can easily
290
+ swap optimizations in and out and explore the best combination for your setup.
291
+
260
292
## How it works
261
293
262
294
Thunder works in three stages:
0 commit comments