[torchax] Fix Llama 3.1 405B host memory space OOM #38

tengyifei · 2025-01-15T05:45:26Z

This fixes #28. Currently each graph uses >128GiB of host RAM per TPU chip, which is not supported. The OOMing host array is bf16[126, 2, 8192, 16384].

Based on the shape and https://pytorch.org/blog/high-performance-llama-2 I made an informed guess to annotate the decoder input with sharding constraints. That got rid of the OOM and we calculate an MFU of 28.65%.

Training output snippet:

program size: 0.316734 m chars
End compiling 41.968193726148456
Compile time: 41.968193726148456
Flops 447646288314368.0
GB accessed 433.546264576
0 loss 6272 bfloat16 step latency: 132.53140141186304
======
INPUT shape torch.Size([256, 8192])
1 loss 6272 bfloat16 step latency: 41.24516941001639
======
INPUT shape torch.Size([256, 8192])
2 loss 6272 bfloat16 step latency: 40.27528425701894
======
INPUT shape torch.Size([256, 8192])
3 loss 6272 bfloat16 step latency: 40.21413461607881
======
INPUT shape torch.Size([256, 8192])
4 loss 6272 bfloat16 step latency: 40.22760192491114
======
INPUT shape torch.Size([256, 8192])
5 loss 6272 bfloat16 step latency: 41.25447623594664
======
INPUT shape torch.Size([256, 8192])
6 loss 6272 bfloat16 step latency: 42.87271257000975
======

This fixes #28. Currently each graph uses >128GiB of host RAM per TPU chip, which is not supported. The OOMing host array is `bf16[126, 2, 8192, 16384]`. Based on the shape and https://pytorch.org/blog/high-performance-llama-2 I made an informed guess to annotate the decoder input with sharding constraints. That got rid of the OOM and we calculate an MFU of 28.65%. Training output snippet: ``` program size: 0.316734 m chars End compiling 41.968193726148456 Compile time: 41.968193726148456 Flops 447646288314368.0 GB accessed 433.546264576 0 loss 6272 bfloat16 step latency: 132.53140141186304 ====== INPUT shape torch.Size([256, 8192]) 1 loss 6272 bfloat16 step latency: 41.24516941001639 ====== INPUT shape torch.Size([256, 8192]) 2 loss 6272 bfloat16 step latency: 40.27528425701894 ====== INPUT shape torch.Size([256, 8192]) 3 loss 6272 bfloat16 step latency: 40.21413461607881 ====== INPUT shape torch.Size([256, 8192]) 4 loss 6272 bfloat16 step latency: 40.22760192491114 ====== INPUT shape torch.Size([256, 8192]) 5 loss 6272 bfloat16 step latency: 41.25447623594664 ====== INPUT shape torch.Size([256, 8192]) 6 loss 6272 bfloat16 step latency: 42.87271257000975 ====== ```

tengyifei requested a review from qihqi January 15, 2025 05:45

qihqi approved these changes Jan 15, 2025

View reviewed changes

qihqi merged commit 94aafb4 into main Jan 15, 2025
6 checks passed

tengyifei deleted the yifeit/torchax-405b-oom branch January 26, 2025 07:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[torchax] Fix Llama 3.1 405B host memory space OOM #38

[torchax] Fix Llama 3.1 405B host memory space OOM #38

Uh oh!

tengyifei commented Jan 15, 2025

Uh oh!

Uh oh!

Uh oh!

[torchax] Fix Llama 3.1 405B host memory space OOM #38

[torchax] Fix Llama 3.1 405B host memory space OOM #38

Uh oh!

Conversation

tengyifei commented Jan 15, 2025

Uh oh!

Uh oh!

Uh oh!