chunked cross entropy loss to reduce peak memory #996

xuanzhang816 · 2025-03-20T18:29:02Z

Noticed that in the memory snapshot when running the Llama3-8B models, the cross entropy loss is causing huge spikes in peak memory:

Replace the cross entropy loss function with a chunked version can help reduce peak memory by 7GiB without affecting training throughput. This is inspired by the torchtune implementation.

Baseline run

...
[rank0]:[titan] 2025-03-20 08:19:11,539 - root - INFO - step: 30  loss:  7.7158  memory: 49.58GiB(52.20%)  tps: 5,966  tflops: 345.53  mfu: 34.94%
[rank0]:[titan] 2025-03-20 08:19:25,291 - root - INFO - step: 40  loss:  7.3677  memory: 49.58GiB(52.20%)  tps: 5,957  tflops: 345.01  mfu: 34.88%
[rank0]:[titan] 2025-03-20 08:19:39,040 - root - INFO - step: 50  loss:  7.0890  memory: 49.58GiB(52.20%)  tps: 5,959  tflops: 345.10  mfu: 34.89%
[rank0]:[titan] 2025-03-20 08:19:52,794 - root - INFO - step: 60  loss:  6.8336  memory: 49.58GiB(52.20%)  tps: 5,957  tflops: 344.99  mfu: 34.88%
...

With the changes

...
[rank0]:[titan] 2025-03-20 08:29:30,780 - root - INFO - step: 30  loss:  7.7423  memory: 42.26GiB(44.49%)  tps: 5,957  tflops: 345.01  mfu: 34.89%
[rank0]:[titan] 2025-03-20 08:29:44,554 - root - INFO - step: 40  loss:  7.3368  memory: 42.26GiB(44.49%)  tps: 5,948  tflops: 344.49  mfu: 34.83%
[rank0]:[titan] 2025-03-20 08:29:58,318 - root - INFO - step: 50  loss:  7.0574  memory: 42.26GiB(44.49%)  tps: 5,953  tflops: 344.74  mfu: 34.86%
[rank0]:[titan] 2025-03-20 08:30:12,080 - root - INFO - step: 60  loss:  6.8472  memory: 42.26GiB(44.49%)  tps: 5,953  tflops: 344.79  mfu: 34.86%
...

cc: @lessw2020 @weifengpy @wconstab @yf225

tianyu-l

Makes sense to me. Could you verify the following?

verify you get the same loss curves after fixing training.seed and setting training.deterministic=True before vs. after.
turn on TP (by default with Loss Parallel) to see if the results are still correct. For this one you need to fix the DP degree, e.g. 1st job DP2, 2nd job DP2 TP4.

awgu · 2025-03-21T14:43:11Z

Chunked cross entropy loss is not as important when you have larger TP and/or are using PP with smaller microbatch size. I wonder if it makes sense to make this configurable.

tianyu-l · 2025-03-21T20:19:49Z

Chunked cross entropy loss is not as important when you have larger TP and/or are using PP with smaller microbatch size. I wonder if it makes sense to make this configurable.

I think this is the right thing to do. Let's make it configurable, necessitating switching loss_fn to build_loss_fn in TrainSpec and add a str config for the loss. If it sounds non-trivial, I can do it in a follow-up PR -- in this PR let's add this loss, but not making it the default.

WDYT?

chunked cross entropy

f99bfc0

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 20, 2025

xuanzhang816 requested review from wconstab and weifengpy March 20, 2025 18:33

tianyu-l reviewed Mar 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chunked cross entropy loss to reduce peak memory #996

chunked cross entropy loss to reduce peak memory #996

xuanzhang816 commented Mar 20, 2025 •

edited

Loading

tianyu-l left a comment

awgu commented Mar 21, 2025

tianyu-l commented Mar 21, 2025

chunked cross entropy loss to reduce peak memory #996

Are you sure you want to change the base?

chunked cross entropy loss to reduce peak memory #996

Conversation

xuanzhang816 commented Mar 20, 2025 • edited Loading

tianyu-l left a comment

Choose a reason for hiding this comment

awgu commented Mar 21, 2025

tianyu-l commented Mar 21, 2025

xuanzhang816 commented Mar 20, 2025 •

edited

Loading