Introducing experimental gradient accumulation API #8584

rpsilva-aws · 2025-01-16T19:28:26Z

In this PR, we introduce experimental.gradient_accumulation which leverages XLA's While op to accumulate gradients.

Training loop with traditional gradient accumulation
===> Preparing data..
Epoch 0 step 8 loss 1.1098170280456543
Epoch 0 step 16 loss 1.1719611883163452
Epoch 0 step 24 loss 3.453134536743164
Epoch 0 step 32 loss 2.518792152404785
Epoch 0 step 40 loss 6.67546272277832
Epoch 0 step 48 loss 4.609560012817383
Epoch 0 step 56 loss 5.953202247619629
Epoch 0 step 64 loss 1.325960636138916
Training loop with XLA's `While` gradient accumulation
===> Preparing data..
Epoch 0 step 8 loss 1.1098170280456543
Epoch 0 step 16 loss 1.1719611883163452
Epoch 0 step 24 loss 3.453134536743164
Epoch 0 step 32 loss 2.518792152404785
Epoch 0 step 40 loss 6.67546272277832
Epoch 0 step 48 loss 4.609560012817383
Epoch 0 step 56 loss 5.953202247619629
Epoch 0 step 64 loss 1.325960636138916

rpsilva-aws · 2025-01-16T20:10:55Z

@jeffhataws @tengyifei

torch_xla/experimental/gradient_accumulation.py

tengyifei · 2025-01-17T22:49:31Z

@rpsilva-aws do you plan on merging this into r2.6?

rpsilva-aws · 2025-01-17T22:58:09Z

@tengyifei Ideally, yes. It's perfectly fine for the 3-layer MLP, but we're seeing a small difference for Llama runs (difference being, from a previous local patch set that was just before cleaning some of the code), so we're just quickly identifying what it is.

torch_xla/experimental/gradient_accumulation.py

tengyifei · 2025-01-18T01:21:16Z

Okay, please aim to sort out all critical issues by Jan 21 if you're aiming for 2.6 so that we could review and cherrypick it by Jan 22. 2.6 release is quicking drawing in and I would like a few days to test all the builds.

torch_xla/experimental/gradient_accumulation.py

Introduce gradient accumulation leading dimension to MLP

2f20c94

rpsilva-aws marked this pull request as ready for review January 16, 2025 19:28

aws-rhsoln reviewed Jan 16, 2025

View reviewed changes

torch_xla/experimental/gradient_accumulation.py Show resolved Hide resolved

aws-rhsoln reviewed Jan 16, 2025

View reviewed changes

torch_xla/experimental/gradient_accumulation.py Outdated Show resolved Hide resolved

aws-rhsoln reviewed Jan 16, 2025

View reviewed changes

torch_xla/experimental/gradient_accumulation.py Show resolved Hide resolved

aws-rhsoln reviewed Jan 16, 2025

View reviewed changes

torch_xla/experimental/gradient_accumulation.py Show resolved Hide resolved

rpsilva-aws commented Jan 17, 2025

View reviewed changes

torch_xla/experimental/gradient_accumulation.py Outdated Show resolved Hide resolved

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch from 720d1e6 to d6bfdd1 Compare January 21, 2025 17:25

jeffhataws added the tpuci label Jan 21, 2025

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch 3 times, most recently from 08831d6 to 567ccb5 Compare January 21, 2025 23:25

aws-rhsoln reviewed Jan 22, 2025

View reviewed changes

torch_xla/experimental/gradient_accumulation.py Outdated Show resolved Hide resolved

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch 2 times, most recently from 4589eb2 to dfbef15 Compare January 22, 2025 01:10

Introduce experimental gradient accumulation API

83f0af1

rpsilva-aws force-pushed the rpsilva_grad_acc_v2 branch from dfbef15 to 83f0af1 Compare January 22, 2025 01:14

Revert the result validation

689dd0e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing experimental gradient accumulation API #8584

Introducing experimental gradient accumulation API #8584

rpsilva-aws commented Jan 16, 2025 •

edited

Loading

rpsilva-aws commented Jan 16, 2025

tengyifei commented Jan 17, 2025

rpsilva-aws commented Jan 17, 2025

tengyifei commented Jan 18, 2025

Introducing experimental gradient accumulation API #8584

Are you sure you want to change the base?

Introducing experimental gradient accumulation API #8584

Conversation

rpsilva-aws commented Jan 16, 2025 • edited Loading

rpsilva-aws commented Jan 16, 2025

tengyifei commented Jan 17, 2025

rpsilva-aws commented Jan 17, 2025

tengyifei commented Jan 18, 2025

rpsilva-aws commented Jan 16, 2025 •

edited

Loading