[Flux] Enable checkpointing #1195

wwwjn · 2025-05-14T21:06:33Z

Context:

Change flux-dev / flux-schnell model training to be ~30000 step based on current MAST training results
Enable checkpointing. We enabled final_layer reshard_after_forward to solve issue described here

Test

If we run following 2 runs, the training loss curve should be identical with deterministic = True:

Without checkpoint save and load, total step=10
Save checkpoint at step 5, and load checkpoint at step 5, continue training

Currently issue #1194 makes the training loss not strictly identical. To exclude the influence of #1194, we reset the seeds (by calling set_deterministic() at the beginning of step 6. Then the checkpoint save/load makes the training loss identical.

torchtitan/experiments/flux/README.md

torchtitan/experiments/flux/train.py

fegin

I suggest that you split the "fix" of parallelize_flux.py into another PR.

torchtitan/experiments/flux/train.py

tianyu-l

LGTM. Please address comments before merge.

torchtitan/experiments/flux/train.py

wwwjn · 2025-05-15T17:30:49Z

I suggest that you split the "fix" of parallelize_flux.py into another PR.

Thank you @fegin for reminding, I created a separate PR for this. I will make separate PRs for changes for later changes as well, thanks for reminding!

## Context: 1. Change flux-dev / flux-schnell model training to be ~30000 step based on current MAST training results 2. Enable checkpointing. We enabled final_layer reshard_after_forward to solve issue described [here](#1167 (comment)) ## Test If we run following 2 runs, the training loss curve should be identical with `deterministic = True`: 1. Without checkpoint save and load, total step=10 2. Save checkpoint at step 5, and load checkpoint at step 5, continue training Currently issue #1194 makes the training loss not strictly identical. To exclude the influence of #1194, we reset the seeds (by calling `set_deterministic()` at the beginning of step 6. Then the checkpoint save/load makes the training loss identical. <img width="1675" alt="Screenshot 2025-05-14 at 2 06 23 PM" src="https://github.com/user-attachments/assets/22882b71-378c-44fa-bd48-8a8f238aa1b0" />

wwwjn added 3 commits May 14, 2025 12:14

add checkpointing

a5cefc3

test

4b858a7

finish deterministic test

fe8157e

wwwjn requested a review from tianyu-l May 14, 2025 21:06

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 14, 2025

wwwjn requested a review from fegin May 14, 2025 21:06

wwwjn added 2 commits May 14, 2025 14:12

lint

2f81aa8

normalize the flux dataloader image

8f3ad62

tianyu-l reviewed May 15, 2025

View reviewed changes

torchtitan/experiments/flux/README.md Outdated Show resolved Hide resolved

torchtitan/experiments/flux/train.py Show resolved Hide resolved

fegin reviewed May 15, 2025

View reviewed changes

torchtitan/experiments/flux/train.py Show resolved Hide resolved

tianyu-l approved these changes May 15, 2025

View reviewed changes

torchtitan/experiments/flux/train.py Show resolved Hide resolved

restore parallel_flux.py and add TODO

0dd3a78

wwwjn merged commit 0104e39 into main May 15, 2025
6 checks passed

tianyu-l deleted the flux-ci-2 branch May 15, 2025 23:42

wwwjn mentioned this pull request May 21, 2025

[Flux] Incorrect loss after loading from checkpoint #1213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flux] Enable checkpointing #1195

[Flux] Enable checkpointing #1195

Uh oh!

wwwjn commented May 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

fegin left a comment •

edited

Loading

Uh oh!

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

wwwjn commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!

[Flux] Enable checkpointing #1195

[Flux] Enable checkpointing #1195

Uh oh!

Conversation

wwwjn commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context:

Test

Uh oh!

Uh oh!

Uh oh!

fegin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wwwjn commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!

wwwjn commented May 14, 2025 •

edited

Loading

fegin left a comment •

edited

Loading