Add debugging instructions for "Reproducibility between runs" #1363

wwwjn · 2025-07-02T00:52:34Z

As titled. To avoid someone with the same issue as me to understand why we need "seed checkpoint"

docs/debugging.md

tianyu-l · 2025-07-02T04:25:54Z

docs/debugging.md

+- **Data Parallel (DP/FSDP):** All ranks use the same seed for model initialization
+- **Tensor Parallel (TP):** All TP ranks use the same seed for consistent weight sharding
+- **Pipeline Parallel (PP):** Each PP stage gets a different seed to ensure different dropout patterns
+- **Sequence Parallel:** Maintains consistent seeding across sequence-sharded dimensions


what's this? CP?
If so, for parameters, dp_shard mesh and cp mesh together form the FSDP sharding group. So if dp_shard=2, cp=2, the parameter sharding are the same as dp_shard=4, cp=1.

docs/debugging.md

Co-authored-by: tianyu-l <[email protected]>

tianyu-l

please address remaining comments

docs/debugging.md

add debugging

7ce024b

wwwjn requested review from tianyu-l, fegin and wconstab as code owners July 2, 2025 00:52

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 2, 2025

cleanr

c15a849

tianyu-l reviewed Jul 2, 2025

View reviewed changes

tianyu-l mentioned this pull request Jul 2, 2025

[WIP][RFC] Always flatten model state_dict #1347

Open

wwwjn and others added 4 commits July 2, 2025 10:06

Update docs/debugging.md

ee564cc

Co-authored-by: tianyu-l <[email protected]>

Update docs/debugging.md

bd099cb

Co-authored-by: tianyu-l <[email protected]>

Update docs/debugging.md

65b6f5b

Co-authored-by: tianyu-l <[email protected]>

fix comment

fffb3ed

wwwjn requested a review from tianyu-l July 2, 2025 18:36

tianyu-l approved these changes Jul 3, 2025

View reviewed changes

fix comments

d713312

wwwjn merged commit 4aa9fde into main Jul 3, 2025
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add debugging instructions for "Reproducibility between runs" #1363

Add debugging instructions for "Reproducibility between runs" #1363

wwwjn commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add debugging instructions for "Reproducibility between runs" #1363

Add debugging instructions for "Reproducibility between runs" #1363

Conversation

wwwjn commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!