Skip to content

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
weixuansun opened this issue Apr 23, 2025 · 4 comments

Comments

@weixuansun
Copy link

Bug description

When I use an tokenizer that has vocabulary size that is not divisible by parallel (or world) size, the training loss will become inconsistent after resuming.

Versions

can be reproduced using torch2.6

Reproduce:
Use any tokenizer that has a vocabulary size that is not divisible by parallel (or world) size.
train from step 0 to step 20:

Image load step 10 checkpoint and resume training: Image As shown, step 11 and following steps have inconsistent loss.
@tianyu-l
Copy link
Contributor

could you try the latest pytorch nightly? might have been fixed by pytorch/pytorch#150490

@unlimblue
Copy link
Contributor

could you try the latest pytorch nightly? might have been fixed by pytorch/pytorch#150490

https://github.com/pytorch/pytorch/tree/c729f7dbee3be20f04a6aeac41ae0cd5be23403d

Using the latest main branch source code from April 22nd to compile PyTorch still results in the same issue.

@fegin
Copy link
Contributor

fegin commented Apr 24, 2025

This looks like a checkpointing problem. Can you confirm 1) does this only happen after checkpoint resume? 2) does this only happen when vocab size is not divisible by world size? From your description, the answers to both questions seem to be yes, but I do want to confirm. Thanks!

@weixuansun
Copy link
Author

This looks like a checkpointing problem. Can you confirm 1) does this only happen after checkpoint resume? 2) does this only happen when vocab size is not divisible by world size? From your description, the answers to both questions seem to be yes, but I do want to confirm. Thanks!

Yes, answers to both questions are Yes.
If I pad the vocab size to be divisible, the loss would be consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants