Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

weixuansun · 2025-04-23T12:24:53Z

Bug description

When I use an tokenizer that has vocabulary size that is not divisible by parallel (or world) size, the training loss will become inconsistent after resuming.

Versions

can be reproduced using torch2.6

Reproduce:
Use any tokenizer that has a vocabulary size that is not divisible by parallel (or world) size.
train from step 0 to step 20:

load step 10 checkpoint and resume training:

As shown, step 11 and following steps have inconsistent loss.

tianyu-l · 2025-04-23T14:08:47Z

could you try the latest pytorch nightly? might have been fixed by pytorch/pytorch#150490

unlimblue · 2025-04-23T15:45:38Z

could you try the latest pytorch nightly? might have been fixed by pytorch/pytorch#150490

https://github.com/pytorch/pytorch/tree/c729f7dbee3be20f04a6aeac41ae0cd5be23403d

Using the latest main branch source code from April 22nd to compile PyTorch still results in the same issue.

fegin · 2025-04-24T05:30:04Z

This looks like a checkpointing problem. Can you confirm 1) does this only happen after checkpoint resume? 2) does this only happen when vocab size is not divisible by world size? From your description, the answers to both questions seem to be yes, but I do want to confirm. Thanks!

weixuansun · 2025-04-24T09:38:22Z

This looks like a checkpointing problem. Can you confirm 1) does this only happen after checkpoint resume? 2) does this only happen when vocab size is not divisible by world size? From your description, the answers to both questions seem to be yes, but I do want to confirm. Thanks!

Yes, answers to both questions are Yes.
If I pad the vocab size to be divisible, the loss would be consistent.

tianyu-l added module: checkpoint high priority labels Apr 27, 2025

pytorch-bot bot added the triage review label Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

weixuansun commented Apr 23, 2025

tianyu-l commented Apr 23, 2025

Uh oh!

unlimblue commented Apr 23, 2025

Uh oh!

fegin commented Apr 24, 2025

Uh oh!

weixuansun commented Apr 24, 2025

Uh oh!

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

Inconsistent loss when resume training with vocab size that is not divisible by world size. #1136

Comments

weixuansun commented Apr 23, 2025

Bug description

Versions

tianyu-l commented Apr 23, 2025

Uh oh!

unlimblue commented Apr 23, 2025

Uh oh!

fegin commented Apr 24, 2025

Uh oh!

weixuansun commented Apr 24, 2025

Uh oh!