Skip to content

How to resume from previous checkpoints trained under different node topology? #14347

@tjoymeed

Description

@tjoymeed

Hi all,

We are doing pretraining Qwen3-30B-A3B LLM using Nemo-Megatron (container Nemo:25.04 but upgraded to latest github repo).

How to resume from previous checkpoints trained under different node topology?

Are there any constraints regarding (tp, pp, ep, cp, sp, vpp, mbs, gbs)?

For example, in resuming from previous checkpoints, what parameter of the above combination can change and which cannot change?

For example, I am guessing ep cannot change?

What are the concrete rules?

I read from somewhere that the versioning system of checkpoint is very "picky" in the sense any slight change will lead the checkpoints to be un-resumable?

How can I bypass the checkpoint versioning system to force max resumability from a previous checkpoint which has slightly different parameter combination?

Thanks!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions