How to resume from previous checkpoints trained under different node topology?

Hi all,

We are doing pretraining Qwen3-30B-A3B LLM using Nemo-Megatron (container Nemo:25.04 but upgraded to latest github repo).

How to resume from previous checkpoints trained under different node topology?

Are there any constraints regarding (tp, pp, ep, cp, sp, vpp, mbs, gbs)?

For example, in resuming from previous checkpoints, what parameter of the above combination can change and which cannot change?

For example, I am guessing ep cannot change?

What are the concrete rules?

I read from somewhere that the versioning system of checkpoint is very "picky" in the sense any slight change will lead the checkpoints to be un-resumable? 

How can I bypass the checkpoint versioning system to force max resumability from a previous checkpoint which has slightly different parameter combination?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to resume from previous checkpoints trained under different node topology? #14347

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to resume from previous checkpoints trained under different node topology? #14347

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions