Hi all,
We are doing pretraining Qwen3-30B-A3B LLM using Nemo-Megatron (container Nemo:25.04 but upgraded to latest github repo).
How to resume from previous checkpoints trained under different node topology?
Are there any constraints regarding (tp, pp, ep, cp, sp, vpp, mbs, gbs)?
For example, in resuming from previous checkpoints, what parameter of the above combination can change and which cannot change?
For example, I am guessing ep cannot change?
What are the concrete rules?
I read from somewhere that the versioning system of checkpoint is very "picky" in the sense any slight change will lead the checkpoints to be un-resumable?
How can I bypass the checkpoint versioning system to force max resumability from a previous checkpoint which has slightly different parameter combination?
Thanks!