**Describe the bug** When use ZERO++ and zero_hpz_partition_size is set, The loss of the first step or the first step after load checkpoint is too high. **To Reproduce** Qwen3 SFT train. stage3, one node, 8 gpu per node, zero_hpz_partition_size is set to 4. **Screenshots** Two experiment: * `base`: not set zero_hpz_partition_size * `bug`: set zero_hpz_partition_size to 4. And save checkpoint at step50, then load from this checkpoint <img width="825" height="479" alt="Image" src="https://github.com/user-attachments/assets/fbcff041-d6cd-47c0-9ddd-1b4feb8d3923" /> **Docker context** base on image: `ghcr.io/pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel` python: 3.11.13 deepspeed: 0.16.9