[BUG] The loss of the first step is too high when zero_hpz_partition_size is set.

**Describe the bug**

When use ZERO++ and zero_hpz_partition_size is set, The loss of the first step or the first step after load checkpoint is too high.

**To Reproduce**

Qwen3 SFT train. stage3, one node, 8 gpu per node, zero_hpz_partition_size is set to 4.

**Screenshots**

Two experiment:

* `base`: not set zero_hpz_partition_size
* `bug`: set zero_hpz_partition_size to 4. And save checkpoint at step50, then load from this checkpoint

<img width="825" height="479" alt="Image" src="https://github.com/user-attachments/assets/fbcff041-d6cd-47c0-9ddd-1b4feb8d3923" />


**Docker context**
base on image: `ghcr.io/pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel`
python: 3.11.13
deepspeed: 0.16.9


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The loss of the first step is too high when zero_hpz_partition_size is set. #7606

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] The loss of the first step is too high when zero_hpz_partition_size is set. #7606

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions