Skip to content

[BUG] The loss of the first step is too high when zero_hpz_partition_size is set. #7606

@zhengchenyu

Description

@zhengchenyu

Describe the bug

When use ZERO++ and zero_hpz_partition_size is set, The loss of the first step or the first step after load checkpoint is too high.

To Reproduce

Qwen3 SFT train. stage3, one node, 8 gpu per node, zero_hpz_partition_size is set to 4.

Screenshots

Two experiment:

  • base: not set zero_hpz_partition_size
  • bug: set zero_hpz_partition_size to 4. And save checkpoint at step50, then load from this checkpoint
Image

Docker context
base on image: ghcr.io/pytorch/pytorch:2.8.0-cuda12.8-cudnn9-devel
python: 3.11.13
deepspeed: 0.16.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions