Skip to content

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

@JianbangZ

Description

@JianbangZ

System Info

8xH100

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

After updating to the latest master branch of transformer, the training loss is mutiple times higher than before (5x-10x). I tried both SFT and DPO (paired with latest trl master), all having the same problems.
SFT after GA fix
image

SFT before GA fix
image

Expected behavior

training loss value should be aligned with old values, or should be expected lower.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions