New GA fix causes training loss multiple times higher across the board (5x to 10x higher)

### System Info

8xH100

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

After updating to the latest master branch of transformer, the training loss is mutiple times higher than before (5x-10x). I tried both SFT and DPO (paired with latest trl master), all having the same problems. 
SFT after GA fix
<img width="1446" alt="image" src="https://github.com/user-attachments/assets/37e099ce-1cb9-45e4-9ca8-7be701f01136">

SFT before GA fix
<img width="1448" alt="image" src="https://github.com/user-attachments/assets/884606fb-097d-4eeb-9125-da055dbab61b">


### Expected behavior

training loss value should be aligned with old values, or should be expected lower. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

New GA fix causes training loss multiple times higher across the board (5x to 10x higher) #34263

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions