Skip to content

Loss scaling is incorrect when using gradient_accumulation_steps > 1 #802

@BitPhinix

Description

@BitPhinix

🐛 Describe the bug

Fixed it for this one specific case here, sadly don't have the time to put in a proper pr for all models for all cases: BitPhinix@c455526

TLDR:

Ever since huggingface/transformers#34191

Transformers expect all models that take in kwargs to scale the loss by num_items_in_batch.

Right now, the loss is effectively multiplied by the number of gradient accumulation steps steps when using liger kernels

Reproduce

No response

Versions

all

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions