🐛 Describe the bug
Fixed it for this one specific case here, sadly don't have the time to put in a proper pr for all models for all cases: BitPhinix@c455526
TLDR:
Ever since huggingface/transformers#34191
Transformers expect all models that take in kwargs
to scale the loss by num_items_in_batch
.
Right now, the loss is effectively multiplied by the number of gradient accumulation steps steps when using liger kernels
Reproduce
No response
Versions
all