Description
Feature request
Token averaging in gradient accumulation was fixed in #34191 . But token averaging in DDP seems to have the same issue.
Expected behaivor
With all the tokens contributing to loss in each step (in each GPU, gradient accumulation step, and microbatch), the equation becomes:
I believe we should average the above tokens at the same time for equivalent non-parallel training.
Current issue
Prior to #34191, the loss/gradients were averaged on num_items_in_batch
in #34191 refers to:
So, the loss/gradients are now averaged on
Can we also incorporate num_items_in_batch
? Something like all_reduce(num_items_in_batch)
?
Motivation
DDP seems not fully equivalent to non-parallel training.
related comments: #34191 (comment)
Your contribution
Found some fairseq implementation of this feature