-
Notifications
You must be signed in to change notification settings - Fork 31.8k
Open
Labels
Description
Feature request
Track only the training, avoiding the count of padding tokens
Motivation
It appears that this metric also includes padding tokens. If one would use example packing, then it really tracks the “correct” number of tokens seen by the model.
However, I can think of two cases where this will not be accurate:
- In cases where packing is not used, training examples are padded to the longest sequence in the batch
- For SFT training on completions only
For the first case, a more accurate calculation would be to sum the attention mask.
For the second case, I'm not sure how this should be regarded. However, we can consider counting only label tokens != -100
Your contribution
Replace lines 2248-2258 in trainer.py (v4.43.4) with the following:
self.state.num_input_tokens_seen += (
torch.sum(
self.accelerator.gather(
torch.tensor(
inputs['attention_mask'].sum(), device=self.args.device, dtype=torch.int64
)
)
)
.cpu()
.item()
)