Additional options for include_num_input_tokens_seen in Trainer

### Feature request

Track only the training, avoiding the count of padding tokens

### Motivation

It appears that this metric also includes padding tokens. If one would use example packing, then it really tracks the “correct” number of tokens seen by the model.

However, I can think of two cases where this will not be accurate:
1. In cases where packing is not used, training examples are padded to the longest sequence in the batch
2. For SFT training on completions only

For the first case, a more accurate calculation would be to sum the attention mask.
For the second case, I'm not sure how this should be regarded. However, we can consider counting only label tokens != `-100`

### Your contribution

Replace lines 2248-2258 in trainer.py (v4.43.4) with the following:
```python
self.state.num_input_tokens_seen += (
    torch.sum(
        self.accelerator.gather(
            torch.tensor(
                inputs['attention_mask'].sum(), device=self.args.device, dtype=torch.int64
            )
        )
    )
    .cpu()
    .item()
)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Additional options for include_num_input_tokens_seen in Trainer #32469

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Additional options for include_num_input_tokens_seen in Trainer #32469

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions