Description
Hi,
I've noticed that the tokenizer is configured with tokenizer.pad_token = tokenizer.eos_token.
This poses a significant problem for models like Llama 3 that use eos_token (e.g., <|eot_id|>
) as a semantic delimiter to separate turns in chat templates. During batch processing of multi-turn dialogues, the real eos_token marking the end of a turn gets incorrectly masked out in the attention_mask.
Problem:
When padding shorter sequences in a batch, the tokenizer's logic sets the attention_mask to False for all pad_token_ids. Because this is the same as the eos_token_id, it incorrectly masks the eos_tokens that are part of the actual input, not just the padding. This can result in an attention_mask like [...True, True, False, True...], where the False corresponds to a meaningful eos_token.
Suggested Solution:
Would you consider adding a new, distinct padding token (e.g., ) or use different pad_token defined by each model (such as <|finetune_right_pad_id|>
in llama-3) to the tokenizer and resizing the model's token embeddings? This would resolve the ambiguity and ensure correct attention masking during training.
Thanks for your great work on this project!