You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, in the label_collate() function and tokenizer pipeline, the padding ID used for batched label sequences is hardcoded to 0.
You can see this being used in here, where padded targets are generated with pad_id = 0.
This creates a mismatch:
The embedding layer is configured to ignore blank_id (vocab_size), which is never actually used in predictor inputs.
Meanwhile, pad_id = 0 is used for padding, but it is not masked in the embedding layer — so it receives gradients and is treated as a normal token during training.