[bugfix] tokenizers respect padding: true with non-null max_length #1284

dwisdom0 · 2025-04-14T03:20:54Z

This commit changes the behavior of tokenizers to fix a bug when a user passes both padding: true and max_length: <some number>.

The expected behavior is described in the docs of the Python library here
https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__.padding

padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
* 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
* False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

And in the Transformers.js docs here
https://huggingface.co/docs/transformers.js/api/tokenizers#pretrainedtokenizercalltext-options--code-batchencoding-code

[options.padding] boolean | 'max_length' false Whether to pad the input sequences.
[options.max_length] number Maximum length of the returned list and optionally padding length.

Before this commit, passing

{
  padding: true,
  max_length: 512
}

or

{
  padding: 'max_length',
  max_length: 512
}

would both always pad all outputs to 512 tokens, even if the longest encoding in the batch was shorter than 512 tokens. This is the correct behavior for padding: 'max_length', but it's incorrect for padding: true.

After this change,

{
  padding: true,
  max_length: 512
}

will now pad the outputs to match the longest encoding or max_length, whichever is shorter.

This commit also adds a test to prevent regressions.

This commit changes the behavior of tokenizers to match the behavior described in the docs and the behavior of the Python library. Before this commit, passing { padding: true, max_length: 512 } or { padding: 'max_length', max_length: 512 } would both always pad all outputs to 512 tokens. After this change, { padding: true, max_length: 512 } will now pad the outputs to match the longest encoding or max_length, whichever is shorter. This commit also adds a test to prevent regressions.

HuggingFaceDocBuilderDev · 2025-04-29T22:53:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

xenova

Thanks for this PR! 🤗 These are indeed valid fixes you have proposed. I checked all combinations with the python library and noticed there were some cases missing, so I added those test cases and adjusted the logic:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Xenova/bert-base-uncased")
inputs = tokenizer(
    ["a", "b c d e f"],
    padding='max_length',
    truncation=False,
    max_length=None,

    return_tensors=None,
    add_special_tokens=False,
)

for example, should pad to the model's max length (of 512).

@dwisdom0 Would you mind double checking these changes too? Thanks!

xenova added 2 commits April 29, 2025 20:08

Revamp tokenizer padding/truncation test suite

5e1171c

Fix tokenization padding/truncation logic

440ad23

xenova approved these changes Apr 30, 2025

View reviewed changes

nit

4a31a74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] tokenizers respect padding: true with non-null max_length #1284

[bugfix] tokenizers respect padding: true with non-null max_length #1284

dwisdom0 commented Apr 14, 2025

HuggingFaceDocBuilderDev commented Apr 29, 2025

xenova left a comment •

edited

Loading

[bugfix] tokenizers respect padding: true with non-null max_length #1284

Are you sure you want to change the base?

[bugfix] tokenizers respect padding: true with non-null max_length #1284

Conversation

dwisdom0 commented Apr 14, 2025

HuggingFaceDocBuilderDev commented Apr 29, 2025

xenova left a comment • edited Loading

Choose a reason for hiding this comment

xenova left a comment •

edited

Loading