Skip to content

[bugfix] tokenizers respect padding: true with non-null max_length #1284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dwisdom0
Copy link

This commit changes the behavior of tokenizers to fix a bug when a user passes both padding: true and max_length: <some number>.

The expected behavior is described in the docs of the Python library here
https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__.padding

padding (bool, str or PaddingStrategy, optional, defaults to False) — Activates and controls padding. Accepts the following values:

  • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
    * 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
    * False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

And in the Transformers.js docs here
https://huggingface.co/docs/transformers.js/api/tokenizers#pretrainedtokenizercalltext-options--code-batchencoding-code

[options.padding] boolean | 'max_length' false Whether to pad the input sequences.
[options.max_length] number Maximum length of the returned list and optionally padding length.

Before this commit, passing

{
  padding: true,
  max_length: 512
}

or

{
  padding: 'max_length',
  max_length: 512
}

would both always pad all outputs to 512 tokens, even if the longest encoding in the batch was shorter than 512 tokens. This is the correct behavior for padding: 'max_length', but it's incorrect for padding: true.

After this change,

{
  padding: true,
  max_length: 512
}

will now pad the outputs to match the longest encoding or max_length, whichever is shorter.

This commit also adds a test to prevent regressions.

This commit changes the behavior of tokenizers to match the
behavior described in the docs and the behavior of the Python
library.

Before this commit, passing
{
  padding: true,
  max_length: 512
}
or
{
  padding: 'max_length',
  max_length: 512
}
would both always pad all outputs to 512 tokens.

After this change,
{
  padding: true,
  max_length: 512
}
will now pad the outputs to match the longest encoding
or max_length, whichever is shorter.

This commit  also adds a test to prevent regressions.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@xenova xenova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! 🤗 These are indeed valid fixes you have proposed. I checked all combinations with the python library and noticed there were some cases missing, so I added those test cases and adjusted the logic:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Xenova/bert-base-uncased")
inputs = tokenizer(
    ["a", "b c d e f"],
    padding='max_length',
    truncation=False,
    max_length=None,

    return_tensors=None,
    add_special_tokens=False,
)

for example, should pad to the model's max length (of 512).


@dwisdom0 Would you mind double checking these changes too? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants