Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add truncation to CLIP model #2969

Merged
merged 6 commits into from
Mar 24, 2025
Merged

Add truncation to CLIP model #2969

merged 6 commits into from
Mar 24, 2025

Conversation

MrLoh
Copy link
Contributor

@MrLoh MrLoh commented Oct 1, 2024

It seems that sentence transformer does currently not support truncation on CLIP models, which leads to an error when calling SentenceTransformer("sentence-transformers/clip-ViT-L-14").encode("my long text" * 26)

Token indices sequence length is longer than the specified maximum sequence length for this model (81 > 77). Running this sequence through the model will result in indexing errors

This PR adds support for truncation to the clip model. Allowing to pass in the standard sentence transformer max_seq_length argument and defaulting it to processor.tokenizer.model_max_length (i.e. 77).

@MrLoh
Copy link
Contributor Author

MrLoh commented Mar 21, 2025

@tomaarsen would you be willing to take a loog at this. Happy to address any comments to get this merged.

@tomaarsen
Copy link
Collaborator

Hello!

Big apologies for missing this PR when it was opened.
I agree that this is a big oversight - truncation should always be enabled. But beyond that, I'm wary of introducing a max_seq_length argument as the Transformer module struggles a bit with having 2 "sources of truth" for the tokenizer maximum sequence length: 1) the max_seq_length in Transformer and 2) the model_max_length in the tokenizer.

Because of this, I propose to only use truncate=True (a.k.a. truncate="longest-first) without providing the maximum sequence length manually. Perhaps in the future I'll introduce this max_seq_length fully, but currently I'm not sure what the best approach is yet.

In short: the tokenizer maximum length is now abided by, and updating the maximum sequence length is now only possible by updating model.processor.tokenizer.model_max_length (I'm not 100% if model_max_length is the right name).

What do you think?

  • Tom Aarsen

@MrLoh
Copy link
Contributor Author

MrLoh commented Mar 21, 2025

Thanks a lot for taking a look.

What you say generally makes sense. I am only concerned about the fact there is a getter and setter for max_seq_length, which operates on the first module to set the sequence length which is this code and which will error for CLIP.

def max_seq_length(self) -> int:
"""
Returns the maximal input sequence length for the model. Longer inputs will be truncated.
Returns:
int: The maximal input sequence length.
Example:
::
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
print(model.max_seq_length)
# => 384
"""
return self._first_module().max_seq_length
@max_seq_length.setter
def max_seq_length(self, value) -> None:
"""
Property to set the maximal input sequence length for the model. Longer inputs will be truncated.
"""
self._first_module().max_seq_length = value

Would it maybe be better to add a getter and setter to this class that in turn gets and sets model.processor.tokenizer.model_max_length?

@MrLoh
Copy link
Contributor Author

MrLoh commented Mar 21, 2025

I pushed a commit that adds a getter and setter. Let me know what you think.

PS: I verified that model_max_length is the correct name

@tomaarsen tomaarsen merged commit 6539557 into UKPLab:master Mar 24, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants