Add truncation to CLIP model #2969

MrLoh · 2024-10-01T16:34:05Z

It seems that sentence transformer does currently not support truncation on CLIP models, which leads to an error when calling SentenceTransformer("sentence-transformers/clip-ViT-L-14").encode("my long text" * 26)

Token indices sequence length is longer than the specified maximum sequence length for this model (81 > 77). Running this sequence through the model will result in indexing errors

This PR adds support for truncation to the clip model. Allowing to pass in the standard sentence transformer max_seq_length argument and defaulting it to processor.tokenizer.model_max_length (i.e. 77).

MrLoh · 2025-03-21T00:05:03Z

@tomaarsen would you be willing to take a loog at this. Happy to address any comments to get this merged.

tomaarsen · 2025-03-21T11:20:41Z

Hello!

Big apologies for missing this PR when it was opened.
I agree that this is a big oversight - truncation should always be enabled. But beyond that, I'm wary of introducing a max_seq_length argument as the Transformer module struggles a bit with having 2 "sources of truth" for the tokenizer maximum sequence length: 1) the max_seq_length in Transformer and 2) the model_max_length in the tokenizer.

Because of this, I propose to only use truncate=True (a.k.a. truncate="longest-first) without providing the maximum sequence length manually. Perhaps in the future I'll introduce this max_seq_length fully, but currently I'm not sure what the best approach is yet.

In short: the tokenizer maximum length is now abided by, and updating the maximum sequence length is now only possible by updating model.processor.tokenizer.model_max_length (I'm not 100% if model_max_length is the right name).

What do you think?

Tom Aarsen

MrLoh · 2025-03-21T20:09:23Z

Thanks a lot for taking a look.

What you say generally makes sense. I am only concerned about the fact there is a getter and setter for max_seq_length, which operates on the first module to set the sequence length which is this code and which will error for CLIP.

sentence-transformers/sentence_transformers/SentenceTransformer.py

Lines 1747 to 1770 in 3fd59c3

    
               def max_seq_length(self) -> int: 
        
                   """ 
        
                   Returns the maximal input sequence length for the model. Longer inputs will be truncated. 
        
                   Returns: 
        
                       int: The maximal input sequence length. 
        
                   Example: 
        
                       :: 
        
                           from sentence_transformers import SentenceTransformer 
        
                           model = SentenceTransformer("all-mpnet-base-v2") 
        
                           print(model.max_seq_length) 
        
                           # => 384 
        
                   """ 
        
                   return self._first_module().max_seq_length 
        
               @max_seq_length.setter 
        
               def max_seq_length(self, value) -> None: 
        
                   """ 
        
                   Property to set the maximal input sequence length for the model. Longer inputs will be truncated. 
        
                   """ 
        
                   self._first_module().max_seq_length = value

Would it maybe be better to add a getter and setter to this class that in turn gets and sets model.processor.tokenizer.model_max_length?

MrLoh · 2025-03-21T20:16:44Z

I pushed a commit that adds a getter and setter. Let me know what you think.

PS: I verified that model_max_length is the correct name

Add truncation to CLIP model

9aeaf14

MrLoh mentioned this pull request Oct 1, 2024

[CLIP] 'clip-ViT-B-32' can we not change the max_seq_lenght? #1269

Closed

Avoid allowing users to specify max_seq_length currently

cdeff8c

tomaarsen approved these changes Mar 21, 2025

View reviewed changes

add max_seq_length getter and setter to clip

c694225

tomaarsen added 3 commits March 24, 2025 09:56

Add test to verify CLIP max length getter/setter

a6ef50f

Merge branch 'master' into pr-2969

fa07b45

Run formatter

9622821

tomaarsen merged commit 6539557 into UKPLab:master Mar 24, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add truncation to CLIP model #2969

Add truncation to CLIP model #2969

MrLoh commented Oct 1, 2024

MrLoh commented Mar 21, 2025

tomaarsen commented Mar 21, 2025

MrLoh commented Mar 21, 2025

MrLoh commented Mar 21, 2025 •

edited

Loading

Add truncation to CLIP model #2969

Add truncation to CLIP model #2969

Conversation

MrLoh commented Oct 1, 2024

MrLoh commented Mar 21, 2025

tomaarsen commented Mar 21, 2025

MrLoh commented Mar 21, 2025

MrLoh commented Mar 21, 2025 • edited Loading

MrLoh commented Mar 21, 2025 •

edited

Loading