Fix ColQwen3.5-4.5B - ColPaliEngineWrapper.encode() for multimodal datasets#4245
Fix ColQwen3.5-4.5B - ColPaliEngineWrapper.encode() for multimodal datasets#4245athrael-soju wants to merge 7 commits intoembeddings-benchmark:mainfrom
Conversation
…nd image features correctly
There was a problem hiding this comment.
I don’t think it’s a good idea to change the base class for ColPali models, because models other than yours are working correctly. With your implementation, some tasks would behave incorrectly (for example, Vidorev3.1), where the model would receive images and text together.
Thanks for the feedback. I'll move the encode() and encode_input() overrides into ColQwen3_5Wrapper |
… handling of text and image features
|
Maybe with |
Tested with only |
|
This is strange that images and text have different shapes |
ColQwen3.5's image processor has a much higher minimum resolution (shortest_edge=65536 vs 3136 for ColQwen2), producing ~3340 image tokens vs ~770 text tokens. ColQwen3 avoids this by processing both modalities in a single forward pass instead of separately fusing them. At least that's what I think is the reason. |
|
Why you don't process this similarly? |
Good question. Let me test that. |
…t for image-text embeddings
|
@Samoed Refactored ColQwen3_5Wrapper to follow the same pattern as ColQwen3Wrapper: inherits from AbsEncoder, processes text+image jointly so the shape mismatch never comes up, and clears rope_deltas before each forward pass as you suggested. Also fixed a bug where outputs[0] was silently slicing the batch dimension since ColQwen3_5.forward returns a plain tensor. Tested just now and seems ok. |
|
|
||
|
|
||
| class ColQwen3_5Wrapper(ColPaliEngineWrapper): # noqa: N801 | ||
| class ColQwen3_5Wrapper(AbsEncoder): # noqa: N801 |
There was a problem hiding this comment.
Maybe would be better to inherit from ColQwen3Wrapper?
There was a problem hiding this comment.
It would require overriding both _encode_inputs and get_fused_embeddings due to ColQwen3.5's different model output format and processor. It would work, but perhaps a little messier.
There was a problem hiding this comment.
I think we can keep as it is then. But you need to rerun your model
There was a problem hiding this comment.
Cool. Already running to fix linting.
Co-authored-by: Roman Solomatin <[email protected]>
Samoed
left a comment
There was a problem hiding this comment.
Looks good. Can you rerun tasks and submit new results? Then we can merge this
|
@Samoed does everything look ok for a merge? |
|
Looks good, but you need to rerun tasks firstly |
I made a change to use process_images()/process_queries() separately instead of joint processing, matching the standard ColPali eval pipeline. The previous approach fused text+image embeddings via element-wise addition, which broke with ColQwen3.5's dynamic resolution. Re-running benchmarks now and will submit in embeddings-benchmark/results#448 |
|
Added embeddings-benchmark/results#450 with latest evals. |
Changes
encode()inColQwen3_5Wrapperto route byprompt_type: queries use text, documents use images when available.encode_input()inColQwen3_5Wrapperto clear stalerope_deltascache before forward passes.