Fix ColQwen3.5-4.5B - ColPaliEngineWrapper.encode() for multimodal datasets by athrael-soju · Pull Request #4245 · embeddings-benchmark/mteb

athrael-soju · 2026-03-17T18:42:34Z

Changes

Override encode() in ColQwen3_5Wrapper to route by prompt_type: queries use text, documents use images when available.
Override encode_input() in ColQwen3_5Wrapper to clear stale rope_deltas cache before forward passes.

…nd image features correctly

Samoed

I don’t think it’s a good idea to change the base class for ColPali models, because models other than yours are working correctly. With your implementation, some tasks would behave incorrectly (for example, Vidorev3.1), where the model would receive images and text together.

athrael-soju · 2026-03-17T18:53:48Z

I don't think that this is good idea to change base class of colpali models, because other than your model working correctly. With your implementation some tasks would work incorrectly (Vidorev3.1) where model would receive images and texts together

Thanks for the feedback. I'll move the encode() and encode_input() overrides into ColQwen3_5Wrapper

… handling of text and image features

Samoed · 2026-03-17T18:59:32Z

Maybe with self.mdl.rope_deltas = None you won't need to change encode function? Your implementation would produce incorrect results a bit

athrael-soju · 2026-03-17T19:43:49Z

Maybe with self.mdl.rope_deltas = None you won't need to change encode function? Your implementation would produce incorrect results a bit

Tested with only rope_deltas = None and the original encode() fusion logic and it crashes with the previous error: RuntimeError: The size of tensor a (3340) must match the size of tensor b (770) at non-singleton dimension 1.

Samoed · 2026-03-17T19:52:51Z

This is strange that images and text have different shapes

athrael-soju · 2026-03-17T20:10:33Z

This is strange that images and text have different shapes

ColQwen3.5's image processor has a much higher minimum resolution (shortest_edge=65536 vs 3136 for ColQwen2), producing ~3340 image tokens vs ~770 text tokens. ColQwen3 avoids this by processing both modalities in a single forward pass instead of separately fusing them.

At least that's what I think is the reason.

Samoed · 2026-03-17T20:14:00Z

Yes, it is https://github.com/athrael-soju/mteb/blob/bb8ee1ffbac8f0dcc57484d751060931f39ff994/mteb/models/model_implementations/colqwen_models.py#L250-L277

Why you don't process this similarly?

athrael-soju · 2026-03-17T20:20:06Z

Yes, it is https://github.com/athrael-soju/mteb/blob/bb8ee1ffbac8f0dcc57484d751060931f39ff994/mteb/models/model_implementations/colqwen_models.py#L250-L277

Why you don't process this similarly?

Good question. Let me test that.

…t for image-text embeddings

athrael-soju · 2026-03-17T20:59:55Z

@Samoed Refactored ColQwen3_5Wrapper to follow the same pattern as ColQwen3Wrapper: inherits from AbsEncoder, processes text+image jointly so the shape mismatch never comes up, and clears rope_deltas before each forward pass as you suggested. Also fixed a bug where outputs[0] was silently slicing the batch dimension since ColQwen3_5.forward returns a plain tensor.

Tested just now and seems ok.

Samoed · 2026-03-17T21:22:12Z

mteb/models/model_implementations/colqwen_models.py



-class ColQwen3_5Wrapper(ColPaliEngineWrapper):  # noqa: N801
+class ColQwen3_5Wrapper(AbsEncoder):  # noqa: N801


Maybe would be better to inherit from ColQwen3Wrapper?

It would require overriding both _encode_inputs and get_fused_embeddings due to ColQwen3.5's different model output format and processor. It would work, but perhaps a little messier.

I think we can keep as it is then. But you need to rerun your model

Cool. Already running to fix linting.

mteb/models/model_implementations/colqwen_models.py

Co-authored-by: Roman Solomatin <[email protected]>

Samoed

Looks good. Can you rerun tasks and submit new results? Then we can merge this

athrael-soju · 2026-03-17T22:42:14Z

@Samoed does everything look ok for a merge?

Samoed · 2026-03-17T22:50:11Z

Looks good, but you need to rerun tasks firstly

…ju/mteb into fix/colpali-encode-v2

athrael-soju · 2026-03-18T00:28:29Z

Looks good, but you need to rerun tasks firstly

I made a change to use process_images()/process_queries() separately instead of joint processing, matching the standard ColPali eval pipeline. The previous approach fused text+image embeddings via element-wise addition, which broke with ColQwen3.5's dynamic resolution.

Re-running benchmarks now and will submit in embeddings-benchmark/results#448

athrael-soju · 2026-03-18T04:26:53Z

Added embeddings-benchmark/results#450 with latest evals.

fix: improve input encoding logic for ColPali models to handle text a…

d5a785b

…nd image features correctly

athrael-soju mentioned this pull request Mar 17, 2026

[ADD] results COLQWEN3.5-4.5B embeddings-benchmark/results#448

Closed

6 tasks

Samoed requested changes Mar 17, 2026

View reviewed changes

fix: refactor encoding logic in ColPali and ColQwen models to improve…

bb8ee1f

… handling of text and image features

athrael-soju requested a review from Samoed March 17, 2026 18:58

fix: refactor ColQwen3.5 wrapper to enhance input handling and suppor…

12153ed

…t for image-text embeddings

Samoed reviewed Mar 17, 2026

View reviewed changes

fix: linting

4c1c2b7

Samoed reviewed Mar 17, 2026

View reviewed changes

mteb/models/model_implementations/colqwen_models.py Outdated Show resolved Hide resolved

Update mteb/models/model_implementations/colqwen_models.py

c5a29a1

Co-authored-by: Roman Solomatin <[email protected]>

athrael-soju requested a review from Samoed March 17, 2026 21:58

Samoed approved these changes Mar 17, 2026

View reviewed changes

athrael-soju added 2 commits March 18, 2026 00:27

fix: image processing logic in ColQwen3.5 wrapper to match how

99a2868

Merge branch 'fix/colpali-encode-v2' of https://github.com/athrael-so…

1546cd6

…ju/mteb into fix/colpali-encode-v2

athrael-soju mentioned this pull request Mar 18, 2026

[ADD] results COLQWEN3.5-4.5B embeddings-benchmark/results#450

Open

6 tasks



		class ColQwen3_5Wrapper(ColPaliEngineWrapper): # noqa: N801
		class ColQwen3_5Wrapper(AbsEncoder): # noqa: N801

Conversation

athrael-soju commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

Samoed left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

athrael-soju commented Mar 17, 2026

Uh oh!

Samoed commented Mar 17, 2026

Uh oh!

athrael-soju commented Mar 17, 2026

Uh oh!

Samoed commented Mar 17, 2026

Uh oh!

athrael-soju commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Mar 17, 2026

Uh oh!

athrael-soju commented Mar 17, 2026

Uh oh!

athrael-soju commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

athrael-soju Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

athrael-soju Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Samoed left a comment

Choose a reason for hiding this comment

Uh oh!

athrael-soju commented Mar 17, 2026

Uh oh!

Samoed commented Mar 17, 2026

Uh oh!

athrael-soju commented Mar 18, 2026

Uh oh!

athrael-soju commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

athrael-soju commented Mar 17, 2026 •

edited

Loading

Samoed left a comment •

edited

Loading

athrael-soju commented Mar 17, 2026 •

edited

Loading

athrael-soju commented Mar 17, 2026 •

edited

Loading