Skip to content

Fix ColQwen3.5-4.5B - ColPaliEngineWrapper.encode() for multimodal datasets#4245

Open
athrael-soju wants to merge 7 commits intoembeddings-benchmark:mainfrom
athrael-soju:fix/colpali-encode-v2
Open

Fix ColQwen3.5-4.5B - ColPaliEngineWrapper.encode() for multimodal datasets#4245
athrael-soju wants to merge 7 commits intoembeddings-benchmark:mainfrom
athrael-soju:fix/colpali-encode-v2

Conversation

@athrael-soju
Copy link
Contributor

@athrael-soju athrael-soju commented Mar 17, 2026

Changes

  • Override encode() in ColQwen3_5Wrapper to route by prompt_type: queries use text, documents use images when available.
  • Override encode_input() in ColQwen3_5Wrapper to clear stale rope_deltas cache before forward passes.

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think it’s a good idea to change the base class for ColPali models, because models other than yours are working correctly. With your implementation, some tasks would behave incorrectly (for example, Vidorev3.1), where the model would receive images and text together.

@athrael-soju
Copy link
Contributor Author

I don't think that this is good idea to change base class of colpali models, because other than your model working correctly. With your implementation some tasks would work incorrectly (Vidorev3.1) where model would receive images and texts together

Thanks for the feedback. I'll move the encode() and encode_input() overrides into ColQwen3_5Wrapper

@athrael-soju athrael-soju requested a review from Samoed March 17, 2026 18:58
@Samoed
Copy link
Member

Samoed commented Mar 17, 2026

Maybe with self.mdl.rope_deltas = None you won't need to change encode function? Your implementation would produce incorrect results a bit

@athrael-soju
Copy link
Contributor Author

Maybe with self.mdl.rope_deltas = None you won't need to change encode function? Your implementation would produce incorrect results a bit

Tested with only rope_deltas = None and the original encode() fusion logic and it crashes with the previous error: RuntimeError: The size of tensor a (3340) must match the size of tensor b (770) at non-singleton dimension 1.

@Samoed
Copy link
Member

Samoed commented Mar 17, 2026

This is strange that images and text have different shapes

@athrael-soju
Copy link
Contributor Author

athrael-soju commented Mar 17, 2026

This is strange that images and text have different shapes

ColQwen3.5's image processor has a much higher minimum resolution (shortest_edge=65536 vs 3136 for ColQwen2), producing ~3340 image tokens vs ~770 text tokens. ColQwen3 avoids this by processing both modalities in a single forward pass instead of separately fusing them.

At least that's what I think is the reason.

@Samoed
Copy link
Member

Samoed commented Mar 17, 2026

@athrael-soju
Copy link
Contributor Author

Yes, it is https://github.com/athrael-soju/mteb/blob/bb8ee1ffbac8f0dcc57484d751060931f39ff994/mteb/models/model_implementations/colqwen_models.py#L250-L277

Why you don't process this similarly?

Good question. Let me test that.

@athrael-soju
Copy link
Contributor Author

athrael-soju commented Mar 17, 2026

@Samoed Refactored ColQwen3_5Wrapper to follow the same pattern as ColQwen3Wrapper: inherits from AbsEncoder, processes text+image jointly so the shape mismatch never comes up, and clears rope_deltas before each forward pass as you suggested. Also fixed a bug where outputs[0] was silently slicing the batch dimension since ColQwen3_5.forward returns a plain tensor.

Tested just now and seems ok.



class ColQwen3_5Wrapper(ColPaliEngineWrapper): # noqa: N801
class ColQwen3_5Wrapper(AbsEncoder): # noqa: N801
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe would be better to inherit from ColQwen3Wrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would require overriding both _encode_inputs and get_fused_embeddings due to ColQwen3.5's different model output format and processor. It would work, but perhaps a little messier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep as it is then. But you need to rerun your model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Already running to fix linting.

@athrael-soju athrael-soju requested a review from Samoed March 17, 2026 21:58
Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can you rerun tasks and submit new results? Then we can merge this

@athrael-soju
Copy link
Contributor Author

@Samoed does everything look ok for a merge?

@Samoed
Copy link
Member

Samoed commented Mar 17, 2026

Looks good, but you need to rerun tasks firstly

@athrael-soju
Copy link
Contributor Author

Looks good, but you need to rerun tasks firstly

I made a change to use process_images()/process_queries() separately instead of joint processing, matching the standard ColPali eval pipeline. The previous approach fused text+image embeddings via element-wise addition, which broke with ColQwen3.5's dynamic resolution.

Re-running benchmarks now and will submit in embeddings-benchmark/results#448

@athrael-soju
Copy link
Contributor Author

Added embeddings-benchmark/results#450 with latest evals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants