dataset: Add JinaVDR #2942

maximilianwerk · 2025-07-22T14:05:32Z

Hey, we would like to contribute the JinaVDR benchmark to MTEB.

Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating visually rich multilingual documents with complex layouts like graphs, charts, and tables (mixed with text and images), as well as adding real-world queries and questions.

The benchmarks were each run on an H100 GPU.

I experienced some OOM errors when running with big batch sizes. Thus, the change for the colpali models in the similarity function.

Please find the results here: embeddings-benchmark/results#242

I have outlined why this dataset is filling an existing gap in mteb
I have tested that the dataset runs with the mteb package.
I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command. It is important to reduce the batch_size in order to have enough memory to run these models.
- vidore/colpali-v1.2
- vidore/colpali-v1.3
- vidore/colqwen2.5-v0.2
- jinaai/jina-embeddings-v4
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks): All datasets have a maximum of 1000 documents.

mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py

Samoed · 2025-07-23T11:02:37Z

mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py

+    self.data_loaded = True
+
+
+class JinaVDRMedicalPrescriptionsRetrieval(MultilingualTask, AbsTaskAny2AnyRetrieval):


Can you remove MultilingualTask from tasks with eval_langs=["eng-Latn"],

Since you already warn, that MultilingualTask will be removed with 2.0, should I remove it from all Tasks? That was something I was also wondering while creating the tasks.

No, it's required to run correctly your tasks in main

mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py

KennethEnevoldsen · 2025-07-23T18:08:26Z

mteb/models/colpali_models.py

@@ -131,7 +131,7 @@ def calculate_probs(self, text_embeddings, image_embeddings):
        return scores.softmax(dim=-1)

    def similarity(self, a, b):
-        return self.processor.score(a, b)
+        return self.processor.score(a, b, batch_size=8)


This should allow the user to pass different processor kwargs along.

Suggested change

return self.processor.score(a, b, batch_size=8)

return self.processor.score(a, b, **self.processor_kwargs)

KennethEnevoldsen · 2025-07-23T18:09:49Z

mteb/models/jina_models.py

+        if not isinstance(texts, list):
+            texts = list(texts)


What are the cases where it is not a list?

When I used an older version of MTEB (like two months old) I got sometimes other iterables then lists. If this is fixed on MTEB side and there is more consistency, great. I'm very happy to remove it. :)

KennethEnevoldsen · 2025-07-23T18:13:32Z

mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py

+
+
+COMMON_METADATA = {
+    "description": "Retrieve associated pages according to questions or related text.",


Can you specify this per dataset? It needs to be detailed enough for the user to understand what the task is about and how it differs from other tasks.

I got "inspired" by the vidore2 benchmark and would prefer to keep the original datasets on huggingface as the only source of truth. When we update information there, we would always need to sync it with here, which seems like a bad idea and will rot soon, non?

KennethEnevoldsen · 2025-07-23T18:21:28Z

mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py

+        },
+        eval_langs=["eng-Latn"],
+        domains=["Academic"],
+        license="not specified",


Ideally we want the license to be specifed?

I understand that. For some datasets, that is hardly doable, since the original authors did not always specified the license. How would you like to proceed? Remove the task for the benchmark?

KennethEnevoldsen · 2025-07-23T18:22:20Z

mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py

+
+COMMON_METADATA = {
+    "description": "Retrieve associated pages according to questions or related text.",
+    "reference": "https://arxiv.org/abs/2506.18902",


While this reference makes great sense for the benchmark itself, I would probably add more detailed refs to the individual datasets (e.g. for EuropeanaItScans it seems like this repo is a better reference if I want to understand something about the data).

These repos are always referenced in the actual huggingface datasets. But if you prefer, I'll double that here.

KennethEnevoldsen · 2025-07-23T18:30:10Z

mteb/benchmarks/benchmarks.py

+            "JinaVDRArxivQARetrieval",
+        ],
+    ),
+    description="Multilingual, domain-diverse and layout-rich document retrieval benchmark.",


It sounds like (from the paper) that this extends the existing Visual Document Retrieval benchmark, which is already implemented in MTEB. Any reason to use JinaVDRTabFQuadRetrieval instead of VidoreTabfquadRetrieval?

In addition, it would be great with an expansion of the description - e.g., mentioning that it is an expansion and indicating what way it expands the benchmark

I will come back with a proper answer here soon. We did some automatic filtering on the queries, since some where bad (like "what can you see in this image?"), but we may have filtered a bit too aggressively.

KennethEnevoldsen · 2025-07-25T15:37:09Z

What a great PR @maximilianwerk! Added a few additional comments

maximilianwerk and others added 12 commits July 21, 2025 08:46

feat: added jinavdr benchmark

d746dd3

feat: added description for jinavdr

57c6c5f

feat: fixed licenses and added bibtex

d09dce4

feat: made jinav4 compatible with vidore benchmark

1dbcf35

feat: corrected query numbers

173e25c

feat: removed print

c9088ca

feat: added max pixel argument for jina models

424698f

feat: score calculation on cpu

de5d0f1

feat: adjust jina model for new mteb code

3dd5364

feat: code cleanup

9e9240a

feat: corrected bibtex

6448ba8

feat: make colpali run with jinavdr

c97c647

maximilianwerk mentioned this pull request Jul 22, 2025

Adding results for jinaVDR for 4 models embeddings-benchmark/results#242

Open

6 tasks

Samoed changed the title ~~Feat add jina vdr~~ dataset: add jina vdr Jul 23, 2025

Samoed reviewed Jul 23, 2025

View reviewed changes

KennethEnevoldsen changed the title ~~dataset: add jina vdr~~ dataset: Add JinaVDR Jul 24, 2025

feat: fixed comments

20c53e1

KennethEnevoldsen self-requested a review July 25, 2025 11:58

KennethEnevoldsen reviewed Jul 25, 2025

View reviewed changes

		self.data_loaded = True


		class JinaVDRMedicalPrescriptionsRetrieval(MultilingualTask, AbsTaskAny2AnyRetrieval):

	return self.processor.score(a, b, batch_size=8)
	return self.processor.score(a, b, **self.processor_kwargs)



		COMMON_METADATA = {
		"description": "Retrieve associated pages according to questions or related text.",

dataset: Add JinaVDR #2942

Are you sure you want to change the base?

dataset: Add JinaVDR #2942

Uh oh!

Conversation

maximilianwerk commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maximilianwerk Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen commented Jul 25, 2025

Uh oh!

Uh oh!

maximilianwerk commented Jul 22, 2025 •

edited

Loading

maximilianwerk Jul 26, 2025 •

edited

Loading