Skip to content

dataset: Add JinaVDR #2942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

maximilianwerk
Copy link

@maximilianwerk maximilianwerk commented Jul 22, 2025

Hey, we would like to contribute the JinaVDR benchmark to MTEB.

Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating visually rich multilingual documents with complex layouts like graphs, charts, and tables (mixed with text and images), as well as adding real-world queries and questions.

The benchmarks were each run on an H100 GPU.

I experienced some OOM errors when running with big batch sizes. Thus, the change for the colpali models in the similarity function.

Please find the results here: embeddings-benchmark/results#242

  • I have outlined why this dataset is filling an existing gap in mteb
  • I have tested that the dataset runs with the mteb package.
  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb run -m {model_name} -t {task_name} command. It is important to reduce the batch_size in order to have enough memory to run these models.
    • vidore/colpali-v1.2
    • vidore/colpali-v1.3
    • vidore/colqwen2.5-v0.2
    • jinaai/jina-embeddings-v4
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • I have considered the size of the dataset and reduced it if it is too big (2048 examples is typically large enough for most tasks): All datasets have a maximum of 1000 documents.

@Samoed Samoed changed the title Feat add jina vdr dataset: add jina vdr Jul 23, 2025
self.data_loaded = True


class JinaVDRMedicalPrescriptionsRetrieval(MultilingualTask, AbsTaskAny2AnyRetrieval):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove MultilingualTask from tasks with eval_langs=["eng-Latn"],

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you already warn, that MultilingualTask will be removed with 2.0, should I remove it from all Tasks? That was something I was also wondering while creating the tasks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's required to run correctly your tasks in main

@KennethEnevoldsen KennethEnevoldsen changed the title dataset: add jina vdr dataset: Add JinaVDR Jul 24, 2025
@KennethEnevoldsen KennethEnevoldsen self-requested a review July 25, 2025 11:58
@@ -131,7 +131,7 @@ def calculate_probs(self, text_embeddings, image_embeddings):
return scores.softmax(dim=-1)

def similarity(self, a, b):
return self.processor.score(a, b)
return self.processor.score(a, b, batch_size=8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should allow the user to pass different processor kwargs along.

Suggested change
return self.processor.score(a, b, batch_size=8)
return self.processor.score(a, b, **self.processor_kwargs)

Comment on lines +340 to +341
if not isinstance(texts, list):
texts = list(texts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases where it is not a list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I used an older version of MTEB (like two months old) I got sometimes other iterables then lists. If this is fixed on MTEB side and there is more consistency, great. I'm very happy to remove it. :)



COMMON_METADATA = {
"description": "Retrieve associated pages according to questions or related text.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify this per dataset? It needs to be detailed enough for the user to understand what the task is about and how it differs from other tasks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got "inspired" by the vidore2 benchmark and would prefer to keep the original datasets on huggingface as the only source of truth. When we update information there, we would always need to sync it with here, which seems like a bad idea and will rot soon, non?

},
eval_langs=["eng-Latn"],
domains=["Academic"],
license="not specified",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we want the license to be specifed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that. For some datasets, that is hardly doable, since the original authors did not always specified the license. How would you like to proceed? Remove the task for the benchmark?


COMMON_METADATA = {
"description": "Retrieve associated pages according to questions or related text.",
"reference": "https://arxiv.org/abs/2506.18902",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this reference makes great sense for the benchmark itself, I would probably add more detailed refs to the individual datasets (e.g. for EuropeanaItScans it seems like this repo is a better reference if I want to understand something about the data).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These repos are always referenced in the actual huggingface datasets. But if you prefer, I'll double that here.

"JinaVDRArxivQARetrieval",
],
),
description="Multilingual, domain-diverse and layout-rich document retrieval benchmark.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like (from the paper) that this extends the existing Visual Document Retrieval benchmark, which is already implemented in MTEB. Any reason to use JinaVDRTabFQuadRetrieval instead of VidoreTabfquadRetrieval?

In addition, it would be great with an expansion of the description - e.g., mentioning that it is an expansion and indicating what way it expands the benchmark

Copy link
Author

@maximilianwerk maximilianwerk Jul 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will come back with a proper answer here soon. We did some automatic filtering on the queries, since some where bad (like "what can you see in this image?"), but we may have filtered a bit too aggressively.

@KennethEnevoldsen
Copy link
Contributor

What a great PR @maximilianwerk! Added a few additional comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants