-
Notifications
You must be signed in to change notification settings - Fork 448
dataset: Add JinaVDR #2942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
dataset: Add JinaVDR #2942
Conversation
mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py
Outdated
Show resolved
Hide resolved
self.data_loaded = True | ||
|
||
|
||
class JinaVDRMedicalPrescriptionsRetrieval(MultilingualTask, AbsTaskAny2AnyRetrieval): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove MultilingualTask
from tasks with eval_langs=["eng-Latn"],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you already warn, that MultilingualTask
will be removed with 2.0, should I remove it from all Tasks? That was something I was also wondering while creating the tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's required to run correctly your tasks in main
mteb/tasks/Image/Any2AnyRetrieval/multilingual/JinaVDRBenchRetrieval.py
Outdated
Show resolved
Hide resolved
@@ -131,7 +131,7 @@ def calculate_probs(self, text_embeddings, image_embeddings): | |||
return scores.softmax(dim=-1) | |||
|
|||
def similarity(self, a, b): | |||
return self.processor.score(a, b) | |||
return self.processor.score(a, b, batch_size=8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should allow the user to pass different processor kwargs along.
return self.processor.score(a, b, batch_size=8) | |
return self.processor.score(a, b, **self.processor_kwargs) |
if not isinstance(texts, list): | ||
texts = list(texts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the cases where it is not a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I used an older version of MTEB (like two months old) I got sometimes other iterables then lists. If this is fixed on MTEB side and there is more consistency, great. I'm very happy to remove it. :)
|
||
|
||
COMMON_METADATA = { | ||
"description": "Retrieve associated pages according to questions or related text.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify this per dataset? It needs to be detailed enough for the user to understand what the task is about and how it differs from other tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got "inspired" by the vidore2 benchmark and would prefer to keep the original datasets on huggingface as the only source of truth. When we update information there, we would always need to sync it with here, which seems like a bad idea and will rot soon, non?
}, | ||
eval_langs=["eng-Latn"], | ||
domains=["Academic"], | ||
license="not specified", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we want the license to be specifed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that. For some datasets, that is hardly doable, since the original authors did not always specified the license. How would you like to proceed? Remove the task for the benchmark?
|
||
COMMON_METADATA = { | ||
"description": "Retrieve associated pages according to questions or related text.", | ||
"reference": "https://arxiv.org/abs/2506.18902", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this reference makes great sense for the benchmark itself, I would probably add more detailed refs to the individual datasets (e.g. for EuropeanaItScans it seems like this repo is a better reference if I want to understand something about the data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These repos are always referenced in the actual huggingface datasets. But if you prefer, I'll double that here.
"JinaVDRArxivQARetrieval", | ||
], | ||
), | ||
description="Multilingual, domain-diverse and layout-rich document retrieval benchmark.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like (from the paper) that this extends the existing Visual Document Retrieval benchmark, which is already implemented in MTEB. Any reason to use JinaVDRTabFQuadRetrieval
instead of VidoreTabfquadRetrieval
?
In addition, it would be great with an expansion of the description - e.g., mentioning that it is an expansion and indicating what way it expands the benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will come back with a proper answer here soon. We did some automatic filtering on the queries, since some where bad (like "what can you see in this image?"), but we may have filtered a bit too aggressively.
What a great PR @maximilianwerk! Added a few additional comments |
Hey, we would like to contribute the JinaVDR benchmark to MTEB.
Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating visually rich multilingual documents with complex layouts like graphs, charts, and tables (mixed with text and images), as well as adding real-world queries and questions.
The benchmarks were each run on an H100 GPU.
I experienced some OOM errors when running with big batch sizes. Thus, the change for the colpali models in the
similarity
function.Please find the results here: embeddings-benchmark/results#242
mteb
mteb
package.mteb run -m {model_name} -t {task_name}
command. It is important to reduce thebatch_size
in order to have enough memory to run these models.