[v2] Change corpus and queries to use dataset#2885
Conversation
corpus and queries to use datasetcorpus and queries to use dataset
orionw
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix. I don't love that we have different types (dict and Dataset) but you're right it doesn't seem to make sense to force them to be datasets when we use them as dicts. And we have type annotations, so users can figure it out if they need to.
There was a problem hiding this comment.
Can you check that this still lead to the same results? I see that you did that
Maybe we should generate a random embedding based on the input instead.
Yes. I have had that problem before as well. I have actually considered replacing the mockencoder with a small model2vec model (small enough to test with and is a more valid test model)
| @@ -116,11 +120,13 @@ def __init__(self, **kwargs): | |||
| def convert_v1_dataset_format_to_v2(self): | |||
There was a problem hiding this comment.
I am unsure how our changes influence the loading of v2 retrieval dataset (do we have any of those? It would be great to see how that would influence the load data setup.
There was a problem hiding this comment.
How does this influence pushing datasets to the hub? Does it push in a v2 compatible manner, and do we need to update now?
There was a problem hiding this comment.
Updated function for pushing datasets
There was a problem hiding this comment.
I tried to measure time with this script
import mteb
from datetime import datetime
task = mteb.get_task("SCIDOCS")
start = datetime.now()
task.load_data()
print((datetime.now() - start).seconds)And for v2 I got times 10-30 seconds, and for this PR from 7-20 seconds. I don't understand why there is so huge difference between same runs on same branch (dataset was preloaded locally).
| results = sorted( | ||
| results["1"].keys(), key=lambda x: (results["1"][x], x), reverse=True | ||
| )[:2] |
There was a problem hiding this comment.
reversed=True. Isn't that a bit worrying?
There was a problem hiding this comment.
I tried to debug this part and found that we’re handling the reranking task a bit incorrectly, because it passes a tuple[dict] to corpus_to_dict, and this ends up storing all information in the text field. This might be part of the issue discussed in #2933.
After making this change, I looked at the results, and they contain values like:
{'18670': -10.466630935668945, '4983': -8.812150955200195, '19238': -11.240396499633789}So we need to sort them in reverse.
There was a problem hiding this comment.
hmm, interesting - we should maybe test for the exact values here then - to ensure that we don't change this behaviour going forward
We of course also have to test #2933
There was a problem hiding this comment.
Yes, I think we would create mock cross-encoders to test them more reproducible
We can use approach as is in langchain fake embedding by setting seed from the text https://github.com/langchain-ai/langchain/blob/master/libs%2Fcore%2Flangchain_core%2Fembeddings%2Ffake.py#L120 |
Great idea! Let us do that instead. |
|
@KennethEnevoldsen can you rereview this? |
# Conflicts: # mteb/create_dataloaders.py # mteb/types/_encoder_io.py
| eval_splits=["test"], | ||
| previous_results=tmp_file, | ||
| save_predictions=True, | ||
| co2_tracker=False, |
There was a problem hiding this comment.
Yea, will be good to have it as default off in v2's evaluate
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
A few minor things, generally looks very good!
# Conflicts: # mteb/evaluation/evaluators/dense_retrieval_exact_search.py
Fixes #2692
I've updated retrieval split as following:
I didn't change
relevant_docsandtop_rankedto a dataset because we're accessing these elements by ID, and using a dict is simpler for that. However, I think we can change it if needed.I also modified the test a bit because we’re creating mock embeddings independently of the input. Previously, the corpus dict was sorted by keys and, for some reason, ended up as
[d2, d1], so the first embedding was used. Maybe we should generate a random embedding based on the input instead. I’ve tested the branches onSCIDOCS, and the results didn’t change.