Skip to content

[v2] Change corpus and queries to use dataset #2885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: v2.0.0
Choose a base branch
from
Open

Conversation

Samoed
Copy link
Member

@Samoed Samoed commented Jul 6, 2025

Fixes #2692

I've updated retrieval split as following:

class RetrievalSplitData(TypedDict):
-    corpus: dict[str, dict[str, str]]
-    queries: dict[str, str]
+    corpus: Dataset
+    queries: Dataset
    relevant_docs: Mapping[str, Mapping[str, float]]
-    instructions: dict[str, str] | None
+    instructions: Dataset | None
    top_ranked: Mapping[str, list[str]] | None

I didn't change relevant_docs and top_ranked to a dataset because we're accessing these elements by ID, and using a dict is simpler for that. However, I think we can change it if needed.

I also modified the test a bit because we’re creating mock embeddings independently of the input. Previously, the corpus dict was sorted by keys and, for some reason, ended up as [d2, d1], so the first embedding was used. Maybe we should generate a random embedding based on the input instead. I’ve tested the branches on SCIDOCS, and the results didn’t change.

@Samoed Samoed changed the title Retrieval ds Change corpus and queries to use dataset Jul 6, 2025
@Samoed Samoed added the v2 Issues and PRs related to `v2` branch label Jul 6, 2025
@Samoed Samoed changed the title Change corpus and queries to use dataset [v2] Change corpus and queries to use dataset Jul 6, 2025
@Samoed Samoed marked this pull request as ready for review July 25, 2025 14:05
Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix. I don't love that we have different types (dict and Dataset) but you're right it doesn't seem to make sense to force them to be datasets when we use them as dicts. And we have type annotations, so users can figure it out if they need to.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check that this still lead to the same results? I see that you did that

Maybe we should generate a random embedding based on the input instead.

Yes. I have had that problem before as well. I have actually considered replacing the mockencoder with a small model2vec model (small enough to test with and is a more valid test model)

@@ -116,11 +120,13 @@ def __init__(self, **kwargs):
def convert_v1_dataset_format_to_v2(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure how our changes influence the loading of v2 retrieval dataset (do we have any of those? It would be great to see how that would influence the load data setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this influence pushing datasets to the hub? Does it push in a v2 compatible manner, and do we need to update now?

@@ -116,11 +120,13 @@ def __init__(self, **kwargs):
def convert_v1_dataset_format_to_v2(self):
if not hasattr(self, "queries"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a way to check if it is v2? Might be worth adding a comment on this

Comment on lines +443 to +448
cur_queries = split_data["queries"].map(
map_indexes_for_ds, fn_kwargs=dict(hf_subset=hf_subset, split=split)
)
cur_corpus = split_data["corpus"].map(
map_indexes_for_ds, fn_kwargs=dict(hf_subset=hf_subset, split=split)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there already an index, why do we need to create a new one?

@@ -577,17 +636,24 @@ def calculate_corpus_length(


def process_docs(
collection: dict[str, dict[str, str] | str], hf_subset: str, split: str
) -> dict[str, str]:
collection: dict[str, list[str]], hf_subset: str, split: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this function? (seems like the following superseedes it)

@@ -62,7 +62,7 @@ def corpus_to_dict(


def create_dataloader_for_retrieval_corpus(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this function we take a dataset, convert it to a dictionary and then back to a dataset - could we avoid the transformation?

@@ -70,15 +71,15 @@ def __init__(

def search(
self,
corpus: dict[str, dict[str, str]],
queries: dict[str, str | list[str] | list[dict[str, str]]],
corpus: Dataset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could introduce two new types:

QueryDataset = Dataset
CorpusDataset = Dataset

To differentiate between the two. It also allows us to add some documentation on what those entail, but I am not sure if it is worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be good addition

@@ -1419,12 +1439,12 @@ class MockRerankingTask(AbsTaskRetrieval):
expected_stats = {
"test": {
"num_samples": 4,
"number_of_characters": 112,
"number_of_characters": 136,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were these incorrect before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added title to the corpus

Comment on lines +354 to +356
results = sorted(
results["1"].keys(), key=lambda x: (results["1"][x], x), reverse=True
)[:2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reversed=True. Isn't that a bit worrying?

@@ -27,7 +27,7 @@ def test_colbert_model_e2e(task: AbsTask, model: str, tmp_path: Path):
)
result = results[0]

assert result.scores["test"][0]["ndcg_at_1"] == 1.0
assert result.scores["test"][0]["ndcg_at_1"] == 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this change?

Copy link
Member Author

@Samoed Samoed Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in description, there is issue due to sorting and embedding creation

I also modified the test a bit because we’re creating mock embeddings independently of the input. Previously, the corpus dict was sorted by keys and, for some reason, ended up as [d2, d1], so the first embedding was used.

@Samoed
Copy link
Member Author

Samoed commented Jul 25, 2025

Yes. I have had that problem before as well. I have actually considered replacing the mockencoder with a small model2vec model

We can use approach as is in langchain fake embedding by setting seed from the text https://github.com/langchain-ai/langchain/blob/master/libs%2Fcore%2Flangchain_core%2Fembeddings%2Ffake.py#L120

@KennethEnevoldsen
Copy link
Contributor

We can use approach as is in langchain fake embedding by setting seed from the text https://github.com/langchain-ai/langchain/blob/master/libs%2Fcore%2Flangchain_core%2Fembeddings%2Ffake.py#L120

Great idea! Let us do that instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v2 Issues and PRs related to `v2` branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants