Skip to content

Different results by retriever.encode_and_retrieve and retriever.retrieve #206

@liyongkang123

Description

@liyongkang123

Hi, I encountered an issue while evaluating the performance of contriever-msmarco on the Arguana dataset using the official example.

When running the following code:

results = retriever.retrieve(corpus, queries)
results = retriever.encode_and_retrieve(corpus, queries)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values, ignore_identical_ids=False)

I noticed that the results output might differ depending on which method (retrieve or encode_and_retrieve) is used. Specifically, encode_and_retrieve may include document IDs that are the same as the query IDs.

I always set ignore_identical_ids=False, and when using retrieve, I get a normal ndcg@10=44 by retrieve. However, when using encode_and_retrieve, the ndcg@10 is much lower at only 33.4. After comparing the results from both methods, I found that encode_and_retrieve includes query-document similarity with the query itself, which causes the issue.

I would like to know how I can fix this problem, as I intend to save the embeddings and use them.

Thank you in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions