Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow aggregated tasks within benchmarks #1231

Open
KennethEnevoldsen opened this issue Sep 23, 2024 · 5 comments · May be fixed by #1771
Open

Allow aggregated tasks within benchmarks #1231

KennethEnevoldsen opened this issue Sep 23, 2024 · 5 comments · May be fixed by #1771
Assignees

Comments

@KennethEnevoldsen
Copy link
Contributor

We currently have only one aggregated task (CQGDupstack), however, we can def. imagien more in the future (e.g. for CoIR in embeddings-benchmark/leaderboard#27).

A proposed solution is to use the benchmark (they are already a group of tasks) and then allow a benchmark to be a list[task | benchmark]

This will require updated to the MTEB.MTEB, as well as the create_meta and potentially for CLI.k

This approach should also solve: #1171

@Samoed
Copy link
Collaborator

Samoed commented Sep 24, 2024

I think that can be added average result for each subset for multilingual datasets

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Sep 24, 2024

Not entirely sure what is meant @Samoed - should we add it for multilingual datasets? (isn't that there?)

@Samoed
Copy link
Collaborator

Samoed commented Sep 24, 2024

Yes, the author of the COIR benchmark wanted an average score for the task. I believe this can be done if all subsets of the task are included in the results. This could also be implemented in the results repository. Currently, there are some tasks where the average is calculated.

@KennethEnevoldsen
Copy link
Contributor Author

This seems like a quick fix (which I am more than happy to add for now), but it does not specify within benchmark specification within mteb how the scores should be aggregated.

@isaac-chung
Copy link
Collaborator

A proposed solution is to use the benchmark (they are already a group of tasks) and then allow a benchmark to be a list[task | benchmark]

It seems like we opted for a different approach, i.e. a new class AggregateTask. Wouldn't the original way simplify a bit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants