Skip to content

[feat] Add generic NanoEvaluator abstractions with NanoBEIR backward compatibility#3673

Draft
hotchpotch wants to merge 16 commits intohuggingface:mainfrom
hotchpotch:nano-eval
Draft

[feat] Add generic NanoEvaluator abstractions with NanoBEIR backward compatibility#3673
hotchpotch wants to merge 16 commits intohuggingface:mainfrom
hotchpotch:nano-eval

Conversation

@hotchpotch
Copy link
Contributor

@hotchpotch hotchpotch commented Feb 24, 2026

Hello!

Summary

This PR introduces a generic NanoEvaluator abstraction for sampled information-retrieval evaluation, while preserving backward compatibility for existing NanoBEIR evaluators.

NanoBEIREvaluator is very useful for large IR benchmarks because it evaluates sampled subsets with much lower cost. In the same spirit, this PR extends the approach beyond NanoBEIR so that other datasets with the same corpus/queries/qrels structure can be evaluated with the same evaluator family.

For example, this enables evaluation on datasets such as:

Details

  • Add NanoEvaluator as a generic parent evaluator for dense retrieval.
  • Refactor NanoBEIREvaluator to subclass NanoEvaluator while keeping its public API and output key conventions.
  • Add CrossEncoderNanoEvaluator and SparseNanoEvaluator as generic parents for cross-encoder and sparse settings.
  • Keep CrossEncoderNanoBEIREvaluator and SparseNanoBEIREvaluator as NanoBEIR-specific wrappers with backward-compatible behavior.
  • Add/expand examples for dense, sparse, and cross-encoder usage.

Backward Compatibility

  • Existing NanoBEIREvaluator usage remains valid.
  • Existing CrossEncoderNanoBEIREvaluator and SparseNanoBEIREvaluator usage remains valid.
  • Metric naming and expected primary metrics are preserved for NanoBEIR evaluators.
  • I also ran the sample code in NanoBEIREvaluator.py and confirmed that the resulting values remained unchanged from previous behavior.

This is an initial implementation to make the idea concrete. I would greatly appreciate any feedback!

@@ -0,0 +1,232 @@
from __future__ import annotations
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests were added to cover the existing NanoBEIR behavior first, so that NanoBEIR could be refactored without breaking that behavior.

instead of ``documents``. When using ``documents``, setting this to True will result in a more useful evaluation
signal, but setting it to False will result in a more realistic evaluation. Defaults to True.
batch_size (int): Batch size to compute sentence embeddings. Defaults to 64.
batch_size (int): Batch size to compute sentence embeddings. Defaults to 32.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default batch size has long been 32 in the implementation, but the documentation still said 64, so I corrected the docstring to match the actual behavior.

@hotchpotch hotchpotch marked this pull request as ready for review February 24, 2026 12:58
@hotchpotch hotchpotch marked this pull request as draft February 26, 2026 23:18
@hotchpotch
Copy link
Contributor Author

After reviewing the changes again, I realized that although the new files are mostly copies of existing implementations, the diff shows them as pure additions.

That makes it very hard to distinguish what was already in the existing implementation from what is actually new and worth focusing on in review, so the review cost feels quite high.

I’m going to move this PR back to draft for now and try to restructure the changes so the diffs are easier to read and the review cost is lower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant