-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --examples
Argument for Fine-Grained Task Evaluation in lm-evaluation-harness
. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2]
#2520
base: main
Are you sure you want to change the base?
Conversation
Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :) Can you rerun the pre-commit formatter and then we can merge it? |
Signed-off-by: Mírian Silva <[email protected]
Done @StellaAthena, thanks for the review! |
Signed-off-by: Mírian Silva <[email protected]
@@ -393,7 +410,8 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None: | |||
max_batch_size=args.max_batch_size, | |||
device=args.device, | |||
use_cache=args.use_cache, | |||
limit=args.limit, | |||
limit=limit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limit and examples can be undefined left here.
I should also add a test for main
so errors like this are caught.
), f"Elements of --examples should be in the interval [0,k-1] where k is the number of total examples. In this case, k={n}." | ||
doc_iterator = utils.create_iterator( | ||
enumerate( | ||
datasets.Dataset.from_pandas( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be able to use normal list methods, something like [x for (i,x) in enumerate(self.eval_docs) if i in examples]
Hi! sorry for the delay. This slipped passed me. Left a couple of comments, but the logic looks good! Was thinking we could combine this with |
This PR introduces a new
--examples
argument to the evaluation pipeline inlm-evaluation-harness
, enabling users to evaluate specific examples across multiple tasks. This enhancement extends the functionality of the--limit
argument by allowing users to control which examples are included in the evaluation. Users can specify task examples via a JSON file containing a dictionary where keys are task names and values are lists of example indices. For instance, a JSON file might look like this:To use this feature, for example, you could save the dictionary to a file (e.g.,
/path/to/examples.json
) and run the following command:If we do not specify the examples for a task, all examples will be evaluated.
This new feature has multiple applications. It allows practitioners to evaluate models on specific subsets of interest, such as critical edge cases or benchmarks. It also supports multi-prompt evaluation using PromptEval [1,2] by enabling the evaluation of a few selected examples for each prompt template, followed by performance distribution estimation. As part of the future roadmap, we plan to integrate PromptEval functionality directly into lm-evaluation-harness to provide a seamless evaluation experience.
References
[1] Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024).
[2] https://github.com/felipemaiapolo/prompteval