Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

felipemaiapolo
Copy link

This PR introduces a new --examples argument to the evaluation pipeline in lm-evaluation-harness, enabling users to evaluate specific examples across multiple tasks. This enhancement extends the functionality of the --limit argument by allowing users to control which examples are included in the evaluation. Users can specify task examples via a JSON file containing a dictionary where keys are task names and values are lists of example indices. For instance, a JSON file might look like this:

{
  "mmlu_astronomy": [0, 3, 6],
  "mmlu_anatomy": [1, 4, 7, 10],
  "mmlu_econometrics": [2, 5, 8, 11, 14]
}

To use this feature, for example, you could save the dictionary to a file (e.g., /path/to/examples.json) and run the following command:

lm_eval \
  --model hf \
  --model_args pretrained=Qwen/Qwen1.5-0.5B \
  --tasks mmlu_astronomy,mmlu_anatomy,mmlu_econometrics \
  --device cuda:0 \
  --log_samples \
  --output_path "/path/to/output" \
  --examples "/path/to/examples.json"

If we do not specify the examples for a task, all examples will be evaluated.

This new feature has multiple applications. It allows practitioners to evaluate models on specific subsets of interest, such as critical edge cases or benchmarks. It also supports multi-prompt evaluation using PromptEval [1,2] by enabling the evaluation of a few selected examples for each prompt template, followed by performance distribution estimation. As part of the future roadmap, we plan to integrate PromptEval functionality directly into lm-evaluation-harness to provide a seamless evaluation experience.

References
[1] Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024).
[2] https://github.com/felipemaiapolo/prompteval

@CLAassistant
Copy link

CLAassistant commented Nov 26, 2024

CLA assistant check
All committers have signed the CLA.

@StellaAthena
Copy link
Member

Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)

Can you rerun the pre-commit formatter and then we can merge it?

@mirianfsilva
Copy link

Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)

Can you rerun the pre-commit formatter and then we can merge it?

Done @StellaAthena, thanks for the review!

@@ -393,7 +410,8 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
max_batch_size=args.max_batch_size,
device=args.device,
use_cache=args.use_cache,
limit=args.limit,
limit=limit,
Copy link
Contributor

@baberabb baberabb Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limit and examples can be undefined left here.

I should also add a test for main so errors like this are caught.

), f"Elements of --examples should be in the interval [0,k-1] where k is the number of total examples. In this case, k={n}."
doc_iterator = utils.create_iterator(
enumerate(
datasets.Dataset.from_pandas(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be able to use normal list methods, something like [x for (i,x) in enumerate(self.eval_docs) if i in examples]

@baberabb
Copy link
Contributor

Hi! sorry for the delay. This slipped passed me. Left a couple of comments, but the logic looks good!

Was thinking we could combine this with limit. Would make it more maintainable and would allow for more backward compatibility. Thoughts? Was thinking something like if limit is int or float we could keep it as before, but if its a dict we can use examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants