Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

felipemaiapolo · 2024-11-26T19:47:09Z

This PR introduces a new --examples argument to the evaluation pipeline in lm-evaluation-harness, enabling users to evaluate specific examples across multiple tasks. This enhancement extends the functionality of the --limit argument by allowing users to control which examples are included in the evaluation. Users can specify task examples via a JSON file containing a dictionary where keys are task names and values are lists of example indices. For instance, a JSON file might look like this:

{
  "mmlu_astronomy": [0, 3, 6],
  "mmlu_anatomy": [1, 4, 7, 10],
  "mmlu_econometrics": [2, 5, 8, 11, 14]
}

To use this feature, for example, you could save the dictionary to a file (e.g., /path/to/examples.json) and run the following command:

lm_eval \
  --model hf \
  --model_args pretrained=Qwen/Qwen1.5-0.5B \
  --tasks mmlu_astronomy,mmlu_anatomy,mmlu_econometrics \
  --device cuda:0 \
  --log_samples \
  --output_path "/path/to/output" \
  --examples "/path/to/examples.json"

If we do not specify the examples for a task, all examples will be evaluated.

This new feature has multiple applications. It allows practitioners to evaluate models on specific subsets of interest, such as critical edge cases or benchmarks. It also supports multi-prompt evaluation using PromptEval [1,2] by enabling the evaluation of a few selected examples for each prompt template, followed by performance distribution estimation. As part of the future roadmap, we plan to integrate PromptEval functionality directly into lm-evaluation-harness to provide a seamless evaluation experience.

References
[1] Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024).
[2] https://github.com/felipemaiapolo/prompteval

CLAassistant · 2024-11-26T19:47:16Z

All committers have signed the CLA.

Signed-off-by: Mírian Silva <[email protected]

StellaAthena · 2025-01-14T17:06:40Z

Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)

Can you rerun the pre-commit formatter and then we can merge it?

Signed-off-by: Mírian Silva <[email protected]

mirianfsilva · 2025-01-15T18:17:50Z

Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)

Can you rerun the pre-commit formatter and then we can merge it?

Done @StellaAthena, thanks for the review!

Signed-off-by: Mírian Silva <[email protected]

baberabb · 2025-01-20T15:30:02Z

lm_eval/__main__.py

@@ -393,7 +410,8 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
        max_batch_size=args.max_batch_size,
        device=args.device,
        use_cache=args.use_cache,
-        limit=args.limit,
+        limit=limit,


limit and examples can be undefined left here.

I should also add a test for main so errors like this are caught.

baberabb · 2025-01-20T20:19:27Z

lm_eval/api/task.py

+            ), f"Elements of --examples should be in the interval [0,k-1] where k is the number of total examples. In this case, k={n}."
+            doc_iterator = utils.create_iterator(
+                enumerate(
+                    datasets.Dataset.from_pandas(


should be able to use normal list methods, something like [x for (i,x) in enumerate(self.eval_docs) if i in examples]

baberabb · 2025-01-20T20:24:24Z

Hi! sorry for the delay. This slipped passed me. Left a couple of comments, but the logic looks good!

Was thinking we could combine this with limit. Would make it more maintainable and would allow for more backward compatibility. Thoughts? Was thinking something like if limit is int or float we could keep it as before, but if its a dict we can use examples.

felipemaiapolo added 2 commits October 28, 2024 18:24

added option --examples

f06ea84

specifying examples in dictionary

8f6de73

felipemaiapolo requested review from baberabb and lintangsutawika as code owners November 26, 2024 19:47

mirianfsilva and others added 5 commits November 26, 2024 19:49

run pre-commit - fix arg type

4863977

Signed-off-by: Mírian Silva <[email protected]

fixing bug when examples==None

28c322a

fixing bug when examples==None

724612d

limit or examples must be None in simple_evaluate.py and in evaluator.py

7613990

Merge branch 'main' into examples-arg

a7a566c

run pre-commit (fix formatting)

15a324d

Signed-off-by: Mírian Silva <[email protected]

mirianfsilva added 2 commits January 15, 2025 18:58

Merge branch 'main' into examples-arg

0878678

merge main and run pre-commit (fix formatting)

a7b3d4a

Signed-off-by: Mírian Silva <[email protected]

baberabb requested changes Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

felipemaiapolo commented Nov 26, 2024

CLAassistant commented Nov 26, 2024 •

edited

Loading

StellaAthena commented Jan 14, 2025

mirianfsilva commented Jan 15, 2025

baberabb Jan 20, 2025 •

edited

Loading

baberabb Jan 20, 2025

baberabb commented Jan 20, 2025

Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Are you sure you want to change the base?

Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Conversation

felipemaiapolo commented Nov 26, 2024

CLAassistant commented Nov 26, 2024 • edited Loading

StellaAthena commented Jan 14, 2025

mirianfsilva commented Jan 15, 2025

baberabb Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

baberabb Jan 20, 2025

Choose a reason for hiding this comment

baberabb commented Jan 20, 2025

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

CLAassistant commented Nov 26, 2024 •

edited

Loading

baberabb Jan 20, 2025 •

edited

Loading