add llama_math

baberabb · baberabb · commit 748eb47e502c · 2025-01-21T17:20:02.000Z
diff --git a/lm_eval/tasks/llama3/instruct/math/README.md b/lm_eval/tasks/llama3/instruct/math/README.md
@@ -0,0 +1,66 @@
+# MATH
+ℹ️ This is the 0-shot variant! reproducing https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-Instruct-evals/viewer/Llama-3.1-8B-Instruct-evals__math__details?row=0
+## Paper
+Measuring Mathematical Problem Solving With the MATH Dataset
+https://arxiv.org/abs/2103.03874
+
+Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
+
+NOTE: The few-shot and the generated answer extraction is based on the [Minerva](https://arxiv.org/abs/2206.14858) and exact match equivalence is calculated using the `sympy` library. This requires additional dependencies, which can be installed via the `lm-eval[math]` extra.
+
+Homepage: https://github.com/hendrycks/math
+
+
+## Citation
+```
+@article{hendrycksmath2021,
+  title={Measuring Mathematical Problem Solving With the MATH Dataset},
+  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
+  journal={NeurIPS},
+  year={2021}
+}
+
+@misc{2206.14858,
+Author = {Aitor Lewkowycz and Anders Andreassen and David Dohan and Ethan Dyer and Henryk Michalewski and Vinay Ramasesh and Ambrose Slone and Cem Anil and Imanol Schlag and Theo Gutman-Solo and Yuhuai Wu and Behnam Neyshabur and Guy Gur-Ari and Vedant Misra},
+Title = {Solving Quantitative Reasoning Problems with Language Models},
+Year = {2022},
+Eprint = {arXiv:2206.14858},
+}
+```
+
+### Groups and Tasks
+
+[//]: # (#### Groups)
+
+[//]: # ()
+[//]: # (- `llama_math`)
+
+#### Tasks
+
+- `llama_math_algebra`
+- `llama_math_counting_and_prob`
+- `llama_math_geometry`
+- `llama_math_intermediate_algebra`
+- `llama_math_num_theory`
+- `llama_math_prealgebra`
+- `llama_math_precalc`
+
+### Checklist
+
+The checklist is the following:
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+    * The implementation in the original paper is one where the model is first fine-tuned on the data. They do have a few-shot evaluation for GPT-3, however the few-shot context used here is sourced from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is comparable to that provided in the paper, though not identical.
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
+
+### Variant Wishlist
+
+- [ ] zero-shot variant
diff --git a/lm_eval/tasks/llama3/instruct/math/llama_math_algebra.yaml b/lm_eval/tasks/llama3/instruct/math/llama_math_algebra.yaml
@@ -0,0 +1,25 @@
+task: llama_math_algebra
+dataset_path: EleutherAI/hendrycks_math
+process_docs: !function utils.process_docs
+dataset_name: algebra
+output_type: generate_until
+training_split: train
+test_split: test
+doc_to_text: "Solve the following math problem efficiently and clearly:\n\n- For simple problems (2 steps or fewer):\nProvide a concise solution with minimal explanation.\n\n- For complex problems (3 steps or more):\nUse this step-by-step format:\n\n## Step 1: [Concise description]\n[Brief explanation and calculations]\n\n## Step 2: [Concise description]\n[Brief explanation and calculations]\n\n...\n\nRegardless of the approach, always conclude with:\n\nTherefore, the final answer is: $\\\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.\n\nProblem: {{ problem }}"
+process_results: !function utils.process_results
+doc_to_target: "{{answer if few_shot is undefined else solution}}"
+generation_kwargs:
+  until:
+    - "Problem:"
+  max_gen_toks: 10
+  do_sample: false
+  temperature: 0
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+num_fewshot: 0
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
diff --git a/lm_eval/tasks/llama3/instruct/math/llama_math_counting_and_prob.yaml b/lm_eval/tasks/llama3/instruct/math/llama_math_counting_and_prob.yaml
@@ -0,0 +1,3 @@
+include: llama_math_algebra.yaml
+dataset_name: counting_and_probability
+task: llama_math_counting_and_prob
diff --git a/lm_eval/tasks/llama3/instruct/math/llama_math_geometry.yaml b/lm_eval/tasks/llama3/instruct/math/llama_math_geometry.yaml
@@ -0,0 +1,3 @@
+include: llama_math_algebra.yaml
+dataset_name: geometry
+task: llama_math_geometry
diff --git a/lm_eval/tasks/llama3/instruct/math/llama_math_intermediate_algebra.yaml b/lm_eval/tasks/llama3/instruct/math/llama_math_intermediate_algebra.yaml
@@ -0,0 +1,3 @@
+include: llama_math_algebra.yaml
+dataset_name: intermediate_algebra
+task: llama_math_intermediate_algebra
diff --git a/lm_eval/tasks/llama3/instruct/math/llama_math_num_theory.yaml b/lm_eval/tasks/llama3/instruct/math/llama_math_num_theory.yaml
@@ -0,0 +1,3 @@
+include: llama_math_algebra.yaml
+dataset_name: number_theory
+task: llama_math_num_theory
diff --git a/lm_eval/tasks/llama3/instruct/math/llama_math_prealgebra.yaml b/lm_eval/tasks/llama3/instruct/math/llama_math_prealgebra.yaml
@@ -0,0 +1,3 @@
+include: llama_math_algebra.yaml
+dataset_name: prealgebra
+task: llama_math_prealgebra
diff --git a/lm_eval/tasks/llama3/instruct/math/llama_math_precalc.yaml b/lm_eval/tasks/llama3/instruct/math/llama_math_precalc.yaml
@@ -0,0 +1,3 @@
+include: llama_math_algebra.yaml
+dataset_name: precalculus
+task: llama_math_precalc
diff --git a/lm_eval/tasks/llama3/instruct/math/math.yaml b/lm_eval/tasks/llama3/instruct/math/math.yaml
@@ -0,0 +1,14 @@
+group: llama_math
+task:
+  - llama_math_algebra
+  - llama_math_counting_and_prob
+  - llama_math_geometry
+  - llama_math_intermediate_algebra
+  - llama_math_num_theory
+  - llama_math_prealgebra
+  - llama_math_precalc
+aggregate_metric_list:
+  - metric: exact_match
+    weight_by_size: True
+metadata:
+  version: 1
diff --git a/lm_eval/tasks/llama3/instruct/math/utils.py b/lm_eval/tasks/llama3/instruct/math/utils.py

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: llama_math_algebra.yaml`
	`2`	`+dataset_name: counting_and_probability`
	`3`	`+task: llama_math_counting_and_prob`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: llama_math_algebra.yaml`
	`2`	`+dataset_name: geometry`
	`3`	`+task: llama_math_geometry`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: llama_math_algebra.yaml`
	`2`	`+dataset_name: intermediate_algebra`
	`3`	`+task: llama_math_intermediate_algebra`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: llama_math_algebra.yaml`
	`2`	`+dataset_name: number_theory`
	`3`	`+task: llama_math_num_theory`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: llama_math_algebra.yaml`
	`2`	`+dataset_name: prealgebra`
	`3`	`+task: llama_math_prealgebra`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: llama_math_algebra.yaml`
	`2`	`+dataset_name: precalculus`
	`3`	`+task: llama_math_precalc`