Skip to content

Commit 748eb47

Browse files
committed
add llama_math
1 parent da92dc8 commit 748eb47

10 files changed

+425
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# MATH
2+
ℹ️ This is the 0-shot variant! reproducing https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-Instruct-evals/viewer/Llama-3.1-8B-Instruct-evals__math__details?row=0
3+
## Paper
4+
Measuring Mathematical Problem Solving With the MATH Dataset
5+
https://arxiv.org/abs/2103.03874
6+
7+
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations.
8+
9+
NOTE: The few-shot and the generated answer extraction is based on the [Minerva](https://arxiv.org/abs/2206.14858) and exact match equivalence is calculated using the `sympy` library. This requires additional dependencies, which can be installed via the `lm-eval[math]` extra.
10+
11+
Homepage: https://github.com/hendrycks/math
12+
13+
14+
## Citation
15+
```
16+
@article{hendrycksmath2021,
17+
title={Measuring Mathematical Problem Solving With the MATH Dataset},
18+
author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
19+
journal={NeurIPS},
20+
year={2021}
21+
}
22+
23+
@misc{2206.14858,
24+
Author = {Aitor Lewkowycz and Anders Andreassen and David Dohan and Ethan Dyer and Henryk Michalewski and Vinay Ramasesh and Ambrose Slone and Cem Anil and Imanol Schlag and Theo Gutman-Solo and Yuhuai Wu and Behnam Neyshabur and Guy Gur-Ari and Vedant Misra},
25+
Title = {Solving Quantitative Reasoning Problems with Language Models},
26+
Year = {2022},
27+
Eprint = {arXiv:2206.14858},
28+
}
29+
```
30+
31+
### Groups and Tasks
32+
33+
[//]: # (#### Groups)
34+
35+
[//]: # ()
36+
[//]: # (- `llama_math`)
37+
38+
#### Tasks
39+
40+
- `llama_math_algebra`
41+
- `llama_math_counting_and_prob`
42+
- `llama_math_geometry`
43+
- `llama_math_intermediate_algebra`
44+
- `llama_math_num_theory`
45+
- `llama_math_prealgebra`
46+
- `llama_math_precalc`
47+
48+
### Checklist
49+
50+
The checklist is the following:
51+
52+
For adding novel benchmarks/datasets to the library:
53+
* [x] Is the task an existing benchmark in the literature?
54+
* [x] Have you referenced the original paper that introduced the task?
55+
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
56+
* The implementation in the original paper is one where the model is first fine-tuned on the data. They do have a few-shot evaluation for GPT-3, however the few-shot context used here is sourced from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is comparable to that provided in the paper, though not identical.
57+
58+
59+
If other tasks on this dataset are already supported:
60+
* [x] Is the "Main" variant of this task clearly denoted?
61+
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
62+
* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
63+
64+
### Variant Wishlist
65+
66+
- [ ] zero-shot variant
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
task: llama_math_algebra
2+
dataset_path: EleutherAI/hendrycks_math
3+
process_docs: !function utils.process_docs
4+
dataset_name: algebra
5+
output_type: generate_until
6+
training_split: train
7+
test_split: test
8+
doc_to_text: "Solve the following math problem efficiently and clearly:\n\n- For simple problems (2 steps or fewer):\nProvide a concise solution with minimal explanation.\n\n- For complex problems (3 steps or more):\nUse this step-by-step format:\n\n## Step 1: [Concise description]\n[Brief explanation and calculations]\n\n## Step 2: [Concise description]\n[Brief explanation and calculations]\n\n...\n\nRegardless of the approach, always conclude with:\n\nTherefore, the final answer is: $\\\\boxed{answer}$. I hope it is correct.\n\nWhere [answer] is just the final number or expression that solves the problem.\n\nProblem: {{ problem }}"
9+
process_results: !function utils.process_results
10+
doc_to_target: "{{answer if few_shot is undefined else solution}}"
11+
generation_kwargs:
12+
until:
13+
- "Problem:"
14+
max_gen_toks: 10
15+
do_sample: false
16+
temperature: 0
17+
metric_list:
18+
- metric: exact_match
19+
aggregation: mean
20+
higher_is_better: true
21+
num_fewshot: 0
22+
metadata:
23+
version: 1.0
24+
dataset_kwargs:
25+
trust_remote_code: true
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: llama_math_algebra.yaml
2+
dataset_name: counting_and_probability
3+
task: llama_math_counting_and_prob
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: llama_math_algebra.yaml
2+
dataset_name: geometry
3+
task: llama_math_geometry
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: llama_math_algebra.yaml
2+
dataset_name: intermediate_algebra
3+
task: llama_math_intermediate_algebra
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: llama_math_algebra.yaml
2+
dataset_name: number_theory
3+
task: llama_math_num_theory
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: llama_math_algebra.yaml
2+
dataset_name: prealgebra
3+
task: llama_math_prealgebra
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
include: llama_math_algebra.yaml
2+
dataset_name: precalculus
3+
task: llama_math_precalc
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
group: llama_math
2+
task:
3+
- llama_math_algebra
4+
- llama_math_counting_and_prob
5+
- llama_math_geometry
6+
- llama_math_intermediate_algebra
7+
- llama_math_num_theory
8+
- llama_math_prealgebra
9+
- llama_math_precalc
10+
aggregate_metric_list:
11+
- metric: exact_match
12+
weight_by_size: True
13+
metadata:
14+
version: 1

0 commit comments

Comments
 (0)