|
| 1 | +# MATH |
| 2 | +ℹ️ This is the 0-shot variant! reproducing https://huggingface.co/datasets/meta-llama/Llama-3.1-8B-Instruct-evals/viewer/Llama-3.1-8B-Instruct-evals__math__details?row=0 |
| 3 | +## Paper |
| 4 | +Measuring Mathematical Problem Solving With the MATH Dataset |
| 5 | +https://arxiv.org/abs/2103.03874 |
| 6 | + |
| 7 | +Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. |
| 8 | + |
| 9 | +NOTE: The few-shot and the generated answer extraction is based on the [Minerva](https://arxiv.org/abs/2206.14858) and exact match equivalence is calculated using the `sympy` library. This requires additional dependencies, which can be installed via the `lm-eval[math]` extra. |
| 10 | + |
| 11 | +Homepage: https://github.com/hendrycks/math |
| 12 | + |
| 13 | + |
| 14 | +## Citation |
| 15 | +``` |
| 16 | +@article{hendrycksmath2021, |
| 17 | + title={Measuring Mathematical Problem Solving With the MATH Dataset}, |
| 18 | + author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt}, |
| 19 | + journal={NeurIPS}, |
| 20 | + year={2021} |
| 21 | +} |
| 22 | +
|
| 23 | +@misc{2206.14858, |
| 24 | +Author = {Aitor Lewkowycz and Anders Andreassen and David Dohan and Ethan Dyer and Henryk Michalewski and Vinay Ramasesh and Ambrose Slone and Cem Anil and Imanol Schlag and Theo Gutman-Solo and Yuhuai Wu and Behnam Neyshabur and Guy Gur-Ari and Vedant Misra}, |
| 25 | +Title = {Solving Quantitative Reasoning Problems with Language Models}, |
| 26 | +Year = {2022}, |
| 27 | +Eprint = {arXiv:2206.14858}, |
| 28 | +} |
| 29 | +``` |
| 30 | + |
| 31 | +### Groups and Tasks |
| 32 | + |
| 33 | +[//]: # (#### Groups) |
| 34 | + |
| 35 | +[//]: # () |
| 36 | +[//]: # (- `llama_math`) |
| 37 | + |
| 38 | +#### Tasks |
| 39 | + |
| 40 | +- `llama_math_algebra` |
| 41 | +- `llama_math_counting_and_prob` |
| 42 | +- `llama_math_geometry` |
| 43 | +- `llama_math_intermediate_algebra` |
| 44 | +- `llama_math_num_theory` |
| 45 | +- `llama_math_prealgebra` |
| 46 | +- `llama_math_precalc` |
| 47 | + |
| 48 | +### Checklist |
| 49 | + |
| 50 | +The checklist is the following: |
| 51 | + |
| 52 | +For adding novel benchmarks/datasets to the library: |
| 53 | +* [x] Is the task an existing benchmark in the literature? |
| 54 | + * [x] Have you referenced the original paper that introduced the task? |
| 55 | + * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? |
| 56 | + * The implementation in the original paper is one where the model is first fine-tuned on the data. They do have a few-shot evaluation for GPT-3, however the few-shot context used here is sourced from [Lewkowycz et al](https://arxiv.org/abs/2206.14858). The achieved accuracy on Llama-2 models is comparable to that provided in the paper, though not identical. |
| 57 | + |
| 58 | + |
| 59 | +If other tasks on this dataset are already supported: |
| 60 | +* [x] Is the "Main" variant of this task clearly denoted? |
| 61 | +* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates? |
| 62 | +* [x] Have you noted which, if any, published evaluation setups are matched by this variant? |
| 63 | + |
| 64 | +### Variant Wishlist |
| 65 | + |
| 66 | +- [ ] zero-shot variant |
0 commit comments