Llama-3.1-8B-Instruct performance on minerva_math #2646

sky-fly97 · 2025-01-21T06:28:02Z

My code is
accelerate launch -m lm_eval --model hf --model_args pretrained=models/Meta-Llama-3.1-8B-Instruct \ --tasks minerva_math\ --batch_size auto \ --output ./output/multi_gpu_hen --num_fewshot 4
And I got the results, the official result seems to be that 0 shot can achieve 51.9, is there any problem?
`| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|------------------------------------|------:|------|-----:|-----------|---|-----:|---|-----:|
|minerva_math | 1|none | |exact_match|↑ |0.3456|± |0.0063|
| - minerva_math_algebra | 1|none | 4|exact_match|↑ |0.4954|± |0.0145|
| - minerva_math_counting_and_prob | 1|none | 4|exact_match|↑ |0.3143|± |0.0213|
| - minerva_math_geometry | 1|none | 4|exact_match|↑ |0.2672|± |0.0202|
| - minerva_math_intermediate_algebra| 1|none | 4|exact_match|↑ |0.1395|± |0.0115|
| - minerva_math_num_theory | 1|none | 4|exact_match|↑ |0.2333|± |0.0182|
| - minerva_math_prealgebra | 1|none | 4|exact_match|↑ |0.6039|± |0.0166|
| - minerva_math_precalc | 1|none | 4|exact_match|↑ |0.1557|± |0.0155|

Groups	Version	Filter	n-shot	Metric		Value		Stderr
minerva_math	1	none		exact_match	↑	0.3456	±	0.0063

The text was updated successfully, but these errors were encountered:

sky-fly97 · 2025-01-21T07:07:33Z

When I use hendrycks_math with 4-shot,I got 0.1538
`| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|--------------------------------------|------:|------|-----:|-----------|---|-----:|---|-----:|
|hendrycks_math | 1|none | |exact_match|↑ |0.1538|± |0.0051|
| - hendrycks_math_algebra | 1|none | 4|exact_match|↑ |0.1474|± |0.0103|
| - hendrycks_math_counting_and_prob | 1|none | 4|exact_match|↑ |0.1456|± |0.0162|
| - hendrycks_math_geometry | 1|none | 4|exact_match|↑ |0.1294|± |0.0154|
| - hendrycks_math_intermediate_algebra| 1|none | 4|exact_match|↑ |0.0941|± |0.0097|
| - hendrycks_math_num_theory | 1|none | 4|exact_match|↑ |0.1519|± |0.0155|
| - hendrycks_math_prealgebra | 1|none | 4|exact_match|↑ |0.2595|± |0.0149|
| - hendrycks_math_precalc | 1|none | 4|exact_match|↑ |0.1282|± |0.0143|

Groups	Version	Filter	n-shot	Metric		Value		Stderr
hendrycks_math	1	none		exact_match	↑	0.1538	±	0.0051

baberabb · 2025-01-21T17:24:49Z

Hi! using --apply_chat_template (also --fewshot_as_multiturn if using fewshots) should provide some improvements , but they also use a COT prompt for MATH. I've added it in #2556 (as llama_math), but haven't tested it out yet.

baberabb · 2025-01-21T22:01:27Z

I'm getting 0.4894 for llama_math on meta-llama/Llama-3.1-8B-Instruct

sky-fly97 · 2025-01-22T03:33:25Z

I'm getting 0.4894 for llama_math on meta-llama/Llama-3.1-8B-Instruct

Very close result! What command should I run to get this result?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3.1-8B-Instruct performance on minerva_math #2646

Llama-3.1-8B-Instruct performance on minerva_math #2646

sky-fly97 commented Jan 21, 2025 •

edited

Loading

sky-fly97 commented Jan 21, 2025

baberabb commented Jan 21, 2025

baberabb commented Jan 21, 2025

sky-fly97 commented Jan 22, 2025

Llama-3.1-8B-Instruct performance on minerva_math #2646

Llama-3.1-8B-Instruct performance on minerva_math #2646

Comments

sky-fly97 commented Jan 21, 2025 • edited Loading

sky-fly97 commented Jan 21, 2025

baberabb commented Jan 21, 2025

baberabb commented Jan 21, 2025

sky-fly97 commented Jan 22, 2025

sky-fly97 commented Jan 21, 2025 •

edited

Loading