Regarding GPQA What do you mean by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" #26

sorobedio · 2024-07-25T08:53:35Z

If you cmaim that GPQA task has no subtask which task is used by the leader board.
here are the list of tasks associated with the gpqa task in lm-eval

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
leaderboard_gpqa	N/A
- leaderboard_gpqa_diamond	1	none	0	acc_norm	↑	0.3384	±	0.0337
- leaderboard_gpqa_extended	1	none	0	acc_norm	↑	0.3205	±	0.0200
- leaderboard_gpqa_main	1	none	0	acc_norm	↑	0.3438	±	0.0225
`

tunglamlqddb · 2024-07-30T08:52:48Z

Can I have a newbie question?
The final result appearing in the leader board is the average over these three metrics, right?
Thank you

sorobedio · 2024-08-20T02:15:04Z

it is not just average according to them you have to perform min_max normalization for sub-tasks with multiple choices then average the accuracy of the sub-tasks, they described it here https://github.com/huggingface/leaderboards/blob/main/docs/source/en/open_llm_leaderboard/normalization.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding GPQA What do you mean by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" #26

Regarding GPQA What do you mean by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" #26

sorobedio commented Jul 25, 2024 •

edited

Loading

tunglamlqddb commented Jul 30, 2024

sorobedio commented Aug 20, 2024

Regarding GPQA What do you mean by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" #26

Regarding GPQA What do you mean by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" #26

Comments

sorobedio commented Jul 25, 2024 • edited Loading

tunglamlqddb commented Jul 30, 2024

sorobedio commented Aug 20, 2024

sorobedio commented Jul 25, 2024 •

edited

Loading