Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding GPQA What do you mean by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" #26

Open
sorobedio opened this issue Jul 25, 2024 · 2 comments

Comments

@sorobedio
Copy link

sorobedio commented Jul 25, 2024

If you cmaim that GPQA task has no subtask which task is used by the leader board.
here are the list of tasks associated with the gpqa task in lm-eval

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.3384 ± 0.0337
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.3205 ± 0.0200
- leaderboard_gpqa_main 1 none 0 acc_norm 0.3438 ± 0.0225
`
@sorobedio sorobedio changed the title Regarding GPQA What do you meant by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" Regarding GPQA What do you mean by "For tasks without subtasks (e.g., GPQA, MMLU-PRO), the normalization process is straightforward:" Jul 25, 2024
@tunglamlqddb
Copy link

Can I have a newbie question?
The final result appearing in the leader board is the average over these three metrics, right?
Thank you

@sorobedio
Copy link
Author

it is not just average according to them you have to perform min_max normalization for sub-tasks with multiple choices then average the accuracy of the sub-tasks, they described it here https://github.com/huggingface/leaderboards/blob/main/docs/source/en/open_llm_leaderboard/normalization.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants