Rigorous, unbiased, and scalable LLM evaluations across diverse AI benchmarks, from GPQA Diamond to Chatbot Arena, testing all major models equally.
![BenchmarkAggregator Dashboard](https://private-user-images.githubusercontent.com/32551374/360450029-4d164e83-d527-4da9-ac43-4366d70a0f04.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyMzM4MjMsIm5iZiI6MTczOTIzMzUyMywicGF0aCI6Ii8zMjU1MTM3NC8zNjA0NTAwMjktNGQxNjRlODMtZDUyNy00ZGE5LWFjNDMtNDM2NmQ3MGEwZjA0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDAwMjUyM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNmMmE0YTVjZGU5NGY3MTg3ZTA3OTRjZGQxNzAzMDI4NWVlNzNiNmIyNzQ3MjkzNjJlMTUxYmI4MGZhODY0NzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.WC3yjGfbmmm2RX--ojh3sJyV5S0R6zFdTJaawoHzY-g)
View Leaderboard | Features | Benchmarks | FAQ
The BenchmarkAggregator framework serves as a central hub, addressing the critical need for consistent model evaluation in the AI community. By providing comprehensive comparisons of Large Language Models (LLMs) across all challenging, well-respected benchmarks in one unified location, it offers a holistic, fair, and scalable view of model performance. Our approach balances depth of evaluation with resource constraints, ensuring fair comparisons while maintaining practicality and accessibility from a single, authoritative source.
Model | Average Score |
---|---|
gpt-4o-2024-08-06 | 69.0 |
claude-3.5-sonnet | 66.2 |
gpt-4o-mini-2024-07-18 | 62.1 |
mistral-large | 61.4 |
llama-3.1-405b-instruct | 59.8 |
llama-3.1-70b-instruct | 58.4 |
claude-3-sonnet | 53.2 |
gpt-3.5-turbo-0125 | 34.8 |
For detailed scores across all benchmarks, visit our leaderboard.
- 🏆 Incorporates top, most respected benchmarks in the AI community
- 📊 Balanced evaluation using 100 randomly drawn samples per benchmark (adjustable)
- 🔌 Quick and easy integration of new benchmarks and models (uses OpenRouter, making the addition of new models absolutely trivial)
- 📈 Holistic performance view through score averaging across diverse tasks
- ⚖️ Efficient approach balancing evaluation depth with resource constraints
📖 Learn more about each benchmark on our website
Why not run all questions for each benchmark?
Running all questions for each benchmark would be cost-prohibitive. Our approach balances comprehensive evaluation with practical resource constraints.How are benchmark samples chosen?
The samples are randomly drawn from the larger benchmark dataset. The same sample set is used for each model to ensure consistency and fair comparison across all evaluations.Why are certain models like Claude 3 Opus and GPT-4 turbo absent?
These models are significantly more expensive to query compared to many others. Their absence is due to cost considerations in running the benchmarks.How easy is it to add new benchmarks or models?
Adding new benchmarks or models is designed to be quick and efficient. For benchmarks, it can take only a few minutes to integrate an existing one. For models, we use OpenRouter, which covers basically all closed and open-source options. To add a model, simply find its ID on the OpenRouter website and include it in our framework. This makes adding new models absolutely trivial!How are the scores from Chatbot Arena calculated?
The scores for Chatbot Arena are fetched directly from their website. These scores are then normalized against the values of other models in this benchmark.👉 View more FAQs on our website
We welcome contributions from the community! If you have any questions, suggestions, or requests, please don't hesitate to create an issue. Your input is valuable in helping us improve and expand the BenchmarkAggregator.
This project is licensed under the MIT License - see the LICENSE file for details.
We're grateful to the creators and maintainers of the benchmark datasets used in this project, as well as to OpenRouter for making model integration seamless.
Made with ❤️ by the AI community