- [Apr 23, 2025] 🎉 Arena-Hard-v2.0 is finally here! Better judges, new hard prompts, and additional eval for creative writing.
- [Oct 14, 2024] 🎉 Style Control is now supported in Arena-Hard-Auto.
Arena-Hard-Auto is an automatic evaluation tool for instruction-tuned LLMs. Arena-Hard-Auto has the highest correlation and separability to LMArena (Chatbot Arena) among popular open-ended LLM benchmarks (See Paper). If you are curious to see how well your model might perform on LMArena before deploying, we recommend trying Arena-Hard-Auto's newest evaluation set, Arena-Hard-v2.0-Preview.
V2.0 contains 500 fresh, challenging real-world user queries (open-ended software engineering problems, math questions, etc) and 250 creative writing queries sourced from Chatbot Arena. We employs automatic judges, GPT-4.1 and Gemini-2.5, as a cheaper and faster approximator to human preference.
Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference. Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto. In the paper we also purposed metrics, such as model separability and agreement to human preference, for evaluating benchmarks' ability to rank models (See Evaluate Benchmarks for more information and code).
Hard Prompt, Style Control, and Gemini-2.5 as Judge (Official Configuration):
Model Scores (%) CI (%)
0 o3-2025-04-16 85.9 (-0.8 / +0.9)
1 o4-mini-2025-04-16-high 79.1 (-1.4 / +1.2)
2 gemini-2.5 79.0 (-2.1 / +1.8)
3 o4-mini-2025-04-16 74.6 (-1.8 / +1.6)
4 gemini-2.5-flash 68.6 (-1.6 / +1.6)
5 o3-mini-2025-01-31-high 66.1 (-1.5 / +2.1)
6 o1-2024-12-17-high 61.0 (-2.0 / +2.1)
7 claude-3-7-sonnet-20250219-thinking-16k 59.8 (-2.0 / +1.8)
8 Qwen3-235B-A22B 58.4 (-1.9 / +2.1)
9 deepseek-r1 58.0 (-2.2 / +2.0)
10 o1-2024-12-17 55.9 (-2.2 / +1.8)
11 gpt-4.5-preview 50.0 (-1.9 / +2.0)
12 o3-mini-2025-01-31 50.0 (-0.0 / +0.0)
13 gpt-4.1 50.0 (-1.9 / +1.7)
14 gpt-4.1-mini 46.9 (-2.4 / +2.1)
15 Qwen3-32B 44.5 (-2.2 / +2.1)
16 QwQ-32B 43.5 (-2.5 / +2.1)
17 Qwen3-30B-A3B 33.9 (-1.6 / +1.5)
18 claude-3-5-sonnet-20241022 33.0 (-2.3 / +1.8)
19 s1.1-32B 22.3 (-1.7 / +1.5)
20 llama4-maverick-instruct-basic 17.2 (-1.5 / +1.2)
21 Athene-V2-Chat 16.4 (-1.4 / +1.4)
22 gemma-3-27b-it 15.0 (-1.4 / +1.0)
23 Qwen3-4B 15.0 (-1.1 / +1.5)
24 gpt-4.1-nano 13.7 (-1.1 / +1.0)
25 Llama-3.1-Nemotron-70B-Instruct-HF 10.3 (-0.8 / +1.0)
26 Qwen2.5-72B-Instruct 10.1 (-0.9 / +1.3)
27 OpenThinker2-32B 3.2 (-0.3 / +0.3)
Hard Prompt, Style Control, and GPT-4.1 as Judge (If prefer OpenAI API)
Model Scores (%) CI (%)
0 o3-2025-04-16 87.0 (-1.0 / +1.0)
1 o4-mini-2025-04-16-high 81.7 (-1.2 / +1.2)
2 o4-mini-2025-04-16 78.0 (-1.3 / +1.4)
3 o3-mini-2025-01-31-high 64.8 (-2.1 / +1.9)
4 o1-2024-12-17-high 58.7 (-2.3 / +2.1)
5 gpt-4.1 58.3 (-2.0 / +2.3)
6 o1-2024-12-17 50.2 (-2.2 / +1.8)
7 o3-mini-2025-01-31 50.0 (-0.0 / +0.0)
8 gemini-2.5 49.1 (-2.5 / +2.4)
9 gpt-4.1-mini 48.6 (-2.7 / +1.9)
10 deepseek-r1 48.0 (-2.6 / +2.3)
11 claude-3-7-sonnet-20250219-thinking-16k 47.0 (-1.9 / +2.3)
12 Qwen3-235B-A22B 46.7 (-1.9 / +2.4)
13 gemini-2.5-flash 45.1 (-2.7 / +2.1)
14 gpt-4.5-preview 43.0 (-1.9 / +2.2)
15 QwQ-32B 36.1 (-2.0 / +2.2)
16 Qwen3-32B 35.8 (-2.1 / +2.2)
17 Qwen3-30B-A3B 28.7 (-1.4 / +2.1)
18 claude-3-5-sonnet-20241022 25.8 (-1.7 / +1.8)
19 s1.1-32B 18.3 (-2.3 / +2.2)
20 gpt-4.1-nano 15.4 (-1.1 / +1.2)
21 Athene-V2-Chat 12.6 (-1.2 / +1.3)
22 Qwen3-4B 12.6 (-1.1 / +1.5)
23 llama4-maverick-instruct-basic 12.0 (-1.0 / +1.2)
24 gemma-3-27b-it 9.7 (-0.9 / +1.1)
25 Qwen2.5-72B-Instruct 8.0 (-0.7 / +0.9)
26 Llama-3.1-Nemotron-70B-Instruct-HF 6.8 (-0.6 / +0.8)
27 OpenThinker2-32B 2.3 (-0.2 / +0.3)
Creative Writing, Ensemble GPT-4.1 and Gemini 2.5 as Judges (Best Configuration for Creative Writing)
Model Scores (%) CI (%)
0 gemini-2.5 90.8 (-1.2 / +1.3)
1 o3-2025-04-16 88.8 (-1.1 / +1.0)
2 gemini-2.5-flash 83.9 (-1.3 / +1.4)
3 deepseek-r1 77.0 (-2.0 / +1.4)
4 Qwen3-235B-A22B 73.5 (-1.8 / +1.5)
5 gemma-3-27b-it 69.9 (-1.9 / +1.7)
6 claude-3-7-sonnet-20250219-thinking-16k 63.9 (-1.7 / +1.9)
7 gpt-4.1 61.5 (-1.9 / +1.9)
8 QwQ-32B 60.9 (-2.0 / +1.6)
9 o1-2024-12-17-high 59.9 (-2.1 / +1.7)
10 o4-mini-2025-04-16-high 58.7 (-1.8 / +1.9)
11 o1-2024-12-17 56.6 (-1.8 / +1.8)
12 o4-mini-2025-04-16 55.6 (-1.8 / +2.0)
13 Qwen3-32B 53.3 (-1.9 / +1.6)
14 gpt-4.5-preview 51.4 (-1.9 / +2.0)
15 gemini-2.0-flash-001 50.0 (-0.0 / +0.0)
16 o3-mini-2025-01-31-high 43.0 (-1.7 / +2.1)
17 Qwen3-30B-A3B 34.9 (-2.0 / +1.6)
18 gpt-4.1-mini 28.2 (-1.8 / +1.8)
19 Llama-3.1-Nemotron-70B-Instruct-HF 26.9 (-2.0 / +1.8)
20 claude-3-5-sonnet-20241022 24.2 (-1.5 / +1.5)
21 OpenThinker2-32B 23.6 (-1.5 / +1.3)
22 Athene-V2-Chat 18.1 (-1.6 / +1.5)
23 Qwen3-4B 13.2 (-1.2 / +1.2)
24 gpt-4.1-nano 10.7 (-1.1 / +1.1)
25 llama4-maverick-instruct-basic 10.5 (-1.1 / +1.0)
26 Qwen2.5-72B-Instruct 10.2 (-1.1 / +1.1)
27 s1.1-32B 8.2 (-0.9 / +0.
For older leaderboards, such as Arena-Hard-v0.1, see past-leaderboards
git clone https://github.com/lmarena/arena-hard-auto.git
cd arena-hard
pip install -r requirements.txt
pip install -r requirements-optional.txt # Optional dependencies (e.g., anthropic sdk)
We have pre-generated many popular models answers and judgments. You can browse them with an online demo or download them (with git-lfs
installed) by
> git lfs install
> git clone [email protected]:datasets/lmarena-ai/arena-hard-auto arena-hard-data
// copy answers/judgments to the data directory
> cp -r arena-hard-data/data .
Then run
> python show_result.py
Model Scores (%) CI (%)
0 o3-2025-04-16 87.6 (-0.8 / +1.0)
1 o4-mini-2025-04-16-high 82.7 (-1.4 / +1.3)
2 o4-mini-2025-04-16 78.9 (-1.6 / +1.6)
Fill in your API endpoint in config/api_config.yaml
. We support OpenAI compatible API server, Anthropic, Vertex AI, and more. You will find examples in config/api_config.yaml
.
You may use inference engine such as vLLM or SGLang to host your model with an OpenAI compatible API server.
We also include support for fast built-in inference with SGLang, see examples in config/api_config.yaml
and implementaton in utils/completion.py
. See misc/sglang_setup.bash
for environment setup.
In config/gen_answer_config.yaml
, add your model name in model_list
.
Run the command to generate answers:
> python gen_answer.py
Caching feature is implemented. The code will skip generating an answer when there is already an existing answer/judgment to the same prompt (this feature is not supported for built-in SGLang server).
In config/arena-hard-v2.0.yaml
, add your model name in model_list
.
...
# Add your model below for evaluation
model_list:
- deepseek-r1
- [YOUR-MODEL-NAME]
We recommend employing GPT-4.1 as judge for fast, stable judge inference. To use Gemini-2.5, comment out:
judge_model: gpt-4.1
temperature: 0.0
max_tokens: 16000
and uncomment:
judge_model: gemini-2.5
temperature: 1.0
max_tokens: 32000
Run the command to generate judgments:
> python gen_judgment.py
For Ensemble-as-Judges, we suggest inferencing both judges independently and we will aggregrate the results when displaying leaderboard for you (see step 4).
Judgment caching is also implemented. It will skip generating judgments that has already been generated or lacks one of the model answers.
Output model win rates for Arena-Hard-v2.0-Preview (Hard Prompt, Style Control, GPT-4.1 as Judge):
> python show_result.py --judge-names gpt-4.1 --control-features markdown length
Output model win rates for Arena-Hard-v2.0-Preview (Creative Writing, Ensemble GPT-4.1 and Gemini 2.5 as Judges):
> python show_result.py --judge-names gpt-4.1 gemini-2.5 --category creative_writing
You can review answers and judgment results using our gradio script (gradio>=5.25.2
).
> python qa_browser.py --share
Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost. Please refer to the blogpost for methodology and technical background.
Before applying style control, make sure your model answers has proper style attribute generated. Either pull the latest data from huggingface repo, or run the following script!
To add style attribute to your model answers, use add_markdown_info.py
. The following command takes model answers from --dir
, append style attributes (token length, number of headers, etc), and save the new answers in --output-dir
.
> python add_markdown_info.py --dir data/arena-hard-v0.1/model_answer --output-dir data/arena-hard-v0.1/model_answer
To control for style (token length and markdown elements), use --control-features
or -f
when running show_result.py
.
> python show_result.py -f markdown length # style control
> python show_result.py -f markdown # control for markdown density only
> python show_result.py -f length # length control only
We outline two key properties that the benchmark aiming to approximate human preference should possess to provide meaningful comparisons between models:
- Separability: the benchmark should separate models with high confidence.
- Alignment with Human Preference: the benchmark should agree with human preference.
While previous works have focused on alignment, separability is also a crucial consideration when comparing models of similar quality (e.g., different checkpoints from the same training run). However, achieving high-confidence separability is challenging due to limitations in prompt design and inherent variances in LLM evaluations. Overly simplistic prompts fail to distinguish between models, while the randomness in human and LLM judgments leads to inconsistent predictions. As a result, it is often difficult to confidently determine if a model’s apparent performance reflects a genuine difference in capability or merely noisy observations, highlighting a need for methods to verify whether a benchmark can reliably separate similar models.
Statistical measures like Pearson (Pearson, 1895) and Spearman Correlations (Spearman, 1961), commonly used in benchmarks such as AlpacaEval (Li et al., 2023) to measure correlation to human preference ranking, may fail to adequately address model separability and ranking instability. In addition, these measures only provide a coarse signal of ranking correlation without quantifying the magnitude of performance differences between model pairs. To address these shortcomings, we develop three novel metrics: Separability with Confidence, Agreement with Confidence, and Pair Rank Brier Score.
Separability with Confidence quantifies the benchmark’s confidence by measuring its consistency in predicting the winner of a model pair across random seeds through bootstrapping. This is done by calculating the percentage of model pairs that have non-overlapping confidence intervals of their benchmark scores. A higher percentage indicates that the benchmark is more confident in distinguishing between the performance of different models, as the confidence intervals of their scores do not overlap.
For Agreement with Confidence, and Pair Rank Brier Score, please refer to section 3 of our paper. The code for calculating these metrics can be found in this colab notebook.
We have now added the capability to benchmark LLMs hosted on Amazon Bedrock with arena-hard. Specifically, we added Amazon Bedrock invoke API in utils/completion.py
which will allow you to use different models hosted on Amazon Bedrock with Arena-Hard.
Currently we support the following models:
- Anthropic Models : Claude 3 Haiku, Claude 3 Sonnet, Claude 3.5 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet v2, Claude 3.7 Sonnet
- Mistral Models: Mistral 7B Instruct, Mistral 8x7B Instruct, Mistral Large v1, Mistral Large v2, Mistral Small, Pixtral Large
- Meta Llama Models: LLaMA 3 8B Instruct, LLaMA 3 70B Instruct, LLaMA 3.1 8B Instruct, LLaMA 3.1 70B Instruct, LLaMA 3.1 405B Instruct LLaMA 3.2 1B Instruct, LLaMA 3.2 3B Instruct, LLaMA 3.2 11B Instruct, LLaMA 3.2 90B Instruct, LLaMA 2 Chat 13B, LLaMA 2 Chat 70B
- Amazon Nova Models: Amazon Nova Lite, Amazon Nova Pro, Amazon Nova Micro, Amazon Nova Premier
- DeepSeek-R1
To Add a new model hosted on Amazon Bedrock, you need to update two files: config/api_config.yaml
and utils/completion.py
.
Define a new entry for the model with the correct model_id
, api_type
, and generation parameters.
Example:
aws_nova_light_v1:
model: aws_nova_light_v1
model_id: us.amazon.nova-lite-v1:0
endpoints: null
api_type: aws_nova
parallel: 8
max_tokens: 4096
temperature: 0.0
Key Fields
- model: Internal alias used for referencing this config.
- model_id: Bedrock-specific model identifier.
- api_type: The api_type should be registered through
utils\completion.py
- endpoints: Set to null for default Bedrock endpoint, or override with custom endpoint.
- parallel: Controls parallel inference calls (adjust for throughput).
- max_tokens: Maximum output tokens.
- temperature: Controls randomness of generation (0.0 for deterministic).
Find more example in config/api_config_bedrock_models.yaml
Refer to Amazon Bedrock documentation (https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html) for model IDs and capabilities.
Create a new function decorated with @register_api("<api_type>")
to define how inputs are formatted, sent to Bedrock using boto3, and how the response is parsed.
You can use existing examples as templates:
> @register_api("aws_llama") handles LLaMA models
> @register_api("aws_nova") handles Nova models
These functions typically use helpers like create_llama3_body()
or create_nova_messages()
and send requests using the Bedrock invoke_model
API.
Pay attention to:
> The api_type in `api_config.yaml` which must match the name used in the `@register_api(...)` decorator.
> Input formatting (e.g., prompt structure, message lists)
> Parameter mapping (temperature, max_tokens, model_id)
> Response parsing (e.g., generation vs nested output.message.content)
By following this two-step process, users can easily extend support to any Bedrock-hosted model that follows a compatible invocation structure. For examples, see existing handlers for Claude, LLaMA, and Amazon Nova in the repository.
Feel free to submit a PR or open up an issue!
If you want to add your model to the leaderboard, please email me the following:
- An OpenAI compatible endpoint to your model.
- An OpenAI API key for me to inference judgment.
Sorry for the inconvience! Since Arena-Hard-Auto is open data, we want to avoid people cheating on our leaderboard. If we find anything suspicious, we reserve the right to not add your model to our leaderboard.
The code in this repository is developed from the papers below. Please cite it if you find the repository helpful.
@article{li2024crowdsourced,
title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline},
author={Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E and Stoica, Ion},
journal={arXiv preprint arXiv:2406.11939},
year={2024}
}
@misc{arenahard2024,
title = {From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline},
url = {https://lmsys.org/blog/2024-04-19-arena-hard/},
author = {Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica},
month = {April},
year = {2024}
}