Skip to content

lmarena/arena-hard-auto

Repository files navigation

Arena-Hard-Auto

Github arXiv Hugging Face Collection Twitter

News

  • [Apr 23, 2025] 🎉 Arena-Hard-v2.0 is finally here! Better judges, new hard prompts, and additional eval for creative writing.
  • [Oct 14, 2024] 🎉 Style Control is now supported in Arena-Hard-Auto.

About

Arena-Hard-Auto is an automatic evaluation tool for instruction-tuned LLMs. Arena-Hard-Auto has the highest correlation and separability to LMArena (Chatbot Arena) among popular open-ended LLM benchmarks (See Paper). If you are curious to see how well your model might perform on LMArena before deploying, we recommend trying Arena-Hard-Auto's newest evaluation set, Arena-Hard-v2.0-Preview.

V2.0 contains 500 fresh, challenging real-world user queries (open-ended software engineering problems, math questions, etc) and 250 creative writing queries sourced from Chatbot Arena. We employs automatic judges, GPT-4.1 and Gemini-2.5, as a cheaper and faster approximator to human preference.

Although both Arena-Hard-Auto and Chatbot Arena Category Hard (See Blog) employ similar pipeline to select hard prompts, Arena-Hard-Auto employs automatic judge as a cheaper and faster approximator to human preference. Checkout BenchBuilder folder for code and resources on how we curate Arena-Hard-Auto. In the paper we also purposed metrics, such as model separability and agreement to human preference, for evaluating benchmarks' ability to rank models (See Evaluate Benchmarks for more information and code).

Leaderboard

Arena-Hard-v2.0-Preview

Hard Prompt, Style Control, and Gemini-2.5 as Judge (Official Configuration):

                                      Model  Scores (%)         CI (%)
0                             o3-2025-04-16        85.9  (-0.8 / +0.9)
1                   o4-mini-2025-04-16-high        79.1  (-1.4 / +1.2)
2                                gemini-2.5        79.0  (-2.1 / +1.8)
3                        o4-mini-2025-04-16        74.6  (-1.8 / +1.6)
4                          gemini-2.5-flash        68.6  (-1.6 / +1.6)
5                   o3-mini-2025-01-31-high        66.1  (-1.5 / +2.1)
6                        o1-2024-12-17-high        61.0  (-2.0 / +2.1)
7   claude-3-7-sonnet-20250219-thinking-16k        59.8  (-2.0 / +1.8)
8                           Qwen3-235B-A22B        58.4  (-1.9 / +2.1)
9                               deepseek-r1        58.0  (-2.2 / +2.0)
10                            o1-2024-12-17        55.9  (-2.2 / +1.8)
11                          gpt-4.5-preview        50.0  (-1.9 / +2.0)
12                       o3-mini-2025-01-31        50.0  (-0.0 / +0.0)
13                                  gpt-4.1        50.0  (-1.9 / +1.7)
14                             gpt-4.1-mini        46.9  (-2.4 / +2.1)
15                                Qwen3-32B        44.5  (-2.2 / +2.1)
16                                  QwQ-32B        43.5  (-2.5 / +2.1)
17                            Qwen3-30B-A3B        33.9  (-1.6 / +1.5)
18               claude-3-5-sonnet-20241022        33.0  (-2.3 / +1.8)
19                                 s1.1-32B        22.3  (-1.7 / +1.5)
20           llama4-maverick-instruct-basic        17.2  (-1.5 / +1.2)
21                           Athene-V2-Chat        16.4  (-1.4 / +1.4)
22                           gemma-3-27b-it        15.0  (-1.4 / +1.0)
23                                 Qwen3-4B        15.0  (-1.1 / +1.5)
24                             gpt-4.1-nano        13.7  (-1.1 / +1.0)
25       Llama-3.1-Nemotron-70B-Instruct-HF        10.3  (-0.8 / +1.0)
26                     Qwen2.5-72B-Instruct        10.1  (-0.9 / +1.3)
27                         OpenThinker2-32B         3.2  (-0.3 / +0.3)

Hard Prompt, Style Control, and GPT-4.1 as Judge (If prefer OpenAI API)

                                      Model  Scores (%)         CI (%)
0                             o3-2025-04-16        87.0  (-1.0 / +1.0)
1                   o4-mini-2025-04-16-high        81.7  (-1.2 / +1.2)
2                        o4-mini-2025-04-16        78.0  (-1.3 / +1.4)
3                   o3-mini-2025-01-31-high        64.8  (-2.1 / +1.9)
4                        o1-2024-12-17-high        58.7  (-2.3 / +2.1)
5                                   gpt-4.1        58.3  (-2.0 / +2.3)
6                             o1-2024-12-17        50.2  (-2.2 / +1.8)
7                        o3-mini-2025-01-31        50.0  (-0.0 / +0.0)
8                                gemini-2.5        49.1  (-2.5 / +2.4)
9                              gpt-4.1-mini        48.6  (-2.7 / +1.9)
10                              deepseek-r1        48.0  (-2.6 / +2.3)
11  claude-3-7-sonnet-20250219-thinking-16k        47.0  (-1.9 / +2.3)
12                          Qwen3-235B-A22B        46.7  (-1.9 / +2.4)
13                         gemini-2.5-flash        45.1  (-2.7 / +2.1)
14                          gpt-4.5-preview        43.0  (-1.9 / +2.2)
15                                  QwQ-32B        36.1  (-2.0 / +2.2)
16                                Qwen3-32B        35.8  (-2.1 / +2.2)
17                            Qwen3-30B-A3B        28.7  (-1.4 / +2.1)
18               claude-3-5-sonnet-20241022        25.8  (-1.7 / +1.8)
19                                 s1.1-32B        18.3  (-2.3 / +2.2)
20                             gpt-4.1-nano        15.4  (-1.1 / +1.2)
21                           Athene-V2-Chat        12.6  (-1.2 / +1.3)
22                                 Qwen3-4B        12.6  (-1.1 / +1.5)
23           llama4-maverick-instruct-basic        12.0  (-1.0 / +1.2)
24                           gemma-3-27b-it         9.7  (-0.9 / +1.1)
25                     Qwen2.5-72B-Instruct         8.0  (-0.7 / +0.9)
26       Llama-3.1-Nemotron-70B-Instruct-HF         6.8  (-0.6 / +0.8)
27                         OpenThinker2-32B         2.3  (-0.2 / +0.3)

Creative Writing, Ensemble GPT-4.1 and Gemini 2.5 as Judges (Best Configuration for Creative Writing)

                                      Model  Scores (%)         CI (%)
0                                gemini-2.5        90.8  (-1.2 / +1.3)
1                             o3-2025-04-16        88.8  (-1.1 / +1.0)
2                          gemini-2.5-flash        83.9  (-1.3 / +1.4)
3                               deepseek-r1        77.0  (-2.0 / +1.4)
4                           Qwen3-235B-A22B        73.5  (-1.8 / +1.5)
5                            gemma-3-27b-it        69.9  (-1.9 / +1.7)
6   claude-3-7-sonnet-20250219-thinking-16k        63.9  (-1.7 / +1.9)
7                                   gpt-4.1        61.5  (-1.9 / +1.9)
8                                   QwQ-32B        60.9  (-2.0 / +1.6)
9                        o1-2024-12-17-high        59.9  (-2.1 / +1.7)
10                  o4-mini-2025-04-16-high        58.7  (-1.8 / +1.9)
11                            o1-2024-12-17        56.6  (-1.8 / +1.8)
12                       o4-mini-2025-04-16        55.6  (-1.8 / +2.0)
13                                Qwen3-32B        53.3  (-1.9 / +1.6)
14                          gpt-4.5-preview        51.4  (-1.9 / +2.0)
15                     gemini-2.0-flash-001        50.0  (-0.0 / +0.0)
16                  o3-mini-2025-01-31-high        43.0  (-1.7 / +2.1)
17                            Qwen3-30B-A3B        34.9  (-2.0 / +1.6)
18                             gpt-4.1-mini        28.2  (-1.8 / +1.8)
19       Llama-3.1-Nemotron-70B-Instruct-HF        26.9  (-2.0 / +1.8)
20               claude-3-5-sonnet-20241022        24.2  (-1.5 / +1.5)
21                         OpenThinker2-32B        23.6  (-1.5 / +1.3)
22                           Athene-V2-Chat        18.1  (-1.6 / +1.5)
23                                 Qwen3-4B        13.2  (-1.2 / +1.2)
24                             gpt-4.1-nano        10.7  (-1.1 / +1.1)
25           llama4-maverick-instruct-basic        10.5  (-1.1 / +1.0)
26                     Qwen2.5-72B-Instruct        10.2  (-1.1 / +1.1)
27                                 s1.1-32B         8.2  (-0.9 / +0.

For older leaderboards, such as Arena-Hard-v0.1, see past-leaderboards

Install Dependencies

git clone https://github.com/lmarena/arena-hard-auto.git
cd arena-hard
pip install -r requirements.txt
pip install -r requirements-optional.txt  # Optional dependencies (e.g., anthropic sdk)

Download dataset

We have pre-generated many popular models answers and judgments. You can browse them with an online demo or download them (with git-lfs installed) by

> git lfs install
> git clone [email protected]:datasets/lmarena-ai/arena-hard-auto arena-hard-data
// copy answers/judgments to the data directory
> cp -r arena-hard-data/data . 

Then run

> python show_result.py
                                      Model  Scores (%)         CI (%)
0                             o3-2025-04-16        87.6  (-0.8 / +1.0)
1                   o4-mini-2025-04-16-high        82.7  (-1.4 / +1.3)
2                        o4-mini-2025-04-16        78.9  (-1.6 / +1.6)

Evaluate

Step 1. Set up the endpoint config to your model

Fill in your API endpoint in config/api_config.yaml. We support OpenAI compatible API server, Anthropic, Vertex AI, and more. You will find examples in config/api_config.yaml.

You may use inference engine such as vLLM or SGLang to host your model with an OpenAI compatible API server.

We also include support for fast built-in inference with SGLang, see examples in config/api_config.yaml and implementaton in utils/completion.py. See misc/sglang_setup.bash for environment setup.

Step 2. Generate Model Answers

In config/gen_answer_config.yaml, add your model name in model_list.

Run the command to generate answers:

> python gen_answer.py

Caching feature is implemented. The code will skip generating an answer when there is already an existing answer/judgment to the same prompt (this feature is not supported for built-in SGLang server).

Step 3. Generate Judgments

In config/arena-hard-v2.0.yaml, add your model name in model_list.

...
# Add your model below for evaluation
model_list:
  - deepseek-r1
  - [YOUR-MODEL-NAME]

We recommend employing GPT-4.1 as judge for fast, stable judge inference. To use Gemini-2.5, comment out:

judge_model: gpt-4.1
temperature: 0.0
max_tokens: 16000

and uncomment:

judge_model: gemini-2.5
temperature: 1.0
max_tokens: 32000

Run the command to generate judgments:

> python gen_judgment.py

For Ensemble-as-Judges, we suggest inferencing both judges independently and we will aggregrate the results when displaying leaderboard for you (see step 4).

Judgment caching is also implemented. It will skip generating judgments that has already been generated or lacks one of the model answers.

Step 4. Show result

Output model win rates for Arena-Hard-v2.0-Preview (Hard Prompt, Style Control, GPT-4.1 as Judge):

> python show_result.py --judge-names gpt-4.1 --control-features markdown length

Output model win rates for Arena-Hard-v2.0-Preview (Creative Writing, Ensemble GPT-4.1 and Gemini 2.5 as Judges):

> python show_result.py --judge-names gpt-4.1 gemini-2.5 --category creative_writing

Step 5. Benchmark Viewer

You can review answers and judgment results using our gradio script (gradio>=5.25.2).

> python qa_browser.py --share

Style Control

Following the newly introduced Style Control on Chatbot Arena, we release Style Control on Arena Hard Auto! We employ the same Style Control methods as proposed in the blogpost. Please refer to the blogpost for methodology and technical background.

Before applying style control, make sure your model answers has proper style attribute generated. Either pull the latest data from huggingface repo, or run the following script!

To add style attribute to your model answers, use add_markdown_info.py. The following command takes model answers from --dir, append style attributes (token length, number of headers, etc), and save the new answers in --output-dir.

> python add_markdown_info.py --dir data/arena-hard-v0.1/model_answer --output-dir data/arena-hard-v0.1/model_answer

To control for style (token length and markdown elements), use --control-features or -f when running show_result.py.

> python show_result.py -f markdown length # style control
> python show_result.py -f markdown # control for markdown density only
> python show_result.py -f length # length control only

Evaluate Benchmarks

We outline two key properties that the benchmark aiming to approximate human preference should possess to provide meaningful comparisons between models:

  1. Separability: the benchmark should separate models with high confidence.
  2. Alignment with Human Preference: the benchmark should agree with human preference.

While previous works have focused on alignment, separability is also a crucial consideration when comparing models of similar quality (e.g., different checkpoints from the same training run). However, achieving high-confidence separability is challenging due to limitations in prompt design and inherent variances in LLM evaluations. Overly simplistic prompts fail to distinguish between models, while the randomness in human and LLM judgments leads to inconsistent predictions. As a result, it is often difficult to confidently determine if a model’s apparent performance reflects a genuine difference in capability or merely noisy observations, highlighting a need for methods to verify whether a benchmark can reliably separate similar models.

Statistical measures like Pearson (Pearson, 1895) and Spearman Correlations (Spearman, 1961), commonly used in benchmarks such as AlpacaEval (Li et al., 2023) to measure correlation to human preference ranking, may fail to adequately address model separability and ranking instability. In addition, these measures only provide a coarse signal of ranking correlation without quantifying the magnitude of performance differences between model pairs. To address these shortcomings, we develop three novel metrics: Separability with Confidence, Agreement with Confidence, and Pair Rank Brier Score.

Separability with Confidence quantifies the benchmark’s confidence by measuring its consistency in predicting the winner of a model pair across random seeds through bootstrapping. This is done by calculating the percentage of model pairs that have non-overlapping confidence intervals of their benchmark scores. A higher percentage indicates that the benchmark is more confident in distinguishing between the performance of different models, as the confidence intervals of their scores do not overlap.

For Agreement with Confidence, and Pair Rank Brier Score, please refer to section 3 of our paper. The code for calculating these metrics can be found in this colab notebook.

Integration with Amazon Bedrock API

We have now added the capability to benchmark LLMs hosted on Amazon Bedrock with arena-hard. Specifically, we added Amazon Bedrock invoke API in utils/completion.py which will allow you to use different models hosted on Amazon Bedrock with Arena-Hard.

Currently we support the following models:

  1. Anthropic Models : Claude 3 Haiku, Claude 3 Sonnet, Claude 3.5 Sonnet, Claude 3 Opus, Claude 3.5 Sonnet v2, Claude 3.7 Sonnet
  2. Mistral Models: Mistral 7B Instruct, Mistral 8x7B Instruct, Mistral Large v1, Mistral Large v2, Mistral Small, Pixtral Large
  3. Meta Llama Models: LLaMA 3 8B Instruct, LLaMA 3 70B Instruct, LLaMA 3.1 8B Instruct, LLaMA 3.1 70B Instruct, LLaMA 3.1 405B Instruct LLaMA 3.2 1B Instruct, LLaMA 3.2 3B Instruct, LLaMA 3.2 11B Instruct, LLaMA 3.2 90B Instruct, LLaMA 2 Chat 13B, LLaMA 2 Chat 70B
  4. Amazon Nova Models: Amazon Nova Lite, Amazon Nova Pro, Amazon Nova Micro, Amazon Nova Premier
  5. DeepSeek-R1

To Add a new model hosted on Amazon Bedrock, you need to update two files: config/api_config.yaml and utils/completion.py.

1. Update config/api_config.yaml

Define a new entry for the model with the correct model_id, api_type, and generation parameters.

Example:

aws_nova_light_v1:
  model: aws_nova_light_v1
  model_id: us.amazon.nova-lite-v1:0
  endpoints: null
  api_type: aws_nova
  parallel: 8
  max_tokens: 4096
  temperature: 0.0

Key Fields

  1. model: Internal alias used for referencing this config.
  2. model_id: Bedrock-specific model identifier.
  3. api_type: The api_type should be registered through utils\completion.py
  4. endpoints: Set to null for default Bedrock endpoint, or override with custom endpoint.
  5. parallel: Controls parallel inference calls (adjust for throughput).
  6. max_tokens: Maximum output tokens.
  7. temperature: Controls randomness of generation (0.0 for deterministic).

Find more example in config/api_config_bedrock_models.yaml Refer to Amazon Bedrock documentation (https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html) for model IDs and capabilities.

2. Register a Model Handler in utils/completion.py

Create a new function decorated with @register_api("<api_type>") to define how inputs are formatted, sent to Bedrock using boto3, and how the response is parsed.

You can use existing examples as templates:

> @register_api("aws_llama") handles LLaMA models
> @register_api("aws_nova") handles Nova models

These functions typically use helpers like create_llama3_body() or create_nova_messages() and send requests using the Bedrock invoke_model API.

Pay attention to:

> The api_type in `api_config.yaml` which must match the name used in the `@register_api(...)` decorator.
> Input formatting (e.g., prompt structure, message lists)
> Parameter mapping (temperature, max_tokens, model_id)
> Response parsing (e.g., generation vs nested output.message.content)

By following this two-step process, users can easily extend support to any Bedrock-hosted model that follows a compatible invocation structure. For examples, see existing handlers for Claude, LLaMA, and Amazon Nova in the repository.

Community Contribution

Feel free to submit a PR or open up an issue!

If you want to add your model to the leaderboard, please email me the following:

  1. An OpenAI compatible endpoint to your model.
  2. An OpenAI API key for me to inference judgment.

Sorry for the inconvience! Since Arena-Hard-Auto is open data, we want to avoid people cheating on our leaderboard. If we find anything suspicious, we reserve the right to not add your model to our leaderboard.

Citation

The code in this repository is developed from the papers below. Please cite it if you find the repository helpful.

@article{li2024crowdsourced,
  title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline},
  author={Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E and Stoica, Ion},
  journal={arXiv preprint arXiv:2406.11939},
  year={2024}
}
@misc{arenahard2024,
    title = {From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline},
    url = {https://lmsys.org/blog/2024-04-19-arena-hard/},
    author = {Tianle Li*, Wei-Lin Chiang*, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica},
    month = {April},
    year = {2024}
}

About

Arena-Hard-Auto: An automatic LLM benchmark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 12