Skip to content

bigdata-ustc/am-ELO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

am-ELO: A Stable Framework for Arena-based LLM Evaluation

This document serves as the code specification for the paper "am-ELO: A Stable Framework for Arena-based LLM Evaluation"([2505.03475] am-ELO: A Stable Framework for Arena-based LLM Evaluation (arxiv.org)), which has been accepted by ICML 2025 and will be presented as a Spotlight paper at the conference.

Requirements

Our work can run in a normal environment (including pandas, numpy, and torch), and for the dataset, we need to use the HuggingFace library. You can also reconfigure the environment by following steps:

First create a Python environment python:

conda create --name ELO python=3.11.5

To install requirements:

pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Structure

Here is the contents for the Code:

am-ELO/
│
├── data/
│   ├── Chatbot.csv
│
├── dataset.py
│
└── model.py
│
└── setting.py
│
└── utils.py
|
└── main.py

Chatbot.csv is the dataset download from huggingface.

dataset.py is the code for processing real data and simulating data

model.py is the code of estimation models (Traditional ELO, m-ELO and am-ELO)

Data Downloading and Preprocessing

To get the dataset used in experiments, you can see the details for Chatbot Dataset on huggingface (lmsys/chatbot_arena_conversations · Datasets at Hugging Face).

cd data
python prepare_data.py --key==xxx
cd ..

!!!params key is the huggingface key!!!

During code execution, annotators with less than 50 annotation records will be filtered out. Here is the statistical information of the filtered dataset.

Dataset Chatbot
#Annotators 42
#Models 20
#Response logs 4,321
#Response logs per annotater 102.88
#Response logs per model 216.05
#Response logs per model pair 22.74

Run

To run my model, please use the following instruction:

python main.py --method=m-ELO --data_path=data/chatbot.csv --seed=2025 --device=cuda

method can choose the estimation for ELO scores, including random, ELO, m-ELO, am-ELO

data_path is the dataset used in the experiment, including data/Chatbot.csv.

seed controls the initialization parameters, and the seed used in the article is 2023, 2024, 2025, 2026, 2027.

Result

The performance of ELO method for prediction (Table 2)

Method MSE AUC
ELO 0.1238± 0.0031 0.7492± 0.0068
m-ELO 0.1234± 0.0029 0.7503± 0.0066
am-ELO 0.1208± 0.0034 0.7581± 0.0067

The ELO Scores and the loss of ELO method (Figure 3):

The heatmap of the case study (Figure 4):

The loss and consistency of the ELO method (Figure 5):

The Simulation experiment (Figure 6,7):

Contributing

📋 We use the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages