Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework (ACL 2025 Findings)
ARJudge is a unified and robust framework for multi-faceted evaluation of language model responses. The system learns to align evaluation across different criteria by generating instruction-specific evaluation questions and providing detailed analysis before making final judgments.
- vllm=0.5.4
- transformers=4.44.2
- trl=0.9.6
- alignment-handbook
Download the Composite Analysis Corpus (Google Drive)
Model Name | HF Checkpoint | Size | License |
---|---|---|---|
ARJudge | 🤗 kaishxu/ARJudge | 7B | Qwen2.5 |
Our model is based on Qwen2.5-7B-Instruct. Please donwload the model from Hugging Face. Train the ARJudge model using the provided training script:
# Configure training parameters in configs/config.yaml
# Then run training
bash scripts/train.sh
The training script will:
- Load the Qwen2.5-7B-Instruct base model
- Fine-tune on the ARJudge training data
- Save checkpoints to the specified output directory
Evaluate the trained model on test datasets:
# Run evaluation on Auto-J dataset
bash scripts/eval.sh
The evaluation pipeline includes:
- Analysis Generation: Generate multi-faceted analysis using the trained ARJudge model
- Judgment Refinement: Use a base model to make final comparative judgments
- Metrics Calculation: Compute evaluation metrics
Key parameters in configs/config.yaml
:
# Model settings
model_name_or_path: /path/to/Qwen2.5-7B-Instruct
torch_dtype: bfloat16
attn_implementation: flash_attention_2
# Training settings
learning_rate: 1.0e-05
num_train_epochs: 2
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
max_seq_length: 4096
Modify the evaluation datasets as: Auto-J, MTBench, etc.
{
"prompt": "instruction",
"output": "response"
}
{
"idx": 0,
"instruction": "instruction",
"response1": "response1",
"response2": "response2",
"winner": 1 // or 2
}