Skip to content

Official repository of the ACL 2025 paper "Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress".

License

Notifications You must be signed in to change notification settings

SapienzaNLP/human-parity-mt-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🔍 Has Machine Translation Evaluation Achieved Human Parity?
The Human Reference and the Limits of Progress

Conference arXiv License: CC BY-NC-SA 4.0

Python 3.10+ Code style: black

⚙️ Setup

The code in this repo requires Python 3.10 or higher. We recommend creating a new conda environment as follows:

conda create -n human-parity-mt-eval python=3.10
conda activate human-parity-mt-eval
pip install --upgrade pip

All scripts included within this repository require cloning and installing the Google WMT Metrics evaluation repository. To do this, execute the following commands:

git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .

Then, download the WMT Metrics evaluation datasets:

alias mtme='python3 -m mt_metrics_eval.mtme'
mtme --download  # Puts ~2G of data into $HOME/.mt-metrics-eval.

📁 Data

The data/ directory contains all the information required to reproduce the analyses presented in our paper. The structure is organized by WMT evaluation year and language pair, and includes both human annotations and automatic metric outputs.

📂 Click to expand the directory tree
data
├── annotations
│   ├── wmt20
│   │   ├── en-de
│   │   │   ├── mqm-col1.pickle
│   │   │   ├── mqm-col2.pickle
│   │   │   ├── mqm-col3.pickle
│   │   │   ├── psqm-col1.pickle
│   │   │   ├── psqm-col2.pickle
│   │   │   └── psqm-col3.pickle
│   │   └── zh-en
│   │       ├── mqm-col1.pickle
│   │       ├── mqm-col2.pickle
│   │       ├── mqm-col3.pickle
│   │       ├── psqm-col1.pickle
│   │       ├── psqm-col2.pickle
│   │       └── psqm-col3.pickle
│   ├── wmt22
│   │   ├── en-de
│   │   │   ├── en-de.ESA-1.seg.score
│   │   │   ├── en-de.ESA-2.seg.score
│   │   │   ├── en-de.MQM-1.seg.score
│   │   │   ├── mqm-col1.pickle
│   │   │   ├── mqm-col2.pickle
│   │   │   └── mqm-col3.pickle
│   │   └── en-zh
│   │       ├── mqm-col1.pickle
│   │       ├── mqm-col2.pickle
│   │       └── mqm-col3.pickle
│   └── wmt23
│       ├── en-de
│       │   ├── mqm-col1_more_data.pickle
│       │   ├── mqm-col1.pickle
│       │   ├── mqm-col2_more_data.pickle
│       │   ├── mqm-col2.pickle
│       │   ├── mqm-col3_more_data.pickle
│       │   └── mqm-col3.pickle
│       └── zh-en
│           ├── mqm-col1.pickle
│           ├── mqm-col2.pickle
│           └── mqm-col3.pickle
├── metrics_info
│   ├── wmt20
│   │   └── out_paths.tsv
│   ├── wmt22
│   │   └── out_paths.tsv
│   └── wmt23
│       └── out_paths.tsv
├── metrics_outputs
│   ├── wmt20
│   │   ├── en-de
│   │   │   └── BLEURT-20
│   │   └── zh-en
│   │       └── BLEURT-20
│   ├── wmt22
│   │   ├── en-de
│   │   │   ├── CometKiwi-XL
│   │   │   ├── CometKiwi-XXL
│   │   │   ├── MetricX-23-QE-XXL
│   │   │   └── MetricX-23-XXL
│   │   └── en-zh
│   │       ├── CometKiwi-XL
│   │       ├── CometKiwi-XXL
│   │       ├── MetricX-23-QE-XXL
│   │       └── MetricX-23-XXL
│   └── wmt23
│       ├── en-de
│       │   ├── MetricX-23-QE-XXL
│       │   └── MetricX-23-XXL
│       └── zh-en
│           ├── MetricX-23-QE-XXL
│           └── MetricX-23-XXL
└── rankings
    ├── wmt20
    │   ├── en-de
    │   └── zh-en
    ├── wmt22
    │   ├── en-de
    │   └── en-zh
    ├── wmt23
    │   ├── en-de
    │   └── zh-en
    └── wmt24
        └── en-es

🧾 Description of the contents

  • 📄 annotations/
    Contains human annotations following MT evaluation protocols (e.g., MQM, PSQM, ESA) across multiple WMT editions and language pairs. These are the human MT evaluators used in our analysis.

  • ℹ️ metrics_info/
    Stores metadata about the additional automatic metrics we included in our study (beyond those originally submitted to WMT). These metadata consist of metric names and output file paths.

  • 📈 metrics_outputs/
    Includes the actual outputs of the additional automatic metrics for each WMT year and language pair.

  • 🏆 rankings/
    This folder is used for the final rankings of all evaluators (both automatic metrics and humans), as generated by the run_mt_meta_eval.py script.

🏃‍♂️ Running the code

To reproduce the results presented in our paper, you can run the run_mt_meta_eval.py script, which performs the meta-evaluation considering both automatic MT metrics and human evaluators.


📊 Reproducing Meta-Evaluation Results

WMT20 (click to expand)

🌍 Language Pair: en-de

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt20 \
    --lp en-de \
    --new-human-annotations-dir data/annotations/wmt20 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt20/out_paths.tsv > data/rankings/wmt20/en-de/ranking.txt

🌍 Language Pair: zh-en

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt20 \
    --lp zh-en \
    --new-human-annotations-dir data/annotations/wmt20 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt20/out_paths.tsv > data/rankings/wmt20/zh-en/ranking.txt
WMT22 (click to expand)

🌍 Language Pair: en-de

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt22 \
    --lp en-de \
    --new-human-annotations-dir data/annotations/wmt22 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt22/out_paths.tsv > data/rankings/wmt22/en-de/ranking.txt

🌍 Language Pair: en-zh

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt22 \
    --lp en-zh \
    --new-human-annotations-dir data/annotations/wmt22 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt22/out_paths.tsv > data/rankings/wmt22/en-zh/ranking.txt
WMT23 (click to expand)

🌍 Language Pair: en-de

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt23 \
    --lp en-de \
    --new-human-annotations-dir data/annotations/wmt23 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt23/out_paths.tsv > data/rankings/wmt23/en-de/ranking.txt

🌍 Language Pair: zh-en

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt23 \
    --lp zh-en \
    --new-human-annotations-dir data/annotations/wmt23 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt23/out_paths.tsv > data/rankings/wmt23/zh-en/ranking.txt
WMT24 (click to expand)

🌍 Language Pair: en-es

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt24 \
    --lp en-es \
    --gold-name --mqm > data/rankings/wmt24/en-es/ranking.txt

Cite this work

This work has been published at ACL 2025 (Main Conference). If you use any part, please consider citing our paper as follows:

@misc{proietti2025machinetranslationevaluationachieved,
      title={Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress}, 
      author={Lorenzo Proietti and Stefano Perrella and Roberto Navigli},
      year={2025},
      eprint={2506.19571},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.19571}, 
}

License

This work is licensed under Creative Commons Attribution-ShareAlike-NonCommercial 4.0.

About

Official repository of the ACL 2025 paper "Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages