🔍 Has Machine Translation Evaluation Achieved Human Parity?
The Human Reference and the Limits of Progress

⚙️ Setup

The code in this repo requires Python 3.10 or higher. We recommend creating a new conda environment as follows:

conda create -n human-parity-mt-eval python=3.10
conda activate human-parity-mt-eval
pip install --upgrade pip

All scripts included within this repository require cloning and installing the Google WMT Metrics evaluation repository. To do this, execute the following commands:

git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .

Then, download the WMT Metrics evaluation datasets:

alias mtme='python3 -m mt_metrics_eval.mtme'
mtme --download  # Puts ~2G of data into $HOME/.mt-metrics-eval.

📁 Data

The data/ directory contains all the information required to reproduce the analyses presented in our paper. The structure is organized by WMT evaluation year and language pair, and includes both human annotations and automatic metric outputs.

📂 Click to expand the directory tree

data
├── annotations
│   ├── wmt20
│   │   ├── en-de
│   │   │   ├── mqm-col1.pickle
│   │   │   ├── mqm-col2.pickle
│   │   │   ├── mqm-col3.pickle
│   │   │   ├── psqm-col1.pickle
│   │   │   ├── psqm-col2.pickle
│   │   │   └── psqm-col3.pickle
│   │   └── zh-en
│   │       ├── mqm-col1.pickle
│   │       ├── mqm-col2.pickle
│   │       ├── mqm-col3.pickle
│   │       ├── psqm-col1.pickle
│   │       ├── psqm-col2.pickle
│   │       └── psqm-col3.pickle
│   ├── wmt22
│   │   ├── en-de
│   │   │   ├── en-de.ESA-1.seg.score
│   │   │   ├── en-de.ESA-2.seg.score
│   │   │   ├── en-de.MQM-1.seg.score
│   │   │   ├── mqm-col1.pickle
│   │   │   ├── mqm-col2.pickle
│   │   │   └── mqm-col3.pickle
│   │   └── en-zh
│   │       ├── mqm-col1.pickle
│   │       ├── mqm-col2.pickle
│   │       └── mqm-col3.pickle
│   └── wmt23
│       ├── en-de
│       │   ├── mqm-col1_more_data.pickle
│       │   ├── mqm-col1.pickle
│       │   ├── mqm-col2_more_data.pickle
│       │   ├── mqm-col2.pickle
│       │   ├── mqm-col3_more_data.pickle
│       │   └── mqm-col3.pickle
│       └── zh-en
│           ├── mqm-col1.pickle
│           ├── mqm-col2.pickle
│           └── mqm-col3.pickle
├── metrics_info
│   ├── wmt20
│   │   └── out_paths.tsv
│   ├── wmt22
│   │   └── out_paths.tsv
│   └── wmt23
│       └── out_paths.tsv
├── metrics_outputs
│   ├── wmt20
│   │   ├── en-de
│   │   │   └── BLEURT-20
│   │   └── zh-en
│   │       └── BLEURT-20
│   ├── wmt22
│   │   ├── en-de
│   │   │   ├── CometKiwi-XL
│   │   │   ├── CometKiwi-XXL
│   │   │   ├── MetricX-23-QE-XXL
│   │   │   └── MetricX-23-XXL
│   │   └── en-zh
│   │       ├── CometKiwi-XL
│   │       ├── CometKiwi-XXL
│   │       ├── MetricX-23-QE-XXL
│   │       └── MetricX-23-XXL
│   └── wmt23
│       ├── en-de
│       │   ├── MetricX-23-QE-XXL
│       │   └── MetricX-23-XXL
│       └── zh-en
│           ├── MetricX-23-QE-XXL
│           └── MetricX-23-XXL
└── rankings
    ├── wmt20
    │   ├── en-de
    │   └── zh-en
    ├── wmt22
    │   ├── en-de
    │   └── en-zh
    ├── wmt23
    │   ├── en-de
    │   └── zh-en
    └── wmt24
        └── en-es

🧾 Description of the contents

📄 annotations/
Contains human annotations following MT evaluation protocols (e.g., MQM, PSQM, ESA) across multiple WMT editions and language pairs. These are the human MT evaluators used in our analysis.
ℹ️ metrics_info/
Stores metadata about the additional automatic metrics we included in our study (beyond those originally submitted to WMT). These metadata consist of metric names and output file paths.
📈 metrics_outputs/
Includes the actual outputs of the additional automatic metrics for each WMT year and language pair.
🏆 rankings/
This folder is used for the final rankings of all evaluators (both automatic metrics and humans), as generated by the run_mt_meta_eval.py script.

🏃‍♂️ Running the code

To reproduce the results presented in our paper, you can run the run_mt_meta_eval.py script, which performs the meta-evaluation considering both automatic MT metrics and human evaluators.

📊 Reproducing Meta-Evaluation Results

WMT20 (click to expand)

🌍 Language Pair: `en-de`

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt20 \
    --lp en-de \
    --new-human-annotations-dir data/annotations/wmt20 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt20/out_paths.tsv > data/rankings/wmt20/en-de/ranking.txt

🌍 Language Pair: `zh-en`

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt20 \
    --lp zh-en \
    --new-human-annotations-dir data/annotations/wmt20 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt20/out_paths.tsv > data/rankings/wmt20/zh-en/ranking.txt

WMT22 (click to expand)

🌍 Language Pair: `en-de`

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt22 \
    --lp en-de \
    --new-human-annotations-dir data/annotations/wmt22 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt22/out_paths.tsv > data/rankings/wmt22/en-de/ranking.txt

🌍 Language Pair: `en-zh`

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt22 \
    --lp en-zh \
    --new-human-annotations-dir data/annotations/wmt22 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt22/out_paths.tsv > data/rankings/wmt22/en-zh/ranking.txt

WMT23 (click to expand)

🌍 Language Pair: `en-de`

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt23 \
    --lp en-de \
    --new-human-annotations-dir data/annotations/wmt23 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt23/out_paths.tsv > data/rankings/wmt23/en-de/ranking.txt

🌍 Language Pair: `zh-en`

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt23 \
    --lp zh-en \
    --new-human-annotations-dir data/annotations/wmt23 \
    --gold-name --mqm-col1 \
    --new-metrics-path data/metrics_info/wmt23/out_paths.tsv > data/rankings/wmt23/zh-en/ranking.txt

WMT24 (click to expand)

🌍 Language Pair: `en-es`

python scripts/run_mt_meta_eval.py \
    --wmt-year wmt24 \
    --lp en-es \
    --gold-name --mqm > data/rankings/wmt24/en-es/ranking.txt

Cite this work

This work has been published at ACL 2025 (Main Conference). If you use any part, please consider citing our paper as follows:

@misc{proietti2025machinetranslationevaluationachieved,
      title={Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress}, 
      author={Lorenzo Proietti and Stefano Perrella and Roberto Navigli},
      year={2025},
      eprint={2506.19571},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.19571}, 
}

License

This work is licensed under Creative Commons Attribution-ShareAlike-NonCommercial 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
scripts		scripts
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔍 Has Machine Translation Evaluation Achieved Human Parity?
The Human Reference and the Limits of Progress

⚙️ Setup

📁 Data

🧾 Description of the contents

🏃‍♂️ Running the code

📊 Reproducing Meta-Evaluation Results

🌍 Language Pair: `en-de`

🌍 Language Pair: `zh-en`

🌍 Language Pair: `en-de`

🌍 Language Pair: `en-zh`

🌍 Language Pair: `en-de`

🌍 Language Pair: `zh-en`

🌍 Language Pair: `en-es`

Cite this work

License

About

Uh oh!

Releases

Packages

Languages

License

SapienzaNLP/human-parity-mt-eval

Folders and files

Latest commit

History

Repository files navigation

🔍 Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress

⚙️ Setup

📁 Data

🧾 Description of the contents

🏃‍♂️ Running the code

📊 Reproducing Meta-Evaluation Results

🌍 Language Pair: en-de

🌍 Language Pair: zh-en

🌍 Language Pair: en-de

🌍 Language Pair: en-zh

🌍 Language Pair: en-de

🌍 Language Pair: zh-en

🌍 Language Pair: en-es

Cite this work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🔍 Has Machine Translation Evaluation Achieved Human Parity?
The Human Reference and the Limits of Progress

🌍 Language Pair: `en-de`

🌍 Language Pair: `zh-en`

🌍 Language Pair: `en-de`

🌍 Language Pair: `en-zh`

🌍 Language Pair: `en-de`

🌍 Language Pair: `zh-en`

🌍 Language Pair: `en-es`

Packages