The code in this repo requires Python 3.10 or higher. We recommend creating a new conda environment as follows:
conda create -n human-parity-mt-eval python=3.10
conda activate human-parity-mt-eval
pip install --upgrade pip
All scripts included within this repository require cloning and installing the Google WMT Metrics evaluation repository. To do this, execute the following commands:
git clone https://github.com/google-research/mt-metrics-eval.git
cd mt-metrics-eval
pip install .
Then, download the WMT Metrics evaluation datasets:
alias mtme='python3 -m mt_metrics_eval.mtme'
mtme --download # Puts ~2G of data into $HOME/.mt-metrics-eval.
The data/ directory contains all the information required to reproduce the analyses presented in our paper. The structure is organized by WMT evaluation year and language pair, and includes both human annotations and automatic metric outputs.
📂 Click to expand the directory tree
data
├── annotations
│ ├── wmt20
│ │ ├── en-de
│ │ │ ├── mqm-col1.pickle
│ │ │ ├── mqm-col2.pickle
│ │ │ ├── mqm-col3.pickle
│ │ │ ├── psqm-col1.pickle
│ │ │ ├── psqm-col2.pickle
│ │ │ └── psqm-col3.pickle
│ │ └── zh-en
│ │ ├── mqm-col1.pickle
│ │ ├── mqm-col2.pickle
│ │ ├── mqm-col3.pickle
│ │ ├── psqm-col1.pickle
│ │ ├── psqm-col2.pickle
│ │ └── psqm-col3.pickle
│ ├── wmt22
│ │ ├── en-de
│ │ │ ├── en-de.ESA-1.seg.score
│ │ │ ├── en-de.ESA-2.seg.score
│ │ │ ├── en-de.MQM-1.seg.score
│ │ │ ├── mqm-col1.pickle
│ │ │ ├── mqm-col2.pickle
│ │ │ └── mqm-col3.pickle
│ │ └── en-zh
│ │ ├── mqm-col1.pickle
│ │ ├── mqm-col2.pickle
│ │ └── mqm-col3.pickle
│ └── wmt23
│ ├── en-de
│ │ ├── mqm-col1_more_data.pickle
│ │ ├── mqm-col1.pickle
│ │ ├── mqm-col2_more_data.pickle
│ │ ├── mqm-col2.pickle
│ │ ├── mqm-col3_more_data.pickle
│ │ └── mqm-col3.pickle
│ └── zh-en
│ ├── mqm-col1.pickle
│ ├── mqm-col2.pickle
│ └── mqm-col3.pickle
├── metrics_info
│ ├── wmt20
│ │ └── out_paths.tsv
│ ├── wmt22
│ │ └── out_paths.tsv
│ └── wmt23
│ └── out_paths.tsv
├── metrics_outputs
│ ├── wmt20
│ │ ├── en-de
│ │ │ └── BLEURT-20
│ │ └── zh-en
│ │ └── BLEURT-20
│ ├── wmt22
│ │ ├── en-de
│ │ │ ├── CometKiwi-XL
│ │ │ ├── CometKiwi-XXL
│ │ │ ├── MetricX-23-QE-XXL
│ │ │ └── MetricX-23-XXL
│ │ └── en-zh
│ │ ├── CometKiwi-XL
│ │ ├── CometKiwi-XXL
│ │ ├── MetricX-23-QE-XXL
│ │ └── MetricX-23-XXL
│ └── wmt23
│ ├── en-de
│ │ ├── MetricX-23-QE-XXL
│ │ └── MetricX-23-XXL
│ └── zh-en
│ ├── MetricX-23-QE-XXL
│ └── MetricX-23-XXL
└── rankings
├── wmt20
│ ├── en-de
│ └── zh-en
├── wmt22
│ ├── en-de
│ └── en-zh
├── wmt23
│ ├── en-de
│ └── zh-en
└── wmt24
└── en-es
-
📄 annotations/
Contains human annotations following MT evaluation protocols (e.g., MQM, PSQM, ESA) across multiple WMT editions and language pairs. These are the human MT evaluators used in our analysis. -
ℹ️ metrics_info/
Stores metadata about the additional automatic metrics we included in our study (beyond those originally submitted to WMT). These metadata consist of metric names and output file paths. -
📈 metrics_outputs/
Includes the actual outputs of the additional automatic metrics for each WMT year and language pair. -
🏆 rankings/
This folder is used for the final rankings of all evaluators (both automatic metrics and humans), as generated by therun_mt_meta_eval.py
script.
To reproduce the results presented in our paper, you can run the run_mt_meta_eval.py
script, which performs the meta-evaluation considering both automatic MT metrics and human evaluators.
WMT20 (click to expand)
python scripts/run_mt_meta_eval.py \
--wmt-year wmt20 \
--lp en-de \
--new-human-annotations-dir data/annotations/wmt20 \
--gold-name --mqm-col1 \
--new-metrics-path data/metrics_info/wmt20/out_paths.tsv > data/rankings/wmt20/en-de/ranking.txt
python scripts/run_mt_meta_eval.py \
--wmt-year wmt20 \
--lp zh-en \
--new-human-annotations-dir data/annotations/wmt20 \
--gold-name --mqm-col1 \
--new-metrics-path data/metrics_info/wmt20/out_paths.tsv > data/rankings/wmt20/zh-en/ranking.txt
WMT22 (click to expand)
python scripts/run_mt_meta_eval.py \
--wmt-year wmt22 \
--lp en-de \
--new-human-annotations-dir data/annotations/wmt22 \
--gold-name --mqm-col1 \
--new-metrics-path data/metrics_info/wmt22/out_paths.tsv > data/rankings/wmt22/en-de/ranking.txt
python scripts/run_mt_meta_eval.py \
--wmt-year wmt22 \
--lp en-zh \
--new-human-annotations-dir data/annotations/wmt22 \
--gold-name --mqm-col1 \
--new-metrics-path data/metrics_info/wmt22/out_paths.tsv > data/rankings/wmt22/en-zh/ranking.txt
WMT23 (click to expand)
python scripts/run_mt_meta_eval.py \
--wmt-year wmt23 \
--lp en-de \
--new-human-annotations-dir data/annotations/wmt23 \
--gold-name --mqm-col1 \
--new-metrics-path data/metrics_info/wmt23/out_paths.tsv > data/rankings/wmt23/en-de/ranking.txt
python scripts/run_mt_meta_eval.py \
--wmt-year wmt23 \
--lp zh-en \
--new-human-annotations-dir data/annotations/wmt23 \
--gold-name --mqm-col1 \
--new-metrics-path data/metrics_info/wmt23/out_paths.tsv > data/rankings/wmt23/zh-en/ranking.txt
WMT24 (click to expand)
python scripts/run_mt_meta_eval.py \
--wmt-year wmt24 \
--lp en-es \
--gold-name --mqm > data/rankings/wmt24/en-es/ranking.txt
This work has been published at ACL 2025 (Main Conference). If you use any part, please consider citing our paper as follows:
@misc{proietti2025machinetranslationevaluationachieved,
title={Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress},
author={Lorenzo Proietti and Stefano Perrella and Roberto Navigli},
year={2025},
eprint={2506.19571},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.19571},
}
This work is licensed under Creative Commons Attribution-ShareAlike-NonCommercial 4.0.