Resources for paper - QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
Accepted by EMNLP 2024.
We share the generated questions from 15 QG models with averaged annotation scores of three annotators in data/scores.xlsx, and the instances integrated by passages are in data/instances.json. We also share the annotation result of each annotator in data/annotation result.
Example of instances.
{
"id": "572882242ca10214002da423",
"passage": "... The publication of a Taoist text inscribed with the name of Töregene Khatun, Ögedei's wife, ...",
"reference": "Who was Ögedei's wife?"
"answer": "Töregene Khatun",
"questions": [
{
"prediction": "Who was the author of the Taoist text inscribed with the name of?",
"source": "SQuAD_BART-base_finetune",
"fluency": 3.0,
"clarity": 2.6667,
"conciseness": 3.0,
"relevance": 3.0,
"consistency": 2.0,
"answerability": 1.0,
"answer_consistency": 1.0
},
// ... 14 more questions
]
}
The average annotation scores of each QG model over 7 dimensions are shown in the below table.
Models | Flu. | Clar. | Conc. | Rel. | Cons. | Ans. | AnsC. | Avg. |
---|---|---|---|---|---|---|---|---|
M1 - Reference | 2.968 | 2.930 | 2.998 | 2.993 | 2.923 | 2.832 | 2.768 | 2.916 |
M2 - BART-base-finetune | 2.958 | 2.882 | 2.898 | 2.995 | 2.920 | 2.732 | 2.588 | 2.853 |
M3 - BART-large-finetune | 2.932 | 2.915 | 2.828 | 2.995 | 2.935 | 2.825 | 2.737 | 2.881 |
M4 - T5-base-finetune | 2.972 | 2.923 | 2.922 | 3.000 | 2.917 | 2.788 | 2.652 | 2.882 |
M5 - T5-large-finetune | 2.978 | 2.930 | 2.907 | 2.995 | 2.933 | 2.795 | 2.720 | 2.894 |
M6 - Flan-T5-base-finetune | 2.963 | 2.888 | 2.938 | 2.998 | 2.925 | 2.775 | 2.665 | 2.879 |
M7 - Flan-T5-large-finetune | 2.982 | 2.902 | 2.895 | 2.995 | 2.950 | 2.818 | 2.727 | 2.895 |
M8 - Flan-T5-XL-LoRA | 2.913 | 2.843 | 2.880 | 2.997 | 2.928 | 2.772 | 2.667 | 2.857 |
M9 - Flan-T5-XXL-LoRA | 2.938 | 2.848 | 2.907 | 3.000 | 2.943 | 2.757 | 2.678 | 2.867 |
M10 - Flan-T5-XL-fewshot | 2.975 | 2.820 | 2.985 | 2.955 | 2.908 | 2.652 | 2.193 | 2.784 |
M11 - Flan-T5-XXL-fewshot | 2.987 | 2.882 | 2.990 | 2.988 | 2.920 | 2.687 | 2.432 | 2.841 |
M12 - GPT-3.5-Turbo-fewshot | 2.972 | 2.927 | 2.858 | 2.995 | 2.955 | 2.850 | 2.335 | 2.842 |
M13 - GPT-4-Turbo-fewshot | 2.988 | 2.987 | 2.897 | 2.992 | 2.947 | 2.922 | 2.772 | 2.929 |
M14 - GPT-3.5-Turbo-zeroshot | 2.995 | 2.977 | 2.913 | 2.992 | 2.917 | 2.823 | 2.157 | 2.825 |
M15 - GPT-4-Turbo-zeroshot | 2.983 | 2.990 | 2.943 | 2.970 | 2.932 | 2.883 | 2.723 | 2.918 |
Avg. | 2.967 | 2.910 | 2.917 | 2.991 | 2.930 | 2.794 | 2.588 |
We implemented 15 automatic metrics for re-evaluation, they are:
We share the results of each metric on each generated question in data/metric_result.xlsx. Results of LLM-based metrics on answerability are in data/test_answerability.xlsx.
You can find our trained QG model at huggingface.
Our codes provide the ability to evaluate automatic metrics
, you can also use our codes to train Question Generation model
and calculate automatic metrics
.
You can install our packages for question generation or automatic metrics.
- Question Generation:
pip install QGEval-qg
. For usage, please refer to https://pypi.org/project/QGEval-qg/. - Automatic Metrics:
pip install QGEval-metrics
. For usage, please refer to https://pypi.org/project/QGEval-metrics/.
You can also download this resource and use it by following the instructions below.
The codes and the data for Question Generation are in qg, train your own QG models by these steps:
- cd
./qg
- run
pip install -r requirements.txt
to install the required packages - run
python process.py
to process data - run the code file for specific models to train. For example, run
python T5.py
to train your T5-based QG model
Find more details in qg/readme.
The codes for Automatic Metrics Calculation(e.g. BLEU-4) are in metrics, calculate automatic metrics by these steps:
- prepare data, you can get the Question Generation dataset at qg/data or you can prepare data yourself
- cd
./metric
- run
pip install -r requirements.txt
to install the required packages - run
python metrics.py
to get your chosen metrics evaluation results
Find more details in metrics/readme.
The codes for Automatic Metrics are in metrics.
Take the evaluation of QRelScore as an example, you can use the QGEval benchmark to evaluate QRelScore step by step:
-
Prepare data for evaluation: You can get the QGEval dataset at data/scores.xlsx.
Column Explanation "passage" - the passage of the question based on. "reference" - the reference question. "answer" - the provided answer. "prediction" - the generated question. "source" - the base dataset and model used to generate the 'prediction' question.
-
Run automatic metrics
- cd
./metric
- run
pip install -r requirements.txt
to install the required packages - run the specific code file to get results from automatic metrics. To get QRelScore results, run
python metrics.py
:
import pandas as pd # load data data_path = 'your data path' save_path = 'result save path' data = pd.read_excel(data_path) # prepare parameters hypos = data['prediction'].tolist() refs_list = [data['reference'].tolist()] contexts = data['passage'].tolist() answers = data['answer'].tolist() # metric to use score_names = ['QRelScore'] # run metric res = get_metrics(hypos, refs_list, contexts, answers, score_names=score_names) # handle results for k, v in res.items(): data[k] = v # save results data.to_excel(save_path, index=False) print('Metrics saved to {}'.format(save_path))
- cd
-
Calculate Correlations
run
python coeff.py
to obtain the Pearson correlation coefficient between the generated results and the labeled results.import pandas as pd result_data_path = 'your result path' df = pd.read_excel(result_data_path) metrics = ['QRelScore'] # dimensions to calculate correlation with dimensions = ['fluency','clarity','conciseness','relevance','consistency','answerability','answer_consistency'] # calculate pearson coeff = Coeff() for metric in metrics: print(f"Pearson of {metric}") for dimension in dimensions: labels = df[dimension].to_list() preds = df[metric].to_list() per, spea, ken = coeff.apply(labels, preds) print(f"{dimension}: Pearson={per}, Spearman={spea}, Kendall={ken}") print()
More details about the codes for automatic metrics are in metrics/readme.
Please cite:
@misc{fu2024qgevalbenchmarkingmultidimensionalevaluation,
title={QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation},
author={Weiping Fu and Bifan Wei and Jianxiang Hu and Zhongmin Cai and Jun Liu},
year={2024},
eprint={2406.05707},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.05707},
}