QGEval

Resources for paper - QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation

Accepted by EMNLP 2024.

Data

We share the generated questions from 15 QG models with averaged annotation scores of three annotators in data/scores.xlsx, and the instances integrated by passages are in data/instances.json. We also share the annotation result of each annotator in data/annotation result.

Example of instances.

{
  "id": "572882242ca10214002da423",
  "passage": "... The publication of a Taoist text inscribed with the name of Töregene Khatun, Ögedei's wife, ...",
  "reference": "Who was Ögedei's wife?"
  "answer": "Töregene Khatun",
  "questions": [
      {
        "prediction": "Who was the author of the Taoist text inscribed with the name of?",
        "source": "SQuAD_BART-base_finetune",
        "fluency": 3.0,
        "clarity": 2.6667,
        "conciseness": 3.0,
        "relevance": 3.0,
        "consistency": 2.0,
        "answerability": 1.0,
        "answer_consistency": 1.0
      },
      // ... 14 more questions
  ]
}

The average annotation scores of each QG model over 7 dimensions are shown in the below table.

Models	Flu.	Clar.	Conc.	Rel.	Cons.	Ans.	AnsC.	Avg.
M1 - Reference	2.968	2.930	2.998	2.993	2.923	2.832	2.768	2.916
M2 - BART-base-finetune	2.958	2.882	2.898	2.995	2.920	2.732	2.588	2.853
M3 - BART-large-finetune	2.932	2.915	2.828	2.995	2.935	2.825	2.737	2.881
M4 - T5-base-finetune	2.972	2.923	2.922	3.000	2.917	2.788	2.652	2.882
M5 - T5-large-finetune	2.978	2.930	2.907	2.995	2.933	2.795	2.720	2.894
M6 - Flan-T5-base-finetune	2.963	2.888	2.938	2.998	2.925	2.775	2.665	2.879
M7 - Flan-T5-large-finetune	2.982	2.902	2.895	2.995	2.950	2.818	2.727	2.895
M8 - Flan-T5-XL-LoRA	2.913	2.843	2.880	2.997	2.928	2.772	2.667	2.857
M9 - Flan-T5-XXL-LoRA	2.938	2.848	2.907	3.000	2.943	2.757	2.678	2.867
M10 - Flan-T5-XL-fewshot	2.975	2.820	2.985	2.955	2.908	2.652	2.193	2.784
M11 - Flan-T5-XXL-fewshot	2.987	2.882	2.990	2.988	2.920	2.687	2.432	2.841
M12 - GPT-3.5-Turbo-fewshot	2.972	2.927	2.858	2.995	2.955	2.850	2.335	2.842
M13 - GPT-4-Turbo-fewshot	2.988	2.987	2.897	2.992	2.947	2.922	2.772	2.929
M14 - GPT-3.5-Turbo-zeroshot	2.995	2.977	2.913	2.992	2.917	2.823	2.157	2.825
M15 - GPT-4-Turbo-zeroshot	2.983	2.990	2.943	2.970	2.932	2.883	2.723	2.918
Avg.	2.967	2.910	2.917	2.991	2.930	2.794	2.588

Automatic Metrics

We implemented 15 automatic metrics for re-evaluation, they are:

Metrics	Paper	Code Link
BLEU-4	BLEU: a Method for Automatic Evaluation of Machine Translation	link
ROUGE-L	ROUGE: A Package for Automatic Evaluation of Summaries	link
METEOR	METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments	link
BERTScore	BERTScore: Evaluating Text Generation with BERT	link
MoverScore	MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance	link
BLEURT	BLEURT: Learning Robust Metrics for Text Generation	link
BARTScore-ref	BARTScore: Evaluating Generated Text as Text Generation	link
GPTScore-ref	GPTScore: Evaluate as You Desire	link
Q-BLEU4	Towards a Better Metric for Evaluating Question Generation Systems	link
QSTS	QSTS: A Question-Sensitive Text Similarity Measure for Question Generation	link
BARTScore-src	BARTScore: Evaluating Generated Text as Text Generation	link
GPTScore-src	GPTScore: Evaluate as You Desire	link
QRelScore	QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance	link
UniEval	Towards a Unified Multi-Dimensional Evaluator for Text Generation	link
RQUGE	RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question	link

We share the results of each metric on each generated question in data/metric_result.xlsx. Results of LLM-based metrics on answerability are in data/test_answerability.xlsx.

QG Models

You can find our trained QG model at huggingface.

How to Use

Our codes provide the ability to evaluate automatic metrics, you can also use our codes to train Question Generation model and calculate automatic metrics.

You can install our packages for question generation or automatic metrics.

Question Generation: pip install QGEval-qg. For usage, please refer to https://pypi.org/project/QGEval-qg/.
Automatic Metrics: pip install QGEval-metrics. For usage, please refer to https://pypi.org/project/QGEval-metrics/.

You can also download this resource and use it by following the instructions below.

Question Generation

The codes and the data for Question Generation are in qg, train your own QG models by these steps:

cd ./qg
run pip install -r requirements.txt to install the required packages
run python process.py to process data
run the code file for specific models to train. For example, run python T5.py to train your T5-based QG model

Find more details in qg/readme.

Automatic Metrics Calculation

The codes for Automatic Metrics Calculation(e.g. BLEU-4) are in metrics, calculate automatic metrics by these steps:

prepare data, you can get the Question Generation dataset at qg/data or you can prepare data yourself
cd ./metric
run pip install -r requirements.txt to install the required packages
run python metrics.py to get your chosen metrics evaluation results

Find more details in metrics/readme.

Evaluation of Automatic Metrics

The codes for Automatic Metrics are in metrics.

Take the evaluation of QRelScore as an example, you can use the QGEval benchmark to evaluate QRelScore step by step:

Prepare data for evaluation: You can get the QGEval dataset at data/scores.xlsx.

Column Explanation
"passage" - the passage of the question based on.
"reference" - the reference question.
"answer" - the provided answer.
"prediction" - the generated question.
"source" - the base dataset and model used to   generate the 'prediction' question.

Run automatic metrics

cd ./metric
run pip install -r requirements.txt to install the required packages
run the specific code file to get results from automatic metrics. To get QRelScore results, run python metrics.py:

 import pandas as pd
 # load data
 data_path = 'your data path'
 save_path = 'result save path'
 data = pd.read_excel(data_path)
 # prepare parameters
 hypos = data['prediction'].tolist()
 refs_list = [data['reference'].tolist()]
 contexts = data['passage'].tolist()
 answers = data['answer'].tolist()
 # metric to use
 score_names = ['QRelScore']
 # run metric
 res = get_metrics(hypos, refs_list, contexts, answers, score_names=score_names)
 # handle results
 for k, v in res.items():
     data[k] = v
 # save results
 data.to_excel(save_path, index=False)
 print('Metrics saved to {}'.format(save_path))

Calculate Correlations

run python coeff.py to obtain the Pearson correlation coefficient between the generated results and the labeled results.

import pandas as pd
result_data_path = 'your result path'
df = pd.read_excel(result_data_path)
metrics = ['QRelScore']

# dimensions to calculate correlation with
dimensions = ['fluency','clarity','conciseness','relevance','consistency','answerability','answer_consistency']

# calculate pearson
coeff = Coeff()

for metric in metrics:
 print(f"Pearson of {metric}")
 for dimension in dimensions:
   labels = df[dimension].to_list()
   preds = df[metric].to_list()
   per, spea, ken = coeff.apply(labels, preds)
   print(f"{dimension}: Pearson={per}, Spearman={spea}, Kendall={ken}")
   print()

More details about the codes for automatic metrics are in metrics/readme.

Citation

Please cite:

@misc{fu2024qgevalbenchmarkingmultidimensionalevaluation,
      title={QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation}, 
      author={Weiping Fu and Bifan Wei and Jianxiang Hu and Zhongmin Cai and Jun Liu},
      year={2024},
      eprint={2406.05707},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.05707}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
data		data
metrics		metrics
qg		qg
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

QGEval

Table of Contents

Data

Automatic Metrics

QG Models

How to Use

Question Generation

Automatic Metrics Calculation

Evaluation of Automatic Metrics

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

WeipingFu/QGEval

Folders and files

Latest commit

History

Repository files navigation

QGEval

Table of Contents

Data

Automatic Metrics

QG Models

How to Use

Question Generation

Automatic Metrics Calculation

Evaluation of Automatic Metrics

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages