Skip to content

QGEval: A Benchmark for Question Generation Evaluation

Notifications You must be signed in to change notification settings

WeipingFu/QGEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QGEval

Resources for paper - QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation

Accepted by EMNLP 2024.

Table of Contents

Data

We share the generated questions from 15 QG models with averaged annotation scores of three annotators in data/scores.xlsx, and the instances integrated by passages are in data/instances.json. We also share the annotation result of each annotator in data/annotation result.

Example of instances.

{
  "id": "572882242ca10214002da423",
  "passage": "... The publication of a Taoist text inscribed with the name of Töregene Khatun, Ögedei's wife, ...",
  "reference": "Who was Ögedei's wife?"
  "answer": "Töregene Khatun",
  "questions": [
      {
        "prediction": "Who was the author of the Taoist text inscribed with the name of?",
        "source": "SQuAD_BART-base_finetune",
        "fluency": 3.0,
        "clarity": 2.6667,
        "conciseness": 3.0,
        "relevance": 3.0,
        "consistency": 2.0,
        "answerability": 1.0,
        "answer_consistency": 1.0
      },
      // ... 14 more questions
  ]
}

The average annotation scores of each QG model over 7 dimensions are shown in the below table.

Models Flu. Clar. Conc. Rel. Cons. Ans. AnsC. Avg.
M1 - Reference 2.968 2.930 2.998 2.993 2.923 2.832 2.768 2.916
M2 - BART-base-finetune 2.958 2.882 2.898 2.995 2.920 2.732 2.588 2.853
M3 - BART-large-finetune 2.932 2.915 2.828 2.995 2.935 2.825 2.737 2.881
M4 - T5-base-finetune 2.972 2.923 2.922 3.000 2.917 2.788 2.652 2.882
M5 - T5-large-finetune 2.978 2.930 2.907 2.995 2.933 2.795 2.720 2.894
M6 - Flan-T5-base-finetune 2.963 2.888 2.938 2.998 2.925 2.775 2.665 2.879
M7 - Flan-T5-large-finetune 2.982 2.902 2.895 2.995 2.950 2.818 2.727 2.895
M8 - Flan-T5-XL-LoRA 2.913 2.843 2.880 2.997 2.928 2.772 2.667 2.857
M9 - Flan-T5-XXL-LoRA 2.938 2.848 2.907 3.000 2.943 2.757 2.678 2.867
M10 - Flan-T5-XL-fewshot 2.975 2.820 2.985 2.955 2.908 2.652 2.193 2.784
M11 - Flan-T5-XXL-fewshot 2.987 2.882 2.990 2.988 2.920 2.687 2.432 2.841
M12 - GPT-3.5-Turbo-fewshot 2.972 2.927 2.858 2.995 2.955 2.850 2.335 2.842
M13 - GPT-4-Turbo-fewshot 2.988 2.987 2.897 2.992 2.947 2.922 2.772 2.929
M14 - GPT-3.5-Turbo-zeroshot 2.995 2.977 2.913 2.992 2.917 2.823 2.157 2.825
M15 - GPT-4-Turbo-zeroshot 2.983 2.990 2.943 2.970 2.932 2.883 2.723 2.918
Avg. 2.967 2.910 2.917 2.991 2.930 2.794 2.588

Automatic Metrics

We implemented 15 automatic metrics for re-evaluation, they are:

Metrics Paper Code Link
BLEU-4 BLEU: a Method for Automatic Evaluation of Machine Translation link
ROUGE-L ROUGE: A Package for Automatic Evaluation of Summaries link
METEOR METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments link
BERTScore BERTScore: Evaluating Text Generation with BERT link
MoverScore MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance link
BLEURT BLEURT: Learning Robust Metrics for Text Generation link
BARTScore-ref BARTScore: Evaluating Generated Text as Text Generation link
GPTScore-ref GPTScore: Evaluate as You Desire link
Q-BLEU4 Towards a Better Metric for Evaluating Question Generation Systems link
QSTS QSTS: A Question-Sensitive Text Similarity Measure for Question Generation link
BARTScore-src BARTScore: Evaluating Generated Text as Text Generation link
GPTScore-src GPTScore: Evaluate as You Desire link
QRelScore QRelScore: Better Evaluating Generated Questions with Deeper Understanding of Context-aware Relevance link
UniEval Towards a Unified Multi-Dimensional Evaluator for Text Generation link
RQUGE RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question link

We share the results of each metric on each generated question in data/metric_result.xlsx. Results of LLM-based metrics on answerability are in data/test_answerability.xlsx.

QG Models

You can find our trained QG model at huggingface.

How to Use

Our codes provide the ability to evaluate automatic metrics, you can also use our codes to train Question Generation model and calculate automatic metrics.

You can install our packages for question generation or automatic metrics.

You can also download this resource and use it by following the instructions below.

Question Generation

The codes and the data for Question Generation are in qg, train your own QG models by these steps:

  1. cd ./qg
  2. run pip install -r requirements.txt to install the required packages
  3. run python process.py to process data
  4. run the code file for specific models to train. For example, run python T5.py to train your T5-based QG model

Find more details in qg/readme.

Automatic Metrics Calculation

The codes for Automatic Metrics Calculation(e.g. BLEU-4) are in metrics, calculate automatic metrics by these steps:

  1. prepare data, you can get the Question Generation dataset at qg/data or you can prepare data yourself
  2. cd ./metric
  3. run pip install -r requirements.txt to install the required packages
  4. run python metrics.py to get your chosen metrics evaluation results

Find more details in metrics/readme.

Evaluation of Automatic Metrics

The codes for Automatic Metrics are in metrics.

Take the evaluation of QRelScore as an example, you can use the QGEval benchmark to evaluate QRelScore step by step:

  1. Prepare data for evaluation: You can get the QGEval dataset at data/scores.xlsx.

    Column Explanation
    "passage" - the passage of the question based on.
    "reference" - the reference question.
    "answer" - the provided answer.
    "prediction" - the generated question.
    "source" - the base dataset and model used to   generate the 'prediction' question.
  2. Run automatic metrics

    • cd ./metric
    • run pip install -r requirements.txt to install the required packages
    • run the specific code file to get results from automatic metrics. To get QRelScore results, run python metrics.py:
     import pandas as pd
     # load data
     data_path = 'your data path'
     save_path = 'result save path'
     data = pd.read_excel(data_path)
     # prepare parameters
     hypos = data['prediction'].tolist()
     refs_list = [data['reference'].tolist()]
     contexts = data['passage'].tolist()
     answers = data['answer'].tolist()
     # metric to use
     score_names = ['QRelScore']
     # run metric
     res = get_metrics(hypos, refs_list, contexts, answers, score_names=score_names)
     # handle results
     for k, v in res.items():
         data[k] = v
     # save results
     data.to_excel(save_path, index=False)
     print('Metrics saved to {}'.format(save_path))
  3. Calculate Correlations

    run python coeff.py to obtain the Pearson correlation coefficient between the generated results and the labeled results.

    import pandas as pd
    result_data_path = 'your result path'
    df = pd.read_excel(result_data_path)
    metrics = ['QRelScore']
    
    # dimensions to calculate correlation with
    dimensions = ['fluency','clarity','conciseness','relevance','consistency','answerability','answer_consistency']
    
    # calculate pearson
    coeff = Coeff()
    
    for metric in metrics:
     print(f"Pearson of {metric}")
     for dimension in dimensions:
       labels = df[dimension].to_list()
       preds = df[metric].to_list()
       per, spea, ken = coeff.apply(labels, preds)
       print(f"{dimension}: Pearson={per}, Spearman={spea}, Kendall={ken}")
       print()

More details about the codes for automatic metrics are in metrics/readme.

Citation

Please cite:

@misc{fu2024qgevalbenchmarkingmultidimensionalevaluation,
      title={QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation}, 
      author={Weiping Fu and Bifan Wei and Jianxiang Hu and Zhongmin Cai and Jun Liu},
      year={2024},
      eprint={2406.05707},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.05707}, 
}

About

QGEval: A Benchmark for Question Generation Evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •