Skip to content

This repository contains code to reproduce the results from our 2025 Paper "Citation Failure: Definition, Analysis and Mitigation". In particular, it contains the code for the Citecontrol benchmark and the Citention framework.

License

Notifications You must be signed in to change notification settings

UKPLab/arxiv2025-citation-failure

Repository files navigation

Alt Text

Citation Failure: Definition, Analysis and Mitigation

This repository contains the code for CITECONTROL, the citation benchmark described in our paper Citation Failure: Definition, Analysis and Mitigation.

From the abstract of our paper:

Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for \textit{citation failure}, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings.


Contact: Jan Buchmann

UKP Lab | TU Darmstadt

Installation

Core

conda create -n citecontrol python=3.12
conda activate citecontrol
pip install -r requirements.txt
conda env config vars set VLLM_WORKER_MULTIPROC_METHOD=spawn

Create a .env file:

HF_TOKEN=<hf_token>

PermissionError from NLTK: If you get a PermissionError from NLTK, set the NLTK_DATA environment variable to a writable directory.

Flash Attention

pip install packaging
pip install ninja

Verify ninja installation

ninja --version
echo $?

should return 0. If not, uninstall ninja and reinstall pip uninstall -y ninja && pip install ninja.

Install cuda toolkits

conda install -c conda-forge gcc_linux-64==12.2.0
conda install -c conda-forge gxx_linux-64==12.2.0
conda install nvidia/label/cuda-12.1.0::cuda-nvcc

Install flash attention

module load cuda
export MAX_JOBS=4 # limit parallelism to avoid OOM error
pip install flash-attn==2.7.4.post1 --no-build-isolation

To use flash attention in experiments, add use_flash_attn=True to your command.

Data Preparation

See data/README.md

Repository Structure

├── assets/ # Images etc
├── citention/ # Code for CITENTION, see README in citention/ folder
├── config/ # hydra config files
│   ├── model/ # config files for models (LLMs)
│   ├── task/ # config files for tasks
│   └── config.yaml # main config
├── data/ # See the README in the data/ folder
├── data_preprocessing/ # Code for preprocessing raw datasets
├── evaluation/ 
│   ├── analysis.py # Aggregates results from evaluated individual predictions
│   ├── attributability.py # Attributability evaluation
│   ├── evaluator.py # Evaluates individual predictions
│   ├── metrics.py # Functions for recall, exact match etc.
│   └── response_processing.py # Code for separating statements and citations
├── models/
│   ├── base_model.py # Model base class
│   ├── citention_model.py # Wrapper around CitentionModel
│   ├── oracle.py # Oracle (returns ground truth) for validation
│   └── vllm_model.py # VLLM-based model for faster inference
├── util/
│   ├── base_classes.py # Base classes for instance, prediction etc.
│   ├── misc.py # Helper functions
│   ├── prompt_formatting.py # Builds prompts from instructions, documents, questions etc
│   └── train_utils.py # Helper functions for training
├── evaluate.py # Run this script to re-evaluate runs
├── preprocess.py # Convert raw datasets to common instance format
├── run.py # Run model on dataset(s) and evaluate
├── train_at2_attention.py # Train attention head parameters according to AT2 (Cohen-Wang et al., 2025)
├── train_bm25.py # Compute token frequency statistics for BM25
├── train_qr_head_attention.py # Train attention head parameters according to QRHead (Zhang et al., 2025)
└── train_score_combinator.py # Train score combinator

Running models on CITECONTROL

Existing Config, Generative Citation Only

To run a model with an existing config in config/model (e.g. Qwen3-1.7B) with generative citation on the CITECONTROL test sets, use the run.py script. Each run has a unique hash that is printed to the console and logged in data/out/overview.csv The results will be printed to the console and stored under data/out/<hash>.

python run.py model=Qwen3-1.7B tasks=[squad,boolq,musique,neoqa] split_names=[test,test,test,test] citation_scorer_class_names=[generation] citation_scorer_model_names_or_paths=[null] use_vllm=True

use_vllm=True is only possible when generation is the only citation method used.

Existing Config, Combining Citation Methods

You can use any combination of citation methods by putting them into the list of citation_scorer_class_names. See the CITENTION repo for all available methods. You need to pass a list of citation_scorer_model_names_or_hash of the same length, where each entry is either the hash of a training run (see below), the name of a trained model (e.g. "dragon") or None.

By default, the scores of the different citation methods will be averaged uniformly to obtain a final citation score for each source document. You can pass score_combinator_hash to load the weights of a trained score combinator to get a weighted average.

python run.py model=Qwen3-1.7B tasks=[squad,boolq,musique,neoqa] citation_scorer_class_names=[generation,attention,attention,bm25,sbert_dual] citation_scorer_model_names_or_paths=[null,null,<hash1>,<hash2>,dragon] score_combinator_hash=<hash4>

No Existing Config

To run or train a model without an existing config in config/model, set command line arguments as follows:

python run.py tasks=[squad] model=custom model.model_name=<your_model_name> model.hf_id=<huggingface_id> ...

Importantly, separating reasoning tokens from response tokens is currently not supported. Therefore, make sure that your model does not generate reasoning tokens. If it normally does, you probably need to create a new config for it and add something like "<think>\n</think>" at the end of the prompt template.

Training Citation Methods

Some citation methods require training (e.g. to select attention heads). This repo contains scripts to train citation methods on the datasets in CITECONTROL. Their usage is similar to the use of run.py. The trained parameters will be stored under data/out/<hash>

Training AT2 Attention

To train attention head parameters as described by Cohen-Wang et al. (2025), use train_at2_attention.py:

python train_at2_attention.py model=Qwen3-1.7B tasks=[squad] split_names=[train]

Training QRHead Attention

To select the best attention heads for a given train set as described by Zhang et al. (2025), use train_qr_head_attention.py:

python train_qr_head_attention.py model=Qwen3-1.7B tasks=[squad] split_names=[train]

Training BM25 (Computing Token Statistics)

To compute token statistics for BM25, use train_bm25.py:

python train_bm25.py tasks=[squad] split_names=[train]

Training Score Combinator

To train the weights for a score combinator, use train_score_combinator.py. This script can be used in two ways, depending on the value of citation_scores_hash (default is None). If you pass the hash of an existing run, the citation scores from that run will be used to optimize the parameters of the score combinator (results in faster runtime):

python train_score_combinator.py citation_scores_hash=<hash>

If you don't have an existing run with scores, the usage of the script is similar to run.py. The script will first run the specified model on the specified datasets and then train the score combinator on the computed citation scores.

python run.py model=Qwen3-1.7B tasks=[squad,boolq,musique,neoqa] citation_scorer_class_names=[generation,attention,attention,bm25,sbert_dual] citation_scorer_model_names_or_paths=[null,null,<hash1>,<hash2>,dragon] 

Citation

If you find this repository useful in your research, consider citing our paper:

@misc{buchmann2025citationfailuredefinitionanalysis,
      title={Citation Failure: Definition, Analysis and Efficient Mitigation}, 
      author={Jan Buchmann and Iryna Gurevych},
      year={2025},
      eprint={2510.20303},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.20303}, 
}

References

Cohen-Wang, B., Chuang, Y., & Madry, A. (2025). Learning to Attribute with Attention. ArXiv, abs/2504.13752.

Zhang, W., Yin, F., Yen, H., Chen, D., & Ye, X. (2025). Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking. ArXiv, abs/2506.09944.

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

About

This repository contains code to reproduce the results from our 2025 Paper "Citation Failure: Definition, Analysis and Mitigation". In particular, it contains the code for the Citecontrol benchmark and the Citention framework.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages