This repository contains the code for CITECONTROL, the citation benchmark described in our paper Citation Failure: Definition, Analysis and Mitigation.
From the abstract of our paper:
Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for \textit{citation failure}, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings.
Contact: Jan Buchmann
conda create -n citecontrol python=3.12
conda activate citecontrol
pip install -r requirements.txt
conda env config vars set VLLM_WORKER_MULTIPROC_METHOD=spawnCreate a .env file:
HF_TOKEN=<hf_token>PermissionError from NLTK: If you get a PermissionError from NLTK, set the NLTK_DATA environment variable to a writable directory.
pip install packaging
pip install ninjaVerify ninja installation
ninja --version
echo $?should return 0. If not, uninstall ninja and reinstall pip uninstall -y ninja && pip install ninja.
Install cuda toolkits
conda install -c conda-forge gcc_linux-64==12.2.0
conda install -c conda-forge gxx_linux-64==12.2.0
conda install nvidia/label/cuda-12.1.0::cuda-nvccInstall flash attention
module load cuda
export MAX_JOBS=4 # limit parallelism to avoid OOM error
pip install flash-attn==2.7.4.post1 --no-build-isolationTo use flash attention in experiments, add use_flash_attn=True to your command.
See data/README.md
├── assets/ # Images etc
├── citention/ # Code for CITENTION, see README in citention/ folder
├── config/ # hydra config files
│ ├── model/ # config files for models (LLMs)
│ ├── task/ # config files for tasks
│ └── config.yaml # main config
├── data/ # See the README in the data/ folder
├── data_preprocessing/ # Code for preprocessing raw datasets
├── evaluation/
│ ├── analysis.py # Aggregates results from evaluated individual predictions
│ ├── attributability.py # Attributability evaluation
│ ├── evaluator.py # Evaluates individual predictions
│ ├── metrics.py # Functions for recall, exact match etc.
│ └── response_processing.py # Code for separating statements and citations
├── models/
│ ├── base_model.py # Model base class
│ ├── citention_model.py # Wrapper around CitentionModel
│ ├── oracle.py # Oracle (returns ground truth) for validation
│ └── vllm_model.py # VLLM-based model for faster inference
├── util/
│ ├── base_classes.py # Base classes for instance, prediction etc.
│ ├── misc.py # Helper functions
│ ├── prompt_formatting.py # Builds prompts from instructions, documents, questions etc
│ └── train_utils.py # Helper functions for training
├── evaluate.py # Run this script to re-evaluate runs
├── preprocess.py # Convert raw datasets to common instance format
├── run.py # Run model on dataset(s) and evaluate
├── train_at2_attention.py # Train attention head parameters according to AT2 (Cohen-Wang et al., 2025)
├── train_bm25.py # Compute token frequency statistics for BM25
├── train_qr_head_attention.py # Train attention head parameters according to QRHead (Zhang et al., 2025)
└── train_score_combinator.py # Train score combinatorTo run a model with an existing config in config/model (e.g. Qwen3-1.7B) with generative citation on the CITECONTROL test sets, use the run.py script. Each run has a unique hash that is printed to the console and logged in data/out/overview.csv The results will be printed to the console and stored under data/out/<hash>.
python run.py model=Qwen3-1.7B tasks=[squad,boolq,musique,neoqa] split_names=[test,test,test,test] citation_scorer_class_names=[generation] citation_scorer_model_names_or_paths=[null] use_vllm=Trueuse_vllm=True is only possible when generation is the only citation method used.
You can use any combination of citation methods by putting them into the list of citation_scorer_class_names. See the CITENTION repo for all available methods. You need to pass a list of citation_scorer_model_names_or_hash of the same length, where each entry is either the hash of a training run (see below), the name of a trained model (e.g. "dragon") or None.
By default, the scores of the different citation methods will be averaged uniformly to obtain a final citation score for each source document. You can pass score_combinator_hash to load the weights of a trained score combinator to get a weighted average.
python run.py model=Qwen3-1.7B tasks=[squad,boolq,musique,neoqa] citation_scorer_class_names=[generation,attention,attention,bm25,sbert_dual] citation_scorer_model_names_or_paths=[null,null,<hash1>,<hash2>,dragon] score_combinator_hash=<hash4>To run or train a model without an existing config in config/model, set command line arguments as follows:
python run.py tasks=[squad] model=custom model.model_name=<your_model_name> model.hf_id=<huggingface_id> ...Importantly, separating reasoning tokens from response tokens is currently not supported. Therefore, make sure that your model does not generate reasoning tokens. If it normally does, you probably need to create a new config for it and add something like "<think>\n</think>" at the end of the prompt template.
Some citation methods require training (e.g. to select attention heads). This repo contains scripts to train citation methods on the datasets in CITECONTROL. Their usage is similar to the use of run.py. The trained parameters will be stored under data/out/<hash>
To train attention head parameters as described by Cohen-Wang et al. (2025), use train_at2_attention.py:
python train_at2_attention.py model=Qwen3-1.7B tasks=[squad] split_names=[train]To select the best attention heads for a given train set as described by Zhang et al. (2025), use train_qr_head_attention.py:
python train_qr_head_attention.py model=Qwen3-1.7B tasks=[squad] split_names=[train]To compute token statistics for BM25, use train_bm25.py:
python train_bm25.py tasks=[squad] split_names=[train]To train the weights for a score combinator, use train_score_combinator.py. This script can be used in two ways, depending on the value of citation_scores_hash (default is None). If you pass the hash of an existing run, the citation scores from that run will be used to optimize the parameters of the score combinator (results in faster runtime):
python train_score_combinator.py citation_scores_hash=<hash>If you don't have an existing run with scores, the usage of the script is similar to run.py. The script will first run the specified model on the specified datasets and then train the score combinator on the computed citation scores.
python run.py model=Qwen3-1.7B tasks=[squad,boolq,musique,neoqa] citation_scorer_class_names=[generation,attention,attention,bm25,sbert_dual] citation_scorer_model_names_or_paths=[null,null,<hash1>,<hash2>,dragon] If you find this repository useful in your research, consider citing our paper:
@misc{buchmann2025citationfailuredefinitionanalysis,
title={Citation Failure: Definition, Analysis and Efficient Mitigation},
author={Jan Buchmann and Iryna Gurevych},
year={2025},
eprint={2510.20303},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.20303},
}
Cohen-Wang, B., Chuang, Y., & Madry, A. (2025). Learning to Attribute with Attention. ArXiv, abs/2504.13752.
Zhang, W., Yin, F., Yen, H., Chen, D., & Ye, X. (2025). Query-Focused Retrieval Heads Improve Long-Context Reasoning and Re-ranking. ArXiv, abs/2506.09944.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.
