This repository contains all scripts and data necessary for reproducing the results from ABCD-Link
Abstract: Understanding fine-grained relations between documents is crucial for many application domains. However, the study of automated assistance is limited by the lack of efficient methods to create training and evaluation datasets of cross-document links. To address this, we introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links in a new domain from scratch. We first generate and validate semi-synthetic datasets of interconnected documents. This data is used to perform automatic evaluation, producing a shortlist of best-performing linking approaches. These approaches are then used in an extensive human evaluation study, yielding performance estimates on natural text pairs. We apply our framework in two distinct domains -- peer review and news -- and show that combining retrieval models with LLMs achieves 78% link approval from human raters, more than doubling the precision of strong retrievers alone. Our framework enables systematic study of cross-document understanding across application scenarios, and the resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review. We make the code, data, and annotation protocols openly available.
Contact person: Serwar Basch
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
First, ensure you have python 3.11
Then, install the necessary requirements
pip install -r requirements.txt
python -m spacy download en_core_web_mdFor OpenAI-based inference, set your key:
export OPENAI_API_KEY=YOUR_KEY_HEREEnsure you have access to an appropriate GPU for the LLM inference step (at least 100GB of VRAM are needed for Qwen2.5)
To reconstruct the NEWS-HE dataset, please download the SPICED dataset from Zenodo (filename: spiced.csv) and place it under ./datasets/news_he
Then you can run
python scripts/reconstruct_dataset.pyTo run all steps in sequence:
bash run.shThis runs:
- Retrieval model inference
- Prompt generation
- LLM inference (local + API)
- Evaluation (ranked + classified)
- Calculate IAA and Acceptance Rate from Human Evaluation
- Calculate true recall rate on the subset of manually annotated data
Results are saved to:
./predictions/./data/prompts_json/./llm_results/./eval_outputs/./datasets/*_he
You can also run specific stages:
bash run.sh --retrieval
bash run.sh --prompts
bash run.sh --llm
bash run.sh --eval
bash run.sh --anno
bash run.sh --goldYou can adjust evaluation parameters using flags passed to run.sh, for example:
bash run.sh --eval --type=classified --metric=f1
bash run.sh --eval --type=ranked --cutoffs=1 3 5 7 10 20 --metric=recallSupported flags:
- --type=ranked|classified|all (default: all)
- --cutoffs= (for ranked)
- --metric=precision|recall|f1
To evaluate model outputs against human-annotated gold labels:
bash run.sh --goldThis evaluates precision, recall, and F1 on:
datasets/news_he/news_gold_labels.csvdatasets/reviews_he/reviews_gold_labels.csv
Results are saved to:
datasets/news_he/eval_gold_labels.jsondatasets/reviews_he/eval_gold_labels.json
Datasets:
news_ecbnews_synthreviews_synthreviews_f1000
Each dataset contains:
docs.json: documents split into sentences<name>_links.json: ground truth cross-document sentence-level links
For retrievers (ranked):
- Precision@k, Recall@k, F1@k
For LLMs (classified):
- Precision, Recall, F1 over entire output
news_hereviews_he
Each dataset contains:
docs.json: documents split into sentencesannotations.json: annotations results from the human evaluation study
The generate_prompts.py script uses dragon_plus as the default source for top-ranked sentences based on our experiments. The value is hardcoded to ensure reproducibility of our results.
project-root/
│
├── datasets/
│ ├── news_ecb/
│ │ ├── docs.json
│ │ └── news_ecb_links.json
│ └── ...
│
├── data/ # Input artifacts for prompt generation
│ ├── positive_examples.json
│ └── prompts_json/ # All generated prompt files
│
├── predictions/ # Retriever output path
│
├── llm_results/ # LLM classification output path
│
├── eval_outputs/ # Metrics and evaluation output path
│
├── retrieval/ # Retriever scripts
│ ├── __init__.py
│ ├── scorers.py
│ ├── models.py
│ └── utils.py
│
├── prompts/ # Prompt construction scripts
│ ├── __init__.py
│ ├── builder.py
│ └── generate_prompts.py
│
├── llm_inference/ # LLM scripts
│ ├── __init__.py
│ ├── chat_utils.py # Shared prompt-building and vLLM setup
│ ├── executor.py # Local vLLM inference (Phi-4, Qwen)
│ ├── openai_utils.py
│ └── openai_executor.py # GPT-4o inference
│
├── scripts/ # Executable scripts
│ ├── run_retrievals.py # Runs all retrieval models on all datasets
│ ├── run_llm_inference.py # Runs all prompts through all LLMs
│ ├── annotation_results.py # Calculates agreement and acceptance rates on annotations
│ ├── evaluate_gold_labels.py
│ └── evaluate.py
│
├── requirements.txt
└── README.md
Please use the following citation:
@misc{basch2025abcdlink,
title={ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links},
author={Serwar Basch and Ilia Kuznetsov and Tom Hope and Iryna Gurevych},
year={2025},
eprint={2509.01387},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.01387},
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.