SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Description:

This repository includes the training, inference and evaluation code used in our paper - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling.

We introduced a principled framework for a single-pass alignment and step-annotation for automatic process supervision. Process Reward Models (SPARE-PRMs) trained based on the proposed annotation scheme outperform baselines such as Self-Consistency and ORM-weighted aggregation on four datasets across mathematical, question-answering and spatial reasoning datasets. The annotation scheme is also competitive while being computationally efficient compared to tree-search based annotation methods.

Abstract: Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only ~16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3x speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.

Installation

Create a conda / mamba / venv virtual environment and install the dependencies in requirements.txt. E.g.:

mamba create -n spare
mamba activate spare
pip install -r requirements.txt

Running the experiments

The parameters of the experiments are specified in their respecive config files:

config/
├── eval-config.yaml
├── infer-config.yaml         # infer config for solution generation
├── infer-rm-config.yaml      # infer config for solution scoring using Reward Model (RM)
├── private-config.yaml
├── train-po-config.yaml      # train config for preference-optimization (PO)
├── train-sft-config.yaml     # train config for simple supervised fine-tuning (SFT)
└── train-tc-rm-config.yaml   # train config for Token Classification (TC) based Reward Model (RM)

The private api keys such as for using OpenAI models or logging through Neptune API can be provided in the private-config.yaml file.

To run a desired task e.g. token classification based reward model (tc-rm), execute the following command:

python train_rm.py # to use the default location of the train-tc-rm-config 
# OR alternatively
python train_rm.py --config my-train-tc-rm-config.yaml

A trained SPARE-PRM model based on Qwen2.5-3b and Llama-3-8b are provided for direct-use at and respectively. A sample code to use it is given below:

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import torch

incorrect_token = "-"
correct_token = "+"
step_tag = " ки" # space in the beginning required for correct Llama tokenization

# tokenizer = AutoTokenizer.from_pretrained("UKPLab/Llama-3-8b-spare-prm-math")
tokenizer = AutoTokenizer.from_pretrained("UKPLab/Qwen2.5-3b-spare-prm-math")

step_target_ids = tokenizer.convert_tokens_to_ids([incorrect_token, correct_token])
step_tag_id = tokenizer.encode(step_tag)[-1] 

device = "cuda:0"
# model = AutoModelForCausalLM.from_pretrained("UKPLab/Llama-3-8b-spare-prm-math").to(device).eval()
model = AutoModelForCausalLM.from_pretrained("UKPLab/Qwen2.5-3b-spare-prm-math").to(device).eval()

# include this instruction as it was left as it is during the PRM training.
instruction = "You are an expert at solving challenging math problems spanning across various categories and difficulties such as Algebra, Number Theory, Geometry, Counting and Probability, Precalculus etc. For a given math problem, your task is to generate a step-by-step reasoning-based solution providing an answer to the question. Identify the correct concepts, formulas and heuristics that needs to be applied and then derive the contents of the reasoning steps from the given contexts and accurate calculations from the previous reasoning steps."
question = "Yann and Camille go to a restaurant. </S>\nIf there are 10 items on the menu, and each orders one dish, how many different combinations of meals can Yann and Camille order if they refuse to order the same dish? (It does matter who orders what---Yann ordering chicken and Camille ordering fish is different from Yann ordering fish and Camille ordering chicken.)"
correct_generation = "Let's think step by step.\nYann can order 1 of the 10 dishes. ки\nWhen he picks a dish, there are 9 left for Camille to choose from. ки\nThus, there are $10\\cdot 9=\\boxed{90}$ possible combinations.\nHence, the answer is 90. ки\n"
incorrect_generation = "Let's think step by step.\nWithout any restrictions, Yann and Camille could both order the same dish out of the 10 options, for a total of $10 \\cdot 9$ dishes. ки\nHowever, since Yann orders one of the 9 dishes that Camille didn't order (and vice versa), the number of possible combinations becomes $10 \\cdot 9 - 8 = \\boxed{72}$.\nHence, the answer is 72. ки\n"

for generation in (correct_generation, incorrect_generation):
    message = [
        dict(role="system", content=instruction),
        dict(role="user", content=question),
        dict(role="assistant", content=generation),
    ]

    input_ids = tokenizer.apply_chat_template(message, tokenize=True, return_tensors="pt").to(device)

    with torch.no_grad():
        logits = model(input_ids).logits[:,:,step_target_ids]
        scores = logits.softmax(dim=-1)[:,:,1] # correct_token at index 1 in the step_target_ids  
        step_scores = scores[input_ids == step_tag_id]
        print(step_scores)
        
# tensor([0.9617, 0.9487, 0.8938]) - correct_generation (Llama-3-8b-spare-prm-math)
# tensor([0.5794, 0.4910]) - incorrect_generation (Llama-3-8b-spare-prm-math)
# tensor([0.8710, 0.9163, 0.9786]) - correct_generation (Qwen2.5-3b-spare-prm-math)
# tensor([0.3292, 0.5288]) - incorrect_generation (Qwen2.5-3b-spare-prm-math)

Contact person: Md Imbesat Hassan Rizvi

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

Cite

If you use this repository, our trained SPARE-PRM model or our work, please cite:

@misc{rizvi2025sparesinglepassannotationreferenceguided,
      title={SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling}, 
      author={Md Imbesat Hassan Rizvi and Xiaodan Zhu and Iryna Gurevych},
      year={2025},
      eprint={2506.15498},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.15498}, 
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
spare-prm		spare-prm
static		static
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Description:

Installation

Running the experiments

Cite

Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

UKPLab/arxiv2025-spare-prm

Folders and files

Latest commit

History

Repository files navigation

SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling

Description:

Installation

Running the experiments

Cite

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages