This repository contains the source code and experimental setup for the master's thesis,
"Evaluation of Sparse Autoencoder-based Refusal Features in LLMs: A Dataset-dependence Study".
The core objective of this research is to investigate how training data of SAEs influences its ability to identify, represent, and control refusal behaviours in (base) LLMs. Our findings demonstrate that SAE feature discovery is highly dependent on its training data + the intervention layer. A mix of pre-training and instruction-style data yields the most effective and steerable refusal-related features, aligning with previous and concurrent work.
We follow a three-stage pipeline, here a simplified overview:
Assemble two corpora:
SmolLM2 pre-training data (PRE) and LMSys-Chat instruction data (INS), plus mixtures at 30:70, 50:50, 70:30 ratios.
Compute dataset statistics (context length, type–token ratio,
Select promising transformer layers via linear refusal probes.
Train Top-k SAEs on those layers under varied expansion factors (
Identify refusal-related latents via effect-size ranking (Cohen’s
Experiments are limited to small open models (SmolLM2-135M); results may not scale to frontier LLMs. Evaluation benchmarks (AdvBench, Alpaca, MMLU, HellaSwag) cover only part of the safety landscape. SAE-based steering remains fragile, with feature saturation, dataset sensitivity, and strong dependence on hook-layer choice. See related discussions here, here and here.
Most experiments were run on TU Wien’s HPC cluster (A100 80GB GPUs). Smaller runs are possible on consumer GPUs using reduced token budgets.
If you use this code or datasets, please cite:
@mastersthesis{kerl2025saerefusal,
title={Evaluation of Sparse Autoencoder-based Refusal Features in LLMs: A Dataset-dependence Study},
author={Tilman Kerl},
school={Technische Universit{\"a}t Wien}
year={2025}
}
This project is licensed under the MIT License - see the LICENSE.md file for details.
(C) 2024-2025 Tilman Kerl