XLLM: Mind the Inconspicuous — Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries
XLLM is the official implementation for the paper Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries. This repository provides tools and scripts for probing, attacking, and evaluating the ethical boundaries of large language models (LLMs) using various methods, including GCG, SURE, ICA, and GPTFuzzer.
Note: Running the code requires a GPU with at least 40GB memory for LLM inference and gradient computation (especially for GCG).
- Clone the repository:
git clone <your-repo-url> cd XLLM
- (Recommended) Create and activate a virtual environment:
conda create -n xllm python=3.8 conda activate xllm
- Install required packages:
(If requirements.txt is missing, please add it or specify dependencies manually.)
pip install -r requirements.txt
Scripts/
— Shell scripts to run main experiments (GCG, SURE, ICA, GPTFuzzer).Experiments/
— Python scripts for running and evaluating experiments.BOOST/
— Core attack implementations and utilities:Attack_ICA/
,Attack_GPTFuzzer/
,Attack_GCG/
— Attack methods and helpers.utils/
— Utility modules (datasets, templates, constants).
Dataset/
— Datasets for probing and evaluation (harmful, harmless, Advbench, etc.).Probing/
— Probing results and screenshots..gitignore
,LICENSE.txt
,README.md
— Standard project files.
All main running scripts are in the Scripts
folder. For example:
bash ./Scripts/run_GCG.sh
bash ./Scripts/run_SURE.sh
bash ./Scripts/run_GPTFuzzer.sh
bash ./Scripts/run_ICA.sh
- Results are saved in the output specified by each script (create a
Logs
directory if needed).
You can run or modify the experiment scripts in the Experiments
folder:
ica_exp.py
sure_exp.py
gcg_exp.py
fuzzer_exp.py
The Dataset
folder contains:
harmful.csv
,harmful_targets.csv
Advbench_391_harm.csv
,Advbench_391_harmless.csv
fuzzer_seed.csv
,Advbench.csv
Probing screenshots and results are provided in the Probing
folder (e.g., screenshots.pdf
).
If you use XLLM in your work, please cite our paper:
@misc{yu2025mindinconspicuousrevealinghidden,
title={Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries},
author={Jiahao Yu and Haozheng Luo and Jerry Yao-Chieh Hu and Wenbo Guo and Han Liu and Xinyu Xing},
year={2025},
eprint={2405.20653},
archivePrefix={arXiv},
primaryClass={cs.AI},
}
This project is inspired by and builds upon various open-source works in LLM safety and evaluation.