XLLM: Mind the Inconspicuous — Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries

1. Introduction

XLLM is the official implementation for the paper Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries. This repository provides tools and scripts for probing, attacking, and evaluating the ethical boundaries of large language models (LLMs) using various methods, including GCG, SURE, ICA, and GPTFuzzer.

2. Environment Setup

Note: Running the code requires a GPU with at least 40GB memory for LLM inference and gradient computation (especially for GCG).

Clone the repository:
```
git clone <your-repo-url>
cd XLLM
```
(Recommended) Create and activate a virtual environment:
```
conda create -n xllm python=3.8
conda activate xllm
```
Install required packages:
```
pip install -r requirements.txt
```
(If requirements.txt is missing, please add it or specify dependencies manually.)

3. Directory Structure

Scripts/ — Shell scripts to run main experiments (GCG, SURE, ICA, GPTFuzzer).
Experiments/ — Python scripts for running and evaluating experiments.
BOOST/ — Core attack implementations and utilities:
- Attack_ICA/, Attack_GPTFuzzer/, Attack_GCG/ — Attack methods and helpers.
- utils/ — Utility modules (datasets, templates, constants).
Dataset/ — Datasets for probing and evaluation (harmful, harmless, Advbench, etc.).
Probing/ — Probing results and screenshots.
.gitignore, LICENSE.txt, README.md — Standard project files.

4. Usage

All main running scripts are in the Scripts folder. For example:

bash ./Scripts/run_GCG.sh
bash ./Scripts/run_SURE.sh
bash ./Scripts/run_GPTFuzzer.sh
bash ./Scripts/run_ICA.sh

Results are saved in the output specified by each script (create a Logs directory if needed).

5. Experiments

You can run or modify the experiment scripts in the Experiments folder:

ica_exp.py
sure_exp.py
gcg_exp.py
fuzzer_exp.py

6. Datasets

The Dataset folder contains:

harmful.csv, harmful_targets.csv
Advbench_391_harm.csv, Advbench_391_harmless.csv
fuzzer_seed.csv, Advbench.csv

7. Probing

Probing screenshots and results are provided in the Probing folder (e.g., screenshots.pdf).

8. Citation

If you use XLLM in your work, please cite our paper:

@misc{yu2025mindinconspicuousrevealinghidden,
      title={Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries}, 
      author={Jiahao Yu and Haozheng Luo and Jerry Yao-Chieh Hu and Wenbo Guo and Han Liu and Xinyu Xing},
      year={2025},
      eprint={2405.20653},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
}

9. Acknowledgements

This project is inspired by and builds upon various open-source works in LLM safety and evaluation.

GPTFuzzer: https://github.com/sherdencooper/GPTFuzz
GCG: https://github.com/llm-attacks/llm-attacks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XLLM: Mind the Inconspicuous — Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries

1. Introduction

2. Environment Setup

3. Directory Structure

4. Usage

5. Experiments

6. Datasets

7. Probing

8. Citation

9. Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
BOOST		BOOST
Dataset		Dataset
Experiments		Experiments
Probing		Probing
Scripts		Scripts
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

License

MAGICS-LAB/XLLM

Folders and files

Latest commit

History

Repository files navigation

XLLM: Mind the Inconspicuous — Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries

1. Introduction

2. Environment Setup

3. Directory Structure

4. Usage

5. Experiments

6. Datasets

7. Probing

8. Citation

9. Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages