ML-based Prompt Injection Detector

Note: I decided to do this project about Prompt Injection that is an LLM attack from the OWASP Top 10 LLM vulnerabilities due to personal and professional interests in the area of Large Language Models. This decision was approved by the professor.

A machine learning system that leverages transformer-based architecture and traditional ML approaches to detect and prevent prompt injection attacks against AI language models.

Overview and Idea for project

This system implements a comparative approach using two models:

DistilBERT: A transformer-based neural network for deep learning-based detection
Random Forest: A traditional machine learning approach using engineered linguistic features

The system learns patterns from data to identify potential prompt injection attacks, offering better resistance to obfuscation techniques compared to rule-based approaches.

Features

Dual-model comparison (DistilBERT vs Random Forest)
Comprehensive feature extraction for linguistic patterns
Support for both pre-trained and custom datasets
Detailed performance metrics and analysis
Real-time injection detection
Extensible architecture for multiple classifiers
Project Structure

Most of the code is inside the src folder. main.py is the entrypoint to the system.

.
├── src/
│   ├── classifiers/
|   |   └── features/
│   |       └── feature_extractor.py
│   │   ├── random_forest_classifier.py
│   │   └── distill_bert_classifier.py
│   ├── dataset_generation/
│       └── dataset_generator.py
│   
├── main.py


## Installation

The project includes:
- `pyproject.toml` with all required dependencies
- Devcontainer configuration for cross-platform compatibility (Windows/Linux)

### Quick Start

# Install dependencies using poetry
poetry install

# Alternatively, use pip
pip install -r requirements.txt

Usage

Running the Detector

# Run with DistilBERT classifier
python main.py --classifier_type distillbert

# Run with Random Forest classifier
python main.py --classifier_type randomforest

Command Line Arguments

--classifier_type: Choose between 'randomforest' or 'distillbert' (default: randomforest)
--pdfs: Enable PDF analysis [experimental] (default: False)

Datasets

The system uses two complementary datasets:

1. HuggingFace Dataset

Source: deepset/prompt-injections
Pre-labeled collection of benign and malicious prompts
Used for baseline training and evaluation

2. Custom Generated Dataset

Generate custom training data using:

python src/dataset_generation/dataset_generator.py

The custom dataset generator creates:

Benign Prompts

Natural language queries across multiple domains
Technical questions (programming, databases, APIs)
Business inquiries (project management, analysis)
Educational content (learning, research methods)
Varied templates and complexity levels

Malicious Prompts

Implements sophisticated injection techniques:

Role/behavior manipulation attempts
System command injections
Security constraint bypasses
Context manipulation
Hidden character obfuscation
Emotional manipulation patterns

Model Performance

The system evaluates models using:

Accuracy, Precision, Recall, F1-Score
Confusion Matrix Analysis

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
original_README.md		original_README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
report.pdf		report.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML-based Prompt Injection Detector

Overview and Idea for project

Features

Project Structure

Usage

Running the Detector

Command Line Arguments

Datasets

1. HuggingFace Dataset

2. Custom Generated Dataset

Benign Prompts

Malicious Prompts

Model Performance

About

Uh oh!

Releases

Packages

Languages

detiuaveiro/aas-malware-icbaptista

Folders and files

Latest commit

History

Repository files navigation

ML-based Prompt Injection Detector

Overview and Idea for project

Features

Project Structure

Usage

Running the Detector

Command Line Arguments

Datasets

1. HuggingFace Dataset

2. Custom Generated Dataset

Benign Prompts

Malicious Prompts

Model Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages