Note: I decided to do this project about Prompt Injection that is an LLM attack from the OWASP Top 10 LLM vulnerabilities due to personal and professional interests in the area of Large Language Models. This decision was approved by the professor.
A machine learning system that leverages transformer-based architecture and traditional ML approaches to detect and prevent prompt injection attacks against AI language models.
This system implements a comparative approach using two models:
- DistilBERT: A transformer-based neural network for deep learning-based detection
- Random Forest: A traditional machine learning approach using engineered linguistic features
The system learns patterns from data to identify potential prompt injection attacks, offering better resistance to obfuscation techniques compared to rule-based approaches.
-
Dual-model comparison (DistilBERT vs Random Forest)
-
Comprehensive feature extraction for linguistic patterns
-
Support for both pre-trained and custom datasets
-
Detailed performance metrics and analysis
-
Real-time injection detection
-
Extensible architecture for multiple classifiers
Most of the code is inside the src folder. main.py is the entrypoint to the system.
.
├── src/
│ ├── classifiers/
| | └── features/
│ | └── feature_extractor.py
│ │ ├── random_forest_classifier.py
│ │ └── distill_bert_classifier.py
│ ├── dataset_generation/
│ └── dataset_generator.py
│
├── main.py
## Installation
The project includes:
- `pyproject.toml` with all required dependencies
- Devcontainer configuration for cross-platform compatibility (Windows/Linux)
### Quick Start
# Install dependencies using poetry
poetry install
# Alternatively, use pip
pip install -r requirements.txt
# Run with DistilBERT classifier
python main.py --classifier_type distillbert
# Run with Random Forest classifier
python main.py --classifier_type randomforest
--classifier_type
: Choose between 'randomforest' or 'distillbert' (default: randomforest)--pdfs
: Enable PDF analysis [experimental] (default: False)
The system uses two complementary datasets:
- Source: deepset/prompt-injections
- Pre-labeled collection of benign and malicious prompts
- Used for baseline training and evaluation
Generate custom training data using:
python src/dataset_generation/dataset_generator.py
The custom dataset generator creates:
- Natural language queries across multiple domains
- Technical questions (programming, databases, APIs)
- Business inquiries (project management, analysis)
- Educational content (learning, research methods)
- Varied templates and complexity levels
Implements sophisticated injection techniques:
- Role/behavior manipulation attempts
- System command injections
- Security constraint bypasses
- Context manipulation
- Hidden character obfuscation
- Emotional manipulation patterns
The system evaluates models using:
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrix Analysis