Skip to content

detiuaveiro/aas-malware-icbaptista

 
 

Repository files navigation

ML-based Prompt Injection Detector

Note: I decided to do this project about Prompt Injection that is an LLM attack from the OWASP Top 10 LLM vulnerabilities due to personal and professional interests in the area of Large Language Models. This decision was approved by the professor.

A machine learning system that leverages transformer-based architecture and traditional ML approaches to detect and prevent prompt injection attacks against AI language models.

Overview and Idea for project

This system implements a comparative approach using two models:

  • DistilBERT: A transformer-based neural network for deep learning-based detection
  • Random Forest: A traditional machine learning approach using engineered linguistic features

The system learns patterns from data to identify potential prompt injection attacks, offering better resistance to obfuscation techniques compared to rule-based approaches.

Features

  • Dual-model comparison (DistilBERT vs Random Forest)

  • Comprehensive feature extraction for linguistic patterns

  • Support for both pre-trained and custom datasets

  • Detailed performance metrics and analysis

  • Real-time injection detection

  • Extensible architecture for multiple classifiers

  • Project Structure

Most of the code is inside the src folder. main.py is the entrypoint to the system.

.
├── src/
│   ├── classifiers/
|   |   └── features/
│   |       └── feature_extractor.py
│   │   ├── random_forest_classifier.py
│   │   └── distill_bert_classifier.py
│   ├── dataset_generation/
│       └── dataset_generator.py
│   
├── main.py


## Installation

The project includes:
- `pyproject.toml` with all required dependencies
- Devcontainer configuration for cross-platform compatibility (Windows/Linux)

### Quick Start

# Install dependencies using poetry
poetry install

# Alternatively, use pip
pip install -r requirements.txt

Usage

Running the Detector

# Run with DistilBERT classifier
python main.py --classifier_type distillbert

# Run with Random Forest classifier
python main.py --classifier_type randomforest

Command Line Arguments

  • --classifier_type: Choose between 'randomforest' or 'distillbert' (default: randomforest)
  • --pdfs: Enable PDF analysis [experimental] (default: False)

Datasets

The system uses two complementary datasets:

1. HuggingFace Dataset

  • Source: deepset/prompt-injections
  • Pre-labeled collection of benign and malicious prompts
  • Used for baseline training and evaluation

2. Custom Generated Dataset

Generate custom training data using:

python src/dataset_generation/dataset_generator.py

The custom dataset generator creates:

Benign Prompts

  • Natural language queries across multiple domains
  • Technical questions (programming, databases, APIs)
  • Business inquiries (project management, analysis)
  • Educational content (learning, research methods)
  • Varied templates and complexity levels

Malicious Prompts

Implements sophisticated injection techniques:

  • Role/behavior manipulation attempts
  • System command injections
  • Security constraint bypasses
  • Context manipulation
  • Hidden character obfuscation
  • Emotional manipulation patterns

Model Performance

The system evaluates models using:

  • Accuracy, Precision, Recall, F1-Score
  • Confusion Matrix Analysis

About

aas24-aas-malware-aas-malware-pe created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%