AI-Generated Text Detector

A transformer-based classifier distinguishing between human-authored and AI-generated texts, fine-tuned on extensive academic and diverse-domain datasets. Built on DistilBERT, enhanced by linguistic features, perplexity measures, and sentiment analysis for robust interpretability. Transformers fine tuned on a macbook air M2 with 8GB memory. Text entries were limited to 256 tokens to work within computing resources. Fine-tuning took around 24 hours.

Overview

This project leverages DistilBERT fine-tuned on ~23K labeled examples (L2R Dataset) and Kaggle's LLM text detect competition (AIDE dataset) from 2024. The data spans domains like academic research, business, and legal documents. The model utilizes sophisticated linguistic indicators and ensemble-based cross-validation for reliable identification of AI-generated content.

Features

Transformer-based Detection: DistilBERT fine-tuned specifically for AI-text discrimination.
Advanced Linguistic Indicators:
- Readability: Flesch-Kincaid, average sentence length.
- Structural Features: Lexical diversity, burstiness (sentence length variance), function-word and stop-word ratios.
- Semantic Indicators: TF-IDF cosine similarity, approximate edit distance.
- Perplexity Scoring: DistilGPT-2-based perplexity (higher indicates human-like complexity).
- Sentiment Analysis: VADER sentiment (positive, neutral, negative thresholds).
Multi-format File Analysis: Text, PDF, LaTeX, DOCX support.
CLI & PDF Reporting: Command-line interface and comprehensive PDF reports detailing overall and sentence-level analyses.

Dataset

Total Size: 23,782 examples
Data Sources:
- Kaggle competition: AIDE: AI Detection for Essays Dataset: A mix of 1.3k student-written essays and essays generated by a variety of LLMs
- Learning to Rewrite (L2R): LEARNING TO REWRITE: GENERALIZED DETECTION OF LLM-GENERATED TEXT, Extensive dataset, 23k entries, AI and human written, spanning domains such as Academic Research, Legal, Business, and Creative Writing.
Preprocessing:
- Text normalization, markup removal (LaTeX/Markdown), tokenization.
- Computation of linguistic and statistical features for interpretability.

Model Training

Fine-Tuning Approach

Base Model: Hugging Face distilbert-base-uncased
Training Configuration:
- Epochs: 3 per fold
- Batch Size: 8 (optimized for Apple Silicon M2 GPU via Metal Performance Shaders)
- Learning Rate: 4.67e-5 with linear decay scheduler
- Optimizer: AdamW
- Hardware Acceleration: Apple Metal Performance Shaders (MPS)
Cross-Validation Strategy:
Implemented 5-fold cross-validation for robust evaluation, ensuring generalizability and providing a solid foundation for ensemble predictions.

Evaluation Metrics

Model performance is rigorously assessed via:

Accuracy: Overall classification correctness.
Precision: Reliability of positive predictions (AI-generated text).
Recall: Proportion of actual AI texts accurately classified.
F1-score: Balance between precision and recall.
Confusion Matrix: Visualization of True Positives, False Positives, True Negatives, and False Negatives.

Performance Results

Aggregated Cross-Validation Performance (5-folds):

Metric	Human-Written (%)	AI-Generated (%)
Precision	94.60	93.44
Recall	83.91	97.97
F1-score	88.91	95.65
Overall Accuracy	93.82

(Metrics averaged over 5-fold cross-validation.)

Confusion Matrix

Aggregated confusion matrix across 5-fold cross-validation:

Actual \ Predicted	Human-Written	AI-Generated
Human-Written	5,816 (83.91%)	1,098 (16.09%)
AI-Generated	330 (2.03%)	15,955 (97.97%)

The confusion matrix indicates excellent recall in detecting AI-generated texts with ongoing efforts to minimize false positives (human texts identified as AI-generated).

Usage

Command-Line Interface

Rapid analysis of documents:

python ai_text_detector.py "/path/to/file.pdf"

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
L2R_dataset		L2R_dataset
__pycache__		__pycache__
archive		archive
llm-detect-ai-generated-text		llm-detect-ai-generated-text
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
8581_Learning_to_Rewrite_Gener.pdf		8581_Learning_to_Rewrite_Gener.pdf
AI_Text_Detector.py		AI_Text_Detector.py
AI_Text_Detector_Report.pdf		AI_Text_Detector_Report.pdf
AI_Text_Detector_Up.py		AI_Text_Detector_Up.py
Figure_1.png		Figure_1.png
Generate_Report.py		Generate_Report.py
README.md		README.md
academic_essays.csv		academic_essays.csv
cleaned_train_essays.csv		cleaned_train_essays.csv
combined_dataset.csv		combined_dataset.csv
data_analysis.py		data_analysis.py
data_loader.py		data_loader.py
feature_extraction.py		feature_extraction.py
features_train_essays.csv		features_train_essays.csv
merged_train_essays.csv		merged_train_essays.csv
save_transformer.py		save_transformer.py
train_baseline_model.py		train_baseline_model.py
train_lightweight_model.py		train_lightweight_model.py
train_transformer_model.py		train_transformer_model.py
validate_saved_model.py		validate_saved_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Generated Text Detector

Table of Contents

Overview

Features

Dataset

Model Training

Fine-Tuning Approach

Evaluation Metrics

Performance Results

Confusion Matrix

Usage

Command-Line Interface

About

Uh oh!

Releases

Packages

Languages

scott-matkosky/AI_Text_Detector

Folders and files

Latest commit

History

Repository files navigation

AI-Generated Text Detector

Table of Contents

Overview

Features

Dataset

Model Training

Fine-Tuning Approach

Evaluation Metrics

Performance Results

Confusion Matrix

Usage

Command-Line Interface

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages