A transformer-based classifier distinguishing between human-authored and AI-generated texts, fine-tuned on extensive academic and diverse-domain datasets. Built on DistilBERT, enhanced by linguistic features, perplexity measures, and sentiment analysis for robust interpretability. Transformers fine tuned on a macbook air M2 with 8GB memory. Text entries were limited to 256 tokens to work within computing resources. Fine-tuning took around 24 hours.
This project leverages DistilBERT fine-tuned on ~23K labeled examples (L2R Dataset) and Kaggle's LLM text detect competition (AIDE dataset) from 2024. The data spans domains like academic research, business, and legal documents. The model utilizes sophisticated linguistic indicators and ensemble-based cross-validation for reliable identification of AI-generated content.
- Transformer-based Detection: DistilBERT fine-tuned specifically for AI-text discrimination.
- Advanced Linguistic Indicators:
- Readability: Flesch-Kincaid, average sentence length.
- Structural Features: Lexical diversity, burstiness (sentence length variance), function-word and stop-word ratios.
- Semantic Indicators: TF-IDF cosine similarity, approximate edit distance.
- Perplexity Scoring: DistilGPT-2-based perplexity (higher indicates human-like complexity).
- Sentiment Analysis: VADER sentiment (positive, neutral, negative thresholds).
- Multi-format File Analysis: Text, PDF, LaTeX, DOCX support.
- CLI & PDF Reporting: Command-line interface and comprehensive PDF reports detailing overall and sentence-level analyses.
-
Total Size: 23,782 examples
-
Data Sources:
- Kaggle competition: AIDE: AI Detection for Essays Dataset: A mix of 1.3k student-written essays and essays generated by a variety of LLMs
- Learning to Rewrite (L2R): LEARNING TO REWRITE: GENERALIZED DETECTION OF LLM-GENERATED TEXT, Extensive dataset, 23k entries, AI and human written, spanning domains such as Academic Research, Legal, Business, and Creative Writing.
-
Preprocessing:
- Text normalization, markup removal (LaTeX/Markdown), tokenization.
- Computation of linguistic and statistical features for interpretability.
-
Base Model: Hugging Face
distilbert-base-uncased -
Training Configuration:
- Epochs: 3 per fold
- Batch Size: 8 (optimized for Apple Silicon M2 GPU via Metal Performance Shaders)
- Learning Rate: 4.67e-5 with linear decay scheduler
- Optimizer: AdamW
- Hardware Acceleration: Apple Metal Performance Shaders (MPS)
-
Cross-Validation Strategy:
Implemented 5-fold cross-validation for robust evaluation, ensuring generalizability and providing a solid foundation for ensemble predictions.
Model performance is rigorously assessed via:
- Accuracy: Overall classification correctness.
- Precision: Reliability of positive predictions (AI-generated text).
- Recall: Proportion of actual AI texts accurately classified.
- F1-score: Balance between precision and recall.
- Confusion Matrix: Visualization of True Positives, False Positives, True Negatives, and False Negatives.
Aggregated Cross-Validation Performance (5-folds):
| Metric | Human-Written (%) | AI-Generated (%) |
|---|---|---|
| Precision | 94.60 | 93.44 |
| Recall | 83.91 | 97.97 |
| F1-score | 88.91 | 95.65 |
| Overall Accuracy | 93.82 |
(Metrics averaged over 5-fold cross-validation.)
Aggregated confusion matrix across 5-fold cross-validation:
| Actual \ Predicted | Human-Written | AI-Generated |
|---|---|---|
| Human-Written | 5,816 (83.91%) | 1,098 (16.09%) |
| AI-Generated | 330 (2.03%) | 15,955 (97.97%) |
The confusion matrix indicates excellent recall in detecting AI-generated texts with ongoing efforts to minimize false positives (human texts identified as AI-generated).
Rapid analysis of documents:
python ai_text_detector.py "/path/to/file.pdf"