This project presents an Anomaly-Based Intrusion Detection System (AIDS) tailored for IoT environments using the CICIoT23 dataset. It leverages a Random Forest classifier to distinguish between benign and malicious IoT network traffic, achieving 99.76% overall accuracy with exceptional precision and recall.
The system automates every stage — from data preparation and training to live anomaly detection and report generation.
CICIoT23 is a comprehensive IoT dataset developed by the Canadian Institute for Cybersecurity (CIC) for network intrusion research. It contains both benign traffic and multiple IoT attack types such as DDoS, Brute-force, and Scanning.
- Source: Canadian Institute for Cybersecurity
- Data Format: Multiple CSV files
- Attributes: 80+ flow-based and statistical features
- Objective: Classify IoT network flows as Benign or Anomalous
Model binaries (.pkl files) are not included in this repository due to GitHub file-size limitations.
To regenerate them, run the train_ids.py script.
This will create:
rf_ids_model.pkl— trained Random Forest IDS modelids_scaler.pkl— scaler object used for preprocessing
- Merge multiple CSVs into unified datasets for training, testing, and validation.
- Clean, normalize, and encode the data.
- Output:
merged_train.csv,merged_test.csv,merged_validation.csv.
- Scale the datasets and train a RandomForestClassifier using scikit-learn.
- Evaluate performance using accuracy, precision, recall, F1-score, and confusion matrix.
- Output:
rf_ids_model.pkl(model) andids_scaler.pkl(scaler).
- Load the trained model and scaler to analyze new IoT traffic CSVs.
- Detect anomalous vs benign flows automatically.
- Generate reports in both CSV and HTML formats.
- Output:
anomalous_flows_report.csvandanomalous_flows_report.html.
flowchart LR
%% ===== DATA & PREPROCESSING MODULE =====
A[Raw CICIoT23 Dataset] --> B[Data Preprocessing Module]
B --> B1[Cleaning and Normalization]
B --> B2[Encoding Categorical Features]
B --> B3[Feature Scaling]
B --> C[Preprocessed Datasets]
%% ===== MODEL TRAINING MODULE =====
C --> D[Random Forest Training Module]
D --> D1[Model Evaluation Engine]
D --> E[rf_ids_model.pkl]
B --> F[ids_scaler.pkl]
%% ===== LIVE DETECTION MODULE =====
H[New IoT Traffic CSV] --> G[Live Detection Module]
E --> G
F --> G
G --> I[Flow Classification]
I --> J[Report Generator]
J --> K1[anomalous_flows_report.csv]
J --> K2[anomalous_flows_report.html]
| Metric | Value |
|---|---|
| Accuracy | 99.76% |
| Precision (Anomaly) | 0.9990 |
| Recall (Anomaly) | 0.9985 |
| F1-Score (Anomaly) | 0.9988 |
- Fully automated ML pipeline from preprocessing to prediction
- High-accuracy Random Forest classifier
- Supports new CSV-based IoT traffic analysis
- Feature importance visualization for explainability
- Modular and reusable design
- Language: Python 3.10+
- Libraries:
pandas,numpy,scikit-learn,matplotlib,joblib - Environment: Kaggle / Jupyter Notebook / Local Python
- Outputs:
.pklmodel, metrics file, anomaly reports
If you use this dataset or reference this work, please cite:
Dataset: Sharafaldin, I., Lashkari, A. H., & Ghorbani, A. A. (2023). CICIoT2023: A realistic IoT dataset for intrusion detection research. Canadian Institute for Cybersecurity.
Alexander P.B.
Cybersecurity Researcher & Penetration Tester
Red Team & IoT Security Specialist
📧 Reach out via GitHub for research collaborations.