Skip to content

Commit df17fd1

Browse files
authored
Merge pull request #14 from stratosphereips/harpo_datasets
Harpo datasets
2 parents 102e365 + 134f0a1 commit df17fd1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+6209
-3025
lines changed

alert_summary/CLAUDE.md

Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,12 @@ python3 slips_dag_generator.py sample_logs/slips.log --per-analysis --include-th
5555
# Compact format (default) - single line per event
5656
python3 slips_dag_generator.py sample_logs/test_data.log --all-ips --compact
5757

58-
# Minimal format - bullet-point summaries
58+
# Minimal format - bullet-point summaries (high priority events only)
5959
python3 slips_dag_generator.py sample_logs/test_data.log --all-ips --minimal
6060

61+
# Comprehensive format - bullet-point summaries showing ALL evidence types
62+
python3 slips_dag_generator.py sample_logs/test_data.log --all-ips --comprehensive
63+
6164
# Pattern analysis - attack phase breakdown
6265
python3 slips_dag_generator.py sample_logs/test_data.log --all-ips --pattern
6366

@@ -81,6 +84,9 @@ python3 slips_dag_generator.py sample_logs/test_data.log --all-ips --json
8184

8285
# Include summary statistics
8386
python3 slips_dag_generator.py sample_logs/test_data.log --all-ips --summary
87+
88+
# Merge similar evidence events for cleaner output
89+
python3 slips_dag_generator.py sample_logs/slips.log --per-analysis --comprehensive --merge-evidence
8490
```
8591

8692
### Common Workflows
@@ -112,6 +118,59 @@ python3 slips_dag_generator.py sample_logs/slips.log --per-analysis --compact --
112118
python3 slips_dag_generator.py sample_logs/slips.log --per-analysis --output alert_analysis.txt
113119
```
114120

121+
## Evidence Merging
122+
123+
### Overview
124+
The `--merge-evidence` flag consolidates similar evidence events within each analysis for cleaner, more concise output.
125+
126+
### How It Works
127+
- Groups similar evidence by type and target (same port for scans, same IP for connections)
128+
- Aggregates statistics (total targets, packets, threat levels)
129+
- Preserves highest threat level and confidence score
130+
- Shows time range and burst count for merged events
131+
132+
### Examples
133+
134+
#### Without Merging (verbose output)
135+
```bash
136+
python3 slips_dag_generator.py sample_logs/slips.log --per-analysis --comprehensive
137+
• 23:27 - Port scans: 443/TCP→45, 443/TCP→30, 80/TCP→5, 80/TCP→15, 443/TCP→20, 443/TCP→5, 80/TCP→30, 8080/TCP→6 (total: 156 hosts)
138+
• Evidence: 11 events in analysis
139+
```
140+
141+
#### With Merging (consolidated output)
142+
```bash
143+
python3 slips_dag_generator.py sample_logs/slips.log --per-analysis --comprehensive --merge-evidence
144+
• 23:27 - Port scans: 8080/TCP→6, 80→50, 443→100 (total: 156 hosts)
145+
• Evidence: 6 events in analysis
146+
```
147+
148+
### Benefits
149+
- **Cleaner Output**: Reduces clutter from repetitive similar events
150+
- **Better Pattern Recognition**: Easier to identify attack strategies
151+
- **Preserved Accuracy**: All statistical data is maintained through aggregation
152+
- **Configurable**: Optional feature that preserves existing behavior when disabled
153+
154+
### Merging Rules
155+
- **Port Scans**: Grouped by port/protocol, targets and packets summed
156+
- **C&C Channels**: Grouped by destination IP, connection attempts counted
157+
- **Blacklisted IPs**: Grouped by destination IP, attempts aggregated
158+
- **Private IP Connections**: Grouped by destination IP and port
159+
- **Other Events**: Grouped by exact event type
160+
161+
### Compatible Modes
162+
Works with all output formats and analysis modes:
163+
```bash
164+
# IP-based analysis with merging
165+
python3 slips_dag_generator.py sample_logs/slips.log 192.168.1.113 --minimal --merge-evidence
166+
167+
# Per-analysis mode with merging
168+
python3 slips_dag_generator.py sample_logs/slips.log --per-analysis --pattern --merge-evidence
169+
170+
# All IPs with merging
171+
python3 slips_dag_generator.py sample_logs/slips.log --all-ips --compact --merge-evidence
172+
```
173+
115174
## Log Format Support
116175

117176
The parser handles multiple Slips log formats:

alert_summary/DATASET_REPORT.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
# Network Event Summarization Dataset for Slips IDS
2+
3+
## Table of Contents
4+
5+
- [1. Task description](#1-task-description)
6+
- [2. Limitations](#2-limitations)
7+
- [Hardware Constraints](#hardware-constraints)
8+
- [Scope Constraints](#scope-constraints)
9+
- [3. Dataset Generation Workflow](#3-dataset-generation-workflow)
10+
- [Stage 1: Incident Sampling](#stage-1-incident-sampling)
11+
- [Stage 2: Structural Analysis](#stage-2-structural-analysis)
12+
- [Stage 3: Multi-Model LLM Analysis](#stage-3-multi-model-llm-analysis)
13+
- [Stage 4: Dataset Correlation](#stage-4-dataset-correlation)
14+
- [Dataset Extension](#dataset-extension)
15+
- [Workflow Diagram](#workflow-diagram)
16+
- [Event Grouping Strategy](#event-grouping-strategy)
17+
- [Additional Optimizations](#additional-optimizations)
18+
- [Dataset Structure](#dataset-structure)
19+
20+
## 1. Task description
21+
22+
Develop a dataset for network security event summarization to be integrated with the Slips Immune system, optimized for deployment on low-resource hardware such as the Raspberry Pi 5. This dataset will be used to fine-tune compact language models capable of generating concise and actionable summaries of security incidents from raw Slips alert data, enabling real-time threat analysis in resource-constrained environments.
23+
24+
## 2. Limitations
25+
26+
### Hardware Constraints
27+
- **Platform**: Raspberry Pi 5 with limited RAM and processing power
28+
- **Model Size**: Only small language models (1.5B-3B parameters) are viable on target hardware
29+
- **Real-time Processing**: Target 10-15 seconds per incident on RPi5 with Ollama requires aggressive token optimization
30+
31+
### Scope Constraints
32+
- **Alert Format**: Analysis currently limited to Slips alert format; generalization to other IDS outputs requires format adaptation
33+
- **Token Budget**: Input and output tokens must be minimized to enable real-time inference on resource-constrained hardware (~2000 tokens max)
34+
- **Output Constraints**: Summaries must be concise (150-300 tokens) while maintaining security context
35+
36+
## 3. Dataset Generation Workflow
37+
38+
The dataset generation process consists of four stages, each implemented as Python scripts with shell wrappers that simplify execution, handle argument validation, and automate file naming. This modular design enables flexible experimentation with different models and configurations while maintaining reproducibility.
39+
40+
**Detailed documentation**: See [README_dataset_workflow.md](README_dataset_workflow.md) for complete pipeline specifications and advanced usage.
41+
42+
### Stage 1: Incident Sampling
43+
Extract security incidents from Slips `alerts.json` logs with category labels (Malware/Normal):
44+
45+
```bash
46+
./sample_dataset.sh 20 my_dataset --category malware --seed 42
47+
```
48+
49+
**Output**: `my_dataset.jsonl` (JSONL format with incidents and events)
50+
51+
### Stage 2: Structural Analysis
52+
Generate DAG-based chronological analysis of incident events:
53+
54+
```bash
55+
./generate_dag_analysis.sh my_dataset.jsonl
56+
```
57+
58+
**Output**: `my_dataset.dag.json` (incident metadata + event timeline)
59+
60+
### Stage 3: Multi-Model LLM Analysis
61+
Query multiple language models with optimized prompts:
62+
63+
```bash
64+
# GPT-4o-mini (baseline)
65+
./generate_llm_analysis.sh my_dataset.jsonl --model gpt-4o-mini \
66+
--group-events --behavior-analysis
67+
68+
# Qwen2.5:3b (target model)
69+
./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:3b \
70+
--base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis
71+
72+
# Qwen2.5:1.5b (minimal model)
73+
./generate_llm_analysis.sh my_dataset.jsonl --model qwen2.5:1.5b \
74+
--base-url http://10.147.20.102:11434/v1 --group-events --behavior-analysis
75+
```
76+
77+
**Outputs**: Model-specific JSON files with `summary` and `behavior_analysis` fields
78+
79+
### Stage 4: Dataset Correlation
80+
Merge all analyses into unified dataset by incident ID:
81+
82+
```bash
83+
python3 correlate_incidents.py my_dataset.*.json \
84+
--jsonl my_dataset.jsonl -o final_dataset.json
85+
```
86+
87+
**Output**: `final_dataset.json` (consolidated dataset with all analyses)
88+
89+
### Dataset Extension
90+
91+
To expand existing datasets without regeneration, use `merge_datasets.py` to combine multiple correlated datasets with automatic deduplication:
92+
93+
```bash
94+
# Generate new samples with different seed
95+
./sample_dataset.sh 20 extension --category malware --seed 99
96+
97+
# Run full analysis pipeline on extension
98+
./generate_dag_analysis.sh extension.jsonl
99+
./generate_llm_analysis.sh extension.jsonl --model qwen2.5:3b --group-events --behavior-analysis
100+
101+
# Correlate extension data
102+
python3 correlate_incidents.py extension.*.json --jsonl extension.jsonl -o extension_dataset.json
103+
104+
# Merge with existing dataset (removes duplicates by incident_id)
105+
python3 merge_datasets.py final_dataset.json extension_dataset.json -o final_dataset_v2.json
106+
```
107+
108+
This approach enables incremental dataset growth while maintaining consistency across all analysis fields.
109+
110+
### Workflow Diagram
111+
112+
```
113+
Raw Slips Logs (alerts.json)
114+
115+
[sample_dataset.py] → incidents.jsonl
116+
117+
├─→ [alert_dag_parser.py] → incidents.dag.json
118+
├─→ [alert_dag_parser_llm.py + GPT-4o-mini] → incidents.llm.gpt-4o-mini.json
119+
├─→ [alert_dag_parser_llm.py + Qwen2.5:3b] → incidents.llm.qwen2.5.json
120+
└─→ [alert_dag_parser_llm.py + Qwen2.5:1.5b] → incidents.llm.qwen2.5.1.5b.json
121+
122+
[correlate_incidents.py] → final_dataset.json
123+
```
124+
125+
### Event Grouping Strategy
126+
127+
The `--group-events` optimization reduces token count through pattern normalization:
128+
129+
1. **Pattern Normalization**: Replaces variable components in event descriptions with placeholders
130+
- IPv4 addresses → `<IP>`
131+
- Port numbers → `<PORT>` (handles formats: `443/TCP`, `port: 80`)
132+
- Standalone numbers → `<NUM>`
133+
134+
2. **Pattern-Based Grouping**: Groups events with identical normalized patterns
135+
- Example: "Connection to 192.168.1.5:443" + "Connection to 10.0.2.15:443" → single pattern "Connection to `<IP>`:`<PORT>`"
136+
- Preserves count, time range, and sample values (first 5 unique IPs/ports) per group
137+
138+
3. **Token Reduction**:
139+
- 103 events: 3,522 → 976 tokens (72% reduction)
140+
- 4,604 events: ~50,000 → 1,897 tokens (96% reduction)
141+
142+
4. **Information Loss Analysis**:
143+
- **Lost**: Individual timestamps (only ranges), complete IP/port lists (max 5 samples), exact event sequence, duplicate frequency tracking
144+
- **Retained**: Semantic patterns, event counts, representative samples, temporal context, protocol details, attack patterns
145+
- **Impact**: Small incidents (~28% loss), large incidents (~90-95% loss, mostly repetitive data)
146+
- **Justification**: Enables LLM summarization on RPi5; alternative is inability to process large incidents
147+
148+
### Additional Optimizations
149+
150+
**Dual-Prompt Analysis** (`--behavior-analysis`): Generates both severity-filtered summaries and structured technical flow analysis, providing richer training signals for model fine-tuning.
151+
152+
**Severity Filtering Strategy**: The dual-prompt approach implements intelligent filtering to manage token budgets:
153+
- Prioritizes high-threat evidence in summaries for focused incident assessment
154+
- May omit low-confidence events to reduce token consumption
155+
- Balanced by generating both severity-filtered summaries and comprehensive behavior analysis
156+
- Trade-off: Enables complete incident coverage while maintaining concise outputs suitable for resource-constrained deployment
157+
158+
**Multi-Model Evaluation**: Compares GPT-4o (quality baseline), GPT-4o-mini, Qwen2.5:3b (target deployment), and Qwen2.5:1.5b (minimal viable model) to assess performance-resource trade-offs.
159+
160+
### Dataset Structure
161+
162+
Each incident in the final dataset contains:
163+
- **Metadata**: incident_id, category, source_ip, timewindow, threat_level
164+
- **DAG Analysis**: Chronological event timeline with threat scores
165+
- **LLM Summaries**: Model-specific severity assessments
166+
- **Behavior Analysis**: Structured network flow descriptions
167+
168+
Token efficiency enables deployment on Raspberry Pi 5 while maintaining security analysis quality suitable for real-time intrusion detection.

0 commit comments

Comments
 (0)