|
| 1 | +# Network Event Cause & Risk Analysis Dataset for Slips IDS |
| 2 | + |
| 3 | +## Table of Contents |
| 4 | + |
| 5 | +- [1. Task Description](#1-task-description) |
| 6 | +- [2. Relationship to Summarization Workflow](#2-relationship-to-summarization-workflow) |
| 7 | +- [3. Dataset Generation Workflow](#3-dataset-generation-workflow) |
| 8 | + - [Workflow Overview](#workflow-overview) |
| 9 | + - [Stage 3: Multi-Model Cause & Risk Analysis](#stage-3-multi-model-cause--risk-analysis) |
| 10 | + - [Stage 4: Dataset Correlation](#stage-4-dataset-correlation) |
| 11 | + - [Dataset Structure](#dataset-structure) |
| 12 | +- [4. Use Cases and Applications](#4-use-cases-and-applications) |
| 13 | + |
| 14 | +## 1. Task Description |
| 15 | + |
| 16 | +Develop a dataset for **root cause analysis and risk assessment** of network security incidents from Slips IDS alerts. This complementary workflow focuses on structured security analysis rather than event summarization, providing: |
| 17 | + |
| 18 | +1. **Cause Analysis** - Categorized incident attribution (Malicious Activity / Legitimate Activity / Misconfigurations) |
| 19 | +2. **Risk Assessment** - Structured evaluation (Risk Level / Business Impact / Investigation Priority) |
| 20 | + |
| 21 | +**Target Deployment**: Same hardware constraints as [summarization workflow](DATASET_REPORT.md#2-limitations) (Raspberry Pi 5, 1.5B-3B parameter models). |
| 22 | + |
| 23 | +## 2. Relationship to Summarization Workflow |
| 24 | + |
| 25 | +Both workflows share identical **Stages 1-2** (incident sampling and DAG generation) but diverge in LLM analysis approach: |
| 26 | + |
| 27 | +| Aspect | Summarization Workflow | Risk Analysis Workflow | |
| 28 | +|--------|------------------------|------------------------| |
| 29 | +| **Documentation** | [DATASET_REPORT.md](DATASET_REPORT.md) | This document | |
| 30 | +| **Detailed Guide** | [README_dataset_summary_workflow.md](README_dataset_summary_workflow.md) | [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md) | |
| 31 | +| **Analysis Script** | `generate_llm_analysis.sh` | `generate_cause_risk_analysis.sh` | |
| 32 | +| **Correlation Script** | `correlate_incidents.py` | `correlate_risks.py` | |
| 33 | +| **Output Fields** | `summary` + `behavior_analysis` | `cause_analysis` + `risk_assessment` | |
| 34 | +| **LLM Prompts** | 2 per incident (event summarization + behavior patterns) | 2 per incident (cause attribution + risk scoring) | |
| 35 | +| **Primary Use Case** | Incident timeline reconstruction, behavior pattern identification | Root cause analysis, threat prioritization, SOC decision support | |
| 36 | + |
| 37 | +**Recommendation**: Generate both datasets from the same sampled incidents to enable comparative analysis and multi-task model training. |
| 38 | + |
| 39 | +## 3. Dataset Generation Workflow |
| 40 | + |
| 41 | +### Workflow Overview |
| 42 | + |
| 43 | +**Stages 1-2** (Sampling + DAG): See [DATASET_REPORT.md §3](DATASET_REPORT.md#3-dataset-generation-workflow) - identical to summarization workflow. |
| 44 | + |
| 45 | +**Quick commands:** |
| 46 | +```bash |
| 47 | +# Stage 1: Sample 100 incidents |
| 48 | +./sample_dataset.sh 100 my_dataset --seed 42 |
| 49 | + |
| 50 | +# Stage 2: Generate DAG analysis |
| 51 | +./generate_dag_analysis.sh datasets/my_dataset.jsonl |
| 52 | +``` |
| 53 | + |
| 54 | +### Stage 3: Multi-Model Cause & Risk Analysis |
| 55 | + |
| 56 | +Query LLMs with dual prompts for cause attribution and risk assessment: |
| 57 | + |
| 58 | +```bash |
| 59 | +# GPT-4o-mini (recommended baseline) |
| 60 | +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ |
| 61 | + --model gpt-4o-mini --group-events |
| 62 | + |
| 63 | +# Qwen2.5:3b (target deployment model) |
| 64 | +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl \ |
| 65 | + --model qwen2.5:3b \ |
| 66 | + --base-url http://10.147.20.102:11434/v1 --group-events |
| 67 | +``` |
| 68 | + |
| 69 | +**Output Structure** (per incident): |
| 70 | +```json |
| 71 | +{ |
| 72 | + "cause_analysis": "**Possible Causes:**\n\n**1. Malicious Activity:**\n• Port scanning indicates reconnaissance...\n\n**2. Legitimate Activity:**\n• Could be network monitoring tools...\n\n**3. Misconfigurations:**\n• Firewall allowing unrestricted scanning...\n\n**Conclusion:** Most likely malicious reconnaissance activity.", |
| 73 | + |
| 74 | + "risk_assessment": "**Risk Level:** High\n\n**Justification:** Active scanning + C2 connections...\n\n**Business Impact:** Potential data breach or service disruption...\n\n**Likelihood of Malicious Activity:** High - Systematic attack pattern...\n\n**Investigation Priority:** Immediate - Block source IP and investigate." |
| 75 | +} |
| 76 | +``` |
| 77 | + |
| 78 | +### Stage 4: Dataset Correlation |
| 79 | + |
| 80 | +Merge all analyses (DAG + LLM cause/risk assessments) by incident ID: |
| 81 | + |
| 82 | +```bash |
| 83 | +python3 correlate_risks.py datasets/my_dataset.*.json \ |
| 84 | + --jsonl datasets/my_dataset.jsonl \ |
| 85 | + -o datasets/final_dataset_risk.json |
| 86 | +``` |
| 87 | + |
| 88 | +### Dataset Structure |
| 89 | + |
| 90 | +Final output contains merged analyses with model-specific risk assessments: |
| 91 | + |
| 92 | +```json |
| 93 | +{ |
| 94 | + "total_incidents": 100, |
| 95 | + "incidents": [ |
| 96 | + { |
| 97 | + "incident_id": "uuid", |
| 98 | + "category": "Malware", |
| 99 | + "source_ip": "192.168.1.113", |
| 100 | + "timewindow": "5", |
| 101 | + "timeline": "2024-04-05 16:53:07 to 16:53:50", |
| 102 | + "threat_level": 15.36, |
| 103 | + "event_count": 4604, |
| 104 | + "dag_analysis": "• 16:53 - 222 horizontal port scans [HIGH]\n...", |
| 105 | + "cause_risk_gpt_4o_mini": { |
| 106 | + "cause_analysis": "**1. Malicious Activity:** Reconnaissance scanning...", |
| 107 | + "risk_assessment": "**Risk Level:** High\n**Justification:**..." |
| 108 | + }, |
| 109 | + "cause_risk_gpt_4o": { ... }, |
| 110 | + "cause_risk_qwen2_5": { ... } |
| 111 | + } |
| 112 | + ] |
| 113 | +} |
| 114 | +``` |
| 115 | + |
| 116 | +**Key differences from summarization dataset**: |
| 117 | +- `cause_risk_*` fields replace `llm_*` fields |
| 118 | +- Structured 3-category cause analysis (vs. free-form summary) |
| 119 | +- 5-field risk assessment framework (vs. behavior flow description) |
| 120 | + |
| 121 | +## 4. Use Cases and Applications |
| 122 | + |
| 123 | +### Security Operations Center (SOC) |
| 124 | +- **Automated Triage**: Risk level + investigation priority for alert queue sorting |
| 125 | +- **Incident Attribution**: Distinguish malicious attacks from misconfigurations |
| 126 | +- **Resource Allocation**: Business impact assessment for team assignments |
| 127 | + |
| 128 | +### Model Training Applications |
| 129 | +- **Classification Tasks**: Train models to categorize incidents (malicious/legitimate/misconfiguration) |
| 130 | +- **Risk Scoring**: Fine-tune models for threat level prediction |
| 131 | +- **Decision Support**: Generate actionable recommendations (block/monitor/investigate) |
| 132 | + |
| 133 | +### Dataset Comparison |
| 134 | +Use both workflows together: |
| 135 | +- **Summarization**: "What happened?" (temporal sequences, behavior patterns) |
| 136 | +- **Risk Analysis**: "Why did it happen?" + "How urgent?" (attribution, prioritization) |
| 137 | + |
| 138 | +**Combined Training Strategy**: |
| 139 | +```bash |
| 140 | +# Generate both datasets from same incidents |
| 141 | +./generate_llm_analysis.sh datasets/my_dataset.jsonl --model qwen2.5:3b --group-events --behavior-analysis |
| 142 | +./generate_cause_risk_analysis.sh datasets/my_dataset.jsonl --model qwen2.5:3b --group-events |
| 143 | + |
| 144 | +# Correlate separately |
| 145 | +python3 correlate_incidents.py datasets/my_dataset.*.json --jsonl datasets/my_dataset.jsonl -o summary_dataset.json |
| 146 | +python3 correlate_risks.py datasets/my_dataset.*.json --jsonl datasets/my_dataset.jsonl -o risk_dataset.json |
| 147 | + |
| 148 | +# Multi-task training: Merge datasets and train single model on both tasks |
| 149 | +``` |
| 150 | + |
| 151 | +--- |
| 152 | + |
| 153 | +**For detailed implementation**: See [README_dataset_risk_workflow.md](README_dataset_risk_workflow.md) |
| 154 | +**For workflow comparison**: See [WORKFLOWS_OVERVIEW.md](WORKFLOWS_OVERVIEW.md) (if available) |
| 155 | +**For evaluation methods**: See [LLM_EVALUATION_GUIDE.md](LLM_EVALUATION_GUIDE.md) |
0 commit comments