Phalanx SRE is an AI-powered, multi-agent operational intelligence platform inspired by the architectures of Palantir Foundry and Gotham. It fuses network telemetry, service topology, and multi-hop packet forensics into a unified operational graph for Site Reliability Engineers operating in high-stakes environments.
Unlike traditional SRE toolkits, Phalanx doesn’t just show you problems — it:
- Explains them in plain English
- Visualizes them in real time
- Recommends (or executes) the optimal response
- Predicts issues before they impact SLAs
- Unified Ontology Layer – Service, network, and business context modeled as one graph
- Operational Pipelines – Chainable transforms for metrics, traces, and captures
- Digital Twin Simulation – What-if scenarios for outages, capacity spikes, or routing changes
- Governance Framework – Role-based and trust-weighted decision approvals
- Data Fusion – Combine telemetry, packet forensics, incident history, and deploy metadata
- Entity Resolution – Merge multi-hop packet captures into a single, coherent TCP/UDP conversation
- Link & Graph Analysis – Visualize operational relationships and dependencies
- Live Investigative War Room – Real-time graph and packet ladder during incidents
- Case Journaling – AI-generated incident narratives with supporting evidence
- Access Controls & Redaction – Fine-grained visibility for sensitive data
- Multi-Hop Packet Capture & Stitching – Distributed capture agents at every hop in the path
- AI-Driven Diagnosis – ML + rules + LLM explanations for root cause
- Asymmetry & Path MTU Detection – Identify routing changes, PMTUD blackholes, and MSS mismatches
- Dynamic SLO Prioritization – Rank incidents by business impact, not just technical severity
- Trust-Weighted Remediation – Multi-agent voting before applying risky changes
- Predictive Ops Planning – Forecast SLO burn and resource needs
- Automation First – If it’s manual twice, it’s automated once
| Capability | Traditional SRE Stack | Phalanx SRE Advantage |
|---|---|---|
| Data Sources | Fragmented (metrics, logs, traces, packets all in different tools) | Unified ontology linking metrics, packets, deploys, incidents, and topology |
| Incident Diagnosis | Manual triage via multiple dashboards | AI-driven cause analysis with multi-hop packet evidence |
| Cross-Domain Correlation | Ad hoc scripting | Native entity graph + operational pipelines |
| Governance | None or tool-specific | Trust-weighted, role-aware decision engine |
| Visualization | Isolated charts | Real-time operational graph + ladder diagrams |
| Remediation | Manual or brittle scripts | Agentic automation with guardrails and approvals |
| Predictive Modeling | Limited to capacity planning | Full digital twin simulation of services and network paths |
1️⃣ Cross-Domain Incident Causality Link metrics, deploy history, packet anomalies, and topology to pinpoint root cause in seconds.
2️⃣ Trust-Weighted Automated Remediation Multi-agent voting before rolling back deploys or shifting traffic.
3️⃣ Dynamic SLO Prioritization Rank and respond based on business impact.
4️⃣ Incident War Room Live battlemap with hop-by-hop KPIs, AI explanations, and actionable insights.
5️⃣ Predictive Ops Planning Run what-if outage simulations and auto-generate playbooks.
phalanx-sre/
│
├── phalanx_agents/ # AGNTCY-powered battlefield units
│ ├── capture_unit.py # Scoped multi-hop packet capture
│ ├── stitch_unit.py # PCAP stitching + seq/ack alignment
│ ├── intel_unit.py # Runs rules, ML, and LLM diagnosis
│ ├── advisor_unit.py # Suggests/remediates with trust voting
│ └── viz_unit.py # Updates battlemap & intel panels
│
├── signal_capture/ # Network intelligence gathering
│ ├── filters.py # 5-tuple BPF filters
│ ├── pcap_rotate.sh # tcpdump/dumpcap rotation
│ └── ingest_api.py # Optional FastAPI packet ingest
│
├── intel_core/ # Brain of the platform
│ ├── features.py # Extract RTT, loss, reordering, MSS
│ ├── rules.py # PMTUD, asymmetry, think-time, congestion
│ ├── model.py # Optional ML classifier
│ └── llm_explainer.py # Plain-English incident narrative
│
├── battlemap/ # Visual command center
│ ├── ladder_diagram.py # Conversation timeline
│ ├── topology_map.py # Hop-to-hop graph with KPI heat
│ └── root_cause_panel.py # Primary cause + recommendations
│
├── campaign_scenarios/ # 5 Palantir-style SRE use cases
│ ├── 01_cross_domain_causality.md
│ ├── 02_trust_weighted_remediation.md
│ ├── 03_dynamic_slo_prioritization.md
│ ├── 04_incident_war_room.md
│ └── 05_predictive_ops_planning.md
│
├── ui/ # Human interface
│ ├── app.py # Streamlit dashboard entry
│ ├── components/ # Shared UI parts
│ └── demo_mode.py # Synthetic data for LinkedIn demo
│
├── command_structure/ # Governance & rules of engagement
│ └── trust_policy.json # Multi-agent consensus settings
│
├── examples/
│ ├── demo_topology.yaml # Sample battlefield topology
│ ├── flows.yaml # Example mission flows
│ └── sample_pcaps/ # Demo captures for instant replay
│
├── docs/
│ ├── README.md # This file
│ ├── DESIGN.md # Architecture & inspiration
│ ├── PRIVACY.md # Capture safety/redaction policy
│ └── USE_CASES.md # Expanded Palantir-style use cases
│
└── requirements.txt
- Packet Capture & Parsing:
tcpdump,dumpcap,pyshark,scapy - Visualization:
Streamlit,Plotly,PyVis,D3.js - Graph Modeling:
networkx,Neo4j(optional) - ML/AI:
LightGBMorXGBoostfor classification;LangChainorOpenAIfor LLM explanations - Automation & Agents: AGNTCY-style multi-agent orchestration
- Backend/API:
FastAPI+ WebSocket events
- Scaffold repository (✅ Done)
- Implement AI-driven multi-hop packet capture & stitching
- Add feature extraction + rules engine
- Integrate LLM-based incident explanation
- Build real-time battlemap UI
- Create demo dataset for instant GitHub/LinkedIn replay
- Implement all 5 Palantir-inspired SRE scenarios in demo mode
- Publish PoC and document results
Phalanx SRE is a proof-of-concept to show what’s possible when operational telemetry, packet forensics, and agentic AI are unified in one platform. It’s a love letter to data fusion, coordinated response, and predictive ops and a challenge to rethink how SRE should work.