Skip to content

Phalanx - Coordinated intelligence for the modern SRE battlefield. Palantir-inspired (Foundry/Gotham) open PoC that fuses service topology, multi-hop packet forensics, and agentic AI into a single operational platform.

Notifications You must be signed in to change notification settings

keewillidevnet/phalanx-sre

Repository files navigation

🛡️ Phalanx SRE – Coordinated Intelligence for the Modern SRE Battlefield

Phalanx SRE is an AI-powered, multi-agent operational intelligence platform inspired by the architectures of Palantir Foundry and Gotham. It fuses network telemetry, service topology, and multi-hop packet forensics into a unified operational graph for Site Reliability Engineers operating in high-stakes environments.

Unlike traditional SRE toolkits, Phalanx doesn’t just show you problems — it:

  • Explains them in plain English
  • Visualizes them in real time
  • Recommends (or executes) the optimal response
  • Predicts issues before they impact SLAs

Planned Capabilities

Foundry-Inspired

  • Unified Ontology Layer – Service, network, and business context modeled as one graph
  • Operational Pipelines – Chainable transforms for metrics, traces, and captures
  • Digital Twin Simulation – What-if scenarios for outages, capacity spikes, or routing changes
  • Governance Framework – Role-based and trust-weighted decision approvals
  • Data Fusion – Combine telemetry, packet forensics, incident history, and deploy metadata

Gotham-Inspired

  • Entity Resolution – Merge multi-hop packet captures into a single, coherent TCP/UDP conversation
  • Link & Graph Analysis – Visualize operational relationships and dependencies
  • Live Investigative War Room – Real-time graph and packet ladder during incidents
  • Case Journaling – AI-generated incident narratives with supporting evidence
  • Access Controls & Redaction – Fine-grained visibility for sensitive data

Core SRE Intelligence

  • Multi-Hop Packet Capture & Stitching – Distributed capture agents at every hop in the path
  • AI-Driven Diagnosis – ML + rules + LLM explanations for root cause
  • Asymmetry & Path MTU Detection – Identify routing changes, PMTUD blackholes, and MSS mismatches
  • Dynamic SLO Prioritization – Rank incidents by business impact, not just technical severity
  • Trust-Weighted Remediation – Multi-agent voting before applying risky changes
  • Predictive Ops Planning – Forecast SLO burn and resource needs
  • Automation First – If it’s manual twice, it’s automated once

Why Phalanx SRE Surpasses Traditional Toolkits

Capability Traditional SRE Stack Phalanx SRE Advantage
Data Sources Fragmented (metrics, logs, traces, packets all in different tools) Unified ontology linking metrics, packets, deploys, incidents, and topology
Incident Diagnosis Manual triage via multiple dashboards AI-driven cause analysis with multi-hop packet evidence
Cross-Domain Correlation Ad hoc scripting Native entity graph + operational pipelines
Governance None or tool-specific Trust-weighted, role-aware decision engine
Visualization Isolated charts Real-time operational graph + ladder diagrams
Remediation Manual or brittle scripts Agentic automation with guardrails and approvals
Predictive Modeling Limited to capacity planning Full digital twin simulation of services and network paths

5 Palantir-Inspired SRE Use Cases

1️⃣ Cross-Domain Incident Causality Link metrics, deploy history, packet anomalies, and topology to pinpoint root cause in seconds.

2️⃣ Trust-Weighted Automated Remediation Multi-agent voting before rolling back deploys or shifting traffic.

3️⃣ Dynamic SLO Prioritization Rank and respond based on business impact.

4️⃣ Incident War Room Live battlemap with hop-by-hop KPIs, AI explanations, and actionable insights.

5️⃣ Predictive Ops Planning Run what-if outage simulations and auto-generate playbooks.


Repository Structure

phalanx-sre/
│
├── phalanx_agents/                 # AGNTCY-powered battlefield units
│   ├── capture_unit.py              # Scoped multi-hop packet capture
│   ├── stitch_unit.py               # PCAP stitching + seq/ack alignment
│   ├── intel_unit.py                # Runs rules, ML, and LLM diagnosis
│   ├── advisor_unit.py              # Suggests/remediates with trust voting
│   └── viz_unit.py                  # Updates battlemap & intel panels
│
├── signal_capture/                  # Network intelligence gathering
│   ├── filters.py                   # 5-tuple BPF filters
│   ├── pcap_rotate.sh               # tcpdump/dumpcap rotation
│   └── ingest_api.py                 # Optional FastAPI packet ingest
│
├── intel_core/                      # Brain of the platform
│   ├── features.py                  # Extract RTT, loss, reordering, MSS
│   ├── rules.py                     # PMTUD, asymmetry, think-time, congestion
│   ├── model.py                     # Optional ML classifier
│   └── llm_explainer.py              # Plain-English incident narrative
│
├── battlemap/                       # Visual command center
│   ├── ladder_diagram.py            # Conversation timeline
│   ├── topology_map.py              # Hop-to-hop graph with KPI heat
│   └── root_cause_panel.py          # Primary cause + recommendations
│
├── campaign_scenarios/              # 5 Palantir-style SRE use cases
│   ├── 01_cross_domain_causality.md
│   ├── 02_trust_weighted_remediation.md
│   ├── 03_dynamic_slo_prioritization.md
│   ├── 04_incident_war_room.md
│   └── 05_predictive_ops_planning.md
│
├── ui/                              # Human interface
│   ├── app.py                       # Streamlit dashboard entry
│   ├── components/                  # Shared UI parts
│   └── demo_mode.py                  # Synthetic data for LinkedIn demo
│
├── command_structure/               # Governance & rules of engagement
│   └── trust_policy.json             # Multi-agent consensus settings
│
├── examples/
│   ├── demo_topology.yaml            # Sample battlefield topology
│   ├── flows.yaml                    # Example mission flows
│   └── sample_pcaps/                 # Demo captures for instant replay
│
├── docs/
│   ├── README.md                     # This file
│   ├── DESIGN.md                     # Architecture & inspiration
│   ├── PRIVACY.md                    # Capture safety/redaction policy
│   └── USE_CASES.md                  # Expanded Palantir-style use cases
│
└── requirements.txt

Technology Stack (Planned)

  • Packet Capture & Parsing: tcpdump, dumpcap, pyshark, scapy
  • Visualization: Streamlit, Plotly, PyVis, D3.js
  • Graph Modeling: networkx, Neo4j (optional)
  • ML/AI: LightGBM or XGBoost for classification; LangChain or OpenAI for LLM explanations
  • Automation & Agents: AGNTCY-style multi-agent orchestration
  • Backend/API: FastAPI + WebSocket events

Roadmap to PoC

  • Scaffold repository (✅ Done)
  • Implement AI-driven multi-hop packet capture & stitching
  • Add feature extraction + rules engine
  • Integrate LLM-based incident explanation
  • Build real-time battlemap UI
  • Create demo dataset for instant GitHub/LinkedIn replay
  • Implement all 5 Palantir-inspired SRE scenarios in demo mode
  • Publish PoC and document results

Call to Action

Phalanx SRE is a proof-of-concept to show what’s possible when operational telemetry, packet forensics, and agentic AI are unified in one platform. It’s a love letter to data fusion, coordinated response, and predictive ops and a challenge to rethink how SRE should work.

About

Phalanx - Coordinated intelligence for the modern SRE battlefield. Palantir-inspired (Foundry/Gotham) open PoC that fuses service topology, multi-hop packet forensics, and agentic AI into a single operational platform.

Resources

Stars

Watchers

Forks

Packages

No packages published