Skip to content

smarunich/inference-in-a-box

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Inference-in-a-Box: Enterprise AI/ML Platform Demo

πŸ“‹ Quick Navigation: 🎯 Goals & Vision β€’ πŸš€ Getting Started β€’ πŸ“– Usage Guide β€’ πŸ—οΈ Architecture β€’ πŸ€– AI Assistant β€’ 🎭 Demos

A complete, production-ready inference platform that demonstrates enterprise-grade AI/ML model serving using modern cloud-native technologies. This platform combines Envoy AI Gateway, Istio service mesh, KServe serverless model serving, and comprehensive observability to create a robust, scalable, and secure inference-as-a-service solution.

🎯 What You're Building

Inference-in-a-Box is a comprehensive demonstration of how modern organizations can deploy AI/ML models at enterprise scale with:

  • πŸ”’ Zero-Trust Security - Automatic mTLS encryption, fine-grained authorization, and compliance-ready audit logging
  • ⚑ Serverless Inference - Auto-scaling from zero to N instances based on traffic demand
  • 🌐 Multi-Tenant Architecture - Secure isolation between different teams, projects, and customers
  • πŸ“Š Enterprise Observability - Full-stack monitoring, distributed tracing, and AI-specific metrics
  • πŸšͺ Unified AI Gateway - Envoy AI Gateway as the primary entry point with JWT authentication and intelligent routing
  • πŸŽ›οΈ Traffic Management - Canary deployments, A/B testing, and intelligent routing

πŸ—οΈ Platform Architecture

graph TB
    subgraph "Inference-in-a-Box Cluster"
        subgraph "Tier-1 Gateway Layer (Primary Entry Point)"
            EG[Envoy Gateway]
            EAG[Envoy AI Gateway]
            AUTH[JWT Authentication]
            RL[Rate Limiting]
        end
        
        subgraph "Tier-2 Service Mesh Layer"
            IC[Istiod]
            IG[Istio Gateway]
            MTLS[mTLS Encryption]
        end
        
        subgraph "Multi-Tenant Model Serving"
            subgraph "Tenant A"
                KS1[sklearn-iris]
                IS1[Istio Sidecar]
            end
            subgraph "Tenant B"
                KS2[Reserved]
                IS2[Istio Sidecar]
            end
            subgraph "Tenant C"
                KS3[pytorch-resnet]
                IS3[Istio Sidecar]
            end
        end
        
        subgraph "Serverless Infrastructure"
            KC[KServe Controller]
            KN[Knative Serving]
            CM[Cert Manager]
        end
        
        subgraph "Observability Stack"
            P[Prometheus]
            G[Grafana]
            K[Kiali]
            AM[AlertManager]
        end
    end
    
    subgraph "External"
        CLIENT[AI Client Apps]
        MODELS[Model Registry]
    end
    
    %% Primary Traffic Flow (Tier-1 β†’ Tier-2)
    CLIENT -->|HTTP/REST| EAG
    EAG -->|JWT Validation| AUTH
    AUTH -->|Authenticated| RL
    RL -->|Rate Limited| IG
    IG -->|mTLS Routing| KS1
    IG -->|mTLS Routing| KS3
    
    %% Gateway Integration
    EG -->|Controls| EAG
    IC -->|Manages| IG
    IC -->|Enables| MTLS
    
    %% Model Serving Infrastructure
    KC -->|Manages| KN
    KN -->|Serves| KS1
    KN -->|Serves| KS2
    KN -->|Serves| KS3
    
    %% External Model Sources
    MODELS -->|Deploys| KS1
    MODELS -->|Deploys| KS2
    MODELS -->|Deploys| KS3
    
    %% Observability Flow
    KS1 -->|Metrics| P
    KS2 -->|Metrics| P
    KS3 -->|Metrics| P
    IC -->|Mesh Metrics| P
    P -->|Data| G
    P -->|Data| K
    
    %% Styling
    classDef tier1 fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
    classDef tier2 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef models fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef observability fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    
    class EG,EAG,AUTH,RL tier1
    class IC,IG,MTLS tier2
    class KS1,KS2,KS3,KC,KN,CM models
    class P,G,K,AM observability
Loading

πŸ› οΈ Technology Stack

Core Platform Components

  • 🐳 Kind - Local Kubernetes cluster for development and testing
  • πŸšͺ Envoy Gateway - Cloud-native API gateway with advanced routing capabilities
  • πŸ€– Envoy AI Gateway - AI-specific gateway with JWT authentication, model routing, and OpenAI API compatibility
  • πŸ•ΈοΈ Istio - Service mesh for security, traffic management, and observability
  • 🦾 KServe - Kubernetes-native model serving with auto-scaling
  • 🌊 Knative - Serverless framework for event-driven applications
  • πŸ” Cert Manager - Automated certificate management

Observability & Monitoring

  • πŸ“ˆ Prometheus - Metrics collection and alerting
  • πŸ“Š Grafana - Visualization and dashboards
  • πŸ” Jaeger - Distributed tracing
  • πŸ—ΊοΈ Kiali - Service mesh visualization
  • 🚨 AlertManager - Alert routing and management

AI/ML Support

  • 🧠 TensorFlow Serving - TensorFlow model serving
  • πŸ”₯ PyTorch Serve - PyTorch model serving
  • ⚑ Scikit-learn - Traditional ML model serving
  • πŸ€— Hugging Face - Transformer model support
  • 🌐 OpenAI API - Compatible endpoints for LLM serving (vLLM, TGI, etc.)

🎯 Key Features Demonstrated

πŸ”’ Enterprise Security

sequenceDiagram
    participant Client
    participant Gateway as Envoy AI Gateway
    participant Auth as Authentication
    participant Istio as Istio Proxy
    participant Model as KServe Model
    
    Client->>Gateway: Inference Request
    Gateway->>Auth: Validate JWT/API Key
    Auth-->>Gateway: Authentication Result
    Gateway->>Istio: Forward Request (mTLS)
    Istio->>Model: Secure Request
    Model-->>Istio: Inference Response
    Istio-->>Gateway: Secure Response
    Gateway-->>Client: Final Response
    
    Note over Gateway,Model: All communication encrypted with mTLS
    Note over Auth: RBAC policies enforced
    Note over Istio: Zero-trust networking
Loading
  • Zero-trust networking with automatic mTLS between all services
  • Multi-tenant isolation with namespace-based security boundaries
  • RBAC and authentication with JWT/API key validation
  • Audit logging for compliance requirements (GDPR, HIPAA, SOC 2)
  • Certificate management with automatic rotation

⚑ AI/ML Model Serving

graph LR
    subgraph "Model Lifecycle"
        MR[Model Registry] --> KS[KServe Controller]
        KS --> KN[Knative Serving]
        KN --> POD[Model Pods]
    end
    
    subgraph "Auto-scaling"
        POD --> AS[Auto-scaler]
        AS --> |Scale Up| POD
        AS --> |Scale to Zero| ZERO[No Pods]
        ZERO --> |Cold Start| POD
    end
    
    subgraph "Traffic Management"
        CANARY[Canary Deploy]
        AB[A/B Testing]
        BLUE[Blue/Green]
    end
    
    POD --> CANARY
    POD --> AB
    POD --> BLUE
Loading
  • Serverless auto-scaling from zero to N instances based on demand
  • Multi-framework support (Scikit-learn, PyTorch, TensorFlow, Hugging Face)
  • OpenAI API compatibility with automatic protocol translation for LLMs
  • AI Gateway routing with model-aware header-based routing (x-ai-eg-model)
  • Canary deployments for gradual model rollouts
  • A/B testing with intelligent traffic splitting
  • Model versioning and rollback capabilities
  • Resource optimization with GPU/CPU scheduling
  • Protocol translation between OpenAI and KServe formats

🌐 Multi-Tenancy & Governance

  • Workspace isolation with dedicated namespaces per tenant
  • Resource quotas and governance policies
  • Separate observability scopes for each tenant
  • Independent lifecycle management and deployment schedules
  • Cost tracking and chargeback mechanisms

πŸ“Š Comprehensive Observability

graph TB
    subgraph "Metrics Pipeline"
        ISTIO[Istio Metrics] --> PROM[Prometheus]
        KSERVE[KServe Metrics] --> PROM
        CUSTOM[Custom AI Metrics] --> PROM
        PROM --> GRAF[Grafana Dashboards]
    end
    
    subgraph "Tracing Pipeline"
        REQ[Request Traces] --> JAEGER[Jaeger]
        SPANS[Service Spans] --> JAEGER
        JAEGER --> ANALYSIS[Trace Analysis]
    end
    
    subgraph "Logging Pipeline"
        LOGS[Application Logs] --> LOKI[Loki]
        AUDIT[Audit Logs] --> LOKI
        LOKI --> GRAF
    end
    
    subgraph "Alerting"
        PROM --> ALERT[AlertManager]
        ALERT --> SLACK[Slack/Email]
    end
Loading
  • End-to-end distributed tracing across the entire inference pipeline
  • AI-specific metrics including inference latency, throughput, and accuracy
  • Business metrics for cost optimization and resource planning
  • SLA monitoring with automated alerting
  • Unified dashboards for operational visibility

πŸ–₯️ Management Service

The Management Service is a comprehensive web-based platform for managing AI/ML model inference operations. It provides both a REST API and an intuitive React-based web interface for complete model lifecycle management.

🎯 Key Features

Model Publishing & Management

  • One-click model publishing with configurable external access
  • Public hostname configuration (default: api.router.inference-in-a-box)
  • Update published models - modify rate limits, paths, and hostnames
  • Multi-tenant model isolation with namespace-based security
  • Automatic model type detection (Traditional vs OpenAI-compatible)
  • OpenAI API compatibility for LLM models (vLLM, TGI, etc.)

Rate Limiting & Traffic Control

  • Per-model rate limiting with configurable requests per minute/hour
  • Token-based rate limiting for OpenAI-compatible models
  • Burst limit configuration for handling traffic spikes
  • Dynamic rate limit updates without republishing models

External Access & Routing

  • Configurable public hostnames for external model access
  • Custom path routing for model endpoints
  • Automatic gateway configuration (Envoy AI Gateway + Istio)
  • SSL/TLS termination with automatic certificate management

Security & Authentication

  • JWT-based authentication with tenant isolation
  • API key management for external access
  • API key rotation with zero-downtime updates
  • Admin and tenant-level permissions

Model Testing & Validation

  • Interactive inference testing directly from the UI
  • Support for both traditional and OpenAI-style testing
  • Real-time response visualization
  • Custom DNS resolution for cluster-internal testing
  • Automatic JWT token generation for test requests

πŸ”§ Technical Architecture

graph TB
    subgraph "Management Service Stack"
        UI[React Frontend] --> API[Go Backend]
        API --> K8S[Kubernetes API]
        API --> GATEWAY[Gateway Configuration]
        API --> STORAGE[Model Metadata]
    end
    
    subgraph "Model Publishing Flow"
        PUBLISH[Publish Model] --> VALIDATE[Validate Config]
        VALIDATE --> GATEWAY_CONFIG[Create Gateway Routes]
        GATEWAY_CONFIG --> RATE_LIMIT[Setup Rate Limiting]
        RATE_LIMIT --> API_KEY[Generate API Key]
        API_KEY --> DOCS[Generate Documentation]
    end
    
    subgraph "External Access"
        CLIENT[External Client] --> ENVOY[Envoy AI Gateway]
        ENVOY --> ISTIO[Istio Gateway]
        ISTIO --> KSERVE[KServe Model]
    end
Loading

πŸ“‹ API Endpoints

Model Management

  • GET /api/models - List all models
  • POST /api/models - Create new model
  • GET /api/models/{name} - Get model details
  • PUT /api/models/{name} - Update model configuration
  • DELETE /api/models/{name} - Delete model

Model Publishing

  • POST /api/models/{name}/publish - Publish model for external access
  • PUT /api/models/{name}/publish - Update published model configuration
  • GET /api/models/{name}/publish - Get published model details
  • DELETE /api/models/{name}/publish - Unpublish model
  • GET /api/published-models - List all published models

API Key Management

  • POST /api/models/{name}/publish/rotate-key - Rotate API key
  • POST /api/validate-api-key - Validate API key (for gateway)

Admin Operations

  • GET /api/admin/system - System information
  • GET /api/admin/tenants - Tenant management
  • POST /api/admin/kubectl - Execute kubectl commands

🌐 Web Interface Access

# Access the Management Service UI
kubectl port-forward svc/management-service 8085:80

# Open in browser
open http://localhost:8085

πŸ”— Publishing Workflow Example

Admin Authentication & Setup

# Admin login and get JWT token
export ADMIN_TOKEN=$(curl -s -X POST -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "password"}' \
  http://localhost:8085/api/admin/login | jq -r '.token')

# Verify login
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  http://localhost:8085/api/admin/system

Model Publishing Workflow

# 1. Create a model
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model", "framework": "sklearn", "storageUri": "s3://my-bucket/model"}' \
  http://localhost:8085/api/models

# 2. Publish model with custom hostname
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "tenantId": "tenant-a",
      "publicHostname": "api.router.inference-in-a-box",
      "externalPath": "/models/my-model",
      "rateLimiting": {
        "requestsPerMinute": 100,
        "requestsPerHour": 5000
      }
    }
  }' \
  http://localhost:8085/api/models/my-model/publish

# 3. Update published model configuration
curl -X PUT -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "tenantId": "tenant-a",
      "publicHostname": "api.router.inference-in-a-box",
      "rateLimiting": {
        "requestsPerMinute": 200,
        "requestsPerHour": 10000
      }
    }
  }' \
  http://localhost:8085/api/models/my-model/publish

# 4. Access published model externally
curl -H "X-API-Key: $API_KEY" \
  https://api.router.inference-in-a-box/models/my-model/predict \
  -d '{"input": "sample data"}'

Complete Admin API Demo

For a comprehensive example of all admin API operations, use the provided script:

# Run the complete admin API demo
./scripts/admin-api-example.sh

This script demonstrates:

  • Admin authentication
  • System information retrieval
  • Model and tenant management
  • Model publishing workflow
  • External API testing
  • kubectl command execution

πŸš€ Quick Start

Prerequisites

Ensure you have the following tools installed:

# Required tools
docker --version          # Docker 20.10+
kind --version           # Kind 0.20+
kubectl version --client  # kubectl 1.24+
helm version             # Helm 3.12+
curl --version           # curl (any recent version)
jq --version             # jq 1.6+

# Optional but recommended
istioctl version         # Istio CLI (auto-installed by bootstrap)

System Requirements

  • Memory: Minimum 8GB RAM (16GB recommended for full observability stack)
  • CPU: 4+ cores recommended
  • Disk: 20GB+ free space for container images
  • OS: macOS, Linux, or Windows with WSL2

One-Command Bootstrap

# Clone the repository
git clone <repository-url>
cd inference-in-a-box

# Bootstrap the entire platform (takes 10-15 minutes)
./scripts/bootstrap.sh

# Run demo scenarios
./scripts/demo.sh

# Access the platform (run these in separate terminals)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
kubectl port-forward -n envoy-gateway-system svc/envoy-ai-gateway 8080:80 &
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
kubectl port-forward -n monitoring svc/kiali 20001:20001 &
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
kubectl port-forward -n default svc/management-service 8085:80 &

echo "πŸŽ‰ Platform is ready!"
echo "πŸ€– AI Gateway (Primary Entry): http://localhost:8080"
echo "πŸ“Š Grafana: http://localhost:3000 (admin/prom-operator)"
echo "πŸ“ˆ Prometheus: http://localhost:9090"
echo "πŸ—ΊοΈ Kiali: http://localhost:20001"
echo "πŸ” Jaeger: http://localhost:16686"
echo "πŸ”§ Management UI: http://localhost:8085"
echo ""
echo "πŸ’‘ All AI/ML requests go through the AI Gateway first!"
echo "   The AI Gateway handles JWT auth and routes to Istio Gateway"

Step-by-Step Setup

# 1. Create Kind cluster
./scripts/clusters/create-kind-cluster.sh

# 2. Install core infrastructure
./scripts/install/install-envoy-gateway.sh
./scripts/install/install-istio.sh
./scripts/install/install-kserve.sh
./scripts/install/install-observability.sh

# 3. Deploy sample models
./scripts/models/deploy-samples.sh

# 4. Configure security and policies
./scripts/security/setup-policies.sh

# 5. Run tests
./scripts/test/run-tests.sh

Features Demonstrated

πŸ”’ Enterprise Security

  • Zero-trust networking with automatic mTLS
  • Multi-tenant isolation with workspace boundaries
  • RBAC and authentication policies
  • Certificate management and rotation

🎯 AI/ML Model Serving

  • Multiple ML frameworks (TensorFlow, PyTorch, Scikit-learn)
  • Auto-scaling from zero to N instances
  • Canary deployments and A/B testing
  • Model versioning and rollback

🌐 Traffic Management

  • Intelligent routing and load balancing
  • Circuit breaking and failover
  • Rate limiting and throttling
  • Geographic routing simulation

πŸ“Š Observability

  • Distributed tracing across the inference pipeline
  • Custom metrics for AI workloads
  • Unified logging and monitoring
  • SLA tracking and alerting

🏒 Multi-Tenancy

  • Namespace-based tenant isolation
  • Resource quotas and governance
  • Separate observability scopes
  • Independent lifecycle management

Directory Structure

inference-in-a-box/
β”œβ”€β”€ README.md
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ bootstrap.sh
β”‚   β”œβ”€β”€ cleanup.sh
β”‚   β”œβ”€β”€ demo.sh
β”‚   └── clusters/
β”‚       β”œβ”€β”€ create-kind-cluster.sh
β”‚       └── setup-networking.sh
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ clusters/
β”‚   β”‚   └── cluster.yaml
β”‚   β”œβ”€β”€ envoy-gateway/
β”‚   β”‚   β”œβ”€β”€ gatewayclass.yaml
β”‚   β”‚   β”œβ”€β”€ ai-gateway.yaml
β”‚   β”‚   β”œβ”€β”€ httproute.yaml
β”‚   β”‚   β”œβ”€β”€ ai-backends.yaml
β”‚   β”‚   β”œβ”€β”€ security-policies.yaml
β”‚   β”‚   └── rate-limiting.yaml
β”‚   β”œβ”€β”€ istio/
β”‚   β”‚   β”œβ”€β”€ installation.yaml
β”‚   β”‚   β”œβ”€β”€ gateway.yaml
β”‚   β”‚   └── virtual-services/
β”‚   β”œβ”€β”€ kserve/
β”‚   β”‚   β”œβ”€β”€ installation.yaml
β”‚   β”‚   └── models/
β”‚   β”œβ”€β”€ envoy-ai-gateway/
β”‚   β”‚   └── configuration.yaml
β”‚   └── observability/
β”‚       β”œβ”€β”€ prometheus.yaml
β”‚       └── grafana/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ sklearn-iris/
β”‚   β”œβ”€β”€ tensorflow-mnist/
β”‚   └── pytorch-resnet/
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ inference-requests/
β”‚   β”œβ”€β”€ security-policies/
β”‚   └── traffic-scenarios/
└── docs/
    β”œβ”€β”€ architecture.md
    β”œβ”€β”€ deployment-guide.md
    └── troubleshooting.md

Prerequisites

  • Docker Desktop or equivalent
  • kubectl
  • kind
  • helm
  • curl
  • jq

🎭 Demo Scenarios

1. πŸ”’ Security & Authentication Demo

sequenceDiagram
    participant User
    participant Gateway
    participant Auth
    participant Model
    
    User->>Gateway: Request with JWT
    Gateway->>Auth: Validate Token
    Auth-->>Gateway: Authorized
    Gateway->>Model: Forward Request (mTLS)
    Model-->>Gateway: Inference Result
    Gateway-->>User: Secure Response
Loading

2. ⚑ Auto-scaling Demo

# The demo script automatically generates load through the AI Gateway
./scripts/demo.sh
# Select option 2 for auto-scaling demo

# Watch pods scale from 0 to N
watch "kubectl get pods -n tenant-a -l serving.kserve.io/inferenceservice=sklearn-iris"

3. 🚦 Canary Deployment Demo

# The demo script creates a canary deployment for sklearn-iris
./scripts/demo.sh
# Select option 3 for canary deployment demo

# Monitor traffic split
kubectl get virtualservice -n tenant-a

4. 🌐 Multi-Tenant Isolation Demo

# The demo script shows tenant isolation and resource boundaries
./scripts/demo.sh
# Select option 4 for multi-tenant isolation demo

# Verify isolation
kubectl get networkpolicies -A

πŸ“Š Monitoring & Observability

Real-time Dashboards

graph LR
    subgraph "Grafana Dashboards"
        OVERVIEW["πŸ“Š Platform Overview"]
        MODELS["πŸ€– Model Performance"]
        SECURITY["πŸ”’ Security Metrics"]
        BUSINESS["πŸ’° Business KPIs"]
    end
    
    subgraph "Data Sources"
        PROM["πŸ“ˆ Prometheus"]
        JAEGER["πŸ” Jaeger"]
        ISTIO["πŸ•ΈοΈ Istio Metrics"]
        KSERVE["πŸ€– KServe Metrics"]
    end
    
    PROM --> OVERVIEW
    PROM --> MODELS
    ISTIO --> SECURITY
    KSERVE --> BUSINESS
    JAEGER --> MODELS
Loading

Key Metrics Tracked

  • 🎯 Model Performance: Inference latency, throughput, accuracy
  • ⚑ Infrastructure: CPU/Memory usage, auto-scaling events
  • πŸ”’ Security: Authentication failures, policy violations
  • πŸ’° Business: Cost per inference, tenant usage, SLA compliance
  • 🌐 Network: Request rates, error rates, circuit breaker events

Alert Configuration

# Example alert rules
groups:
- name: inference.rules
  rules:
  - alert: HighInferenceLatency
    expr: histogram_quantile(0.95, rate(kserve_request_duration_seconds_bucket[5m])) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High inference latency detected"
      
  - alert: ModelDown
    expr: up{job="kserve-model"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Model service is down"

🌐 Traffic Flow Architecture

Tier-1/Tier-2 Gateway Design

The platform implements a two-tier gateway architecture where external traffic first hits the Envoy AI Gateway (Tier-1) and then flows to the Istio Gateway (Tier-2) for service mesh routing:

sequenceDiagram
    participant Client as πŸ–₯️ Client Apps
    participant EAG as πŸ€– AI Gateway (Tier-1)
    participant Auth as πŸ” JWT Auth
    participant IG as πŸ•ΈοΈ Istio Gateway (Tier-2)
    participant Model as 🎯 Model Service
    
    Client->>EAG: HTTP/REST Request
    EAG->>Auth: Validate JWT Token
    Auth-->>EAG: Token Valid (tenant-x)
    EAG->>EAG: Apply Rate Limits
    EAG->>EAG: Extract Model Name
    EAG->>IG: Route to Service Mesh
    IG->>Model: mTLS Encrypted Request
    Model-->>IG: Inference Response
    IG-->>EAG: Response via Service Mesh
    EAG-->>Client: Final Response
Loading

Primary Access Patterns

  1. 🎯 AI Model Inference: Client β†’ AI Gateway β†’ JWT Auth β†’ Rate Limiting β†’ Istio Gateway β†’ Model Service
  2. πŸ“Š Observability: Client β†’ AI Gateway β†’ Istio Gateway β†’ Monitoring Services
  3. πŸ”§ Management: Client β†’ AI Gateway β†’ Istio Gateway β†’ Admin Services

Gateway Responsibilities

πŸš€ Tier-1: Envoy AI Gateway (Primary Entry Point)

  • Authentication: JWT token validation with JWKS
  • Authorization: Tenant-based access control
  • Rate Limiting: Per-tenant and global limits
  • AI Protocol: OpenAI-compatible API transformation
  • Routing: Model-aware intelligent routing

πŸ•ΈοΈ Tier-2: Istio Gateway (Service Mesh)

  • mTLS: Service-to-service encryption
  • Load Balancing: Traffic distribution
  • Circuit Breaking: Fault tolerance
  • Observability: Metrics and tracing
  • Service Discovery: Dynamic routing

πŸšͺ AI Gateway Features

JWT Authentication & Authorization

  • Tenant-specific JWT validation with dedicated JWKS endpoints
  • Automatic claim extraction to request headers for downstream services
  • Multi-provider support for different authentication sources

Intelligent Routing

  • Model-aware routing based on x-ai-eg-model header
  • Header-based tenant routing for multi-tenant isolation
  • Fallback routing to Istio Gateway for non-AI traffic
  • EnvoyExtensionPolicy for external AI processing

Rate Limiting & Traffic Management

  • Per-tenant rate limiting with configurable limits
  • Global rate limiting for platform protection
  • Circuit breaker patterns for resilience
  • Retry policies with exponential backoff
  • Token-based limiting for LLM models

Security & Compliance

  • CORS support for web applications
  • TLS termination at the edge
  • Security headers injection
  • Audit logging for compliance requirements

OpenAI API Compatibility

  • Automatic protocol translation from OpenAI to KServe format
  • Support for chat completions (/v1/chat/completions)
  • Support for completions (/v1/completions)
  • Support for embeddings (/v1/embeddings)
  • Model-specific routing with x-ai-eg-model header
  • Compatible with popular LLM servers (vLLM, TGI, Ollama, etc.)

Example API Usage

# All requests go through the AI Gateway first (Tier-1 Entry Point)
export AI_GATEWAY_URL="http://localhost:8080"
export JWT_TOKEN="<your-jwt-token>"

# Traditional model request to sklearn model (tenant-a)
curl -H "Authorization: Bearer $JWT_TOKEN" \
     -H "x-tenant: tenant-a" \
     -H "x-ai-eg-model: sklearn-iris" \
     $AI_GATEWAY_URL/v1/models/sklearn-iris:predict \
     -d '{"instances": [[5.1, 3.5, 1.4, 0.2]]}'

# OpenAI-compatible chat completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
     -H "x-tenant: tenant-a" \
     -H "x-ai-eg-model: llama-3-8b" \
     $AI_GATEWAY_URL/v1/chat/completions \
     -d '{
       "model": "llama-3-8b",
       "messages": [
         {"role": "user", "content": "Hello, how are you?"}
       ],
       "temperature": 0.7
     }'

# OpenAI-compatible completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
     -H "x-tenant: tenant-a" \
     -H "x-ai-eg-model: gpt-j-6b" \
     $AI_GATEWAY_URL/v1/completions \
     -d '{
       "model": "gpt-j-6b",
       "prompt": "The quick brown fox",
       "max_tokens": 50
     }'

# The AI Gateway handles:
# 1. JWT validation and tenant authorization
# 2. Rate limiting and traffic management  
# 3. Model routing based on headers
# 4. OpenAI protocol transformation
# 5. Forwarding to Istio Gateway (Tier-2)

πŸš€ Getting Started

Quick Start Guide

  1. Prerequisites: Ensure Docker, Kind, kubectl, and Helm are installed
  2. Bootstrap: Run ./scripts/bootstrap.sh (takes 10-15 minutes)
  3. Access Services: Use the port-forward commands above
  4. Run Demos: Execute ./scripts/demo.sh for interactive scenarios
  5. Get JWT Tokens: Run ./scripts/get-jwt-tokens.sh for authentication

Development Workflow

  • Management Service: See management/README.md for Go backend + React frontend development
  • Configuration: Kubernetes configs in configs/ directory
  • Automation: Deployment scripts in scripts/ directory

πŸ“š Documentation & Learning Resources

🎯 Essential Reading

  • GOALS.md - 🎯 Project vision, goals, and strategic impact
  • Getting Started Guide - πŸš€ Step-by-step installation and bootstrap
  • Usage Guide - πŸ“– API usage and service access patterns

πŸ—οΈ Architecture & Design

  • Architecture Guide - πŸ—οΈ Technical system design and patterns
  • CLAUDE.md - πŸ€– AI assistant deployment guidance and commands

πŸ”§ Operations & Management

🎭 Demonstrations & Examples

πŸ“– Learning Path Recommendations

For Platform Engineers

  1. Start with GOALS.md to understand the vision and target state
  2. Follow Getting Started Guide for hands-on deployment
  3. Deep dive into Architecture Guide for technical patterns
  4. Use CLAUDE.md for AI-assisted operations

For AI/ML Engineers

  1. Read GOALS.md to understand AI/ML capabilities
  2. Quick start with Getting Started Guide
  3. Explore Model Publishing Guide for model deployment
  4. Reference Management Service Guide for API usage

For DevOps Teams

  1. Start with GOALS.md for operational understanding
  2. Follow Getting Started Guide for deployment
  3. Study Usage Guide for service management patterns
  4. Use Demo Guide for scenario testing

For Students & Educators

  1. Begin with GOALS.md for learning objectives
  2. Work through Getting Started Guide hands-on
  3. Explore Demo Guide for practical scenarios
  4. Reference Architecture Guide for deep understanding

πŸ”§ Troubleshooting

Common Issues

  • Gateway not ready: Check kubectl get gateway -n envoy-gateway-system
  • JWT validation fails: Verify JWKS endpoint is accessible with kubectl get pods -n default -l app=jwt-server
  • Rate limiting: Check rate limit policies and quotas
  • Model not accessible: Verify model is ready with kubectl get inferenceservice --all-namespaces
  • Port conflicts: Ensure ports 3000, 8080, 8085, 9090, 16686, 20001 are available

Quick Verification

πŸ”§ Detailed Troubleshooting: For comprehensive troubleshooting steps, see Usage Guide

# Check overall cluster health
kubectl get pods --all-namespaces | grep -v Running

# Verify AI Gateway is ready
kubectl get pods -n envoy-gateway-system

# Check sample models are deployed
kubectl get inferenceservice --all-namespaces

Cleanup

# Complete cleanup
./scripts/cleanup.sh

# Or manual cleanup
kind delete cluster --name inference-in-a-box

πŸ“ Version Information

πŸ”§ Source of Truth: All infrastructure component versions are defined in scripts/bootstrap.sh

Infrastructure Components

  • Istio: v1.26.2
  • KServe: v0.15.2
  • Knative: v1.18.1
  • Envoy Gateway: v1.4.2
  • Envoy AI Gateway: v0.2.1 (with EnvoyExtensionPolicy)
  • Cert Manager: v1.18.1
  • Prometheus Stack: v75.6.0
  • Grafana: v12.0.2
  • Jaeger: v3.4.1
  • Kiali: v2.11.0

Runtime Components

  • Go: v1.21 (management service backend)
  • Node.js: v18 (management service UI, JWT server)
  • React: v18.2.0 (management service frontend)
  • OpenAI API: Compatible with OpenAI SDK v1.x

🀝 Contributing

This is a demonstration project showcasing enterprise AI/ML deployment patterns. For questions or improvements, please refer to the documentation or create an issue.

About

Inference-in-a-Box: Enterprise AI/ML Platform Demo

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •