🚀 Inference-in-a-Box: Enterprise AI/ML Platform Demo

📋 Quick Navigation: 🎯 Goals & Vision • 🚀 Getting Started • 📖 Usage Guide • 🏗️ Architecture • 🤖 AI Assistant • 🎭 Demos

A complete, production-ready inference platform that demonstrates enterprise-grade AI/ML model serving using modern cloud-native technologies. This platform combines Envoy AI Gateway, Istio service mesh, KServe serverless model serving, and comprehensive observability to create a robust, scalable, and secure inference-as-a-service solution.

🎯 What You're Building

Inference-in-a-Box is a comprehensive demonstration of how modern organizations can deploy AI/ML models at enterprise scale with:

🔒 Zero-Trust Security - Automatic mTLS encryption, fine-grained authorization, and compliance-ready audit logging
⚡ Serverless Inference - Auto-scaling from zero to N instances based on traffic demand
🌐 Multi-Tenant Architecture - Secure isolation between different teams, projects, and customers
📊 Enterprise Observability - Full-stack monitoring, distributed tracing, and AI-specific metrics
🚪 Unified AI Gateway - Envoy AI Gateway as the primary entry point with JWT authentication and intelligent routing
🎛️ Traffic Management - Canary deployments, A/B testing, and intelligent routing

🏗️ Platform Architecture

graph TB
    subgraph "Inference-in-a-Box Cluster"
        subgraph "Tier-1 Gateway Layer (Primary Entry Point)"
            EG[Envoy Gateway]
            EAG[Envoy AI Gateway]
            AUTH[JWT Authentication]
            RL[Rate Limiting]
        end
        
        subgraph "Tier-2 Service Mesh Layer"
            IC[Istiod]
            IG[Istio Gateway]
            MTLS[mTLS Encryption]
        end
        
        subgraph "Multi-Tenant Model Serving"
            subgraph "Tenant A"
                KS1[sklearn-iris]
                IS1[Istio Sidecar]
            end
            subgraph "Tenant B"
                KS2[Reserved]
                IS2[Istio Sidecar]
            end
            subgraph "Tenant C"
                KS3[pytorch-resnet]
                IS3[Istio Sidecar]
            end
        end
        
        subgraph "Serverless Infrastructure"
            KC[KServe Controller]
            KN[Knative Serving]
            CM[Cert Manager]
        end
        
        subgraph "Observability Stack"
            P[Prometheus]
            G[Grafana]
            K[Kiali]
            AM[AlertManager]
        end
    end
    
    subgraph "External"
        CLIENT[AI Client Apps]
        MODELS[Model Registry]
    end
    
    %% Primary Traffic Flow (Tier-1 → Tier-2)
    CLIENT -->|HTTP/REST| EAG
    EAG -->|JWT Validation| AUTH
    AUTH -->|Authenticated| RL
    RL -->|Rate Limited| IG
    IG -->|mTLS Routing| KS1
    IG -->|mTLS Routing| KS3
    
    %% Gateway Integration
    EG -->|Controls| EAG
    IC -->|Manages| IG
    IC -->|Enables| MTLS
    
    %% Model Serving Infrastructure
    KC -->|Manages| KN
    KN -->|Serves| KS1
    KN -->|Serves| KS2
    KN -->|Serves| KS3
    
    %% External Model Sources
    MODELS -->|Deploys| KS1
    MODELS -->|Deploys| KS2
    MODELS -->|Deploys| KS3
    
    %% Observability Flow
    KS1 -->|Metrics| P
    KS2 -->|Metrics| P
    KS3 -->|Metrics| P
    IC -->|Mesh Metrics| P
    P -->|Data| G
    P -->|Data| K
    
    %% Styling
    classDef tier1 fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
    classDef tier2 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef models fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef observability fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    
    class EG,EAG,AUTH,RL tier1
    class IC,IG,MTLS tier2
    class KS1,KS2,KS3,KC,KN,CM models
    class P,G,K,AM observability

🛠️ Technology Stack

Core Platform Components

🐳 Kind - Local Kubernetes cluster for development and testing
🚪 Envoy Gateway - Cloud-native API gateway with advanced routing capabilities
🤖 Envoy AI Gateway - AI-specific gateway with JWT authentication, model routing, and OpenAI API compatibility
🕸️ Istio - Service mesh for security, traffic management, and observability
🦾 KServe - Kubernetes-native model serving with auto-scaling
🌊 Knative - Serverless framework for event-driven applications
🔐 Cert Manager - Automated certificate management

Observability & Monitoring

📈 Prometheus - Metrics collection and alerting
📊 Grafana - Visualization and dashboards
🔍 Jaeger - Distributed tracing
🗺️ Kiali - Service mesh visualization
🚨 AlertManager - Alert routing and management

AI/ML Support

🧠 TensorFlow Serving - TensorFlow model serving
🔥 PyTorch Serve - PyTorch model serving
⚡ Scikit-learn - Traditional ML model serving
🤗 Hugging Face - Transformer model support
🌐 OpenAI API - Compatible endpoints for LLM serving (vLLM, TGI, etc.)

🎯 Key Features Demonstrated

🔒 Enterprise Security

sequenceDiagram
    participant Client
    participant Gateway as Envoy AI Gateway
    participant Auth as Authentication
    participant Istio as Istio Proxy
    participant Model as KServe Model
    
    Client->>Gateway: Inference Request
    Gateway->>Auth: Validate JWT/API Key
    Auth-->>Gateway: Authentication Result
    Gateway->>Istio: Forward Request (mTLS)
    Istio->>Model: Secure Request
    Model-->>Istio: Inference Response
    Istio-->>Gateway: Secure Response
    Gateway-->>Client: Final Response
    
    Note over Gateway,Model: All communication encrypted with mTLS
    Note over Auth: RBAC policies enforced
    Note over Istio: Zero-trust networking

Zero-trust networking with automatic mTLS between all services
Multi-tenant isolation with namespace-based security boundaries
RBAC and authentication with JWT/API key validation
Audit logging for compliance requirements (GDPR, HIPAA, SOC 2)
Certificate management with automatic rotation

⚡ AI/ML Model Serving

graph LR
    subgraph "Model Lifecycle"
        MR[Model Registry] --> KS[KServe Controller]
        KS --> KN[Knative Serving]
        KN --> POD[Model Pods]
    end
    
    subgraph "Auto-scaling"
        POD --> AS[Auto-scaler]
        AS --> |Scale Up| POD
        AS --> |Scale to Zero| ZERO[No Pods]
        ZERO --> |Cold Start| POD
    end
    
    subgraph "Traffic Management"
        CANARY[Canary Deploy]
        AB[A/B Testing]
        BLUE[Blue/Green]
    end
    
    POD --> CANARY
    POD --> AB
    POD --> BLUE

Serverless auto-scaling from zero to N instances based on demand
Multi-framework support (Scikit-learn, PyTorch, TensorFlow, Hugging Face)
OpenAI API compatibility with automatic protocol translation for LLMs
AI Gateway routing with model-aware header-based routing (x-ai-eg-model)
Canary deployments for gradual model rollouts
A/B testing with intelligent traffic splitting
Model versioning and rollback capabilities
Resource optimization with GPU/CPU scheduling
Protocol translation between OpenAI and KServe formats

🌐 Multi-Tenancy & Governance

Workspace isolation with dedicated namespaces per tenant
Resource quotas and governance policies
Separate observability scopes for each tenant
Independent lifecycle management and deployment schedules
Cost tracking and chargeback mechanisms

📊 Comprehensive Observability

graph TB
    subgraph "Metrics Pipeline"
        ISTIO[Istio Metrics] --> PROM[Prometheus]
        KSERVE[KServe Metrics] --> PROM
        CUSTOM[Custom AI Metrics] --> PROM
        PROM --> GRAF[Grafana Dashboards]
    end
    
    subgraph "Tracing Pipeline"
        REQ[Request Traces] --> JAEGER[Jaeger]
        SPANS[Service Spans] --> JAEGER
        JAEGER --> ANALYSIS[Trace Analysis]
    end
    
    subgraph "Logging Pipeline"
        LOGS[Application Logs] --> LOKI[Loki]
        AUDIT[Audit Logs] --> LOKI
        LOKI --> GRAF
    end
    
    subgraph "Alerting"
        PROM --> ALERT[AlertManager]
        ALERT --> SLACK[Slack/Email]
    end

End-to-end distributed tracing across the entire inference pipeline
AI-specific metrics including inference latency, throughput, and accuracy
Business metrics for cost optimization and resource planning
SLA monitoring with automated alerting
Unified dashboards for operational visibility

🖥️ Management Service

The Management Service is a comprehensive web-based platform for managing AI/ML model inference operations. It provides both a REST API and an intuitive React-based web interface for complete model lifecycle management.

🎯 Key Features

Model Publishing & Management

One-click model publishing with configurable external access
Public hostname configuration (default: api.router.inference-in-a-box)
Update published models - modify rate limits, paths, and hostnames
Multi-tenant model isolation with namespace-based security
Automatic model type detection (Traditional vs OpenAI-compatible)
OpenAI API compatibility for LLM models (vLLM, TGI, etc.)

Rate Limiting & Traffic Control

Per-model rate limiting with configurable requests per minute/hour
Token-based rate limiting for OpenAI-compatible models
Burst limit configuration for handling traffic spikes
Dynamic rate limit updates without republishing models

External Access & Routing

Configurable public hostnames for external model access
Custom path routing for model endpoints
Automatic gateway configuration (Envoy AI Gateway + Istio)
SSL/TLS termination with automatic certificate management

Security & Authentication

JWT-based authentication with tenant isolation
API key management for external access
API key rotation with zero-downtime updates
Admin and tenant-level permissions

Model Testing & Validation

Interactive inference testing directly from the UI
Support for both traditional and OpenAI-style testing
Real-time response visualization
Custom DNS resolution for cluster-internal testing
Automatic JWT token generation for test requests

🔧 Technical Architecture

graph TB
    subgraph "Management Service Stack"
        UI[React Frontend] --> API[Go Backend]
        API --> K8S[Kubernetes API]
        API --> GATEWAY[Gateway Configuration]
        API --> STORAGE[Model Metadata]
    end
    
    subgraph "Model Publishing Flow"
        PUBLISH[Publish Model] --> VALIDATE[Validate Config]
        VALIDATE --> GATEWAY_CONFIG[Create Gateway Routes]
        GATEWAY_CONFIG --> RATE_LIMIT[Setup Rate Limiting]
        RATE_LIMIT --> API_KEY[Generate API Key]
        API_KEY --> DOCS[Generate Documentation]
    end
    
    subgraph "External Access"
        CLIENT[External Client] --> ENVOY[Envoy AI Gateway]
        ENVOY --> ISTIO[Istio Gateway]
        ISTIO --> KSERVE[KServe Model]
    end

📋 API Endpoints

Model Management

GET /api/models - List all models
POST /api/models - Create new model
GET /api/models/{name} - Get model details
PUT /api/models/{name} - Update model configuration
DELETE /api/models/{name} - Delete model

Model Publishing

POST /api/models/{name}/publish - Publish model for external access
PUT /api/models/{name}/publish - Update published model configuration
GET /api/models/{name}/publish - Get published model details
DELETE /api/models/{name}/publish - Unpublish model
GET /api/published-models - List all published models

API Key Management

POST /api/models/{name}/publish/rotate-key - Rotate API key
POST /api/validate-api-key - Validate API key (for gateway)

Admin Operations

GET /api/admin/system - System information
GET /api/admin/tenants - Tenant management
POST /api/admin/kubectl - Execute kubectl commands

🌐 Web Interface Access

# Access the Management Service UI
kubectl port-forward svc/management-service 8085:80

# Open in browser
open http://localhost:8085

🔗 Publishing Workflow Example

Admin Authentication & Setup

# Admin login and get JWT token
export ADMIN_TOKEN=$(curl -s -X POST -H "Content-Type: application/json" \
  -d '{"username": "admin", "password": "password"}' \
  http://localhost:8085/api/admin/login | jq -r '.token')

# Verify login
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
  http://localhost:8085/api/admin/system

Model Publishing Workflow

# 1. Create a model
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-model", "framework": "sklearn", "storageUri": "s3://my-bucket/model"}' \
  http://localhost:8085/api/models

# 2. Publish model with custom hostname
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "tenantId": "tenant-a",
      "publicHostname": "api.router.inference-in-a-box",
      "externalPath": "/models/my-model",
      "rateLimiting": {
        "requestsPerMinute": 100,
        "requestsPerHour": 5000
      }
    }
  }' \
  http://localhost:8085/api/models/my-model/publish

# 3. Update published model configuration
curl -X PUT -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "config": {
      "tenantId": "tenant-a",
      "publicHostname": "api.router.inference-in-a-box",
      "rateLimiting": {
        "requestsPerMinute": 200,
        "requestsPerHour": 10000
      }
    }
  }' \
  http://localhost:8085/api/models/my-model/publish

# 4. Access published model externally
curl -H "X-API-Key: $API_KEY" \
  https://api.router.inference-in-a-box/models/my-model/predict \
  -d '{"input": "sample data"}'

Complete Admin API Demo

For a comprehensive example of all admin API operations, use the provided script:

# Run the complete admin API demo
./scripts/admin-api-example.sh

This script demonstrates:

Admin authentication
System information retrieval
Model and tenant management
Model publishing workflow
External API testing
kubectl command execution

🚀 Quick Start

Prerequisites

Ensure you have the following tools installed:

# Required tools
docker --version          # Docker 20.10+
kind --version           # Kind 0.20+
kubectl version --client  # kubectl 1.24+
helm version             # Helm 3.12+
curl --version           # curl (any recent version)
jq --version             # jq 1.6+

# Optional but recommended
istioctl version         # Istio CLI (auto-installed by bootstrap)

System Requirements

Memory: Minimum 8GB RAM (16GB recommended for full observability stack)
CPU: 4+ cores recommended
Disk: 20GB+ free space for container images
OS: macOS, Linux, or Windows with WSL2

One-Command Bootstrap

# Clone the repository
git clone <repository-url>
cd inference-in-a-box

# Bootstrap the entire platform (takes 10-15 minutes)
./scripts/bootstrap.sh

# Run demo scenarios
./scripts/demo.sh

# Access the platform (run these in separate terminals)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
kubectl port-forward -n envoy-gateway-system svc/envoy-ai-gateway 8080:80 &
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
kubectl port-forward -n monitoring svc/kiali 20001:20001 &
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
kubectl port-forward -n default svc/management-service 8085:80 &

echo "🎉 Platform is ready!"
echo "🤖 AI Gateway (Primary Entry): http://localhost:8080"
echo "📊 Grafana: http://localhost:3000 (admin/prom-operator)"
echo "📈 Prometheus: http://localhost:9090"
echo "🗺️ Kiali: http://localhost:20001"
echo "🔍 Jaeger: http://localhost:16686"
echo "🔧 Management UI: http://localhost:8085"
echo ""
echo "💡 All AI/ML requests go through the AI Gateway first!"
echo "   The AI Gateway handles JWT auth and routes to Istio Gateway"

Step-by-Step Setup

# 1. Create Kind cluster
./scripts/clusters/create-kind-cluster.sh

# 2. Install core infrastructure
./scripts/install/install-envoy-gateway.sh
./scripts/install/install-istio.sh
./scripts/install/install-kserve.sh
./scripts/install/install-observability.sh

# 3. Deploy sample models
./scripts/models/deploy-samples.sh

# 4. Configure security and policies
./scripts/security/setup-policies.sh

# 5. Run tests
./scripts/test/run-tests.sh

Features Demonstrated

🔒 Enterprise Security

Zero-trust networking with automatic mTLS
Multi-tenant isolation with workspace boundaries
RBAC and authentication policies
Certificate management and rotation

🎯 AI/ML Model Serving

Multiple ML frameworks (TensorFlow, PyTorch, Scikit-learn)
Auto-scaling from zero to N instances
Canary deployments and A/B testing
Model versioning and rollback

🌐 Traffic Management

Intelligent routing and load balancing
Circuit breaking and failover
Rate limiting and throttling
Geographic routing simulation

📊 Observability

Distributed tracing across the inference pipeline
Custom metrics for AI workloads
Unified logging and monitoring
SLA tracking and alerting

🏢 Multi-Tenancy

Namespace-based tenant isolation
Resource quotas and governance
Separate observability scopes
Independent lifecycle management

Directory Structure

inference-in-a-box/
├── README.md
├── scripts/
│   ├── bootstrap.sh
│   ├── cleanup.sh
│   ├── demo.sh
│   └── clusters/
│       ├── create-kind-cluster.sh
│       └── setup-networking.sh
├── configs/
│   ├── clusters/
│   │   └── cluster.yaml
│   ├── envoy-gateway/
│   │   ├── gatewayclass.yaml
│   │   ├── ai-gateway.yaml
│   │   ├── httproute.yaml
│   │   ├── ai-backends.yaml
│   │   ├── security-policies.yaml
│   │   └── rate-limiting.yaml
│   ├── istio/
│   │   ├── installation.yaml
│   │   ├── gateway.yaml
│   │   └── virtual-services/
│   ├── kserve/
│   │   ├── installation.yaml
│   │   └── models/
│   ├── envoy-ai-gateway/
│   │   └── configuration.yaml
│   └── observability/
│       ├── prometheus.yaml
│       └── grafana/
├── models/
│   ├── sklearn-iris/
│   ├── tensorflow-mnist/
│   └── pytorch-resnet/
├── examples/
│   ├── inference-requests/
│   ├── security-policies/
│   └── traffic-scenarios/
└── docs/
    ├── architecture.md
    ├── deployment-guide.md
    └── troubleshooting.md

Prerequisites

Docker Desktop or equivalent
kubectl
kind
helm
curl
jq

🎭 Demo Scenarios

1. 🔒 Security & Authentication Demo

sequenceDiagram
    participant User
    participant Gateway
    participant Auth
    participant Model
    
    User->>Gateway: Request with JWT
    Gateway->>Auth: Validate Token
    Auth-->>Gateway: Authorized
    Gateway->>Model: Forward Request (mTLS)
    Model-->>Gateway: Inference Result
    Gateway-->>User: Secure Response

2. ⚡ Auto-scaling Demo

# The demo script automatically generates load through the AI Gateway
./scripts/demo.sh
# Select option 2 for auto-scaling demo

# Watch pods scale from 0 to N
watch "kubectl get pods -n tenant-a -l serving.kserve.io/inferenceservice=sklearn-iris"

3. 🚦 Canary Deployment Demo

# The demo script creates a canary deployment for sklearn-iris
./scripts/demo.sh
# Select option 3 for canary deployment demo

# Monitor traffic split
kubectl get virtualservice -n tenant-a

4. 🌐 Multi-Tenant Isolation Demo

# The demo script shows tenant isolation and resource boundaries
./scripts/demo.sh
# Select option 4 for multi-tenant isolation demo

# Verify isolation
kubectl get networkpolicies -A

📊 Monitoring & Observability

Real-time Dashboards

graph LR
    subgraph "Grafana Dashboards"
        OVERVIEW["📊 Platform Overview"]
        MODELS["🤖 Model Performance"]
        SECURITY["🔒 Security Metrics"]
        BUSINESS["💰 Business KPIs"]
    end
    
    subgraph "Data Sources"
        PROM["📈 Prometheus"]
        JAEGER["🔍 Jaeger"]
        ISTIO["🕸️ Istio Metrics"]
        KSERVE["🤖 KServe Metrics"]
    end
    
    PROM --> OVERVIEW
    PROM --> MODELS
    ISTIO --> SECURITY
    KSERVE --> BUSINESS
    JAEGER --> MODELS

Key Metrics Tracked

🎯 Model Performance: Inference latency, throughput, accuracy
⚡ Infrastructure: CPU/Memory usage, auto-scaling events
🔒 Security: Authentication failures, policy violations
💰 Business: Cost per inference, tenant usage, SLA compliance
🌐 Network: Request rates, error rates, circuit breaker events

Alert Configuration

# Example alert rules
groups:
- name: inference.rules
  rules:
  - alert: HighInferenceLatency
    expr: histogram_quantile(0.95, rate(kserve_request_duration_seconds_bucket[5m])) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High inference latency detected"
      
  - alert: ModelDown
    expr: up{job="kserve-model"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Model service is down"

🌐 Traffic Flow Architecture

Tier-1/Tier-2 Gateway Design

The platform implements a two-tier gateway architecture where external traffic first hits the Envoy AI Gateway (Tier-1) and then flows to the Istio Gateway (Tier-2) for service mesh routing:

sequenceDiagram
    participant Client as 🖥️ Client Apps
    participant EAG as 🤖 AI Gateway (Tier-1)
    participant Auth as 🔐 JWT Auth
    participant IG as 🕸️ Istio Gateway (Tier-2)
    participant Model as 🎯 Model Service
    
    Client->>EAG: HTTP/REST Request
    EAG->>Auth: Validate JWT Token
    Auth-->>EAG: Token Valid (tenant-x)
    EAG->>EAG: Apply Rate Limits
    EAG->>EAG: Extract Model Name
    EAG->>IG: Route to Service Mesh
    IG->>Model: mTLS Encrypted Request
    Model-->>IG: Inference Response
    IG-->>EAG: Response via Service Mesh
    EAG-->>Client: Final Response

Primary Access Patterns

🎯 AI Model Inference: Client → AI Gateway → JWT Auth → Rate Limiting → Istio Gateway → Model Service
📊 Observability: Client → AI Gateway → Istio Gateway → Monitoring Services
🔧 Management: Client → AI Gateway → Istio Gateway → Admin Services

Gateway Responsibilities

🚀 Tier-1: Envoy AI Gateway (Primary Entry Point)

Authentication: JWT token validation with JWKS
Authorization: Tenant-based access control
Rate Limiting: Per-tenant and global limits
AI Protocol: OpenAI-compatible API transformation
Routing: Model-aware intelligent routing

🕸️ Tier-2: Istio Gateway (Service Mesh)

mTLS: Service-to-service encryption
Load Balancing: Traffic distribution
Circuit Breaking: Fault tolerance
Observability: Metrics and tracing
Service Discovery: Dynamic routing

🚪 AI Gateway Features

JWT Authentication & Authorization

Tenant-specific JWT validation with dedicated JWKS endpoints
Automatic claim extraction to request headers for downstream services
Multi-provider support for different authentication sources

Intelligent Routing

Model-aware routing based on x-ai-eg-model header
Header-based tenant routing for multi-tenant isolation
Fallback routing to Istio Gateway for non-AI traffic
EnvoyExtensionPolicy for external AI processing

Rate Limiting & Traffic Management

Per-tenant rate limiting with configurable limits
Global rate limiting for platform protection
Circuit breaker patterns for resilience
Retry policies with exponential backoff
Token-based limiting for LLM models

Security & Compliance

CORS support for web applications
TLS termination at the edge
Security headers injection
Audit logging for compliance requirements

OpenAI API Compatibility

Automatic protocol translation from OpenAI to KServe format
Support for chat completions (/v1/chat/completions)
Support for completions (/v1/completions)
Support for embeddings (/v1/embeddings)
Model-specific routing with x-ai-eg-model header
Compatible with popular LLM servers (vLLM, TGI, Ollama, etc.)

Example API Usage

# All requests go through the AI Gateway first (Tier-1 Entry Point)
export AI_GATEWAY_URL="http://localhost:8080"
export JWT_TOKEN="<your-jwt-token>"

# Traditional model request to sklearn model (tenant-a)
curl -H "Authorization: Bearer $JWT_TOKEN" \
     -H "x-tenant: tenant-a" \
     -H "x-ai-eg-model: sklearn-iris" \
     $AI_GATEWAY_URL/v1/models/sklearn-iris:predict \
     -d '{"instances": [[5.1, 3.5, 1.4, 0.2]]}'

# OpenAI-compatible chat completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
     -H "x-tenant: tenant-a" \
     -H "x-ai-eg-model: llama-3-8b" \
     $AI_GATEWAY_URL/v1/chat/completions \
     -d '{
       "model": "llama-3-8b",
       "messages": [
         {"role": "user", "content": "Hello, how are you?"}
       ],
       "temperature": 0.7
     }'

# OpenAI-compatible completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
     -H "x-tenant: tenant-a" \
     -H "x-ai-eg-model: gpt-j-6b" \
     $AI_GATEWAY_URL/v1/completions \
     -d '{
       "model": "gpt-j-6b",
       "prompt": "The quick brown fox",
       "max_tokens": 50
     }'

# The AI Gateway handles:
# 1. JWT validation and tenant authorization
# 2. Rate limiting and traffic management  
# 3. Model routing based on headers
# 4. OpenAI protocol transformation
# 5. Forwarding to Istio Gateway (Tier-2)

🚀 Getting Started

Quick Start Guide

Prerequisites: Ensure Docker, Kind, kubectl, and Helm are installed
Bootstrap: Run ./scripts/bootstrap.sh (takes 10-15 minutes)
Access Services: Use the port-forward commands above
Run Demos: Execute ./scripts/demo.sh for interactive scenarios
Get JWT Tokens: Run ./scripts/get-jwt-tokens.sh for authentication

Development Workflow

Management Service: See management/README.md for Go backend + React frontend development
Configuration: Kubernetes configs in configs/ directory
Automation: Deployment scripts in scripts/ directory

📚 Documentation & Learning Resources

🎯 Essential Reading

GOALS.md - 🎯 Project vision, goals, and strategic impact
Getting Started Guide - 🚀 Step-by-step installation and bootstrap
Usage Guide - 📖 API usage and service access patterns

🏗️ Architecture & Design

Architecture Guide - 🏗️ Technical system design and patterns
CLAUDE.md - 🤖 AI assistant deployment guidance and commands

🔧 Operations & Management

Management Service Guide - 🔧 Complete API reference and web UI
Model Publishing Guide - 📋 Publishing workflow and best practices

🎭 Demonstrations & Examples

Demo Guide - 🎭 Interactive demonstrations and scenarios
Examples Directory - 💡 Sample configurations and use cases

📖 Learning Path Recommendations

For Platform Engineers

Start with GOALS.md to understand the vision and target state
Follow Getting Started Guide for hands-on deployment
Deep dive into Architecture Guide for technical patterns
Use CLAUDE.md for AI-assisted operations

For AI/ML Engineers

Read GOALS.md to understand AI/ML capabilities
Quick start with Getting Started Guide
Explore Model Publishing Guide for model deployment
Reference Management Service Guide for API usage

For DevOps Teams

Start with GOALS.md for operational understanding
Follow Getting Started Guide for deployment
Study Usage Guide for service management patterns
Use Demo Guide for scenario testing

For Students & Educators

Begin with GOALS.md for learning objectives
Work through Getting Started Guide hands-on
Explore Demo Guide for practical scenarios
Reference Architecture Guide for deep understanding

🔧 Troubleshooting

Common Issues

Gateway not ready: Check kubectl get gateway -n envoy-gateway-system
JWT validation fails: Verify JWKS endpoint is accessible with kubectl get pods -n default -l app=jwt-server
Rate limiting: Check rate limit policies and quotas
Model not accessible: Verify model is ready with kubectl get inferenceservice --all-namespaces
Port conflicts: Ensure ports 3000, 8080, 8085, 9090, 16686, 20001 are available

Quick Verification

🔧 Detailed Troubleshooting: For comprehensive troubleshooting steps, see Usage Guide

# Check overall cluster health
kubectl get pods --all-namespaces | grep -v Running

# Verify AI Gateway is ready
kubectl get pods -n envoy-gateway-system

# Check sample models are deployed
kubectl get inferenceservice --all-namespaces

Cleanup

# Complete cleanup
./scripts/cleanup.sh

# Or manual cleanup
kind delete cluster --name inference-in-a-box

📝 Version Information

🔧 Source of Truth: All infrastructure component versions are defined in scripts/bootstrap.sh

Infrastructure Components

Istio: v1.26.2
KServe: v0.15.2
Knative: v1.18.1
Envoy Gateway: v1.4.2
Envoy AI Gateway: v0.2.1 (with EnvoyExtensionPolicy)
Cert Manager: v1.18.1
Prometheus Stack: v75.6.0
Grafana: v12.0.2
Jaeger: v3.4.1
Kiali: v2.11.0

Runtime Components

Go: v1.21 (management service backend)
Node.js: v18 (management service UI, JWT server)
React: v18.2.0 (management service frontend)
OpenAI API: Compatible with OpenAI SDK v1.x

🤝 Contributing

This is a demonstration project showcasing enterprise AI/ML deployment patterns. For questions or improvements, please refer to the documentation or create an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
.kiro		.kiro
configs		configs
docs		docs
management		management
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GOALS.md		GOALS.md
LICENSE		LICENSE
README.md		README.md
demo.md		demo.md

License

smarunich/inference-in-a-box

Folders and files

Latest commit

History

Repository files navigation

🚀 Inference-in-a-Box: Enterprise AI/ML Platform Demo

🎯 What You're Building

🏗️ Platform Architecture

🛠️ Technology Stack

Core Platform Components

Observability & Monitoring

AI/ML Support

🎯 Key Features Demonstrated

🔒 Enterprise Security

⚡ AI/ML Model Serving

🌐 Multi-Tenancy & Governance

📊 Comprehensive Observability

🖥️ Management Service

🎯 Key Features

Model Publishing & Management

Rate Limiting & Traffic Control

External Access & Routing

Security & Authentication

Model Testing & Validation

🔧 Technical Architecture

📋 API Endpoints

Model Management

Model Publishing

API Key Management

Admin Operations

🌐 Web Interface Access

🔗 Publishing Workflow Example

Admin Authentication & Setup

Model Publishing Workflow

Complete Admin API Demo

🚀 Quick Start

Prerequisites

System Requirements

One-Command Bootstrap

Step-by-Step Setup

Features Demonstrated

🔒 Enterprise Security

🎯 AI/ML Model Serving

🌐 Traffic Management

📊 Observability

🏢 Multi-Tenancy

Directory Structure

Prerequisites

🎭 Demo Scenarios

1. 🔒 Security & Authentication Demo

2. ⚡ Auto-scaling Demo

3. 🚦 Canary Deployment Demo

4. 🌐 Multi-Tenant Isolation Demo

📊 Monitoring & Observability

Real-time Dashboards

Key Metrics Tracked

Alert Configuration

🌐 Traffic Flow Architecture

Tier-1/Tier-2 Gateway Design

Primary Access Patterns

Gateway Responsibilities

🚀 Tier-1: Envoy AI Gateway (Primary Entry Point)

🕸️ Tier-2: Istio Gateway (Service Mesh)

🚪 AI Gateway Features

JWT Authentication & Authorization

Intelligent Routing

Rate Limiting & Traffic Management

Security & Compliance

OpenAI API Compatibility

Example API Usage

🚀 Getting Started

Quick Start Guide

Development Workflow

📚 Documentation & Learning Resources

🎯 Essential Reading

🏗️ Architecture & Design

🔧 Operations & Management

🎭 Demonstrations & Examples

📖 Learning Path Recommendations

Packages