π Quick Navigation: π― Goals & Vision β’ π Getting Started β’ π Usage Guide β’ ποΈ Architecture β’ π€ AI Assistant β’ π Demos
A complete, production-ready inference platform that demonstrates enterprise-grade AI/ML model serving using modern cloud-native technologies. This platform combines Envoy AI Gateway, Istio service mesh, KServe serverless model serving, and comprehensive observability to create a robust, scalable, and secure inference-as-a-service solution.
Inference-in-a-Box is a comprehensive demonstration of how modern organizations can deploy AI/ML models at enterprise scale with:
- π Zero-Trust Security - Automatic mTLS encryption, fine-grained authorization, and compliance-ready audit logging
- β‘ Serverless Inference - Auto-scaling from zero to N instances based on traffic demand
- π Multi-Tenant Architecture - Secure isolation between different teams, projects, and customers
- π Enterprise Observability - Full-stack monitoring, distributed tracing, and AI-specific metrics
- πͺ Unified AI Gateway - Envoy AI Gateway as the primary entry point with JWT authentication and intelligent routing
- ποΈ Traffic Management - Canary deployments, A/B testing, and intelligent routing
graph TB
subgraph "Inference-in-a-Box Cluster"
subgraph "Tier-1 Gateway Layer (Primary Entry Point)"
EG[Envoy Gateway]
EAG[Envoy AI Gateway]
AUTH[JWT Authentication]
RL[Rate Limiting]
end
subgraph "Tier-2 Service Mesh Layer"
IC[Istiod]
IG[Istio Gateway]
MTLS[mTLS Encryption]
end
subgraph "Multi-Tenant Model Serving"
subgraph "Tenant A"
KS1[sklearn-iris]
IS1[Istio Sidecar]
end
subgraph "Tenant B"
KS2[Reserved]
IS2[Istio Sidecar]
end
subgraph "Tenant C"
KS3[pytorch-resnet]
IS3[Istio Sidecar]
end
end
subgraph "Serverless Infrastructure"
KC[KServe Controller]
KN[Knative Serving]
CM[Cert Manager]
end
subgraph "Observability Stack"
P[Prometheus]
G[Grafana]
K[Kiali]
AM[AlertManager]
end
end
subgraph "External"
CLIENT[AI Client Apps]
MODELS[Model Registry]
end
%% Primary Traffic Flow (Tier-1 β Tier-2)
CLIENT -->|HTTP/REST| EAG
EAG -->|JWT Validation| AUTH
AUTH -->|Authenticated| RL
RL -->|Rate Limited| IG
IG -->|mTLS Routing| KS1
IG -->|mTLS Routing| KS3
%% Gateway Integration
EG -->|Controls| EAG
IC -->|Manages| IG
IC -->|Enables| MTLS
%% Model Serving Infrastructure
KC -->|Manages| KN
KN -->|Serves| KS1
KN -->|Serves| KS2
KN -->|Serves| KS3
%% External Model Sources
MODELS -->|Deploys| KS1
MODELS -->|Deploys| KS2
MODELS -->|Deploys| KS3
%% Observability Flow
KS1 -->|Metrics| P
KS2 -->|Metrics| P
KS3 -->|Metrics| P
IC -->|Mesh Metrics| P
P -->|Data| G
P -->|Data| K
%% Styling
classDef tier1 fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
classDef tier2 fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef models fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef observability fill:#fff3e0,stroke:#f57c00,stroke-width:2px
class EG,EAG,AUTH,RL tier1
class IC,IG,MTLS tier2
class KS1,KS2,KS3,KC,KN,CM models
class P,G,K,AM observability
- π³ Kind - Local Kubernetes cluster for development and testing
- πͺ Envoy Gateway - Cloud-native API gateway with advanced routing capabilities
- π€ Envoy AI Gateway - AI-specific gateway with JWT authentication, model routing, and OpenAI API compatibility
- πΈοΈ Istio - Service mesh for security, traffic management, and observability
- π¦Ύ KServe - Kubernetes-native model serving with auto-scaling
- π Knative - Serverless framework for event-driven applications
- π Cert Manager - Automated certificate management
- π Prometheus - Metrics collection and alerting
- π Grafana - Visualization and dashboards
- π Jaeger - Distributed tracing
- πΊοΈ Kiali - Service mesh visualization
- π¨ AlertManager - Alert routing and management
- π§ TensorFlow Serving - TensorFlow model serving
- π₯ PyTorch Serve - PyTorch model serving
- β‘ Scikit-learn - Traditional ML model serving
- π€ Hugging Face - Transformer model support
- π OpenAI API - Compatible endpoints for LLM serving (vLLM, TGI, etc.)
sequenceDiagram
participant Client
participant Gateway as Envoy AI Gateway
participant Auth as Authentication
participant Istio as Istio Proxy
participant Model as KServe Model
Client->>Gateway: Inference Request
Gateway->>Auth: Validate JWT/API Key
Auth-->>Gateway: Authentication Result
Gateway->>Istio: Forward Request (mTLS)
Istio->>Model: Secure Request
Model-->>Istio: Inference Response
Istio-->>Gateway: Secure Response
Gateway-->>Client: Final Response
Note over Gateway,Model: All communication encrypted with mTLS
Note over Auth: RBAC policies enforced
Note over Istio: Zero-trust networking
- Zero-trust networking with automatic mTLS between all services
- Multi-tenant isolation with namespace-based security boundaries
- RBAC and authentication with JWT/API key validation
- Audit logging for compliance requirements (GDPR, HIPAA, SOC 2)
- Certificate management with automatic rotation
graph LR
subgraph "Model Lifecycle"
MR[Model Registry] --> KS[KServe Controller]
KS --> KN[Knative Serving]
KN --> POD[Model Pods]
end
subgraph "Auto-scaling"
POD --> AS[Auto-scaler]
AS --> |Scale Up| POD
AS --> |Scale to Zero| ZERO[No Pods]
ZERO --> |Cold Start| POD
end
subgraph "Traffic Management"
CANARY[Canary Deploy]
AB[A/B Testing]
BLUE[Blue/Green]
end
POD --> CANARY
POD --> AB
POD --> BLUE
- Serverless auto-scaling from zero to N instances based on demand
- Multi-framework support (Scikit-learn, PyTorch, TensorFlow, Hugging Face)
- OpenAI API compatibility with automatic protocol translation for LLMs
- AI Gateway routing with model-aware header-based routing (x-ai-eg-model)
- Canary deployments for gradual model rollouts
- A/B testing with intelligent traffic splitting
- Model versioning and rollback capabilities
- Resource optimization with GPU/CPU scheduling
- Protocol translation between OpenAI and KServe formats
- Workspace isolation with dedicated namespaces per tenant
- Resource quotas and governance policies
- Separate observability scopes for each tenant
- Independent lifecycle management and deployment schedules
- Cost tracking and chargeback mechanisms
graph TB
subgraph "Metrics Pipeline"
ISTIO[Istio Metrics] --> PROM[Prometheus]
KSERVE[KServe Metrics] --> PROM
CUSTOM[Custom AI Metrics] --> PROM
PROM --> GRAF[Grafana Dashboards]
end
subgraph "Tracing Pipeline"
REQ[Request Traces] --> JAEGER[Jaeger]
SPANS[Service Spans] --> JAEGER
JAEGER --> ANALYSIS[Trace Analysis]
end
subgraph "Logging Pipeline"
LOGS[Application Logs] --> LOKI[Loki]
AUDIT[Audit Logs] --> LOKI
LOKI --> GRAF
end
subgraph "Alerting"
PROM --> ALERT[AlertManager]
ALERT --> SLACK[Slack/Email]
end
- End-to-end distributed tracing across the entire inference pipeline
- AI-specific metrics including inference latency, throughput, and accuracy
- Business metrics for cost optimization and resource planning
- SLA monitoring with automated alerting
- Unified dashboards for operational visibility
The Management Service is a comprehensive web-based platform for managing AI/ML model inference operations. It provides both a REST API and an intuitive React-based web interface for complete model lifecycle management.
- One-click model publishing with configurable external access
- Public hostname configuration (default:
api.router.inference-in-a-box
) - Update published models - modify rate limits, paths, and hostnames
- Multi-tenant model isolation with namespace-based security
- Automatic model type detection (Traditional vs OpenAI-compatible)
- OpenAI API compatibility for LLM models (vLLM, TGI, etc.)
- Per-model rate limiting with configurable requests per minute/hour
- Token-based rate limiting for OpenAI-compatible models
- Burst limit configuration for handling traffic spikes
- Dynamic rate limit updates without republishing models
- Configurable public hostnames for external model access
- Custom path routing for model endpoints
- Automatic gateway configuration (Envoy AI Gateway + Istio)
- SSL/TLS termination with automatic certificate management
- JWT-based authentication with tenant isolation
- API key management for external access
- API key rotation with zero-downtime updates
- Admin and tenant-level permissions
- Interactive inference testing directly from the UI
- Support for both traditional and OpenAI-style testing
- Real-time response visualization
- Custom DNS resolution for cluster-internal testing
- Automatic JWT token generation for test requests
graph TB
subgraph "Management Service Stack"
UI[React Frontend] --> API[Go Backend]
API --> K8S[Kubernetes API]
API --> GATEWAY[Gateway Configuration]
API --> STORAGE[Model Metadata]
end
subgraph "Model Publishing Flow"
PUBLISH[Publish Model] --> VALIDATE[Validate Config]
VALIDATE --> GATEWAY_CONFIG[Create Gateway Routes]
GATEWAY_CONFIG --> RATE_LIMIT[Setup Rate Limiting]
RATE_LIMIT --> API_KEY[Generate API Key]
API_KEY --> DOCS[Generate Documentation]
end
subgraph "External Access"
CLIENT[External Client] --> ENVOY[Envoy AI Gateway]
ENVOY --> ISTIO[Istio Gateway]
ISTIO --> KSERVE[KServe Model]
end
GET /api/models
- List all modelsPOST /api/models
- Create new modelGET /api/models/{name}
- Get model detailsPUT /api/models/{name}
- Update model configurationDELETE /api/models/{name}
- Delete model
POST /api/models/{name}/publish
- Publish model for external accessPUT /api/models/{name}/publish
- Update published model configurationGET /api/models/{name}/publish
- Get published model detailsDELETE /api/models/{name}/publish
- Unpublish modelGET /api/published-models
- List all published models
POST /api/models/{name}/publish/rotate-key
- Rotate API keyPOST /api/validate-api-key
- Validate API key (for gateway)
GET /api/admin/system
- System informationGET /api/admin/tenants
- Tenant managementPOST /api/admin/kubectl
- Execute kubectl commands
# Access the Management Service UI
kubectl port-forward svc/management-service 8085:80
# Open in browser
open http://localhost:8085
# Admin login and get JWT token
export ADMIN_TOKEN=$(curl -s -X POST -H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}' \
http://localhost:8085/api/admin/login | jq -r '.token')
# Verify login
curl -H "Authorization: Bearer $ADMIN_TOKEN" \
http://localhost:8085/api/admin/system
# 1. Create a model
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "my-model", "framework": "sklearn", "storageUri": "s3://my-bucket/model"}' \
http://localhost:8085/api/models
# 2. Publish model with custom hostname
curl -X POST -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"publicHostname": "api.router.inference-in-a-box",
"externalPath": "/models/my-model",
"rateLimiting": {
"requestsPerMinute": 100,
"requestsPerHour": 5000
}
}
}' \
http://localhost:8085/api/models/my-model/publish
# 3. Update published model configuration
curl -X PUT -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"config": {
"tenantId": "tenant-a",
"publicHostname": "api.router.inference-in-a-box",
"rateLimiting": {
"requestsPerMinute": 200,
"requestsPerHour": 10000
}
}
}' \
http://localhost:8085/api/models/my-model/publish
# 4. Access published model externally
curl -H "X-API-Key: $API_KEY" \
https://api.router.inference-in-a-box/models/my-model/predict \
-d '{"input": "sample data"}'
For a comprehensive example of all admin API operations, use the provided script:
# Run the complete admin API demo
./scripts/admin-api-example.sh
This script demonstrates:
- Admin authentication
- System information retrieval
- Model and tenant management
- Model publishing workflow
- External API testing
- kubectl command execution
Ensure you have the following tools installed:
# Required tools
docker --version # Docker 20.10+
kind --version # Kind 0.20+
kubectl version --client # kubectl 1.24+
helm version # Helm 3.12+
curl --version # curl (any recent version)
jq --version # jq 1.6+
# Optional but recommended
istioctl version # Istio CLI (auto-installed by bootstrap)
- Memory: Minimum 8GB RAM (16GB recommended for full observability stack)
- CPU: 4+ cores recommended
- Disk: 20GB+ free space for container images
- OS: macOS, Linux, or Windows with WSL2
# Clone the repository
git clone <repository-url>
cd inference-in-a-box
# Bootstrap the entire platform (takes 10-15 minutes)
./scripts/bootstrap.sh
# Run demo scenarios
./scripts/demo.sh
# Access the platform (run these in separate terminals)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
kubectl port-forward -n envoy-gateway-system svc/envoy-ai-gateway 8080:80 &
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 &
kubectl port-forward -n monitoring svc/kiali 20001:20001 &
kubectl port-forward -n monitoring svc/jaeger-query 16686:16686 &
kubectl port-forward -n default svc/management-service 8085:80 &
echo "π Platform is ready!"
echo "π€ AI Gateway (Primary Entry): http://localhost:8080"
echo "π Grafana: http://localhost:3000 (admin/prom-operator)"
echo "π Prometheus: http://localhost:9090"
echo "πΊοΈ Kiali: http://localhost:20001"
echo "π Jaeger: http://localhost:16686"
echo "π§ Management UI: http://localhost:8085"
echo ""
echo "π‘ All AI/ML requests go through the AI Gateway first!"
echo " The AI Gateway handles JWT auth and routes to Istio Gateway"
# 1. Create Kind cluster
./scripts/clusters/create-kind-cluster.sh
# 2. Install core infrastructure
./scripts/install/install-envoy-gateway.sh
./scripts/install/install-istio.sh
./scripts/install/install-kserve.sh
./scripts/install/install-observability.sh
# 3. Deploy sample models
./scripts/models/deploy-samples.sh
# 4. Configure security and policies
./scripts/security/setup-policies.sh
# 5. Run tests
./scripts/test/run-tests.sh
- Zero-trust networking with automatic mTLS
- Multi-tenant isolation with workspace boundaries
- RBAC and authentication policies
- Certificate management and rotation
- Multiple ML frameworks (TensorFlow, PyTorch, Scikit-learn)
- Auto-scaling from zero to N instances
- Canary deployments and A/B testing
- Model versioning and rollback
- Intelligent routing and load balancing
- Circuit breaking and failover
- Rate limiting and throttling
- Geographic routing simulation
- Distributed tracing across the inference pipeline
- Custom metrics for AI workloads
- Unified logging and monitoring
- SLA tracking and alerting
- Namespace-based tenant isolation
- Resource quotas and governance
- Separate observability scopes
- Independent lifecycle management
inference-in-a-box/
βββ README.md
βββ scripts/
β βββ bootstrap.sh
β βββ cleanup.sh
β βββ demo.sh
β βββ clusters/
β βββ create-kind-cluster.sh
β βββ setup-networking.sh
βββ configs/
β βββ clusters/
β β βββ cluster.yaml
β βββ envoy-gateway/
β β βββ gatewayclass.yaml
β β βββ ai-gateway.yaml
β β βββ httproute.yaml
β β βββ ai-backends.yaml
β β βββ security-policies.yaml
β β βββ rate-limiting.yaml
β βββ istio/
β β βββ installation.yaml
β β βββ gateway.yaml
β β βββ virtual-services/
β βββ kserve/
β β βββ installation.yaml
β β βββ models/
β βββ envoy-ai-gateway/
β β βββ configuration.yaml
β βββ observability/
β βββ prometheus.yaml
β βββ grafana/
βββ models/
β βββ sklearn-iris/
β βββ tensorflow-mnist/
β βββ pytorch-resnet/
βββ examples/
β βββ inference-requests/
β βββ security-policies/
β βββ traffic-scenarios/
βββ docs/
βββ architecture.md
βββ deployment-guide.md
βββ troubleshooting.md
- Docker Desktop or equivalent
- kubectl
- kind
- helm
- curl
- jq
sequenceDiagram
participant User
participant Gateway
participant Auth
participant Model
User->>Gateway: Request with JWT
Gateway->>Auth: Validate Token
Auth-->>Gateway: Authorized
Gateway->>Model: Forward Request (mTLS)
Model-->>Gateway: Inference Result
Gateway-->>User: Secure Response
# The demo script automatically generates load through the AI Gateway
./scripts/demo.sh
# Select option 2 for auto-scaling demo
# Watch pods scale from 0 to N
watch "kubectl get pods -n tenant-a -l serving.kserve.io/inferenceservice=sklearn-iris"
# The demo script creates a canary deployment for sklearn-iris
./scripts/demo.sh
# Select option 3 for canary deployment demo
# Monitor traffic split
kubectl get virtualservice -n tenant-a
# The demo script shows tenant isolation and resource boundaries
./scripts/demo.sh
# Select option 4 for multi-tenant isolation demo
# Verify isolation
kubectl get networkpolicies -A
graph LR
subgraph "Grafana Dashboards"
OVERVIEW["π Platform Overview"]
MODELS["π€ Model Performance"]
SECURITY["π Security Metrics"]
BUSINESS["π° Business KPIs"]
end
subgraph "Data Sources"
PROM["π Prometheus"]
JAEGER["π Jaeger"]
ISTIO["πΈοΈ Istio Metrics"]
KSERVE["π€ KServe Metrics"]
end
PROM --> OVERVIEW
PROM --> MODELS
ISTIO --> SECURITY
KSERVE --> BUSINESS
JAEGER --> MODELS
- π― Model Performance: Inference latency, throughput, accuracy
- β‘ Infrastructure: CPU/Memory usage, auto-scaling events
- π Security: Authentication failures, policy violations
- π° Business: Cost per inference, tenant usage, SLA compliance
- π Network: Request rates, error rates, circuit breaker events
# Example alert rules
groups:
- name: inference.rules
rules:
- alert: HighInferenceLatency
expr: histogram_quantile(0.95, rate(kserve_request_duration_seconds_bucket[5m])) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "High inference latency detected"
- alert: ModelDown
expr: up{job="kserve-model"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Model service is down"
The platform implements a two-tier gateway architecture where external traffic first hits the Envoy AI Gateway (Tier-1) and then flows to the Istio Gateway (Tier-2) for service mesh routing:
sequenceDiagram
participant Client as π₯οΈ Client Apps
participant EAG as π€ AI Gateway (Tier-1)
participant Auth as π JWT Auth
participant IG as πΈοΈ Istio Gateway (Tier-2)
participant Model as π― Model Service
Client->>EAG: HTTP/REST Request
EAG->>Auth: Validate JWT Token
Auth-->>EAG: Token Valid (tenant-x)
EAG->>EAG: Apply Rate Limits
EAG->>EAG: Extract Model Name
EAG->>IG: Route to Service Mesh
IG->>Model: mTLS Encrypted Request
Model-->>IG: Inference Response
IG-->>EAG: Response via Service Mesh
EAG-->>Client: Final Response
- π― AI Model Inference:
Client β AI Gateway β JWT Auth β Rate Limiting β Istio Gateway β Model Service
- π Observability:
Client β AI Gateway β Istio Gateway β Monitoring Services
- π§ Management:
Client β AI Gateway β Istio Gateway β Admin Services
- Authentication: JWT token validation with JWKS
- Authorization: Tenant-based access control
- Rate Limiting: Per-tenant and global limits
- AI Protocol: OpenAI-compatible API transformation
- Routing: Model-aware intelligent routing
- mTLS: Service-to-service encryption
- Load Balancing: Traffic distribution
- Circuit Breaking: Fault tolerance
- Observability: Metrics and tracing
- Service Discovery: Dynamic routing
- Tenant-specific JWT validation with dedicated JWKS endpoints
- Automatic claim extraction to request headers for downstream services
- Multi-provider support for different authentication sources
- Model-aware routing based on x-ai-eg-model header
- Header-based tenant routing for multi-tenant isolation
- Fallback routing to Istio Gateway for non-AI traffic
- EnvoyExtensionPolicy for external AI processing
- Per-tenant rate limiting with configurable limits
- Global rate limiting for platform protection
- Circuit breaker patterns for resilience
- Retry policies with exponential backoff
- Token-based limiting for LLM models
- CORS support for web applications
- TLS termination at the edge
- Security headers injection
- Audit logging for compliance requirements
- Automatic protocol translation from OpenAI to KServe format
- Support for chat completions (
/v1/chat/completions
) - Support for completions (
/v1/completions
) - Support for embeddings (
/v1/embeddings
) - Model-specific routing with x-ai-eg-model header
- Compatible with popular LLM servers (vLLM, TGI, Ollama, etc.)
# All requests go through the AI Gateway first (Tier-1 Entry Point)
export AI_GATEWAY_URL="http://localhost:8080"
export JWT_TOKEN="<your-jwt-token>"
# Traditional model request to sklearn model (tenant-a)
curl -H "Authorization: Bearer $JWT_TOKEN" \
-H "x-tenant: tenant-a" \
-H "x-ai-eg-model: sklearn-iris" \
$AI_GATEWAY_URL/v1/models/sklearn-iris:predict \
-d '{"instances": [[5.1, 3.5, 1.4, 0.2]]}'
# OpenAI-compatible chat completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
-H "x-tenant: tenant-a" \
-H "x-ai-eg-model: llama-3-8b" \
$AI_GATEWAY_URL/v1/chat/completions \
-d '{
"model": "llama-3-8b",
"messages": [
{"role": "user", "content": "Hello, how are you?"}
],
"temperature": 0.7
}'
# OpenAI-compatible completion request
curl -H "Authorization: Bearer $JWT_TOKEN" \
-H "x-tenant: tenant-a" \
-H "x-ai-eg-model: gpt-j-6b" \
$AI_GATEWAY_URL/v1/completions \
-d '{
"model": "gpt-j-6b",
"prompt": "The quick brown fox",
"max_tokens": 50
}'
# The AI Gateway handles:
# 1. JWT validation and tenant authorization
# 2. Rate limiting and traffic management
# 3. Model routing based on headers
# 4. OpenAI protocol transformation
# 5. Forwarding to Istio Gateway (Tier-2)
- Prerequisites: Ensure Docker, Kind, kubectl, and Helm are installed
- Bootstrap: Run
./scripts/bootstrap.sh
(takes 10-15 minutes) - Access Services: Use the port-forward commands above
- Run Demos: Execute
./scripts/demo.sh
for interactive scenarios - Get JWT Tokens: Run
./scripts/get-jwt-tokens.sh
for authentication
- Management Service: See management/README.md for Go backend + React frontend development
- Configuration: Kubernetes configs in configs/ directory
- Automation: Deployment scripts in scripts/ directory
- GOALS.md - π― Project vision, goals, and strategic impact
- Getting Started Guide - π Step-by-step installation and bootstrap
- Usage Guide - π API usage and service access patterns
- Architecture Guide - ποΈ Technical system design and patterns
- CLAUDE.md - π€ AI assistant deployment guidance and commands
- Management Service Guide - π§ Complete API reference and web UI
- Model Publishing Guide - π Publishing workflow and best practices
- Demo Guide - π Interactive demonstrations and scenarios
- Examples Directory - π‘ Sample configurations and use cases
- Start with GOALS.md to understand the vision and target state
- Follow Getting Started Guide for hands-on deployment
- Deep dive into Architecture Guide for technical patterns
- Use CLAUDE.md for AI-assisted operations
- Read GOALS.md to understand AI/ML capabilities
- Quick start with Getting Started Guide
- Explore Model Publishing Guide for model deployment
- Reference Management Service Guide for API usage
- Start with GOALS.md for operational understanding
- Follow Getting Started Guide for deployment
- Study Usage Guide for service management patterns
- Use Demo Guide for scenario testing
- Begin with GOALS.md for learning objectives
- Work through Getting Started Guide hands-on
- Explore Demo Guide for practical scenarios
- Reference Architecture Guide for deep understanding
- Gateway not ready: Check
kubectl get gateway -n envoy-gateway-system
- JWT validation fails: Verify JWKS endpoint is accessible with
kubectl get pods -n default -l app=jwt-server
- Rate limiting: Check rate limit policies and quotas
- Model not accessible: Verify model is ready with
kubectl get inferenceservice --all-namespaces
- Port conflicts: Ensure ports 3000, 8080, 8085, 9090, 16686, 20001 are available
π§ Detailed Troubleshooting: For comprehensive troubleshooting steps, see Usage Guide
# Check overall cluster health
kubectl get pods --all-namespaces | grep -v Running
# Verify AI Gateway is ready
kubectl get pods -n envoy-gateway-system
# Check sample models are deployed
kubectl get inferenceservice --all-namespaces
# Complete cleanup
./scripts/cleanup.sh
# Or manual cleanup
kind delete cluster --name inference-in-a-box
π§ Source of Truth: All infrastructure component versions are defined in
scripts/bootstrap.sh
- Istio: v1.26.2
- KServe: v0.15.2
- Knative: v1.18.1
- Envoy Gateway: v1.4.2
- Envoy AI Gateway: v0.2.1 (with EnvoyExtensionPolicy)
- Cert Manager: v1.18.1
- Prometheus Stack: v75.6.0
- Grafana: v12.0.2
- Jaeger: v3.4.1
- Kiali: v2.11.0
- Go: v1.21 (management service backend)
- Node.js: v18 (management service UI, JWT server)
- React: v18.2.0 (management service frontend)
- OpenAI API: Compatible with OpenAI SDK v1.x
This is a demonstration project showcasing enterprise AI/ML deployment patterns. For questions or improvements, please refer to the documentation or create an issue.