| status | maintainer | last_updated | tags |
|---|---|---|---|
Active |
pacoxu |
2025-12-01 |
ai-infrastructure, kubernetes, learning-path, landscape |
中文版 | English
Welcome to the AI-Infra repository! This project provides a curated landscape and structured learning path for engineers building and operating modern AI infrastructure, especially in the Kubernetes and cloud-native ecosystem.
This landscape visualizes key components across the AI Infrastructure stack, mapped by:
-
Horizontal Axis (X):
- Left: Prototype / Early-stage projects
- Right: Kernel & Runtime maturity
-
Vertical Axis (Y):
- Bottom: Infrastructure Layer (Kernel/Runtime)
- Top: Application Layer (AI/Inference)
The goal is to demystify the evolving AI Infra stack and guide engineers on where to focus their learning.
- AI-Infra Landscape
- Learning Path for AI Infra Engineers
- RoadMap
- Contributing
- References
- Conferences
- License
- Kubernetes Overview
- Kubernetes Learning Plan
- Pod Lifecycle
- Pod Startup Speed
- GPU Pod Cold Start
- Scheduling Optimization
- Workload Isolation
- Dynamic Resource Allocation (DRA)
- DRA Performance Testing
- NVIDIA GPU Operator
- Node Resource Interface (NRI)
- Large-Scale Clusters (130K+ Nodes)
- Inference Overview
- Model Architectures
- LoRA: Low-Rank Adaptation
- AIBrix Platform
- OME Platform
- Serverless AI Inference
- Model Switching & Dynamic Scheduling
- Prefill-Decode Disaggregation
- Caching Strategies
- Memory & Context DB
- Large-Scale MoE Models
- Model Lifecycle Management
- Performance Testing
- Training Overview
- Transformers
- PyTorch Ecosystem
- Pre-Training
- Parallelism Strategies
- Kubeflow Training
- ArgoCD for GitOps
- Blog Overview
- KCD Hangzhou: Observability at Scale | 中文版
- Kubernetes Safe Upgrade and Rollback | 中文版
- JobSet In-Place Restart: 92% Faster | 中文版
- cgroup v2 Migration Guide | 中文版
- Gang Scheduling in Kubernetes v1.35 | 中文版
- AWS 10K Node EKS Ultra Scale Clusters | 中文版
- Inference Orchestration Solutions | 中文版
Legend:
- Dashed outlines = Early stage or under exploration
- Labels on right = Functional categories
Inspired by Shohei Ohtani's goal achievement methodology, this chart outlines the key practices and habits for becoming a successful Cloud Native AI Infrastructure Architect. The chart is organized into nine core pillars: Kubernetes Core Skills, AI Workloads & GPU, AI Platform Architecture, Industry Influence, Architecture Vision, Technical Leadership, Self-Management, Family Time, and Long-term Thinking.
| Read source code 2 nights/week | Understand new WG project updates | Follow key new KEP implementations | GPU CUDA | Reservation and Backfill | Model Switching | Inference Orchestration | Training Fault Recovery | Consider multi-tenant isolation solutions |
|---|---|---|---|---|---|---|---|---|
| DRA + NRI | Kubernetes Core Skills | Security Upgrades | KV-cache / Prefill-Decode Summary | AI Workloads & GPU | Cold Start/Warm Pool | Cluster AutoScaler | AI Platform Architecture | Topology Management |
| API Server & ETCD Performance | Agent Sandbox | Self-healing Exploration | Track new models & operator trends | TPU/NPU etc. | Acceleration Solutions | Co-Evolving | Public vs Private Cloud Differences | |
| AI-Infra Repo Roadmap Maintenance | 2–3 Conference Talks/Year | Publish one technical long-form article monthly | Kubernetes Core Skills | AI Workloads & GPU | AI Platform Architecture | Multi-dimensional Cost Evaluation | Performance Quantification/Optimization | SLA Stability |
| English Proficiency (Blog) | Industry Influence | Conformance Certification | Industry Influence | Cloud Native AI Infra on Kubernetes Lead | Architecture Vision | Multi-cluster Solutions | Architecture Vision | Ultra-large Scale |
| Community Contributions | CNCF Ambassador | Technical Leadership | Self-Management | Family Time | Think about 3-year evolution roadmap | Agentic / Model Ecosystem Trends | ||
| Drive cross-company collaboration tasks | Learn to disagree gently but clearly | Cross-department Influence Enhancement | Ensure 7–8 hours of sleep | Exercise 3 times per week to maintain fitness | Quarterly OKR/Monthly Review/Top 5 Things | 1h daily quality time with daughter | Monthly date/long talk with spouse | Support spouse's personal time/interests |
| Mentor Core Contributors Team Building | Technical Leadership | Long-term Thinking | Control information input & screen time | Self-Management | Long vacation to prevent burn-out | Plan holidays/anniversaries in advance | Family Time | Daughter growth records / Quarterly review |
| Cross-project Dependency Governance, Architecture Coordination | Governance | Reading + Knowledge Base Accumulation | Reduce sugary drinks | Quarterly family travel budget & planning | Reserve time for family & rest | Annual family activity with parents |
Core Kubernetes components and container runtime fundamentals. Skip this section if using managed Kubernetes services.
-
Key Components:
- Core: Kubernetes, CRI, containerd, KubeVirt
- Networking: CNI (focus: RDMA, specialized devices)
- Storage: CSI (focus: checkpointing, model caching, data management)
- Tools: KWOK (GPU node mocking), Helm (package management)
-
Learning Topics:
- Container lifecycle & runtime internals
- Kubernetes scheduler architecture
- Resource allocation & GPU management
- For detailed guides, see Kubernetes Guide
Advanced scheduling, workload orchestration, and device management for AI workloads in Kubernetes clusters.
-
Key Areas:
- Batch Scheduling: Kueue, Volcano, koordinator, Godel, YuniKorn (Kubernetes WG Batch)
- GPU Scheduling: HAMI, NVIDIA Kai Scheduler, NVIDIA Grove
- GPU Management: NVIDIA GPU Operator, NVIDIA DRA Driver, Device Plugins
- Workload Management: LWS (LeaderWorkset), Pod Groups, Gang Scheduling
- Device Management: DRA, NRI (Kubernetes WG Device Management)
- Checkpoint/Restore: GPU checkpoint/restore for fault tolerance and migration (NVIDIA cuda-checkpoint, AMD AMDGPU plugin via CRIU)
-
Learning Topics:
- Job vs. pod scheduling strategies (binpack, spread, DRF)
- Queue management & SLOs
- Multi-model & multi-tenant scheduling
See Kubernetes Guide for comprehensive coverage of pod lifecycle, scheduling optimization, workload isolation, and resource management. Detailed guides: Kubernetes Learning Plan | Pod Lifecycle | Pod Startup Speed | Scheduling Optimization | Isolation | DRA | DRA Performance Testing | NVIDIA GPU Operator | NRI
LLM inference engines, platforms, and optimization techniques for efficient model serving at scale.
-
Key Topics:
- Model architectures (Llama 3/4, Qwen 3, DeepSeek-V3, Flux)
- Efficient transformer inference (KV Cache, FlashAttention, CUDA Graphs)
- LLM serving and orchestration platforms
- Serverless AI inference (Knative, AWS SageMaker, cloud platforms)
- Multi-accelerator optimization
- MoE (Mixture of Experts) architectures
- Model lifecycle management (cold-start, sleep mode, offloading)
- AI agent memory and context management
- Performance testing and benchmarking
-
RoadMap:
See Inference Guide for comprehensive coverage of engines (vLLM, SGLang, Triton, TGI), platforms (Dynamo, AIBrix, OME, llmaz, Kthena, KServe), serverless solutions (Knative, AWS SageMaker), and deep-dive topics: Model Architectures | AIBrix | Serverless | P/D Disaggregation | Caching | Memory/Context DB | MoE Models | Model Lifecycle | Performance Testing
-
Projects to Learn:
- AI Gateway:
Gateway API Inference ExtensionEnvoy AI GatewayIstioKGateway: previously known as Gloo.DaoCloud knowayHigress: AlibabaKongSemantic Router: vLLM Project
- Agentic Workflow:
DifyKAgent: CNCF SandboxDaggerkube-agentic-networking: Agentic networking policies and governance for agents and tools in Kubernetes
- Serverless:
Knative: Serverless solution, like llama stack use case.
- AI Gateway:
-
Learning Topics:
- API orchestration for LLMs
- Prompt routing and A/B testing
- RAG workflows, vector DB integration
Distributed training of large AI models on Kubernetes with fault tolerance, gang scheduling, and efficient resource management.
- Key Topics:
- Transformers: Standardizing model definitions across the PyTorch ecosystem
- PyTorch ecosystem and accelerator integration (DeepSpeed, vLLM, NPU/HPU/XPU)
- Distributed training strategies (data/model/pipeline parallelism)
- Gang scheduling and job queueing
- Fault tolerance and checkpointing
- GPU error detection and recovery
- Training efficiency metrics (ETTR, MFU)
- GitOps workflows for training management
- Storage optimization for checkpoints
- Pre-training large language models (MoE, DeepseekV3, Llama4)
- Scaling experiments and cluster setup (AMD MI325)
See Training Guide for comprehensive coverage of training operators (Kubeflow, Volcano, Kueue), ML platforms (Kubeflow Pipelines, Argo Workflows), GitOps (ArgoCD), fault tolerance strategies, ByteDance's training optimization framework, and industry best practices. Detailed guides: Transformers | PyTorch Ecosystem | Pre-Training | Parallelism Strategies | Kubeflow | ArgoCD
Comprehensive monitoring, metrics, and observability across the AI infrastructure stack for production operations.
- Key Topics:
- Infrastructure monitoring: GPU utilization, memory, temperature, power
- Inference metrics: TTFT, TPOT, ITL, throughput, request latency
- Scheduler observability: Queue depth, scheduling latency, resource allocation
- LLM application tracing: Request traces, prompt performance, model quality
- Cost optimization: Resource utilization analysis and right-sizing
- Multi-tenant monitoring: Per-tenant metrics and fair-share enforcement
See Observability Guide for comprehensive coverage of GPU monitoring (DCGM, Prometheus), inference metrics (OpenLLMetry, Langfuse, OpenLit), scheduler observability (Kueue, Volcano), distributed tracing (DeepFlow), and LLM evaluation platforms (TruLens, Deepchecks).
- Featured Tools:
- OpenTelemetry-native:
OpenLit,OpenLLMetry - LLM platforms:
Langfuse,TruLens - Model validation:
Deepchecks - Network tracing:
DeepFlow - Infrastructure:
Okahu
- OpenTelemetry-native:
- Projects to Learn:
Model Spec: CNCF SandboxImageVolume
For planned features, upcoming topics, and discussion on what may or may not be included in this repository, please see the RoadMap.
The roadmap has been updated to focus on the AI Native era (2025-2035), addressing key challenges including:
- AI Native Platform: Model/Agent as first-class citizens
- Resource Scheduling: DRA, heterogeneous computing, topology awareness
- Runtime Evolution: Container + WASM + Nix + Agent Runtime
- Platform Engineering 2.0: IDP + AI SRE + Security + Cost + Compliance
- Security & Supply Chain: Full-chain governance of AI assets
- Open Source & Ecosystem: Upstream collaboration in AI Infra
We welcome contributions to improve this landscape and path! Whether it's a new project, learning material, or diagram update — please open a PR or issue.
- CNCF Landscape
- Awesome LLMOps
- CNCF TAG Workloads Foundation
- CNCF TAG Infrastructure
- CNCF AI Initiative
- Kubernetes WG AI Gateway
- Kubernetes WG AI Conformance
- Kubernetes WG AI Integration
If you have some resources about AI Infra, please share them in #8.
Here are some key conferences in the AI Infra space:
- AI_dev: for instance, AI_dev EU 2025
- PyTorch Conference by PyTorch Foundation
- KubeCon+CloudNativeCon AI+ML Track, for instance, KubeCon NA 2025 and co-located events Cloud Native + Kubernetes AI Day
- AICon in China by QCon.
- GOSIM(Global Open-Source Innovation Meetup): for instance, GOSIM Hangzhou 2025
Apache License 2.0.
This repo is inspired by the rapidly evolving AI Infra stack and aims to help engineers navigate and master it.
