AI-Infra Landscape & Learning Path 🚀

status	maintainer	last_updated	tags
Active	pacoxu	2025-12-01	ai-infrastructure, kubernetes, learning-path, landscape

AI-Infra Landscape & Learning Path 🚀

中文版 | English

Welcome to the AI-Infra repository! This project provides a curated landscape and structured learning path for engineers building and operating modern AI infrastructure, especially in the Kubernetes and cloud-native ecosystem.

🌐 Overview

This landscape visualizes key components across the AI Infrastructure stack, mapped by:

Horizontal Axis (X):
- Left: Prototype / Early-stage projects
- Right: Kernel & Runtime maturity
Vertical Axis (Y):
- Bottom: Infrastructure Layer (Kernel/Runtime)
- Top: Application Layer (AI/Inference)

The goal is to demystify the evolving AI Infra stack and guide engineers on where to focus their learning.

📂 Documentation Files

Kubernetes

Inference

Training

Observability

Observability Overview

Blog

📊 AI-Infra Landscape (2025 June, needs an update)

Legend:

Dashed outlines = Early stage or under exploration

Labels on right = Functional categories

🎯 Goal Achievement Chart for Cloud Native AI Infra Architect

Inspired by Shohei Ohtani's goal achievement methodology, this chart outlines the key practices and habits for becoming a successful Cloud Native AI Infrastructure Architect. The chart is organized into nine core pillars: Kubernetes Core Skills, AI Workloads & GPU, AI Platform Architecture, Industry Influence, Architecture Vision, Technical Leadership, Self-Management, Family Time, and Long-term Thinking.

Read source code 2 nights/week	Understand new WG project updates	Follow key new KEP implementations	GPU CUDA	Reservation and Backfill	Model Switching	Inference Orchestration	Training Fault Recovery	Consider multi-tenant isolation solutions
DRA + NRI	Kubernetes Core Skills	Security Upgrades	KV-cache / Prefill-Decode Summary	AI Workloads & GPU	Cold Start/Warm Pool	Cluster AutoScaler	AI Platform Architecture	Topology Management
API Server & ETCD Performance	Agent Sandbox	Self-healing Exploration	Track new models & operator trends	TPU/NPU etc.	Acceleration Solutions	Co-Evolving	Public vs Private Cloud Differences
AI-Infra Repo Roadmap Maintenance	2–3 Conference Talks/Year	Publish one technical long-form article monthly	Kubernetes Core Skills	AI Workloads & GPU	AI Platform Architecture	Multi-dimensional Cost Evaluation	Performance Quantification/Optimization	SLA Stability
English Proficiency (Blog)	Industry Influence	Conformance Certification	Industry Influence	Cloud Native AI Infra on Kubernetes Lead	Architecture Vision	Multi-cluster Solutions	Architecture Vision	Ultra-large Scale
Community Contributions	CNCF Ambassador		Technical Leadership	Self-Management	Family Time	Think about 3-year evolution roadmap	Agentic / Model Ecosystem Trends
Drive cross-company collaboration tasks	Learn to disagree gently but clearly	Cross-department Influence Enhancement	Ensure 7–8 hours of sleep	Exercise 3 times per week to maintain fitness	Quarterly OKR/Monthly Review/Top 5 Things	1h daily quality time with daughter	Monthly date/long talk with spouse	Support spouse's personal time/interests
Mentor Core Contributors Team Building	Technical Leadership	Long-term Thinking	Control information input & screen time	Self-Management	Long vacation to prevent burn-out	Plan holidays/anniversaries in advance	Family Time	Daughter growth records / Quarterly review
Cross-project Dependency Governance, Architecture Coordination	Governance		Reading + Knowledge Base Accumulation	Reduce sugary drinks		Quarterly family travel budget & planning	Reserve time for family & rest	Annual family activity with parents

🧭 Learning Path for AI Infra Engineers

📦 0. Kernel & Runtime (底层内核)

Core Kubernetes components and container runtime fundamentals. Skip this section if using managed Kubernetes services.

Key Components:
- Core: Kubernetes, CRI, containerd, KubeVirt
- Networking: CNI (focus: RDMA, specialized devices)
- Storage: CSI (focus: checkpointing, model caching, data management)
- Tools: KWOK (GPU node mocking), Helm (package management)
Learning Topics:
- Container lifecycle & runtime internals
- Kubernetes scheduler architecture
- Resource allocation & GPU management
- For detailed guides, see Kubernetes Guide

📍 1. Scheduling & Workloads (调度与工作负载)

Advanced scheduling, workload orchestration, and device management for AI workloads in Kubernetes clusters.

Key Areas:
- Batch Scheduling: Kueue, Volcano, koordinator, Godel, YuniKorn (Kubernetes WG Batch)
- GPU Scheduling: HAMI, NVIDIA Kai Scheduler, NVIDIA Grove
- GPU Management: NVIDIA GPU Operator, NVIDIA DRA Driver, Device Plugins
- Workload Management: LWS (LeaderWorkset), Pod Groups, Gang Scheduling
- Device Management: DRA, NRI (Kubernetes WG Device Management)
- Checkpoint/Restore: GPU checkpoint/restore for fault tolerance and migration (NVIDIA cuda-checkpoint, AMD AMDGPU plugin via CRIU)
Learning Topics:
- Job vs. pod scheduling strategies (binpack, spread, DRF)
- Queue management & SLOs
- Multi-model & multi-tenant scheduling

RoadMap:
- Gang Scheduling in Kubernetes #4671
- LWS Gang Scheduling KEP-407

🧠 2. Model Inference & Runtime Optimization (推理优化)

LLM inference engines, platforms, and optimization techniques for efficient model serving at scale.

Key Topics:
- Model architectures (Llama 3/4, Qwen 3, DeepSeek-V3, Flux)
- Efficient transformer inference (KV Cache, FlashAttention, CUDA Graphs)
- LLM serving and orchestration platforms
- Serverless AI inference (Knative, AWS SageMaker, cloud platforms)
- Multi-accelerator optimization
- MoE (Mixture of Experts) architectures
- Model lifecycle management (cold-start, sleep mode, offloading)
- AI agent memory and context management
- Performance testing and benchmarking
RoadMap:
- Serving WG

🧩 3. AI Gateway & Agentic Workflow

Projects to Learn:
- AI Gateway:
  - Gateway API Inference Extension
  - Envoy AI Gateway
  - Istio
  - KGateway: previously known as Gloo.
  - DaoCloud knoway
  - Higress: Alibaba
  - Kong
  - Semantic Router: vLLM Project
- Agentic Workflow:
  - Dify
  - KAgent: CNCF Sandbox
  - Dagger
  - kube-agentic-networking: Agentic networking policies and governance for agents and tools in Kubernetes
- Serverless:
  - Knative: Serverless solution, like llama stack use case.
Learning Topics:
- API orchestration for LLMs
- Prompt routing and A/B testing
- RAG workflows, vector DB integration

🎯 4. Training on Kubernetes

Distributed training of large AI models on Kubernetes with fault tolerance, gang scheduling, and efficient resource management.

Key Topics:
- Transformers: Standardizing model definitions across the PyTorch ecosystem
- PyTorch ecosystem and accelerator integration (DeepSpeed, vLLM, NPU/HPU/XPU)
- Distributed training strategies (data/model/pipeline parallelism)
- Gang scheduling and job queueing
- Fault tolerance and checkpointing
- GPU error detection and recovery
- Training efficiency metrics (ETTR, MFU)
- GitOps workflows for training management
- Storage optimization for checkpoints
- Pre-training large language models (MoE, DeepseekV3, Llama4)
- Scaling experiments and cluster setup (AMD MI325)

See Training Guide for comprehensive coverage of training operators (Kubeflow, Volcano, Kueue), ML platforms (Kubeflow Pipelines, Argo Workflows), GitOps (ArgoCD), fault tolerance strategies, ByteDance's training optimization framework, and industry best practices. Detailed guides: Transformers | PyTorch Ecosystem | Pre-Training | Parallelism Strategies | Kubeflow | ArgoCD

🔍 5. Observability of AI Workloads

Comprehensive monitoring, metrics, and observability across the AI infrastructure stack for production operations.

Key Topics:
- Infrastructure monitoring: GPU utilization, memory, temperature, power
- Inference metrics: TTFT, TPOT, ITL, throughput, request latency
- Scheduler observability: Queue depth, scheduling latency, resource allocation
- LLM application tracing: Request traces, prompt performance, model quality
- Cost optimization: Resource utilization analysis and right-sizing
- Multi-tenant monitoring: Per-tenant metrics and fair-share enforcement

See Observability Guide for comprehensive coverage of GPU monitoring (DCGM, Prometheus), inference metrics (OpenLLMetry, Langfuse, OpenLit), scheduler observability (Kueue, Volcano), distributed tracing (DeepFlow), and LLM evaluation platforms (TruLens, Deepchecks).

Featured Tools:
- OpenTelemetry-native: OpenLit, OpenLLMetry
- LLM platforms: Langfuse, TruLens
- Model validation: Deepchecks
- Network tracing: DeepFlow
- Infrastructure: Okahu

6. Ecosystem Initiatives

Projects to Learn:
- Model Spec: CNCF Sandbox
- ImageVolume

🗺️ RoadMap

For planned features, upcoming topics, and discussion on what may or may not be included in this repository, please see the RoadMap.

The roadmap has been updated to focus on the AI Native era (2025-2035), addressing key challenges including:

AI Native Platform: Model/Agent as first-class citizens
Resource Scheduling: DRA, heterogeneous computing, topology awareness
Runtime Evolution: Container + WASM + Nix + Agent Runtime
Platform Engineering 2.0: IDP + AI SRE + Security + Cost + Compliance
Security & Supply Chain: Full-chain governance of AI assets
Open Source & Ecosystem: Upstream collaboration in AI Infra

🤝 Contributing

We welcome contributions to improve this landscape and path! Whether it's a new project, learning material, or diagram update — please open a PR or issue.

📚 References

If you have some resources about AI Infra, please share them in #8.

Conferences

Here are some key conferences in the AI Infra space:

AI_dev: for instance, AI_dev EU 2025
PyTorch Conference by PyTorch Foundation
KubeCon+CloudNativeCon AI+ML Track, for instance, KubeCon NA 2025 and co-located events Cloud Native + Kubernetes AI Day
AICon in China by QCon.
GOSIM(Global Open-Source Innovation Meetup): for instance, GOSIM Hangzhou 2025

📜 License

Apache License 2.0.

This repo is inspired by the rapidly evolving AI Infra stack and aims to help engineers navigate and master it.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.github		.github
diagrams		diagrams
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
RoadMap.md		RoadMap.md
STRUCTURE.md		STRUCTURE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Infra Landscape & Learning Path 🚀

🌐 Overview

📑 Table of Contents

📂 Documentation Files

Kubernetes

Inference

Training

Observability

Blog

📊 AI-Infra Landscape (2025 June, needs an update)

🎯 Goal Achievement Chart for Cloud Native AI Infra Architect

🧭 Learning Path for AI Infra Engineers

📦 0. Kernel & Runtime (底层内核)

📍 1. Scheduling & Workloads (调度与工作负载)

🧠 2. Model Inference & Runtime Optimization (推理优化)

🧩 3. AI Gateway & Agentic Workflow

🎯 4. Training on Kubernetes

🔍 5. Observability of AI Workloads

6. Ecosystem Initiatives

🗺️ RoadMap

🤝 Contributing

📚 References

Conferences

📜 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 5

Uh oh!

License

pacoxu/AI-Infra

Folders and files

Latest commit

History

Repository files navigation

AI-Infra Landscape & Learning Path 🚀

🌐 Overview

📑 Table of Contents

📂 Documentation Files

Kubernetes

Inference

Training

Observability

Blog

📊 AI-Infra Landscape (2025 June, needs an update)

🎯 Goal Achievement Chart for Cloud Native AI Infra Architect

🧭 Learning Path for AI Infra Engineers

📦 0. Kernel & Runtime (底层内核)

📍 1. Scheduling & Workloads (调度与工作负载)

🧠 2. Model Inference & Runtime Optimization (推理优化)

🧩 3. AI Gateway & Agentic Workflow

🎯 4. Training on Kubernetes

🔍 5. Observability of AI Workloads

6. Ecosystem Initiatives

🗺️ RoadMap

🤝 Contributing

📚 References

Conferences

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 5

Uh oh!

Packages