Deploy Qwen2.5 0.5B Large Language Model on Amazon EKS using vLLM, Terraform, and Helm with cost-effective t3.micro spot instances.
This project provides a minimal, cost-effective setup to run Qwen2.5 0.5B on AWS EKS:
- Model: Qwen2.5-0.5B-Instruct (CPU-only, no GPU required)
- Infrastructure: EKS cluster with Karpenter autoscaling
- Compute: t3.micro spot instances (~$0.005/hour)
- Automation: Infrastructure as Code with Terraform and Helm
Karpenter is deployed as a Kubernetes controller that observes unschedulable pods and provisions optimal EC2 instances. Unlike Cluster Autoscaler or managed node groups, Karpenter evaluates instance types dynamically and can provision nodes in <30 seconds without ASG warm-up delays.
Key Advantages:
- Right-sizing: Evaluates all compatible instance types, selects cheapest fit
- Spot-first: Prefers Spot capacity (90% savings) with On-Demand fallback
- Zero-min scaling: Can scale to zero worker nodes, unlike managed node groups
- Fast provisioning: Direct EC2 API calls, no ASG lifecycle hooks
- Consolidation: Actively binpacks and terminates underutilized nodes
Location: eks/modules/eks/addons.tf
enable_karpenter = true
karpenter = {
chart_version = "1.1.2"
repository_username = data.aws_ecrpublic_authorization_token.token.user_name
repository_password = data.aws_ecrpublic_authorization_token.token.password
}Deployment Details:
- Chart: Deployed via Helm using
aws-ia/eks-blueprints-addonsmodule - Namespace:
karpenter(created by Helm chart) - Scheduling: Controller runs on Fargate (not Karpenter-managed nodes)
- Image: Public ECR (
public.ecr.aws/karpenter/karpenter), requiresus-east-1provider for auth token
Controller IAM Role (created by aws-ia/eks-blueprints-addons):
- Trust: IRSA (IAM Roles for Service Accounts) via OIDC provider
- Permissions: EC2 instance creation, termination, Describe APIs, Launch Templates
- Access Entry:
aws_eks_access_entry.karpenter_node_access_entrygrants cluster API access
Node IAM Role (karpenter_node configuration):
- Base Policy: Standard EKS node policy (EKS CNI, container registry pull, CloudWatch logs)
- Additional Policies:
AmazonSSMManagedInstanceCore(SSM Session Manager access) - Role Name: Uses
cluster_nameprefix (no random suffix per config)
Purpose: Defines EC2 instance configuration (AMI, subnet, security groups, IAM role, tags)
Location: eks/modules/eks/karpenter/default-nodeclass.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: al2@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: ${cluster_name}
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: ${cluster_name}Key Fields:
amiSelectorTerms: Dynamic AMI selection (AL2 latest)subnetSelectorTerms: Tag-based subnet discovery (private subnets only)securityGroupSelectorTerms: Tag-based security group discoveryrole: Referenceskarpenter_nodeIAM role name
Purpose: Defines scheduling constraints, instance requirements, limits, and disruption policies
Location: eks/modules/eks/karpenter/freetier-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: freetier-cpu
spec:
template:
spec:
nodeClassRef:
name: default
requirements:
- key: karpenter.k8s.aws/instance-type
operator: In
values: ["t3.micro"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
limits:
cpu: 1 # Maximum total CPU across all nodes in this pool
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30sConfiguration Details:
| Field | Value | Rationale |
|---|---|---|
instance-type |
t3.micro |
Free-tier eligible, 1 vCPU, 1GB RAM |
capacity-type |
spot |
90% cost savings vs On-Demand |
arch |
amd64 |
x86_64 architecture |
limits.cpu |
1 |
Prevents cluster from scaling beyond budget |
consolidationPolicy |
WhenEmpty |
Only consolidates when node is empty (safer for development) |
consolidateAfter |
30s |
Quick scale-down to minimize idle costs |
sequenceDiagram
participant Pod as Pod (Pending)
participant Scheduler as kube-scheduler
participant Controller as Karpenter Controller
participant NodePool as NodePool CR
participant EC2Class as EC2NodeClass CR
participant EC2API as AWS EC2 API
participant Node as EC2 Instance
Pod->>Scheduler: Create/Pending
Scheduler->>Scheduler: Evaluate nodeSelector/taints/affinity
Scheduler->>Pod: Mark Unschedulable (NoNodes)
Controller->>Pod: Watch Event (PodUnschedulable)
Controller->>NodePool: List matching NodePools
NodePool->>Controller: Return requirements/limits
Controller->>EC2Class: Get subnet/SG/AMI config
EC2Class->>Controller: Return selector terms
Controller->>EC2API: Evaluate instance types (price/compatibility)
Controller->>EC2API: CreateFleet/LaunchInstance (Spot preferred)
EC2API->>Node: Instance Launching
Node->>Scheduler: Node Ready (kubelet registered)
Scheduler->>Pod: Bind to Node
Pod->>Pod: Running
Technical Steps:
- Pod Creation: Pod resource created with
nodeSelector: {instanceType: freetier}or matching taints - Scheduler Evaluation: kube-scheduler evaluates existing nodes, finds none match ->
Unschedulable - Controller Watch: Karpenter watches Pod events via informer, detects
Unschedulablewith reasonNoNodes - NodePool Matching: Controller evaluates NodePool
requirementsagainst podrequests/nodeSelector - Instance Selection:
- Queries EC2 pricing/availability APIs
- Filters by
requirements(instance-type, capacity-type, arch, zone) - Selects cheapest compatible instance (Spot if available)
- EC2 Provisioning:
- Uses
EC2NodeClassto resolve subnet/SG/AMI - Calls
RunInstancesorCreateFleetwith user-data for bootstrap script - Tags instance with
karpenter.sh/nodepool,karpenter.sh/discovery
- Uses
- Node Registration: Bootstrap script installs kubelet, joins cluster via EKS API
- Pod Binding: Scheduler assigns pod to new node, Kubelet starts container
Provisioning Time: Typically 20-30 seconds from pending pod to running (faster than ASG warm-up ~2-5 minutes)
Consolidation Policy: WhenEmpty
- Only consolidates nodes with zero pods (safer for workloads with no PDB)
- After
consolidateAfterduration, evicts pods, drains node, terminates instance
Alternative Policies:
WhenUnderutilized: Consolidates even with pods (requires PodDisruptionBudget)Never: Disable consolidation (not recommended for cost savings)
Interruption Handling:
- Spot interruptions trigger
NodeClaimdeletion → pods rescheduled → new node provisioned - PodDisruptionBudget protects against voluntary evictions (consolidation, not Spot interruptions)
Subnet Discovery:
- Subnets must be tagged:
karpenter.sh/discovery: <cluster_name> - Terraform tags private subnets automatically
- Karpenter provisions nodes in discovered subnets across AZs for HA
Security Groups:
- Tagged:
karpenter.sh/discovery: <cluster_name> - Uses cluster security group + node security group (if created)
- Allows traffic: control plane ↔ nodes, nodes ↔ nodes (pod networking)
Monitoring:
# Controller logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f
# Node claims (provisioned nodes)
kubectl get nodeclaims
# NodePool status
kubectl get nodepools freetier-cpu -o yaml
# Provisioning metrics (if Prometheus enabled)
karpenter_provisioner_nodepool_limit{...} # CPU limit
karpenter_provisioner_nodepool_usage{...} # Current usageTroubleshooting:
- No nodes provisioned: Check NodePool
limits.cpu, Spot capacity in region, IAM permissions - Wrong instance type: Verify NodePool
requirementsmatch podrequests - Nodes terminate immediately: Check
consolidationPolicy, pod count, PDBs
This deployment uses a standard AWS VPC architecture with public and private subnets across multiple Availability Zones:
flowchart TB
Internet((Internet)):::ext
EKSAPI[EKS API Endpoint<br/>Public]:::cp
subgraph VPC[VPC - IPv4 Only]
IGW[Internet Gateway]:::net
subgraph AZA[Availability Zone A]
PubA[Public Subnet A<br/>NAT Gateway]:::net
PrivA[Private Subnet A<br/>Karpenter Nodes<br/>Fargate Pods]:::node
end
subgraph AZB[Availability Zone B]
PubB[Public Subnet B<br/>Optional: ALB/NLB]:::lb
PrivB[Private Subnet B<br/>Karpenter Nodes<br/>Fargate Pods]:::node
end
end
Internet --> EKSAPI
EKSAPI -->|TLS + IAM Auth| VPC
PubA --> IGW
PubB --> IGW
PrivA -->|0.0.0.0/0| PubA
PrivB -->|0.0.0.0/0| PubA
classDef cp fill:#e8f0fe,stroke:#3b82f6,color:#1e3a8a
classDef net fill:#eef2ff,stroke:#6366f1,color:#3730a3
classDef node fill:#ecfdf5,stroke:#10b981,color:#065f46
classDef lb fill:#fde68a,stroke:#d97706,color:#78350f
classDef ext fill:#f1f5f9,stroke:#94a3b8,color:#334155
Network Configuration:
- VPC: Single VPC spanning 2 Availability Zones for high availability
- Public Subnets: Host NAT Gateway and optional Load Balancers
- Private Subnets: Host Karpenter-provisioned EC2 nodes and Fargate pods
- Egress: All outbound traffic from private subnets routes through NAT Gateway
- Ingress: EKS API endpoint is public (configured via Terraform)
- IPv4 Only: This deployment uses IPv4 addressing only. Dual-stack (IPv4/IPv6) networking is currently out of scope.
Key Networking Notes:
- Karpenter nodes are provisioned in private subnets tagged for Karpenter discovery
- System pods (CoreDNS, Karpenter controller) run on Fargate
- Workload pods run on Karpenter-provisioned EC2 nodes
- Single NAT Gateway reduces costs but creates a single point of egress
- AWS Account with appropriate permissions
- AWS CLI installed and configured
- Terraform >= 1.13 installed
- kubectl installed
- Helm 3.x installed
.
├── eks/
│ ├── deploy/clusters/dev/ # EKS cluster deployment
│ └── modules/eks/ # EKS Terraform module
├── qwen-vllm/
│ ├── deploy/ # Qwen model deployment
│ └── modules/qwen/ # Qwen Terraform module
└── genai-app/ # Chatbot application
-
Navigate to EKS deployment directory:
cd eks/deploy/clusters/dev -
Configure your cluster:
Edit
config.yamland update:aws.account_id: Your AWS account IDaws.region: Your AWS region (e.g.,us-west-1)eks.name: Your desired cluster name (default:yuklia-sbx-inference)default_tags: Your account details
Example
config.yaml:aws: account_id: "123456789012" region: us-west-1 eks: name: "my-qwen-cluster" team: "mlops" gpu_mng: enable: false # Disabled for CPU-only Qwen model default_tags: aws_account_id: "123456789012" aws_account_name: "my-account" project: "Qwen2.5 Educational"
-
Initialize Terraform:
terraform init
-
Review the deployment plan:
terraform plan -out=eks-plan.json
This will show you what resources will be created:
- VPC with public/private subnets
- EKS cluster
- Karpenter for node autoscaling
- Internet Gateway and NAT Gateway
- Security groups and IAM roles
-
Deploy the EKS cluster:
terraform apply eks-plan.json
⏱️ This takes approximately 15-20 minutes to complete.
-
Configure kubectl:
After the cluster is created, configure kubectl to access it:
aws eks update-kubeconfig \ --region us-west-1 \ --name my-qwen-cluster
Replace
us-west-1andmy-qwen-clusterwith your values. -
Verify the cluster:
# Check cluster nodes kubectl get nodes # Check Karpenter kubectl get pods -n karpenter # Verify Karpenter nodepools kubectl get nodepools kubectl get ec2nodeclasses
You should see:
- Karpenter pods running
freetier-cpunodepool configured for t3.micro spot instancesdefaultEC2NodeClass available
-
Navigate to model deployment directory:
cd ../../../qwen-vllm/deploy -
Configure the model deployment:
Edit
config_qwen2.5.yamland update:eks.name: Must match the EKS cluster name from Step 1aws.region: Must match the region from Step 1default_tags.aws_account_id: Your AWS account ID
Example:
aws: region: "us-west-1" eks: name: "my-qwen-cluster" # Must match EKS cluster name model_size: "0.5b" model_name: "Qwen/Qwen2.5-0.5B-Instruct" resources: limits_cpu: 1 requests_cpu: 1 limits_memory: "900Mi" requests_memory: "800Mi" volumes: size: "512Mi" default_tags: aws_account_id: "123456789012" aws_account_name: "my-account" project: "Qwen2.5 Educational"
-
Initialize Terraform:
terraform init
-
Deploy the model:
terraform plan -out=qwen-plan.json terraform apply qwen-plan.json
-
Monitor the deployment:
# Watch pod status kubectl get pods -n qwen2.5-0.5b -w # Check logs (wait for pod to be running) kubectl logs -n qwen2.5-0.5b -l app=qwen2.5-0.5b -f
The model will:
- Download from HuggingFace (~200MB)
- Load into memory
- Start the vLLM server
⏱️ This takes 5-10 minutes depending on download speed.
-
Verify the model is ready:
Look for this in the logs:
INFO: Started server process INFO: Uvicorn running on http://0.0.0.0:8000
-
Port forward to access the API:
kubectl port-forward svc/qwen2.5-0.5b-service 8000:8000 -n qwen2.5-0.5b
-
Test with curl:
curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-0.5B-Instruct", "messages": [{"role": "user", "content": "Is life possible on Enceladus moon?"}] }'
-
Test with the web app (optional):
cd ../../genai-app python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt python app.py
Then open http://localhost:7860 in your browser.
- EKS Control Plane: ~$0.10/hour (charged 24/7 when cluster exists)
- t3.micro Spot Instance: ~$0.004-0.005/hour (only when model is running)
- NAT Gateway: ~$0.045/hour + data transfer (only during deployment/inference)
- Data Transfer: Minimal for small model
Total: Approximately $0.15-0.20/hour when running, ~$0.10/hour when idle (just control plane).
To remove all resources and avoid charges:
-
Delete the model deployment:
cd qwen-vllm/deploy terraform destroy -
Delete the EKS cluster:
cd ../../eks/deploy/clusters/dev terraform destroy