The goal of this project is to explore how we can use spot instances to reduce the cost of running kubernetes clusters.
Click on the above image to watch our video where we discuss the whole project
-
Amazon EC2 Spot Instances offer spare compute capacity available in the AWS cloud at steep discounts compared to On-Demand instances. Spot Instances enable you to optimize your costs on the AWS cloud and scale your application's throughput up to 10X for the same budget. The Spot price is determined by temporary trends in supply and demand and the amount of On-Demand capacity on a particular instance size, family, Availability Zone, and AWS Region
-
Kubernetes is a portable, extensible open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.
-
Developers use On Demand EC2 instances to use with Kubernetes clusters as it is risk free and easier to manage compared to Spot Instances.
-
But using such On Demand instances results in high costs of running an application.
-
Using Amazon’s Elastic Kubernetes Service (EKS) one can manage Spot Instances, but Amazon charges $150 per month for running the control panel in addition to cost of running nodes which host the application.
-
This additional cost can be avoided if you're administering the Kubernetes Cluster.
-
We create a controller which runs in an infinite while loop and sits on top of the Kubernetes.
-
This controller performs these actions
-
Manages an application that runs on spot instances within a Kubernetes cluster in a cost effective way
-
Optimizes the number of spot instances dynamically
-
Maintains the Service Level Agreement (SLA) for the application
-
-
This will be used by cluster admins and developers using the Kubernetes Cluster and want to optimize their infrastructure cost for various applications or projects.
-
It does not target non-expert users who are unfamiliar with Kubernetes or AWS.
- Helps to create, destroy and upgrade clusters from command line
-
Collects metrics from configured targets at given intervals
-
Displays the results
-
The application is the user-facing part of an online shop that sells socks.
-
It is intended to aid the demonstration and testing of microservices and cloud native technologies.
- Application is deployed on kubernetes cluster with on demand as and spot instances
- Prometheus pulls metrics from application
- Controller takes thresholds as inputs
- Controller then compares the threshold values with metrics from prometheus and takes decisions
- These decisions invoke actions that performed through kops on the cluster
-
A service level agreement, or SLA, is a common term for formal service commitments that are made to customers by service providers.
-
The target SLA for our test application (Sock-shop) is set to have availability of 99% (“two nines”)
-
Availability is defined by these Service Level Indicators (SLIs)
-
Throughput: More than 96%
-
Error Rate: Less than 4%
-
Latency: Less than 1.5 seconds
-
-
Budget is the quota which we have in terms of SLA that allows our application to perform below set thresholds.
-
Budget is increased after a specific time interval.
-
Anytime our test application performs below a certain threshold, we reduce some amount from the budget
-
For the test application, we have defined the budget to be 15 mins per day.
-
Our test application requires minimum 2 nodes to be on demand instances.
-
Based on network traffic (load) , our algorithm considers three cases
-
High Network Traffic ( more than 600 requests per second )
-
Medium Network Traffic ( between 100 and 600 requests per second )
-
Low Network Traffic ( less than 100 requests per second )
-
-
If network_traffic > 600:
-
If application performance is low for 10 mins:
-
If budget available:
- Spin up a Spot instance scale up the application
-
Else:
- Spin up an On demand and scale up the application
-
-
Else
- Application is running smoothly, do nothing
-
- If 100 < network_traffic < 600:
- If application performance is high for 180 mins:
-
If budget available:
-
Scale down
-
Delete on demand
-
Spin up a spot
-
Scale up
-
-
Else:
- Budget not available, do nothing
-
- If application performance is high for 180 mins:
-
If network_traffic < 100:
-
If application performance stable from 180 mins:
-
If budget available :
-
Scale down
-
If On demand nodes > 2
- Delete an On-Demand Node
-
Else
- Delete Spot instance
-
-
-
-
Before we spin up a Spot instance, we have to set a max bidding price.
-
Spot Instances can be taken away at any time due to following reasons:
-
Price: The Spot price is greater than your maximum bidding price.
-
Capacity: If there are not enough unused EC2 instances to meet the demand for Spot Instances, Amazon EC2 interrupts Spot Instances. The order in which the instances are interrupted is determined by Amazon EC2.
-
-
Controller constantly checks for ‘to be terminated’ flag of Spot Instances.
-
If to_be_terminated = True:
-
Scale down the application
-
Drain the spot instance
-
If budget available :
- Spin up a spot instance
-
Else
- Spin up an On demand Instance
-
-
Config file that contains these parameters:
- SLA of the application provided by the client such as the availability of the application
- External IP of the prometheus deployment
- Sprint 1:
- Set up the prerequisites (AWS-CLI, kops, Kubectl)
- Created a Kubernetes cluster using kops
- Mock EC2 spot instances that simulated termination from AWS
- Sprint 2:
- Tested multiple demo applications on minikube
- Finalize Sock-shop as the demo application
- Sprint 3:
- Used Jmeter to perform load testing on the application deployed to the cluster
- Observed the impact on application due to scaling pods or nodes
- Obtained kubernetes metrics using a Go based controller
- Sprint 4:
- Monitored metrics using prometheus
- Defined a basic SLA for the application
- Created and deleted on demand and spot instances from the Go controller
- Sprint 5:
- Defined the complete SLA for the project with the SLIs
- Created the algorithm to make decisions
- We started the application with 2 on-demand nodes of type
t3.medium
- We scaled up the cluster using only on-demand nodes in one case, and using only spot instances in the second case.
- We observed that the savings increased as we increased the number of nodes.
- Use the following command to create a S3 bucket for storing kops cluster data
aws s3 mb s3://<bucket-name>
export KOPS_STATE_STORE=s3://<bucket-name>
- Use the following command to create a cluster
kops create cluster <cluster-name> --zones us-east-1a --yes --master-size=<machine-type> --node-size=<machine-type> --node-count=<count>
Note: <cluster-name>.k8s.local
creates a cluster without DNS.
Set the <machine-type>
and <count>
to the desired option.
We set <machine-type> => t3.medium
and <count> => 2
for our test application.
- Create a new instance group for Spot Instances
kops create ig <instance-group-name>
- Edit the yaml file that opens in vim as follows
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-05-01T20:54:19Z
labels:
kops.k8s.io/cluster: <cluster-name>
name: <instance-group-name>
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: <machine-type>
maxPrice: <max-bidding-price>
maxSize: 0
minSize: 0
nodeLabels:
kops.k8s.io/instancegroup: <instance-group-name>
role: Node
subnets:
- us-east-1a
Edit the minSize
and maxSize
both to 0, as we don't want the cluster to begin on Spot Instances.
This is done to make sure that the cluster is set-up without any risk of the nodes being deleted when deploying the application and the metrics.
Add a new parameter maxPrice:<max-bidding-price>
above maxSize
.
Set the <max-bidding-price>
as the max price you want to bid for a spot instance.
Set the <machine-type>
to the desired option.
We used t3.medium
for our test application.
- Use the kops update command to update the cluster
kops update cluster <cluster-name> --yes
- Use kops validate to check if cluster is ready
kops validate cluster
- Clone the Sock shop repository
git clone https://github.com/microservices-demo/microservices-demo.git
- Deploy the sock-shop along with prometheus application to your cluster using following commands
cd microservices-demo/deploy/kubernetes/
kubectl create namespace sock-shop
kubectl apply -f complete-demo.yaml
kubectl create -f ./deploy/kubernetes/manifests-monitoring
- Expose the deployments
kubectl expose deployment front-end --type=LoadBalancer -n sock-shop --name=front-end-deployment
kubectl expose deployment prometheus-deployment --type=LoadBalancer -n monitoring --name=prometheus-deployment
- Note down the exposed external-ip of the application
kubectl get service -n sock-shop
You will see an output similar to this example:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
carts ClusterIP 100.70.116.237 <none> 80/TCP 1d
carts-db ClusterIP 100.70.224.159 <none> 27017/TCP 1d
catalogue ClusterIP 100.65.171.188 <none> 80/TCP 1d
catalogue-db ClusterIP 100.65.237.207 <none> 3306/TCP 1d
front-end LoadBalancer 100.64.107.5 a471afe9b6ce911e9b71e02004e2872a-463552420.us-east-1.elb.amazonaws.com 8079:30416/TCP 14h
orders ClusterIP 100.65.108.161 <none> 80/TCP 1d
orders-db ClusterIP 100.67.156.105 <none> 27017/TCP 1d
payment ClusterIP 100.69.170.8 <none> 80/TCP 1d
queue-master ClusterIP 100.66.55.164 <none> 80/TCP 1d
rabbitmq ClusterIP 100.70.49.108 <none> 5672/TCP 1d
shipping ClusterIP 100.68.54.197 <none> 80/TCP 1d
user ClusterIP 100.66.202.190 <none> 80/TCP 1d
user-db ClusterIP 100.64.14.164 <none> 27017/TCP 1d
- Note down the exposed external-ip of prometheus
kubectl get service -n monitoring
You will see an output similar to this example:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-state-metrics ClusterIP 100.64.7.247 <none> 8080/TCP 1d
prometheus NodePort 100.68.102.152 <none> 9090:31090/TCP 1d
prometheus-deployment LoadBalancer 100.64.119.122 af1b1f0726c5011e9b71e02004e2872a-1377931422.us-east-1.elb.amazonaws.com 9090:30544/TCP 1d
View the website using http://<sock-shop-external-ip>:8079
View Prometheus Graphs using http://<prometheus-external-ip>:9090
- Edit prometheus endpoint in the go program using the external ip of your prometheus deployment
- Store the kops scripts inside bin folder of your Go installation.
- Run the program