-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate cpu and mem savings in tortoise #427
base: main
Are you sure you want to change the base?
Conversation
pkg/metrics/metrics.go
Outdated
|
||
NetCPURequest = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "net_cpu_request", | ||
Help: "net cpu request (millicore) that tortoises actually applys", | ||
}, []string{"tortoise_name", "namespace", "container_name", "controller_name", "controller_kind"}) | ||
}, []string{"tortoise_name", "namespace", "container_name", "kube_deployment", "controller_kind"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to change it? I do prefer controller_name, which allows us to support other kind of resources (replicaset etc) in the future, and is also consistent with GKE metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm but which controller is it referring to? because in the code it is actually referring to the deployment name. i just thought controller_name was a little confusing. But as you mentioned about the breaking change then we could leave it as is because theres no specific need to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I remember initially I wanted to use owner_name
, but I changed it to controller_name
on the second thought, because controller_name
is used at GKE metrics. Generally speaking, it's much easier to use the same label names as other metrics as much as possible for when you make a dashboard etc at datadog. (you can aggregate with the common labels)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, make a complaint to GKE 😅
Help: "net hpa maxReplicas that tortoises actually applys to hpa", | ||
}, []string{"tortoise_name", "namespace", "hpa_name", "kube_deployment"}) | ||
NetHPAMinReplicasCPUCores = prometheus.NewGaugeVec(prometheus.GaugeOpts{ | ||
Name: "net_hpa_minreplicas_cpu_cores", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "net" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall change i.e. from 5 cores to 3 cores the change is -2 cores
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you must be careful about the fact that changing metrics is a breaking change that we must avoid.
If you think you might have to change the metrics in the future, make sure you mention that in the metric description (i.e. Help
), like Help: "..... This metric is the alpha stage and can be breaking-changed in the future releases."
Especially since you released v1 recently, we need to be careful about the breaking change, in general.
i did not know this. I will make the changes |
@sanposhiho For release version, does this mean we have to release v2 in this case? because there are some changes to the metrics |
Let's just go ahead with Please mention this change at a release note when you release |
@sanposhiho made the changes PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate about the use case a bit more? I mean, how would those metrics benefit you? because, for example, net_hpa_minreplicas_cpu_cores
shows the difference of CPU allocation with an assumption that HPA always keeps the replicas at minReplicas, which occasionally happens. For another example, net_hpa_maxreplicas_cpu_cores
shows the difference of CPU allocation _with an assumption that HPA always keeps the replicas at maxReplicas, which should never happens.
I'm not sure how worthwhile those values would be.
In the first place though, why are you trying to measure them within tortoise? Why not just directly checking the allocated CPU or memory on each service that adopts tortoise if you just want to see the cost reduction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I do acknowledge the first point on the assumptions as this was something I brought up as well. However, it is difficult to calculate accurate cost savings from tortoise as compared to a service purely using HPA. Directly checking allocated CPU or Memory does not tell whether tortoise is actually helping to reduce costs or not. Therefore we need to show cost savings based on decisions made by tortoise such as changing max/min replicas. I do agree its not an accurate amount especially on the max replica side but with regards to min replica, it should only decrease when there is significant underutilization, in which case i do think it can be used for calculating cost savings by tortoise. Do you think we should remove net max replica changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is difficult to calculate accurate cost savings from tortoise as compared to a service purely using HPA
I mean, how would net_hpa_minreplicas_cpu_cores
help then? Again, it's a super rare situation that HPA always keeps the replicas at minReplicas, and hence if you see -2 cores in net_hpa_minreplicas_cpu_cores
, in most cases, that doesn't mean tortoise makes the cost saving of 2 cores. I'm not sure how this "2 cores" help you understand the cost saving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this "net" value is simply old value - new value
, right? Then, except very first reconciliation, old value is also the value calculated from tortoise. It's not the value that the service owner used within their HPA before adopting Tortoise.
Another thing, tortoise dynamically changes the minReplicas based on the time.
https://github.com/mercari/tortoise/blob/main/docs/horizontal.md#minreplicas
So, the graph of net_hpa_minreplicas_cpu_cores
would just show increasing values during the time towards the peak time, and decreasing values during the time towards the off-peak time. Tortoise's principal is not just putting as low as possible value on minReplicas, but is putting the value to be a safe guard as this section describes.
How would those values benefit you for the cost saving calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh my bad i thought that recommendation algo also made changes to hpa min/max replica based on utilization but looking at the code it seems it does not.. then it doesnt make sense to have cost saving metrics for min/max replica since its more about reliability it seems.. i will revert the changes on min/max replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was somehow under the impression that if utilization was low and min replica was at maybe 10, tortoise would decrease min replica then but thats not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if utilization was low and min replica was at maybe 10, tortoise would decrease min replica then but thats not the case.
You could be right depending on the scenario.
If last week's same day's same time's replica is 20 and the utilization is low today, then tortoise keeps minReplica at 10 (1/2 * lastweek) because most likely something weird is happening now (e.g., the upstream service like gateway is down etc) and it might be dangerous to reduce minReplicas.
Although let's say, then, next week, it's again 10 with a low utilization. It means probably this is a new trend that this service gets smaller traffic now (e.g., one upstream service stopped calling this service, Mercari got unpopular somehow etc).
Tortoise checks the last week's value and it's 10. So, it changes minReplicas to 5 (again 1/2 * lastweek).
So, that's how tortoise deals with the scenario like too high minReplicas makes the service low utilized.
What this PR does / why we need it:
Calculates CPU and MEM savings based on min/max replica changes.
Which issue(s) this PR fixes:
#411
Fixes #
Special notes for your reviewer: