Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate cpu and mem savings in tortoise #427

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

randytqwjp
Copy link
Collaborator

What this PR does / why we need it:

Calculates CPU and MEM savings based on min/max replica changes.

Which issue(s) this PR fixes:

#411

Fixes #

Special notes for your reviewer:

@randytqwjp randytqwjp requested a review from sanposhiho as a code owner January 8, 2025 14:23

NetCPURequest = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "net_cpu_request",
Help: "net cpu request (millicore) that tortoises actually applys",
}, []string{"tortoise_name", "namespace", "container_name", "controller_name", "controller_kind"})
}, []string{"tortoise_name", "namespace", "container_name", "kube_deployment", "controller_kind"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to change it? I do prefer controller_name, which allows us to support other kind of resources (replicaset etc) in the future, and is also consistent with GKE metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm but which controller is it referring to? because in the code it is actually referring to the deployment name. i just thought controller_name was a little confusing. But as you mentioned about the breaking change then we could leave it as is because theres no specific need to change it

Copy link
Collaborator

@sanposhiho sanposhiho Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I remember initially I wanted to use owner_name, but I changed it to controller_name on the second thought, because controller_name is used at GKE metrics. Generally speaking, it's much easier to use the same label names as other metrics as much as possible for when you make a dashboard etc at datadog. (you can aggregate with the common labels)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, make a complaint to GKE 😅

Help: "net hpa maxReplicas that tortoises actually applys to hpa",
}, []string{"tortoise_name", "namespace", "hpa_name", "kube_deployment"})
NetHPAMinReplicasCPUCores = prometheus.NewGaugeVec(prometheus.GaugeOpts{
Name: "net_hpa_minreplicas_cpu_cores",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "net" mean?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall change i.e. from 5 cores to 3 cores the change is -2 cores

Copy link
Collaborator

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you must be careful about the fact that changing metrics is a breaking change that we must avoid.
If you think you might have to change the metrics in the future, make sure you mention that in the metric description (i.e. Help), like Help: "..... This metric is the alpha stage and can be breaking-changed in the future releases."

Especially since you released v1 recently, we need to be careful about the breaking change, in general.

@randytqwjp
Copy link
Collaborator Author

Also, you must be careful about the fact that changing metrics is a breaking change that we must avoid. If you think you might have to change the metrics in the future, make sure you mention that in the metric description (i.e. Help), like Help: "..... This metric is the alpha stage and can be breaking-changed in the future releases."

Especially since you released v1 recently, we need to be careful about the breaking change, in general.

i did not know this. I will make the changes

@randytqwjp
Copy link
Collaborator Author

@sanposhiho For release version, does this mean we have to release v2 in this case? because there are some changes to the metrics

@sanposhiho
Copy link
Collaborator

sanposhiho commented Jan 15, 2025

Let's just go ahead with v1.1.0 in this case; releasing v2 just for those metrics is too much for us. I just wanted to prevent the same thing from happening again in the future.
(Although as you said if we followed the compatibility strictly, releasing v2 would be the most suitable option. So, just keep it in mind as your learning, there're many projects, especially in OSS world, which follow the semantic versioning very strictly)

Please mention this change at a release note when you release v1.1.0. That would help users (including us) to see what breaking changes are introduced in which version. (of course, we should try to avoid a breaking change as much as possible in the first place though)
Also, like I said earlier, let's make sure we have a clear note on each metric in the future, if the metric could be breaking-changed later.

@randytqwjp
Copy link
Collaborator Author

@sanposhiho made the changes PTAL

Copy link
Collaborator

@sanposhiho sanposhiho Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate about the use case a bit more? I mean, how would those metrics benefit you? because, for example, net_hpa_minreplicas_cpu_cores shows the difference of CPU allocation with an assumption that HPA always keeps the replicas at minReplicas, which occasionally happens. For another example, net_hpa_maxreplicas_cpu_cores shows the difference of CPU allocation _with an assumption that HPA always keeps the replicas at maxReplicas, which should never happens.
I'm not sure how worthwhile those values would be.

In the first place though, why are you trying to measure them within tortoise? Why not just directly checking the allocated CPU or memory on each service that adopts tortoise if you just want to see the cost reduction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I do acknowledge the first point on the assumptions as this was something I brought up as well. However, it is difficult to calculate accurate cost savings from tortoise as compared to a service purely using HPA. Directly checking allocated CPU or Memory does not tell whether tortoise is actually helping to reduce costs or not. Therefore we need to show cost savings based on decisions made by tortoise such as changing max/min replicas. I do agree its not an accurate amount especially on the max replica side but with regards to min replica, it should only decrease when there is significant underutilization, in which case i do think it can be used for calculating cost savings by tortoise. Do you think we should remove net max replica changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is difficult to calculate accurate cost savings from tortoise as compared to a service purely using HPA

I mean, how would net_hpa_minreplicas_cpu_cores help then? Again, it's a super rare situation that HPA always keeps the replicas at minReplicas, and hence if you see -2 cores in net_hpa_minreplicas_cpu_cores, in most cases, that doesn't mean tortoise makes the cost saving of 2 cores. I'm not sure how this "2 cores" help you understand the cost saving.

Copy link
Collaborator

@sanposhiho sanposhiho Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this "net" value is simply old value - new value, right? Then, except very first reconciliation, old value is also the value calculated from tortoise. It's not the value that the service owner used within their HPA before adopting Tortoise.

Another thing, tortoise dynamically changes the minReplicas based on the time.
https://github.com/mercari/tortoise/blob/main/docs/horizontal.md#minreplicas
So, the graph of net_hpa_minreplicas_cpu_cores would just show increasing values during the time towards the peak time, and decreasing values during the time towards the off-peak time. Tortoise's principal is not just putting as low as possible value on minReplicas, but is putting the value to be a safe guard as this section describes.
How would those values benefit you for the cost saving calculation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad i thought that recommendation algo also made changes to hpa min/max replica based on utilization but looking at the code it seems it does not.. then it doesnt make sense to have cost saving metrics for min/max replica since its more about reliability it seems.. i will revert the changes on min/max replicas

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was somehow under the impression that if utilization was low and min replica was at maybe 10, tortoise would decrease min replica then but thats not the case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if utilization was low and min replica was at maybe 10, tortoise would decrease min replica then but thats not the case.

You could be right depending on the scenario.
If last week's same day's same time's replica is 20 and the utilization is low today, then tortoise keeps minReplica at 10 (1/2 * lastweek) because most likely something weird is happening now (e.g., the upstream service like gateway is down etc) and it might be dangerous to reduce minReplicas.
Although let's say, then, next week, it's again 10 with a low utilization. It means probably this is a new trend that this service gets smaller traffic now (e.g., one upstream service stopped calling this service, Mercari got unpopular somehow etc).
Tortoise checks the last week's value and it's 10. So, it changes minReplicas to 5 (again 1/2 * lastweek).

So, that's how tortoise deals with the scenario like too high minReplicas makes the service low utilized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants