Calculate cpu and mem savings in tortoise #427

randytqwjp · 2025-01-08T14:23:17Z

What this PR does / why we need it:

Calculates CPU and MEM savings based on min/max replica changes.

Which issue(s) this PR fixes:

#411

Fixes #

Special notes for your reviewer:

sanposhiho · 2025-01-08T23:21:54Z

pkg/metrics/metrics.go


 	NetCPURequest = prometheus.NewGaugeVec(prometheus.GaugeOpts{
 		Name: "net_cpu_request",
 		Help: "net cpu request (millicore) that tortoises actually applys",
-	}, []string{"tortoise_name", "namespace", "container_name", "controller_name", "controller_kind"})
+	}, []string{"tortoise_name", "namespace", "container_name", "kube_deployment", "controller_kind"})


Why do we have to change it? I do prefer controller_name, which allows us to support other kind of resources (replicaset etc) in the future, and is also consistent with GKE metrics.

hmm but which controller is it referring to? because in the code it is actually referring to the deployment name. i just thought controller_name was a little confusing. But as you mentioned about the breaking change then we could leave it as is because theres no specific need to change it

Yeah, I remember initially I wanted to use owner_name, but I changed it to controller_name on the second thought, because controller_name is used at GKE metrics. Generally speaking, it's much easier to use the same label names as other metrics as much as possible for when you make a dashboard etc at datadog. (you can aggregate with the common labels)

So, make a complaint to GKE 😅

sanposhiho · 2025-01-08T23:23:47Z

pkg/metrics/metrics.go

-		Help: "net hpa maxReplicas that tortoises actually applys to hpa",
-	}, []string{"tortoise_name", "namespace", "hpa_name", "kube_deployment"})
+	NetHPAMinReplicasCPUCores = prometheus.NewGaugeVec(prometheus.GaugeOpts{
+		Name: "net_hpa_minreplicas_cpu_cores",


What does "net" mean?

overall change i.e. from 5 cores to 3 cores the change is -2 cores

sanposhiho

Also, you must be careful about the fact that changing metrics is a breaking change that we must avoid.
If you think you might have to change the metrics in the future, make sure you mention that in the metric description (i.e. Help), like Help: "..... This metric is the alpha stage and can be breaking-changed in the future releases."

Especially since you released v1 recently, we need to be careful about the breaking change, in general.

randytqwjp · 2025-01-10T07:31:30Z

Also, you must be careful about the fact that changing metrics is a breaking change that we must avoid. If you think you might have to change the metrics in the future, make sure you mention that in the metric description (i.e. Help), like Help: "..... This metric is the alpha stage and can be breaking-changed in the future releases."

Especially since you released v1 recently, we need to be careful about the breaking change, in general.

i did not know this. I will make the changes

randytqwjp · 2025-01-15T02:07:21Z

@sanposhiho For release version, does this mean we have to release v2 in this case? because there are some changes to the metrics

sanposhiho · 2025-01-15T05:17:28Z

Let's just go ahead with v1.1.0 in this case; releasing v2 just for those metrics is too much for us. I just wanted to prevent the same thing from happening again in the future.
(Although as you said if we followed the compatibility strictly, releasing v2 would be the most suitable option. So, just keep it in mind as your learning, there're many projects, especially in OSS world, which follow the semantic versioning very strictly)

Please mention this change at a release note when you release v1.1.0. That would help users (including us) to see what breaking changes are introduced in which version. (of course, we should try to avoid a breaking change as much as possible in the first place though)
Also, like I said earlier, let's make sure we have a clear note on each metric in the future, if the metric could be breaking-changed later.

randytqwjp · 2025-01-17T01:43:51Z

@sanposhiho made the changes PTAL

sanposhiho · 2025-01-17T02:20:25Z

pkg/metrics/metrics.go

Can you elaborate about the use case a bit more? I mean, how would those metrics benefit you? because, for example, net_hpa_minreplicas_cpu_cores shows the difference of CPU allocation with an assumption that HPA always keeps the replicas at minReplicas, which occasionally happens. For another example, net_hpa_maxreplicas_cpu_cores shows the difference of CPU allocation _with an assumption that HPA always keeps the replicas at maxReplicas, which should never happens.
I'm not sure how worthwhile those values would be.

In the first place though, why are you trying to measure them within tortoise? Why not just directly checking the allocated CPU or memory on each service that adopts tortoise if you just want to see the cost reduction?

Yes I do acknowledge the first point on the assumptions as this was something I brought up as well. However, it is difficult to calculate accurate cost savings from tortoise as compared to a service purely using HPA. Directly checking allocated CPU or Memory does not tell whether tortoise is actually helping to reduce costs or not. Therefore we need to show cost savings based on decisions made by tortoise such as changing max/min replicas. I do agree its not an accurate amount especially on the max replica side but with regards to min replica, it should only decrease when there is significant underutilization, in which case i do think it can be used for calculating cost savings by tortoise. Do you think we should remove net max replica changes?

it is difficult to calculate accurate cost savings from tortoise as compared to a service purely using HPA

I mean, how would net_hpa_minreplicas_cpu_cores help then? Again, it's a super rare situation that HPA always keeps the replicas at minReplicas, and hence if you see -2 cores in net_hpa_minreplicas_cpu_cores, in most cases, that doesn't mean tortoise makes the cost saving of 2 cores. I'm not sure how this "2 cores" help you understand the cost saving.

Also, this "net" value is simply old value - new value, right? Then, except very first reconciliation, old value is also the value calculated from tortoise. It's not the value that the service owner used within their HPA before adopting Tortoise.

Another thing, tortoise dynamically changes the minReplicas based on the time.
https://github.com/mercari/tortoise/blob/main/docs/horizontal.md#minreplicas
So, the graph of net_hpa_minreplicas_cpu_cores would just show increasing values during the time towards the peak time, and decreasing values during the time towards the off-peak time. Tortoise's principal is not just putting as low as possible value on minReplicas, but is putting the value to be a safe guard as this section describes.
How would those values benefit you for the cost saving calculation?

oh my bad i thought that recommendation algo also made changes to hpa min/max replica based on utilization but looking at the code it seems it does not.. then it doesnt make sense to have cost saving metrics for min/max replica since its more about reliability it seems.. i will revert the changes on min/max replicas

i was somehow under the impression that if utilization was low and min replica was at maybe 10, tortoise would decrease min replica then but thats not the case.

if utilization was low and min replica was at maybe 10, tortoise would decrease min replica then but thats not the case.

You could be right depending on the scenario.
If last week's same day's same time's replica is 20 and the utilization is low today, then tortoise keeps minReplica at 10 (1/2 * lastweek) because most likely something weird is happening now (e.g., the upstream service like gateway is down etc) and it might be dangerous to reduce minReplicas.
Although let's say, then, next week, it's again 10 with a low utilization. It means probably this is a new trend that this service gets smaller traffic now (e.g., one upstream service stopped calling this service, Mercari got unpopular somehow etc).
Tortoise checks the last week's value and it's 10. So, it changes minReplicas to 5 (again 1/2 * lastweek).

So, that's how tortoise deals with the scenario like too high minReplicas makes the service low utilized.

randytqwjp added 5 commits December 20, 2024 14:14

calculate cpu and mem changes from replica changes

cf5bc34

minor changes

e3547a6

minor changes

c076cc4

fix

8802f3b

fix

d634c79

randytqwjp requested a review from sanposhiho as a code owner January 8, 2025 14:23

sanposhiho reviewed Jan 8, 2025

View reviewed changes

add alpha description to new metrics

3982122

sanposhiho reviewed Jan 17, 2025

View reviewed changes

remove net hpa metrics

ad261c7

randytqwjp requested review from AVSBharadwaj, jrbangit and AdityaK011 as code owners March 31, 2025 07:43

fix net change in request

220d434

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate cpu and mem savings in tortoise #427

Calculate cpu and mem savings in tortoise #427

randytqwjp commented Jan 8, 2025

sanposhiho Jan 8, 2025

randytqwjp Jan 10, 2025

sanposhiho Jan 10, 2025 •

edited

Loading

sanposhiho Jan 10, 2025

sanposhiho Jan 8, 2025

randytqwjp Jan 17, 2025

sanposhiho left a comment •

edited

Loading

randytqwjp commented Jan 10, 2025

randytqwjp commented Jan 15, 2025

sanposhiho commented Jan 15, 2025 •

edited

Loading

randytqwjp commented Jan 17, 2025

sanposhiho Jan 17, 2025 •

edited

Loading

randytqwjp Jan 17, 2025

sanposhiho Jan 17, 2025

sanposhiho Jan 17, 2025 •

edited

Loading

randytqwjp Jan 17, 2025

randytqwjp Jan 17, 2025

sanposhiho Jan 17, 2025

Calculate cpu and mem savings in tortoise #427

Are you sure you want to change the base?

Calculate cpu and mem savings in tortoise #427

Conversation

randytqwjp commented Jan 8, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Jan 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho left a comment • edited Loading

Choose a reason for hiding this comment

randytqwjp commented Jan 10, 2025

randytqwjp commented Jan 15, 2025

sanposhiho commented Jan 15, 2025 • edited Loading

randytqwjp commented Jan 17, 2025

sanposhiho Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Jan 10, 2025 •

edited

Loading

sanposhiho left a comment •

edited

Loading

sanposhiho commented Jan 15, 2025 •

edited

Loading

sanposhiho Jan 17, 2025 •

edited

Loading

sanposhiho Jan 17, 2025 •

edited

Loading