Skip to content

fix(metrics): implement resources managed and sync metrics #11605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dhaifley
Copy link
Contributor

@dhaifley dhaifley commented Jul 7, 2025

Description

Motivation:

  • The current implementation of control-plane metrics focus around KRT collection internals, and do not represent user facing resources in a way that is always easy to understand.
  • There should be easy metrics to use to see how many resources are currently being synced, and sync duration metrics should include the full latency of all sync operations for the resource, not only the KRT transform function process time.
  • XDS snapshot metrics were a label of larger KRT collection metrics, and these are useful on their own and should be easier to access and interpret.

What changed:

  • kgateway_collection_* metrics have been removed and replaced by the new kgateway_resources_* subsystem metrics. The instrumentation of KRT collection metrics is still used, but the metrics generated are more user-friendly in terms of naming and labels.
  • These metrics allow users to track syncs started for each gateway resource type, syncs completed, and see histogram metrics for sync duration. These metrics track the entire process for a resource from the start of any translations involved, until the status for the resource has been updated.
  • XDS snapshot metrics have been renamed to kgateway_xds_snapshot_* to clarify that these metrics relate to individual resource counts and transform times for each unique client XDS snapshot.

Related issues:

Change Type

/kind bug_fix
/kind cleanup

Changelog

NONE

@github-actions github-actions bot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/fix Categorizes issue or PR as related to a bug. release-note-none labels Jul 7, 2025
@dhaifley dhaifley force-pushed the resources_sync_metrics branch from 9b721b9 to 811b6bc Compare July 7, 2025 21:28
Help: "Current number of gateway resources managed by the collection",
},
[]string{"namespace", "resource"},
[]string{"name", "namespace", "resource"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a label that describes gatewayName a bit more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. In most cases the name label is used for the gateway name, changing the label to "gateway" would be helpful on several metrics.

return st[0], true
}

return time.Now(), false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is more explicit?

Suggested change
return time.Now(), false
return time.Time{}, false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was a holdover from before this function returned a bool indicating whether it was found. It was a safer default, but this will be changed now that this function is used differently.

@@ -108,10 +130,77 @@ func (m *nullTranslatorMetricsRecorder) TranslationStart() func(error) {
return func(err error) {}
}

// IncTranslationsTotal increments the total translations counter.
func IncResourcesSyncsStartedTotal(resourceName string, labels ResourceMetricLabels) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourceName does not appear as a metric label to reduce cardinality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. gateway, namespace, resource kind is the level of cardinality for these metrics. We could increase this to the level of the individual resource names, but I don't know if we will need that yet.

@dhaifley dhaifley force-pushed the resources_sync_metrics branch 2 times, most recently from f8bb0a0 to 72b00e8 Compare July 8, 2025 14:12
Copy link
Contributor

@sheidkamp sheidkamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still poking around, and I know its in draft, but some initial feedback.

Overall looking great!

metricsRecorder := metrics.NewTranslatorMetricsRecorder("TranslateHTTPRoute")
defer metricsRecorder.TranslationStart()(nil)

metrics.IncResourcesSyncsStartedTotal(routeInfo.GetName(), metrics.ResourceMetricLabels{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is getting called more times than expected. Not sure if the this function is recursively invoked in the function header explains it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Taking a look at this...

@sheidkamp
Copy link
Contributor

(also failing test unit test is a metrics one)

@dhaifley dhaifley force-pushed the resources_sync_metrics branch 9 times, most recently from e51b96d to 01ef855 Compare July 8, 2025 23:20
Help: "Current number of resources managed by the collection",
Subsystem: resourcesSubsystem,
Name: "managed",
Help: "Gateway resources currently managed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the phrasing of the original Help text better (current number of...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed back the help text on this, and updated the other metrics to use consistent phrasing.

metrics.HistogramOpts{
Subsystem: snapshotSubsystem,
Name: "transform_duration_seconds",
Help: "XDS snapshot transform duration",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment that the default bucket layout works fine here (and at other Histograms) after testing? Otherwise we can try to take a stab at overriding that with a better layout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added explicit Buckets assignments for all histograms (even if they use metrics.DefaultBuckets). Some histogram bucket sets may still need further refinement.

newSTs := []ResourceSyncStartTime{}
res := []ResourceSyncStartTime{}

if xdsSnapshot {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this nested loop become a problem at scale, especially with startTimes.Lock()? I can imagine the number of namespaces to be quite high in some environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested for loop was not needed and has been removed. There may be further ways to optimize the process used in EndResourceSync().

@dhaifley dhaifley force-pushed the resources_sync_metrics branch 2 times, most recently from 17c16ec to 99fa0a3 Compare July 9, 2025 14:25
gateway := snapWrap.ResourceName()
namespace := "unknown"

pks := strings.SplitN(gateway, "~", 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is in here a few times... move to a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This has been updated.

@dhaifley dhaifley force-pushed the resources_sync_metrics branch 2 times, most recently from 58af234 to 17c5cab Compare July 9, 2025 14:57
Copy link
Contributor

@sheidkamp sheidkamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple more comments

gwNames = append(gwNames, string(pr.Name))
}

ResourceMetricEventHandler(o, "HTTPRoute", gwNames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a wrapped version of this function that handles extracting the names?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this to use a helper function and the code in setup.go is much simpler. But, GetResourceMetricEventHandler() will need to be updated if any new types need to used this function to return a krt.Event handler for metrics.

@@ -27,4 +29,30 @@ func (s *ProxyTranslator) syncXds(
// a default initial fetch timeout
// snap.MakeConsistent()
s.xdsCache.SetSnapshot(ctx, proxyKey, snap)

gateway, namespace := getGatewayFromXDSSnapshotResourceName(snapWrap.ResourceName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any downsides to putting these metrics calculations in a goroutine so they can run async and let the translators get back to translating? (comment probably applies anywhere we're doing metrics)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent question that Sam also brought up on the previous metrics PR. This is important given that these metrics require calculations and contain some loops. I'm not sure spawning a separate goroutine for each calculation would work best, but perhaps passing the info to calculate metrics in a channel to a worker goroutine would be helpful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've done so far (see the next lines) is consolodate all of work to update the actual metrics within the tmetrics.EndResourceSync() call. This can make it easier to pass this work into a channel to be handled by one or more worker goroutines. And, it cleans up the instrumentation where EndResourceSync() is used.

@dhaifley dhaifley force-pushed the resources_sync_metrics branch 4 times, most recently from 536419d to c6dba87 Compare July 9, 2025 19:52
@dhaifley dhaifley force-pushed the resources_sync_metrics branch from c6dba87 to 8716b85 Compare July 9, 2025 19:53
@dhaifley dhaifley marked this pull request as ready for review July 9, 2025 19:59
@dhaifley dhaifley force-pushed the resources_sync_metrics branch from 8716b85 to cb6149a Compare July 9, 2025 20:33
@dhaifley dhaifley force-pushed the resources_sync_metrics branch from cb6149a to 72daed9 Compare July 9, 2025 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/fix Categorizes issue or PR as related to a bug. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants