-
Notifications
You must be signed in to change notification settings - Fork 532
fix(metrics): implement resources managed and sync metrics #11605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9b721b9
to
811b6bc
Compare
Help: "Current number of gateway resources managed by the collection", | ||
}, | ||
[]string{"namespace", "resource"}, | ||
[]string{"name", "namespace", "resource"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a label that describes gatewayName a bit more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. In most cases the name label is used for the gateway name, changing the label to "gateway" would be helpful on several metrics.
return st[0], true | ||
} | ||
|
||
return time.Now(), false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is more explicit?
return time.Now(), false | |
return time.Time{}, false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was a holdover from before this function returned a bool indicating whether it was found. It was a safer default, but this will be changed now that this function is used differently.
@@ -108,10 +130,77 @@ func (m *nullTranslatorMetricsRecorder) TranslationStart() func(error) { | |||
return func(err error) {} | |||
} | |||
|
|||
// IncTranslationsTotal increments the total translations counter. | |||
func IncResourcesSyncsStartedTotal(resourceName string, labels ResourceMetricLabels) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resourceName
does not appear as a metric label to reduce cardinality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. gateway, namespace, resource kind is the level of cardinality for these metrics. We could increase this to the level of the individual resource names, but I don't know if we will need that yet.
f8bb0a0
to
72b00e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still poking around, and I know its in draft, but some initial feedback.
Overall looking great!
metricsRecorder := metrics.NewTranslatorMetricsRecorder("TranslateHTTPRoute") | ||
defer metricsRecorder.TranslationStart()(nil) | ||
|
||
metrics.IncResourcesSyncsStartedTotal(routeInfo.GetName(), metrics.ResourceMetricLabels{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is getting called more times than expected. Not sure if the this function is recursively invoked
in the function header explains it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Taking a look at this...
(also failing test unit test is a metrics one) |
e51b96d
to
01ef855
Compare
Help: "Current number of resources managed by the collection", | ||
Subsystem: resourcesSubsystem, | ||
Name: "managed", | ||
Help: "Gateway resources currently managed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked the phrasing of the original Help text better (current number of...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed back the help text on this, and updated the other metrics to use consistent phrasing.
metrics.HistogramOpts{ | ||
Subsystem: snapshotSubsystem, | ||
Name: "transform_duration_seconds", | ||
Help: "XDS snapshot transform duration", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment that the default bucket layout works fine here (and at other Histograms) after testing? Otherwise we can try to take a stab at overriding that with a better layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added explicit Buckets assignments for all histograms (even if they use metrics.DefaultBuckets). Some histogram bucket sets may still need further refinement.
newSTs := []ResourceSyncStartTime{} | ||
res := []ResourceSyncStartTime{} | ||
|
||
if xdsSnapshot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this nested loop become a problem at scale, especially with startTimes.Lock()
? I can imagine the number of namespaces to be quite high in some environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested for loop was not needed and has been removed. There may be further ways to optimize the process used in EndResourceSync().
17c16ec
to
99fa0a3
Compare
gateway := snapWrap.ResourceName() | ||
namespace := "unknown" | ||
|
||
pks := strings.SplitN(gateway, "~", 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is in here a few times... move to a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This has been updated.
58af234
to
17c5cab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple more comments
gwNames = append(gwNames, string(pr.Name)) | ||
} | ||
|
||
ResourceMetricEventHandler(o, "HTTPRoute", gwNames) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a wrapped version of this function that handles extracting the names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated this to use a helper function and the code in setup.go is much simpler. But, GetResourceMetricEventHandler() will need to be updated if any new types need to used this function to return a krt.Event handler for metrics.
@@ -27,4 +29,30 @@ func (s *ProxyTranslator) syncXds( | |||
// a default initial fetch timeout | |||
// snap.MakeConsistent() | |||
s.xdsCache.SetSnapshot(ctx, proxyKey, snap) | |||
|
|||
gateway, namespace := getGatewayFromXDSSnapshotResourceName(snapWrap.ResourceName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any downsides to putting these metrics calculations in a goroutine so they can run async and let the translators get back to translating? (comment probably applies anywhere we're doing metrics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent question that Sam also brought up on the previous metrics PR. This is important given that these metrics require calculations and contain some loops. I'm not sure spawning a separate goroutine for each calculation would work best, but perhaps passing the info to calculate metrics in a channel to a worker goroutine would be helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I've done so far (see the next lines) is consolodate all of work to update the actual metrics within the tmetrics.EndResourceSync() call. This can make it easier to pass this work into a channel to be handled by one or more worker goroutines. And, it cleans up the instrumentation where EndResourceSync() is used.
536419d
to
c6dba87
Compare
c6dba87
to
8716b85
Compare
8716b85
to
cb6149a
Compare
Signed-off-by: David Haifley <[email protected]>
cb6149a
to
72daed9
Compare
Description
Motivation:
What changed:
Related issues:
Change Type
Changelog