-
Notifications
You must be signed in to change notification settings - Fork 13
Support prometheus metrics #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @7ing. Thanks for your PR. I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
munnerz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks Jing 🙌 the integration and unit tests here make it far easier to review confidently!
My main questions/concerns are around the construction logic in the metrics subpackage, which I think we need to decouple from net.Listener (and allow more flexibility for projects that already have their own prometheus.Registry they'd like to re-use).
metrics/metrics.go
Outdated
| } | ||
|
|
||
| // NewServer registers Prometheus metrics and returns a new Prometheus metrics HTTP server. | ||
| func (m *Metrics) NewServer(ln net.Listener) *http.Server { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't called outside of test cases, and I guess that is by design as it is expected that the corresponding implementation should call NewServer on metrics.Metrics to register their Listener.
Could you possibly update the example/ implementation in the root of this repository to demonstrate how to actually add the /metrics endpoint? I also wonder if the Managed should be extended to be able to auto-serve this endpoint in cases where a user doesn't need to provide their own listener (but does want metrics to be served).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we renamed this to Register(*prometheus.Registry) rather than tying the net.Listener logic into the registration?
We can always have/find some kind of csihelpers.Handle(*http.Server, *prometheus.Registry) function elsewhere then, which needn't be opinionated about csi-lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you possibly update the example/ implementation in the root of this repository to demonstrate how to actually add the /metrics endpoint?
Yes, it is there. And modified accordingly based on recent changes.
https://github.com/7ing/csi-lib/blob/abf15631238fa809d10d9c206178c414d9495b4f/test/integration/metrics_test.go#L83-L95
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions. I have changed to use a DefaultHandler instead, which could be served as a reference implementation for http handler.
|
@munnerz Thank you for your valuable inputs. Sorry took so long to make the change. But I guess this version addressed most of your concerns. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
test/integration/metrics_test.go
Outdated
|
|
||
| // Should expose that CertificateRequest as ready with expiry and renewal time | ||
| // node="f56fd9f8b" is the hash value of "test-node" defined in driver_testing.go | ||
| expectedOutputTemplate := `# HELP certmanager_csi_certificate_request_expiration_timestamp_seconds The date after which the certificate request expires. Expressed as a Unix Epoch Time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will certmanager_csi_... be the prefix for the metrics when exposed in https://github.com/cert-manager/csi-driver?
Can we make sure they are all using similar names? Does this happen automatically?
Could you provide an example of before and after for https://github.com/cert-manager/csi-driver with these changes applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @inteon for your review.
Yes, certmanager_csi_... will be the prefix for the metrics when in the csi-driver. It is part of the definition from:
https://github.com/7ing/csi-lib/blob/b9186bad5b6f9af9bc93dbd28571d2d9219700e6/metrics/metrics.go#L27-L31
We choose this name based on cert-manager.io definition: https://github.com/cert-manager/cert-manager/blob/5e09ef6c0552df0bde64746c735cb1ff324b6261/pkg/metrics/metrics.go#L44
All cert-manager controllers have certmanager_.. prefix.
Our https://github.com/cert-manager/csi-driver does not have any certmanager related metrics. Currently it only serves k8s components metrics, like cpu / mem etc. That's why this PR exist. This test files show the expected output regarding certmanager metrics (besides the k8s metrics upon driver configuration).
|
@7ing We have made some major dependency upgrades in this module. Are you able to rebase your PR, preparing for another round of review? Sorry for the inconvenience and for the delays in review. 😒 |
|
/retest |
|
@erikgb done with rebase |
erikgb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great stuff! Thanks @7ing! I did my first pass on this PR now, and I think I would prefer using a Prometheus collector to avoid the add/remove metrics for metrics based on API resources. Please take a look and let me know what you think! It's not a blocker, but a pattern we are adopting in cert-manager projects nowadays.
|
Updated the certificate requests' metrics implementation with collector pattern, to align with cert-manager.io project. Basically the collector will periodically scan the certificaterequest sharedinformer (for expiry) and host metadata files (for renewal time), then record the metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Prometheus metrics support to the CSI library to expose operational metrics about certificate management. The implementation includes metrics for certificate request expiration timestamps, ready status, renewal times, driver issue call counts, and managed volume/certificate counts.
Key changes:
- Added comprehensive metrics collection system with Prometheus integration
- Implemented certificate request collector for tracking certificate lifecycle metrics
- Added metrics tracking to the manager for monitoring issue calls and errors
- Created extensive test coverage for metrics functionality
Reviewed Changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/util/testutil.go | Exported function to support metrics testing |
| test/integration/metrics_test.go | Added comprehensive integration test for metrics server functionality |
| test/driver/driver_testing.go | Added metrics support to test driver infrastructure |
| metrics/metrics.go | Core metrics implementation with Prometheus registry and HTTP handler |
| metrics/certificaterequest_test.go | Unit tests for certificate request metrics collection |
| metrics/certificaterequest_collector.go | Prometheus collector for certificate request lifecycle metrics |
| manager/manager.go | Integrated metrics tracking into certificate issuance operations |
| go.mod | Added prometheus client dependency |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
/retest-required |
|
@erikgb @hjoshi123 I believe this one close to the final version. Let me know if you have further questions. Thanks. |
|
Any chance we could add this feature to the next release? |
|
Hi! I'll ask if someone from CyberArk can review this PR. |
|
@7ing, I feel really sorry for the delays in the review of this PR. Sorry! 😢 What worries me most about this (huge) PR is the UX/interface. Are you able to create a draft PR in the csi-driver to see how this can be used in an application? We can always fix bugs in follow-up PR, but breaking changes in APIs is something we try to avoid. |
This PR demonstrates csi-lib metrics feature in cert-manager/csi-lib#73 Signed-off-by: Jing Liu <[email protected]>
This PR demonstrates csi-lib metrics feature in cert-manager/csi-lib#73 Signed-off-by: Jing Liu <[email protected]>
|
@erikgb we have example-csi in this repo demonstrating how to integrate the metrics handler. Similarly, I have added a draft PR in csi-driver to demonstrate how to use it. cert-manager/csi-driver#502. Hopefully that resolves the concern. Thanks. |
Following metrics added: certmanager_csi_certificate_request_expiration_timestamp_seconds certmanager_csi_certificate_request_ready_status certmanager_csi_certificate_request_renewal_timestamp_seconds certmanager_csi_driver_issue_call_count certmanager_csi_driver_issue_error_count certmanager_csi_managed_certificate_count certmanager_csi_managed_volume_count fixes: cert-manager#60 Signed-off-by: Jing Liu <[email protected]>
fixes: cert-manager#60 Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
Signed-off-by: Jing Liu <[email protected]>
|
/retest-required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 12 out of 14 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,124 @@ | |||
| /* | |||
| Copyright 2024 The cert-manager Authors. | |||
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected copyright year from 2024 to 2025 to match the actual year of the file creation (metrics/certificaterequest_collector.go shows 2025).
| Copyright 2024 The cert-manager Authors. | |
| Copyright 2025 The cert-manager Authors. |
| @@ -0,0 +1,450 @@ | |||
| /* | |||
| Copyright 2024 The cert-manager Authors. | |||
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected copyright year from 2024 to 2025 to match the actual year consistent with other new files in this PR.
| Copyright 2024 The cert-manager Authors. | |
| Copyright 2025 The cert-manager Authors. |
| func (m *Metrics) SetupCertificateRequestCollector(nodeId string, metadataReader storage.MetadataReader, certificateRequestLister cmlisters.CertificateRequestLister) { | ||
| m.certificateRequestCollector = NewCertificateRequestCollector(internalapiutil.HashIdentifier(nodeId), metadataReader, certificateRequestLister) | ||
| m.registry.MustRegister(m.certificateRequestCollector) | ||
| } | ||
|
|
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SetupCertificateRequestCollector method re-registers a collector without unregistering the previous one, which will cause a panic from Prometheus. The collector is already registered in the New function at line 97. This method appears unused in the codebase and should likely be removed, or it should first unregister the old collector before registering a new one.
| func (m *Metrics) SetupCertificateRequestCollector(nodeId string, metadataReader storage.MetadataReader, certificateRequestLister cmlisters.CertificateRequestLister) { | |
| m.certificateRequestCollector = NewCertificateRequestCollector(internalapiutil.HashIdentifier(nodeId), metadataReader, certificateRequestLister) | |
| m.registry.MustRegister(m.certificateRequestCollector) | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Copilot here. This function appears superfluous after the latest changes.
| // It should not be used full-stop :). | ||
| doNotUse_CallOnEachIssue func() | ||
|
|
||
| // metrics is used to expose Prometheus |
Copilot
AI
Nov 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is incomplete. It should be: 'metrics is used to expose Prometheus metrics' or 'metrics is used for Prometheus metrics collection'.
| // metrics is used to expose Prometheus | |
| // metrics is used to expose Prometheus metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // metrics is used to expose Prometheus | |
| // metrics is used for Prometheus metrics collection |
erikgb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@7ing, you have been waiting for a long time. Sorry about that! 🙈 This codebase is still a bit unfamiliar to me, so I was hoping for someone else to also review it. I have tried to focus on the UX in my review, so we avoid future breaking changes if something needs to be fixed. Let me know what you think!
I really think this feature is needed, so one way or the other we should try to get it merged soon.
| // things like the number of times issue() is called. | ||
| // No thread safety is added around this field, and it MUST NOT be used for any implementation logic. | ||
| // It should not be used full-stop :). | ||
| doNotUse_CallOnEachIssue func() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can/should this field be removed now, or is this a follow-up? Ref.
Lines 382 to 385 in f3a2df0
| // TODO: remove this code and replace with actual metrics support | |
| if m.doNotUse_CallOnEachIssue != nil { | |
| m.doNotUse_CallOnEachIssue() | |
| } |
| // It should not be used full-stop :). | ||
| doNotUse_CallOnEachIssue func() | ||
|
|
||
| // metrics is used to expose Prometheus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // metrics is used to expose Prometheus | |
| // metrics is used for Prometheus metrics collection |
| const ( | ||
| // Namespace is the namespace for csi-lib metric names | ||
| namespace = "certmanager" | ||
| subsystem = "csi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it make sense to make the subsystem configurable from the application using csi-lib, with a good default? I am just thinking how this might appear if two CSI drivers using this lib are running in the same environment. It probably doesn't make much sense, but if two drivers using this lib are producing the same metrics, I suggest dropping the driver_ prefix from the metrics name (commented below).
| prometheus.CounterOpts{ | ||
| Namespace: namespace, | ||
| Subsystem: subsystem, | ||
| Name: "driver_issue_call_count_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Name: "driver_issue_call_count_total", | |
| Name: "issue_requests_total", |
If we want to keep driver_ in the full metrics name, I suggest including it in the subsystem part above.
Naming is hard, but I tried to read https://prometheus.io/docs/practices/naming/, and I think "call" is a bit code-native. Also count/total is almost the same.
| prometheus.CounterOpts{ | ||
| Namespace: namespace, | ||
| Subsystem: subsystem, | ||
| Name: "driver_issue_error_count_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Name: "driver_issue_error_count_total", | |
| Name: "issue_errors_total", |
If we want to keep driver_ in the full metrics name, I suggest including it in the subsystem part above.
Naming is hard, but I tried to read https://prometheus.io/docs/practices/naming/, and I think count/total is almost the same.
| func (m *Metrics) SetupCertificateRequestCollector(nodeId string, metadataReader storage.MetadataReader, certificateRequestLister cmlisters.CertificateRequestLister) { | ||
| m.certificateRequestCollector = NewCertificateRequestCollector(internalapiutil.HashIdentifier(nodeId), metadataReader, certificateRequestLister) | ||
| m.registry.MustRegister(m.certificateRequestCollector) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Copilot here. This function appears superfluous after the latest changes.
certmanager_csi_certificate_request_expiration_timestamp_seconds certmanager_csi_certificate_request_ready_status
certmanager_csi_certificate_request_renewal_timestamp_seconds certmanager_csi_driver_issue_call_count_total
certmanager_csi_driver_issue_error_count_total
certmanager_csi_managed_certificate_count_total
certmanager_csi_managed_volume_count_total
fixes: #60
CyberArk tracker: VC-46169