-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog #24743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds certificate authentication validation capabilities to the telemetry watchdog health check system. The feature introduces a new probe that tests both positive and negative certificate authentication scenarios by making gNMI calls with different certificate configurations to verify that authentication is working correctly.
Key changes:
- Added a new
TELEMETRY_WATCHDOG_CERT_PROBE_ENABLEDenvironment variable to control certificate probe functionality (enabled by default) - Implemented dual certificate probes: one with "bad" credentials expected to fail, and one with "good" credentials expected to succeed
- Added certificate-related constants for CA certificates, server certificates, keys, and target names for both test scenarios
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let mut res_bad = run_gnmi_for_xpath(&xpath_rc, port, &bad_sec, DEFAULT_BAD_TNAME, timeout, "SHOW"); | ||
| if res_bad.success { | ||
| res_bad.success = false; | ||
| let msg = "Expected FAILURE with BAD cert but command SUCCEEDED".to_string(); | ||
| res_bad.error = Some(match res_bad.error.take() { | ||
| Some(existing) => format!("{existing}; {msg}"), | ||
| None => msg, | ||
| }); | ||
| http_status = "HTTP/1.1 500 Internal Server Error"; | ||
| } | ||
| cmd_results.push(res_bad); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bad certificate probe will always report success: false in the health status output (either from actual rejection or from manual inversion), which makes it difficult to distinguish between expected behavior (cert correctly rejected) and unexpected behavior (cert incorrectly accepted) when reviewing the JSON output. Consider adding context to the xpath field (e.g., "reboot-cause/history [BAD_CERT_TEST]") or the error message to clarify that failures are expected for this probe.
| // BAD (expected fail) probe | ||
| const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | ||
| const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | ||
| const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | ||
| const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl"; | ||
|
|
||
| // GOOD (expected success) probe | ||
| const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem"; | ||
| const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt"; | ||
| const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key"; | ||
| const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl"; | ||
|
|
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded certificate paths and target names (DEFAULT_BAD_CA, DEFAULT_BAD_CERT, etc.) appear to be environment-specific and may not exist on all systems. Consider making these configurable via environment variables (similar to how other probes can be configured) or Redis configuration to make the probe more flexible across different deployments.
| // BAD (expected fail) probe | |
| const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | |
| const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | |
| const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | |
| const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl"; | |
| // GOOD (expected success) probe | |
| const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem"; | |
| const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt"; | |
| const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key"; | |
| const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl"; | |
| // BAD (expected fail) probe environment variable names | |
| const BAD_CA_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_CA"; | |
| const BAD_CERT_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_CERT"; | |
| const BAD_KEY_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_KEY"; | |
| const BAD_TNAME_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_TNAME"; | |
| // GOOD (expected success) probe environment variable names | |
| const GOOD_CA_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_CA"; | |
| const GOOD_CERT_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_CERT"; | |
| const GOOD_KEY_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_KEY"; | |
| const GOOD_TNAME_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_TNAME"; | |
| // BAD (expected fail) probe defaults | |
| const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | |
| const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | |
| const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | |
| const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl"; | |
| // GOOD (expected success) probe defaults | |
| const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem"; | |
| const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt"; | |
| const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key"; | |
| const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl"; | |
| // Helper functions to get BAD/GOOD probe values from env or default | |
| fn get_bad_ca() -> String { | |
| std::env::var(BAD_CA_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_CA.to_string()) | |
| } | |
| fn get_bad_cert() -> String { | |
| std::env::var(BAD_CERT_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_CERT.to_string()) | |
| } | |
| fn get_bad_key() -> String { | |
| std::env::var(BAD_KEY_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_KEY.to_string()) | |
| } | |
| fn get_bad_tname() -> String { | |
| std::env::var(BAD_TNAME_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_TNAME.to_string()) | |
| } | |
| fn get_good_ca() -> String { | |
| std::env::var(GOOD_CA_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_CA.to_string()) | |
| } | |
| fn get_good_cert() -> String { | |
| std::env::var(GOOD_CERT_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_CERT.to_string()) | |
| } | |
| fn get_good_key() -> String { | |
| std::env::var(GOOD_KEY_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_KEY.to_string()) | |
| } | |
| fn get_good_tname() -> String { | |
| std::env::var(GOOD_TNAME_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_TNAME.to_string()) | |
| } |
| // 1) BAD cert: expect failure | ||
| // 2) GOOD cert: expect success | ||
| if is_cert_probe_enabled() { | ||
| let xpath_rc = "reboot-cause/history"; |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The variable name xpath_rc is ambiguous. While it appears to stand for "reboot-cause", the abbreviation "rc" could also mean "return code" or "release candidate". Consider using a more descriptive name like xpath_reboot_cause for clarity.
| // BAD (expected fail) probe | ||
| const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | ||
| const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | ||
| const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | ||
| const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl"; | ||
|
|
||
| // GOOD (expected success) probe | ||
| const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem"; | ||
| const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt"; | ||
| const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key"; | ||
| const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl"; |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The naming convention "BAD" and "GOOD" for certificate test constants could be misinterpreted. "BAD" doesn't mean the certificate is malformed or invalid—it means we expect authentication to fail with these credentials. Consider more descriptive names like DEFAULT_UNAUTHORIZED_CA and DEFAULT_AUTHORIZED_CA (or _REJECTED_ and _ACCEPTED_) to clarify the intended test behavior rather than the quality of the certificates themselves.
| // BAD (expected fail) probe | |
| const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | |
| const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | |
| const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | |
| const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl"; | |
| // GOOD (expected success) probe | |
| const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem"; | |
| const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt"; | |
| const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key"; | |
| const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl"; | |
| // UNAUTHORIZED (expected fail) probe | |
| const DEFAULT_UNAUTHORIZED_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | |
| const DEFAULT_UNAUTHORIZED_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | |
| const DEFAULT_UNAUTHORIZED_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | |
| const DEFAULT_UNAUTHORIZED_TNAME: &str = "server.ndastreaming.ap.gbl"; | |
| // AUTHORIZED (expected success) probe | |
| const DEFAULT_AUTHORIZED_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem"; | |
| const DEFAULT_AUTHORIZED_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt"; | |
| const DEFAULT_AUTHORIZED_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key"; | |
| const DEFAULT_AUTHORIZED_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl"; |
| // BAD (expected fail) probe | ||
| const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | ||
| const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | ||
| const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | ||
| const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl"; |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment "BAD (expected fail) probe" could be clearer. Consider expanding it to explain WHY this probe is expected to fail (e.g., "BAD cert probe: uses unauthorized credentials to verify authentication rejection works correctly"). This would help future maintainers understand the testing strategy.
| fn is_cert_probe_enabled() -> bool { | ||
| match env::var(CERT_PROBE_ENV_VAR) { | ||
| Ok(v) if v.eq_ignore_ascii_case("false") => false, | ||
| _ => true, // default enabled | ||
| } | ||
| } |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cert probe is enabled by default and relies on hardcoded file paths (e.g., /etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem). If these files don't exist in all deployment environments, the probe will fail on every health check, which could cause monitoring alerts. Consider either: 1) making the probe default to disabled (like the serial number probe), or 2) checking if the certificate files exist before running the probe, or 3) documenting the required file locations clearly.
| let timeout = read_timeout(); | ||
| let target_name = get_target_name(); | ||
|
|
||
| // Certificate probes on reboot-cause/history API |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment describes "Certificate probes on reboot-cause/history API" but doesn't explain the purpose or rationale. Consider adding context about why this specific endpoint is used for certificate testing (e.g., "Tests certificate validation using the reboot-cause/history endpoint as a representative gNMI query").
| // Certificate probes on reboot-cause/history API | |
| // Certificate probes using the reboot-cause/history API. | |
| // This endpoint is used as a representative gNMI query to test certificate validation. |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // 1) BAD cert: expect failure | ||
| // 2) GOOD cert: expect success | ||
| if is_cert_probe_enabled() { | ||
| let xpath_rc = "reboot-cause/history"; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Variable naming inconsistency: xpath_rc is used for the reboot-cause xpath, but other similar variables in the codebase use more descriptive names like xpath_sn (line 443). Consider renaming to something more descriptive like xpath_reboot_cause or reboot_cause_xpath for better clarity and consistency with the naming pattern used elsewhere in the function.
| res_bad.error = Some(match res_bad.error.take() { | ||
| Some(existing) => format!("{existing}; {msg}"), | ||
| None => msg, | ||
| }); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The error message construction could be simplified and made clearer. Instead of using take() and matching on Some(existing), consider a more straightforward approach: set res_bad.error = Some(msg) if there's no existing error, or prepend the message if there is one. Alternatively, since this is checking for an unexpected success, the original error (if any) from a successful command is likely not relevant - the probe succeeded when it should have failed, so replacing rather than appending might be more appropriate.
| res_bad.error = Some(match res_bad.error.take() { | |
| Some(existing) => format!("{existing}; {msg}"), | |
| None => msg, | |
| }); | |
| res_bad.error = Some(msg); |
| const CERT_PROBE_ENV_VAR: &str = "TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED"; | ||
|
|
||
| // BAD (expected fail) probe | ||
| const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer"; | ||
| const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer"; | ||
| const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key"; | ||
| const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl"; | ||
|
|
||
| // GOOD (expected success) probe | ||
| const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem"; | ||
| const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt"; | ||
| const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key"; | ||
| const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl"; |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing documentation for the new certificate probe feature. Consider adding a comment block before the constants (lines 47-59) explaining the purpose of these certificate probes, what "BAD" and "GOOD" mean in this context, and when/why this feature should be enabled or disabled. This would help future maintainers understand the testing strategy being implemented.
| // Certificate probes on reboot-cause/history API | ||
| // 1) BAD cert: expect failure | ||
| // 2) GOOD cert: expect success |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment states "Certificate probes on reboot-cause/history API" but doesn't explain the testing strategy. The comment should clarify that this is testing certificate validation by: 1) Attempting to connect with known invalid certificates (expecting failure) and 2) Attempting to connect with valid certificates (expecting success). This helps readers understand that this is a negative/positive test pair for certificate authentication.
| // Certificate probes on reboot-cause/history API | |
| // 1) BAD cert: expect failure | |
| // 2) GOOD cert: expect success | |
| // Certificate validation probes on reboot-cause/history API. | |
| // This tests certificate authentication with a negative/positive test pair: | |
| // 1) Attempt to connect with a known invalid (BAD) certificate and expect failure (negative test). | |
| // 2) Attempt to connect with a known valid (GOOD) certificate and expect success (positive test). |
Why I did it
Work item tracking
How I did it
How to verify it
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)