Skip to content

Conversation

@FengPan-Frank
Copy link
Contributor

Why I did it

Work item tracking
  • Microsoft ADO (number only):

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Copilot AI review requested due to automatic review settings December 4, 2025 12:42
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds certificate authentication validation capabilities to the telemetry watchdog health check system. The feature introduces a new probe that tests both positive and negative certificate authentication scenarios by making gNMI calls with different certificate configurations to verify that authentication is working correctly.

Key changes:

  • Added a new TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED environment variable to control certificate probe functionality (enabled by default)
  • Implemented dual certificate probes: one with "bad" credentials expected to fail, and one with "good" credentials expected to succeed
  • Added certificate-related constants for CA certificates, server certificates, keys, and target names for both test scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +418 to +428
let mut res_bad = run_gnmi_for_xpath(&xpath_rc, port, &bad_sec, DEFAULT_BAD_TNAME, timeout, "SHOW");
if res_bad.success {
res_bad.success = false;
let msg = "Expected FAILURE with BAD cert but command SUCCEEDED".to_string();
res_bad.error = Some(match res_bad.error.take() {
Some(existing) => format!("{existing}; {msg}"),
None => msg,
});
http_status = "HTTP/1.1 500 Internal Server Error";
}
cmd_results.push(res_bad);
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bad certificate probe will always report success: false in the health status output (either from actual rejection or from manual inversion), which makes it difficult to distinguish between expected behavior (cert correctly rejected) and unexpected behavior (cert incorrectly accepted) when reviewing the JSON output. Consider adding context to the xpath field (e.g., "reboot-cause/history [BAD_CERT_TEST]") or the error message to clarify that failures are expected for this probe.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +60
// BAD (expected fail) probe
const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl";

// GOOD (expected success) probe
const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem";
const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt";
const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key";
const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl";

Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded certificate paths and target names (DEFAULT_BAD_CA, DEFAULT_BAD_CERT, etc.) appear to be environment-specific and may not exist on all systems. Consider making these configurable via environment variables (similar to how other probes can be configured) or Redis configuration to make the probe more flexible across different deployments.

Suggested change
// BAD (expected fail) probe
const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl";
// GOOD (expected success) probe
const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem";
const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt";
const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key";
const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl";
// BAD (expected fail) probe environment variable names
const BAD_CA_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_CA";
const BAD_CERT_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_CERT";
const BAD_KEY_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_KEY";
const BAD_TNAME_ENV_VAR: &str = "TELEMETRY_WATCHDOG_BAD_TNAME";
// GOOD (expected success) probe environment variable names
const GOOD_CA_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_CA";
const GOOD_CERT_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_CERT";
const GOOD_KEY_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_KEY";
const GOOD_TNAME_ENV_VAR: &str = "TELEMETRY_WATCHDOG_GOOD_TNAME";
// BAD (expected fail) probe defaults
const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl";
// GOOD (expected success) probe defaults
const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem";
const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt";
const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key";
const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl";
// Helper functions to get BAD/GOOD probe values from env or default
fn get_bad_ca() -> String {
std::env::var(BAD_CA_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_CA.to_string())
}
fn get_bad_cert() -> String {
std::env::var(BAD_CERT_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_CERT.to_string())
}
fn get_bad_key() -> String {
std::env::var(BAD_KEY_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_KEY.to_string())
}
fn get_bad_tname() -> String {
std::env::var(BAD_TNAME_ENV_VAR).unwrap_or_else(|_| DEFAULT_BAD_TNAME.to_string())
}
fn get_good_ca() -> String {
std::env::var(GOOD_CA_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_CA.to_string())
}
fn get_good_cert() -> String {
std::env::var(GOOD_CERT_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_CERT.to_string())
}
fn get_good_key() -> String {
std::env::var(GOOD_KEY_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_KEY.to_string())
}
fn get_good_tname() -> String {
std::env::var(GOOD_TNAME_ENV_VAR).unwrap_or_else(|_| DEFAULT_GOOD_TNAME.to_string())
}

Copilot uses AI. Check for mistakes.
// 1) BAD cert: expect failure
// 2) GOOD cert: expect success
if is_cert_probe_enabled() {
let xpath_rc = "reboot-cause/history";
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The variable name xpath_rc is ambiguous. While it appears to stand for "reboot-cause", the abbreviation "rc" could also mean "return code" or "release candidate". Consider using a more descriptive name like xpath_reboot_cause for clarity.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +59
// BAD (expected fail) probe
const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl";

// GOOD (expected success) probe
const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem";
const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt";
const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key";
const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl";
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The naming convention "BAD" and "GOOD" for certificate test constants could be misinterpreted. "BAD" doesn't mean the certificate is malformed or invalid—it means we expect authentication to fail with these credentials. Consider more descriptive names like DEFAULT_UNAUTHORIZED_CA and DEFAULT_AUTHORIZED_CA (or _REJECTED_ and _ACCEPTED_) to clarify the intended test behavior rather than the quality of the certificates themselves.

Suggested change
// BAD (expected fail) probe
const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl";
// GOOD (expected success) probe
const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem";
const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt";
const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key";
const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl";
// UNAUTHORIZED (expected fail) probe
const DEFAULT_UNAUTHORIZED_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_UNAUTHORIZED_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_UNAUTHORIZED_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_UNAUTHORIZED_TNAME: &str = "server.ndastreaming.ap.gbl";
// AUTHORIZED (expected success) probe
const DEFAULT_AUTHORIZED_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem";
const DEFAULT_AUTHORIZED_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt";
const DEFAULT_AUTHORIZED_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key";
const DEFAULT_AUTHORIZED_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl";

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +53
// BAD (expected fail) probe
const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl";
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment "BAD (expected fail) probe" could be clearer. Consider expanding it to explain WHY this probe is expected to fail (e.g., "BAD cert probe: uses unauthorized credentials to verify authentication rejection works correctly"). This would help future maintainers understand the testing strategy.

Copilot uses AI. Check for mistakes.
Comment on lines +320 to +325
fn is_cert_probe_enabled() -> bool {
match env::var(CERT_PROBE_ENV_VAR) {
Ok(v) if v.eq_ignore_ascii_case("false") => false,
_ => true, // default enabled
}
}
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cert probe is enabled by default and relies on hardcoded file paths (e.g., /etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem). If these files don't exist in all deployment environments, the probe will fail on every health check, which could cause monitoring alerts. Consider either: 1) making the probe default to disabled (like the serial number probe), or 2) checking if the certificate files exist before running the probe, or 3) documenting the required file locations clearly.

Copilot uses AI. Check for mistakes.
let timeout = read_timeout();
let target_name = get_target_name();

// Certificate probes on reboot-cause/history API
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment describes "Certificate probes on reboot-cause/history API" but doesn't explain the purpose or rationale. Consider adding context about why this specific endpoint is used for certificate testing (e.g., "Tests certificate validation using the reboot-cause/history endpoint as a representative gNMI query").

Suggested change
// Certificate probes on reboot-cause/history API
// Certificate probes using the reboot-cause/history API.
// This endpoint is used as a representative gNMI query to test certificate validation.

Copilot uses AI. Check for mistakes.
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings December 5, 2025 03:05
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// 1) BAD cert: expect failure
// 2) GOOD cert: expect success
if is_cert_probe_enabled() {
let xpath_rc = "reboot-cause/history";
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Variable naming inconsistency: xpath_rc is used for the reboot-cause xpath, but other similar variables in the codebase use more descriptive names like xpath_sn (line 443). Consider renaming to something more descriptive like xpath_reboot_cause or reboot_cause_xpath for better clarity and consistency with the naming pattern used elsewhere in the function.

Copilot uses AI. Check for mistakes.
Comment on lines +422 to +425
res_bad.error = Some(match res_bad.error.take() {
Some(existing) => format!("{existing}; {msg}"),
None => msg,
});
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The error message construction could be simplified and made clearer. Instead of using take() and matching on Some(existing), consider a more straightforward approach: set res_bad.error = Some(msg) if there's no existing error, or prepend the message if there is one. Alternatively, since this is checking for an unexpected success, the original error (if any) from a successful command is likely not relevant - the probe succeeded when it should have failed, so replacing rather than appending might be more appropriate.

Suggested change
res_bad.error = Some(match res_bad.error.take() {
Some(existing) => format!("{existing}; {msg}"),
None => msg,
});
res_bad.error = Some(msg);

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +59
const CERT_PROBE_ENV_VAR: &str = "TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED";

// BAD (expected fail) probe
const DEFAULT_BAD_CA: &str = "/etc/sonic/telemetry/dsmsroot.cer";
const DEFAULT_BAD_CERT: &str = "/etc/sonic/telemetry/streamingtelemetryserver.cer";
const DEFAULT_BAD_KEY: &str = "/etc/sonic/telemetry/streamingtelemetryserver.key";
const DEFAULT_BAD_TNAME: &str = "server.ndastreaming.ap.gbl";

// GOOD (expected success) probe
const DEFAULT_GOOD_CA: &str = "/etc/sonic/credentials/AME_ROOT_CERTIFICATE.pem";
const DEFAULT_GOOD_CERT: &str = "/etc/sonic/credentials/sonick8sclient2.crt";
const DEFAULT_GOOD_KEY: &str = "/etc/sonic/credentials/sonick8sclient2.key";
const DEFAULT_GOOD_TNAME: &str = "SonicK8sDashboard.NETWORK-test-bl6p.bl6p.ap.gbl";
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation for the new certificate probe feature. Consider adding a comment block before the constants (lines 47-59) explaining the purpose of these certificate probes, what "BAD" and "GOOD" mean in this context, and when/why this feature should be enabled or disabled. This would help future maintainers understand the testing strategy being implemented.

Copilot uses AI. Check for mistakes.
Comment on lines +406 to +408
// Certificate probes on reboot-cause/history API
// 1) BAD cert: expect failure
// 2) GOOD cert: expect success
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment states "Certificate probes on reboot-cause/history API" but doesn't explain the testing strategy. The comment should clarify that this is testing certificate validation by: 1) Attempting to connect with known invalid certificates (expecting failure) and 2) Attempting to connect with valid certificates (expecting success). This helps readers understand that this is a negative/positive test pair for certificate authentication.

Suggested change
// Certificate probes on reboot-cause/history API
// 1) BAD cert: expect failure
// 2) GOOD cert: expect success
// Certificate validation probes on reboot-cause/history API.
// This tests certificate authentication with a negative/positive test pair:
// 1) Attempt to connect with a known invalid (BAD) certificate and expect failure (negative test).
// 2) Attempt to connect with a known valid (GOOD) certificate and expect success (positive test).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants