[Cosmos] Health checks spec for gateway endpoints by tvaron3 · Pull Request #3866 · Azure/azure-sdk-for-rust

tvaron3 · 2026-03-05T18:13:18Z

Summary

Spec for proactive gateway endpoint health checking in azure_data_cosmos_driver. Addresses #3846.

Design

The health check system probes regional gateway endpoints in the background to detect unavailability and recovery, feeding results into the endpoint routing system so requests avoid unhealthy regions.

Key components

HealthProbe trait — swappable probe abstraction. Current implementation uses GetDatabaseAccount (lightweight metadata read). When the dedicated gateway health API ships, only the probe implementation changes — orchestration, scheduling, state management, and config remain untouched.
GetDatabaseAccountProbe — reuses the existing CosmosTransport metadata pipeline for authenticated GET / calls with per-probe timeout.
EndpointHealthMonitor — background tokio task that sweeps all regional endpoints concurrently every 5 minutes (configurable). Each probe has exponential backoff retry (3 attempts, 500ms initial delay).
Endpoint selection — probes all read endpoints + first write endpoint (deduplicated), matching the Python SDK pattern.
State integration — writes to the same AccountEndpointStateStore from the Transport Pipeline Spec (§4.4). Healthy probes remove endpoints from the unavailable set; failed probes add them. No changes needed in execute_operation — routing automatically respects the updated state.
Startup probing — optional blocking sweep during driver init to verify endpoints before accepting requests.
HealthCheckOptions — public configuration with env var overrides (AZURE_COSMOS_HEALTH_CHECK_*).

Python SDK alignment

Defaults match the Python SDK: 5-minute refresh interval, 3 probe retries with 500ms initial backoff, concurrent probing of read + write endpoints. The partition-level circuit breaker is out of scope (covered separately in Transport Pipeline Spec §4.5).

Open questions

5 questions flagged for discussion in the spec (startup blocking vs background, emulator behavior, metadata refresh piggybacking, gateway API contract, options visibility).

Files changed

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md — full spec with component design, integration points, configuration, testing strategy, and 10-step migration plan

Spec for proactive endpoint health checking in the driver, addressing Azure#3846. Key design: - HealthProbe trait abstraction makes it trivial to swap GetDatabaseAccount for the upcoming gateway health API - EndpointHealthMonitor runs background sweeps with concurrent probing across all regional endpoints - Integrates with AccountEndpointStateStore from the Transport Pipeline Spec for automatic routing updates - Configurable via HealthCheckOptions and env var overrides Co-authored-by: Copilot <[email protected]>

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md

- Replace reason:String with structured ProbeFailure (status code, sub-status) - Use transport-layer timeout instead of tokio::time::timeout wrapper - Remove probe_on_startup toggle — always probe on startup (non-blocking) - Remove custom probe_with_retry; use ClientRetryPolicy with probe-specific config - Add ExecutionContext::HealthCheckProbe variant separate from PPCB - Remove all env var overrides (§5.3) — config is programmatic-only - Startup probe runs in background, does not block client creation - Resolve Open Questions #1 (non-blocking) and #2 (keep enabled for emulator) Co-authored-by: Copilot <[email protected]>

Remove branch name from spec header to avoid cspell failure on username. Add RAII to cosmos dictionary. Co-authored-by: Copilot <[email protected]>

Update spec to match the pipeline architecture established in PR Azure#3875 (driver-transport step 01): - Replace old azure_core Pipeline/Context/Policy API with execute_transport_pipeline() and get_metadata_http_client() - Replace CosmosEndpoint (non-existent) with AccountEndpoint - Replace Arc<dyn Authorization> with Credential enum - Replace ClientRetryPolicy references with actual retry architecture (probe-level loop + transport 429 retry) - Use CosmosStatus instead of raw u16/u32 in ProbeFailure - Mark Step 2 dependencies (AccountEndpointStateStore etc.) as pseudocode with prominent callout - Fix mark_endpoint_available description (both functions new) - Update migration plan to reference new pipeline functions - Defer per-request timeout (uses client-level metadata timeout) Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Adds a new design/spec document for proactive gateway endpoint health checking in azure_data_cosmos_driver, describing a background probing system that feeds into the endpoint routing state so operations can avoid unhealthy regions.

Changes:

Added a full health checks specification covering probe abstraction, scheduling, retries, state integration, and configuration.
Updated the Cosmos dictionary to allow the term “RAII”.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md	New spec describing the proposed health-check architecture and integration with routing state.
sdk/cosmos/.dict.txt	Adds “RAII” to the spellcheck dictionary.

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md

- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout) - Rename max_probe_retries to max_probe_attempts, fix off-by-one semantics with exclusive range (3 = 3 total attempts) primitives for runtime-agnostic design - Store CancellationToken in EndpointHealthMonitor, implement Drop to cancel background task on shutdown - Replace snapshot+swap with per-endpoint mark_available/ mark_unavailable calls to avoid lost concurrent updates - Update PR description to match current spec (no env vars, non-blocking startup, no optional blocking) Co-authored-by: Copilot <[email protected]>

analogrelay

Some initial thoughts. It's a great start and I don't see any fundamental issues!

analogrelay · 2026-03-09T21:07:32Z

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md

+
+### Design Constraint: Swappable Probe API
+
+The Cosmos DB gateway team is building a dedicated health endpoint. Until that ships, we use


Is it possible for us to detect when this endpoint is available? Ideally, we'd like to just start using it when it becomes available instead of requiring users to update their SDK

The discussed approach was for the service team to create the endpoint (which might initially just redirect to getaccount internally) - but even that might take a few weeks until it is deployed

But we expect to have the endpoint at the expected URL by GA? If so, then we don't need the trait/enum at all. We can just change the endpoint we use when it's fully deployed.

Let us double-check - I don't think there is a clear ETA - and deployment in non-public clouds usually takes 3+ months after publci clouds - so, by GA timeline we probabyl cannot rely yet on the endpoint being there.

Another option is a discovery approach, such as Get Account Properties returning a healthCheckEndpoints collection or similar. Since we're going to run Get Account Properties before initializing the driver (as part of async initialization) we could check for that value and use it if present, otherwise fall back to just using Get Account Properties itself.

Basically, I'm just trying to see if we can make this Just Start Working ™️ when the service rolls it out instead of requiring an SDK update.

IMO discovery approach via health endpoit is over-engineering and requires also service deployments. My 2 cents - the pragmatic approach is what Tomas is proposing in the PR - if we want something unified by GA to avoid teh eventual SDK update enecssity we need to push service team to accelerate providing the health endpoint. Everything else just over-complicates for some very short-term gain. I am fine with requiring an SDK update as long as it does not change any public API because I expect customers using GA version to upgarde relatively quickly initially anyway.

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md

analogrelay · 2026-03-09T21:26:48Z

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md

+
+When the dedicated gateway health endpoint ships, the migration is:
+
+1. Create `GatewayHealthProbe` implementing `HealthProbe`:


It would be ideal if the gateway health probe endpoints could be returned by Get Account Properties or something similar. That way we can select an appropriate strategy during driver initialization

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md

FabianMeiswinkel

Deep Review — Health Checks Spec for Gateway Endpoints

Well-structured spec with solid design thinking. The HealthProbe trait abstraction and DOP-aligned pure function approach for state mutations are highlights. Reviewed against the Transport Pipeline Spec and Python SDK implementation for cross-spec consistency and parity.

13 findings: 1 blocking, 7 recommendations, 3 suggestions, 2 observations.

Key Issues

🔴 Retry logic gap — §3.5 defines the 3-attempt retry loop but §3.6's concurrent probing calls probe.probe(&ep) only once per endpoint.
🟡 Cross-spec gaps — mark_endpoint_available and AccountEndpoint type naming not aligned with Transport Pipeline Spec.
🟡 Underspecified — backoff_delay() algorithm, HealthMonitorConfig ↔ HealthCheckOptions mapping.
🟡 Architecture — Health monitor bypasses LocationStateStore.apply() single-writer model; independent sweep loop diverges from Python SDK's piggybacked approach (intentional improvement, but should be documented).

See inline comments for full details on each finding.

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md

Major design changes from Ashley and Fabian's review: - Replace HealthProbe trait with enum (DOP guidance) - Use execute_operation for probes (reuses full pipeline) - Route results through LocationEffect/LocationStateStore (single-writer model, no snapshot+swap lost updates) - Use BackgroundTaskManager for drop-based cancellation (no CancellationToken, no JoinHandle) - Use FuturesUnordered for concurrent probing (process results as they arrive, runtime-agnostic) - Carry azure_core::Error in ProbeFailure (defer formatting) - Track consecutive_failures per endpoint across sweeps - Add jitter to sweep interval (thundering herd prevention) - Define backoff algorithm (500ms/2x/3min cap/25% jitter) - Inline retry loop in probe_with_retry helper - Use CosmosEndpoint (Transport Spec canonical type) - Remove public HealthCheckOptions (all config internal) - Remove HealthMonitorConfig (collapsed, hardcoded defaults) - Remove account_metadata_cache (unused param) - Remove expiration-based recovery (health checks own it) - Add healthy-by-default design principle - Add Python SDK divergence documentation - Global endpoint: topology only, never probed - Fix ASCII diagram alignment - Use FaultInjection for testing (not MockHealthProbe) - Logging: warn! failures, info! recovery, no routine OK Co-authored-by: Copilot <[email protected]>

Upstream PR Azure#3824 moved .dict.txt words into .cspell.json. Accept deletion of .dict.txt and add our new words (RAII, deprioritize, bursty) to sdk/cosmos/.cspell.json. Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

Add blank lines before lists and fenced code blocks per MD031/MD032 rules. Co-authored-by: Copilot <[email protected]>

Add Markdown Linting section to AGENTS.md with markdownlint command and add step 6 to the pre-completion checklist for .md files. Also update stale .dict.txt reference to .cspell.json (migrated in PR Azure#3824). Co-authored-by: Copilot <[email protected]>

Rebuilt architecture overview diagram using ASCII characters with programmatically verified uniform line widths. Both markdownlint and cspell pass clean. Co-authored-by: Copilot <[email protected]>

github-actions bot added the Cosmos The azure_cosmos crate label Mar 5, 2026

github-project-automation bot added this to CosmosDB Go/Rust Crew Mar 5, 2026

github-project-automation bot moved this to Todo in CosmosDB Go/Rust Crew Mar 5, 2026

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 5, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 self-assigned this Mar 9, 2026

Fix analyze pipeline spelling errors

1a511d7

Remove branch name from spec header to avoid cspell failure on username. Add RAII to cosmos dictionary. Co-authored-by: Copilot <[email protected]>

tvaron3 linked an issue Mar 9, 2026 that may be closed by this pull request

Cosmos Driver: Health checks for gateway endpoints #3846

Closed

tvaron3 marked this pull request as ready for review March 9, 2026 20:10

tvaron3 requested a review from a team as a code owner March 9, 2026 20:10

Copilot AI review requested due to automatic review settings March 9, 2026 20:10

Copilot started reviewing on behalf of tvaron3 March 9, 2026 20:10 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

analogrelay requested changes Mar 9, 2026

View reviewed changes

github-project-automation bot moved this from Todo to Changes Requested in CosmosDB Go/Rust Crew Mar 9, 2026

FabianMeiswinkel reviewed Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Outdated Show resolved Hide resolved

FabianMeiswinkel reviewed Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md Show resolved Hide resolved