Skip to content

[Cosmos] Health checks spec for gateway endpoints#3866

Closed
tvaron3 wants to merge 11 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/healthChecksSpec
Closed

[Cosmos] Health checks spec for gateway endpoints#3866
tvaron3 wants to merge 11 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/healthChecksSpec

Conversation

@tvaron3
Copy link
Member

@tvaron3 tvaron3 commented Mar 5, 2026

Summary

Spec for proactive gateway endpoint health checking in azure_data_cosmos_driver. Addresses #3846.

Design

The health check system probes regional gateway endpoints in the background to detect unavailability and recovery, feeding results into the endpoint routing system so requests avoid unhealthy regions.

Key components

  • HealthProbe trait — swappable probe abstraction. Current implementation uses GetDatabaseAccount (lightweight metadata read). When the dedicated gateway health API ships, only the probe implementation changes — orchestration, scheduling, state management, and config remain untouched.
  • GetDatabaseAccountProbe — reuses the existing CosmosTransport metadata pipeline for authenticated GET / calls with per-probe timeout.
  • EndpointHealthMonitor — background tokio task that sweeps all regional endpoints concurrently every 5 minutes (configurable). Each probe has exponential backoff retry (3 attempts, 500ms initial delay).
  • Endpoint selection — probes all read endpoints + first write endpoint (deduplicated), matching the Python SDK pattern.
  • State integration — writes to the same AccountEndpointStateStore from the Transport Pipeline Spec (§4.4). Healthy probes remove endpoints from the unavailable set; failed probes add them. No changes needed in execute_operation — routing automatically respects the updated state.
  • Startup probing — optional blocking sweep during driver init to verify endpoints before accepting requests.
  • HealthCheckOptions — public configuration with env var overrides (AZURE_COSMOS_HEALTH_CHECK_*).

Python SDK alignment

Defaults match the Python SDK: 5-minute refresh interval, 3 probe retries with 500ms initial backoff, concurrent probing of read + write endpoints. The partition-level circuit breaker is out of scope (covered separately in Transport Pipeline Spec §4.5).

Open questions

5 questions flagged for discussion in the spec (startup blocking vs background, emulator behavior, metadata refresh piggybacking, gateway API contract, options visibility).

Files changed

  • sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md — full spec with component design, integration points, configuration, testing strategy, and 10-step migration plan

Spec for proactive endpoint health checking in the driver,
addressing Azure#3846. Key design:

- HealthProbe trait abstraction makes it trivial to swap
  GetDatabaseAccount for the upcoming gateway health API
- EndpointHealthMonitor runs background sweeps with concurrent
  probing across all regional endpoints
- Integrates with AccountEndpointStateStore from the Transport
  Pipeline Spec for automatic routing updates
- Configurable via HealthCheckOptions and env var overrides

Co-authored-by: Copilot <[email protected]>
- Replace reason:String with structured ProbeFailure (status code, sub-status)
- Use transport-layer timeout instead of tokio::time::timeout wrapper
- Remove probe_on_startup toggle — always probe on startup (non-blocking)
- Remove custom probe_with_retry; use ClientRetryPolicy with probe-specific config
- Add ExecutionContext::HealthCheckProbe variant separate from PPCB
- Remove all env var overrides (§5.3) — config is programmatic-only
- Startup probe runs in background, does not block client creation
- Resolve Open Questions #1 (non-blocking) and #2 (keep enabled for emulator)

Co-authored-by: Copilot <[email protected]>
@tvaron3 tvaron3 self-assigned this Mar 9, 2026
Remove branch name from spec header to avoid cspell failure
on username. Add RAII to cosmos dictionary.

Co-authored-by: Copilot <[email protected]>
@tvaron3 tvaron3 linked an issue Mar 9, 2026 that may be closed by this pull request
Update spec to match the pipeline architecture established in
PR Azure#3875 (driver-transport step 01):

- Replace old azure_core Pipeline/Context/Policy API with
  execute_transport_pipeline() and get_metadata_http_client()
- Replace CosmosEndpoint (non-existent) with AccountEndpoint
- Replace Arc<dyn Authorization> with Credential enum
- Replace ClientRetryPolicy references with actual retry
  architecture (probe-level loop + transport 429 retry)
- Use CosmosStatus instead of raw u16/u32 in ProbeFailure
- Mark Step 2 dependencies (AccountEndpointStateStore etc.)
  as pseudocode with prominent callout
- Fix mark_endpoint_available description (both functions new)
- Update migration plan to reference new pipeline functions
- Defer per-request timeout (uses client-level metadata timeout)

Co-authored-by: Copilot <[email protected]>
@tvaron3 tvaron3 marked this pull request as ready for review March 9, 2026 20:10
@tvaron3 tvaron3 requested a review from a team as a code owner March 9, 2026 20:10
Copilot AI review requested due to automatic review settings March 9, 2026 20:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new design/spec document for proactive gateway endpoint health checking in azure_data_cosmos_driver, describing a background probing system that feeds into the endpoint routing state so operations can avoid unhealthy regions.

Changes:

  • Added a full health checks specification covering probe abstraction, scheduling, retries, state integration, and configuration.
  • Updated the Cosmos dictionary to allow the term “RAII”.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md New spec describing the proposed health-check architecture and integration with routing state.
sdk/cosmos/.dict.txt Adds “RAII” to the spellcheck dictionary.

- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout)
- Rename max_probe_retries to max_probe_attempts, fix off-by-one
  semantics with exclusive range (3 = 3 total attempts)
  primitives for runtime-agnostic design
- Store CancellationToken in EndpointHealthMonitor, implement
  Drop to cancel background task on shutdown
- Replace snapshot+swap with per-endpoint mark_available/
  mark_unavailable calls to avoid lost concurrent updates
- Update PR description to match current spec (no env vars,
  non-blocking startup, no optional blocking)

Co-authored-by: Copilot <[email protected]>
Copy link
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts. It's a great start and I don't see any fundamental issues!


### Design Constraint: Swappable Probe API

The Cosmos DB gateway team is building a dedicated health endpoint. Until that ships, we use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to detect when this endpoint is available? Ideally, we'd like to just start using it when it becomes available instead of requiring users to update their SDK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussed approach was for the service team to create the endpoint (which might initially just redirect to getaccount internally) - but even that might take a few weeks until it is deployed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we expect to have the endpoint at the expected URL by GA? If so, then we don't need the trait/enum at all. We can just change the endpoint we use when it's fully deployed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us double-check - I don't think there is a clear ETA - and deployment in non-public clouds usually takes 3+ months after publci clouds - so, by GA timeline we probabyl cannot rely yet on the endpoint being there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is a discovery approach, such as Get Account Properties returning a healthCheckEndpoints collection or similar. Since we're going to run Get Account Properties before initializing the driver (as part of async initialization) we could check for that value and use it if present, otherwise fall back to just using Get Account Properties itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, I'm just trying to see if we can make this Just Start Working ™️ when the service rolls it out instead of requiring an SDK update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO discovery approach via health endpoit is over-engineering and requires also service deployments. My 2 cents - the pragmatic approach is what Tomas is proposing in the PR - if we want something unified by GA to avoid teh eventual SDK update enecssity we need to push service team to accelerate providing the health endpoint. Everything else just over-complicates for some very short-term gain. I am fine with requiring an SDK update as long as it does not change any public API because I expect customers using GA version to upgarde relatively quickly initially anyway.


When the dedicated gateway health endpoint ships, the migration is:

1. Create `GatewayHealthProbe` implementing `HealthProbe`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ideal if the gateway health probe endpoints could be returned by Get Account Properties or something similar. That way we can select an appropriate strategy during driver initialization

@github-project-automation github-project-automation bot moved this from Todo to Changes Requested in CosmosDB Go/Rust Crew Mar 9, 2026
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review — Health Checks Spec for Gateway Endpoints

Well-structured spec with solid design thinking. The HealthProbe trait abstraction and DOP-aligned pure function approach for state mutations are highlights. Reviewed against the Transport Pipeline Spec and Python SDK implementation for cross-spec consistency and parity.

13 findings: 1 blocking, 7 recommendations, 3 suggestions, 2 observations.

Key Issues

  1. 🔴 Retry logic gap — §3.5 defines the 3-attempt retry loop but §3.6's concurrent probing calls probe.probe(&ep) only once per endpoint.
  2. 🟡 Cross-spec gapsmark_endpoint_available and AccountEndpoint type naming not aligned with Transport Pipeline Spec.
  3. 🟡 Underspecifiedbackoff_delay() algorithm, HealthMonitorConfigHealthCheckOptions mapping.
  4. 🟡 Architecture — Health monitor bypasses LocationStateStore.apply() single-writer model; independent sweep loop diverges from Python SDK's piggybacked approach (intentional improvement, but should be documented).

See inline comments for full details on each finding.

tvaron3 and others added 6 commits March 13, 2026 10:39
Major design changes from Ashley and Fabian's review:

- Replace HealthProbe trait with enum (DOP guidance)
- Use execute_operation for probes (reuses full pipeline)
- Route results through LocationEffect/LocationStateStore
  (single-writer model, no snapshot+swap lost updates)
- Use BackgroundTaskManager for drop-based cancellation
  (no CancellationToken, no JoinHandle)
- Use FuturesUnordered for concurrent probing (process
  results as they arrive, runtime-agnostic)
- Carry azure_core::Error in ProbeFailure (defer formatting)
- Track consecutive_failures per endpoint across sweeps
- Add jitter to sweep interval (thundering herd prevention)
- Define backoff algorithm (500ms/2x/3min cap/25% jitter)
- Inline retry loop in probe_with_retry helper
- Use CosmosEndpoint (Transport Spec canonical type)
- Remove public HealthCheckOptions (all config internal)
- Remove HealthMonitorConfig (collapsed, hardcoded defaults)
- Remove account_metadata_cache (unused param)
- Remove expiration-based recovery (health checks own it)
- Add healthy-by-default design principle
- Add Python SDK divergence documentation
- Global endpoint: topology only, never probed
- Fix ASCII diagram alignment
- Use FaultInjection for testing (not MockHealthProbe)
- Logging: warn! failures, info! recovery, no routine OK

Co-authored-by: Copilot <[email protected]>
Upstream PR Azure#3824 moved .dict.txt words into .cspell.json.
Accept deletion of .dict.txt and add our new words (RAII,
deprioritize, bursty) to sdk/cosmos/.cspell.json.

Co-authored-by: Copilot <[email protected]>
Add blank lines before lists and fenced code blocks per
MD031/MD032 rules.

Co-authored-by: Copilot <[email protected]>
Add Markdown Linting section to AGENTS.md with markdownlint
command and add step 6 to the pre-completion checklist for
.md files. Also update stale .dict.txt reference to .cspell.json
(migrated in PR Azure#3824).

Co-authored-by: Copilot <[email protected]>
Rebuilt architecture overview diagram using ASCII characters
with programmatically verified uniform line widths. Both
markdownlint and cspell pass clean.

Co-authored-by: Copilot <[email protected]>
@tvaron3 tvaron3 closed this Mar 13, 2026
@github-project-automation github-project-automation bot moved this from Changes Requested to Done in CosmosDB Go/Rust Crew Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Cosmos Driver: Health checks for gateway endpoints

4 participants