[Cosmos] Health checks spec for gateway endpoints#3866
[Cosmos] Health checks spec for gateway endpoints#3866tvaron3 wants to merge 11 commits intoAzure:release/azure_data_cosmos-previewsfrom
Conversation
Spec for proactive endpoint health checking in the driver, addressing Azure#3846. Key design: - HealthProbe trait abstraction makes it trivial to swap GetDatabaseAccount for the upcoming gateway health API - EndpointHealthMonitor runs background sweeps with concurrent probing across all regional endpoints - Integrates with AccountEndpointStateStore from the Transport Pipeline Spec for automatic routing updates - Configurable via HealthCheckOptions and env var overrides Co-authored-by: Copilot <[email protected]>
- Replace reason:String with structured ProbeFailure (status code, sub-status) - Use transport-layer timeout instead of tokio::time::timeout wrapper - Remove probe_on_startup toggle — always probe on startup (non-blocking) - Remove custom probe_with_retry; use ClientRetryPolicy with probe-specific config - Add ExecutionContext::HealthCheckProbe variant separate from PPCB - Remove all env var overrides (§5.3) — config is programmatic-only - Startup probe runs in background, does not block client creation - Resolve Open Questions #1 (non-blocking) and #2 (keep enabled for emulator) Co-authored-by: Copilot <[email protected]>
Remove branch name from spec header to avoid cspell failure on username. Add RAII to cosmos dictionary. Co-authored-by: Copilot <[email protected]>
Update spec to match the pipeline architecture established in PR Azure#3875 (driver-transport step 01): - Replace old azure_core Pipeline/Context/Policy API with execute_transport_pipeline() and get_metadata_http_client() - Replace CosmosEndpoint (non-existent) with AccountEndpoint - Replace Arc<dyn Authorization> with Credential enum - Replace ClientRetryPolicy references with actual retry architecture (probe-level loop + transport 429 retry) - Use CosmosStatus instead of raw u16/u32 in ProbeFailure - Mark Step 2 dependencies (AccountEndpointStateStore etc.) as pseudocode with prominent callout - Fix mark_endpoint_available description (both functions new) - Update migration plan to reference new pipeline functions - Defer per-request timeout (uses client-level metadata timeout) Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Pull request overview
Adds a new design/spec document for proactive gateway endpoint health checking in azure_data_cosmos_driver, describing a background probing system that feeds into the endpoint routing state so operations can avoid unhealthy regions.
Changes:
- Added a full health checks specification covering probe abstraction, scheduling, retries, state integration, and configuration.
- Updated the Cosmos dictionary to allow the term “RAII”.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md | New spec describing the proposed health-check architecture and integration with routing state. |
| sdk/cosmos/.dict.txt | Adds “RAII” to the spellcheck dictionary. |
- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout) - Rename max_probe_retries to max_probe_attempts, fix off-by-one semantics with exclusive range (3 = 3 total attempts) primitives for runtime-agnostic design - Store CancellationToken in EndpointHealthMonitor, implement Drop to cancel background task on shutdown - Replace snapshot+swap with per-endpoint mark_available/ mark_unavailable calls to avoid lost concurrent updates - Update PR description to match current spec (no env vars, non-blocking startup, no optional blocking) Co-authored-by: Copilot <[email protected]>
analogrelay
left a comment
There was a problem hiding this comment.
Some initial thoughts. It's a great start and I don't see any fundamental issues!
|
|
||
| ### Design Constraint: Swappable Probe API | ||
|
|
||
| The Cosmos DB gateway team is building a dedicated health endpoint. Until that ships, we use |
There was a problem hiding this comment.
Is it possible for us to detect when this endpoint is available? Ideally, we'd like to just start using it when it becomes available instead of requiring users to update their SDK
There was a problem hiding this comment.
The discussed approach was for the service team to create the endpoint (which might initially just redirect to getaccount internally) - but even that might take a few weeks until it is deployed
There was a problem hiding this comment.
But we expect to have the endpoint at the expected URL by GA? If so, then we don't need the trait/enum at all. We can just change the endpoint we use when it's fully deployed.
There was a problem hiding this comment.
Let us double-check - I don't think there is a clear ETA - and deployment in non-public clouds usually takes 3+ months after publci clouds - so, by GA timeline we probabyl cannot rely yet on the endpoint being there.
There was a problem hiding this comment.
Another option is a discovery approach, such as Get Account Properties returning a healthCheckEndpoints collection or similar. Since we're going to run Get Account Properties before initializing the driver (as part of async initialization) we could check for that value and use it if present, otherwise fall back to just using Get Account Properties itself.
There was a problem hiding this comment.
Basically, I'm just trying to see if we can make this Just Start Working ™️ when the service rolls it out instead of requiring an SDK update.
There was a problem hiding this comment.
IMO discovery approach via health endpoit is over-engineering and requires also service deployments. My 2 cents - the pragmatic approach is what Tomas is proposing in the PR - if we want something unified by GA to avoid teh eventual SDK update enecssity we need to push service team to accelerate providing the health endpoint. Everything else just over-complicates for some very short-term gain. I am fine with requiring an SDK update as long as it does not change any public API because I expect customers using GA version to upgarde relatively quickly initially anyway.
|
|
||
| When the dedicated gateway health endpoint ships, the migration is: | ||
|
|
||
| 1. Create `GatewayHealthProbe` implementing `HealthProbe`: |
There was a problem hiding this comment.
It would be ideal if the gateway health probe endpoints could be returned by Get Account Properties or something similar. That way we can select an appropriate strategy during driver initialization
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Deep Review — Health Checks Spec for Gateway Endpoints
Well-structured spec with solid design thinking. The HealthProbe trait abstraction and DOP-aligned pure function approach for state mutations are highlights. Reviewed against the Transport Pipeline Spec and Python SDK implementation for cross-spec consistency and parity.
13 findings: 1 blocking, 7 recommendations, 3 suggestions, 2 observations.
Key Issues
- 🔴 Retry logic gap — §3.5 defines the 3-attempt retry loop but §3.6's concurrent probing calls
probe.probe(&ep)only once per endpoint. - 🟡 Cross-spec gaps —
mark_endpoint_availableandAccountEndpointtype naming not aligned with Transport Pipeline Spec. - 🟡 Underspecified —
backoff_delay()algorithm,HealthMonitorConfig↔HealthCheckOptionsmapping. - 🟡 Architecture — Health monitor bypasses
LocationStateStore.apply()single-writer model; independent sweep loop diverges from Python SDK's piggybacked approach (intentional improvement, but should be documented).
See inline comments for full details on each finding.
Major design changes from Ashley and Fabian's review: - Replace HealthProbe trait with enum (DOP guidance) - Use execute_operation for probes (reuses full pipeline) - Route results through LocationEffect/LocationStateStore (single-writer model, no snapshot+swap lost updates) - Use BackgroundTaskManager for drop-based cancellation (no CancellationToken, no JoinHandle) - Use FuturesUnordered for concurrent probing (process results as they arrive, runtime-agnostic) - Carry azure_core::Error in ProbeFailure (defer formatting) - Track consecutive_failures per endpoint across sweeps - Add jitter to sweep interval (thundering herd prevention) - Define backoff algorithm (500ms/2x/3min cap/25% jitter) - Inline retry loop in probe_with_retry helper - Use CosmosEndpoint (Transport Spec canonical type) - Remove public HealthCheckOptions (all config internal) - Remove HealthMonitorConfig (collapsed, hardcoded defaults) - Remove account_metadata_cache (unused param) - Remove expiration-based recovery (health checks own it) - Add healthy-by-default design principle - Add Python SDK divergence documentation - Global endpoint: topology only, never probed - Fix ASCII diagram alignment - Use FaultInjection for testing (not MockHealthProbe) - Logging: warn! failures, info! recovery, no routine OK Co-authored-by: Copilot <[email protected]>
Upstream PR Azure#3824 moved .dict.txt words into .cspell.json. Accept deletion of .dict.txt and add our new words (RAII, deprioritize, bursty) to sdk/cosmos/.cspell.json. Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Add blank lines before lists and fenced code blocks per MD031/MD032 rules. Co-authored-by: Copilot <[email protected]>
Add Markdown Linting section to AGENTS.md with markdownlint command and add step 6 to the pre-completion checklist for .md files. Also update stale .dict.txt reference to .cspell.json (migrated in PR Azure#3824). Co-authored-by: Copilot <[email protected]>
Rebuilt architecture overview diagram using ASCII characters with programmatically verified uniform line widths. Both markdownlint and cspell pass clean. Co-authored-by: Copilot <[email protected]>
Summary
Spec for proactive gateway endpoint health checking in
azure_data_cosmos_driver. Addresses #3846.Design
The health check system probes regional gateway endpoints in the background to detect unavailability and recovery, feeding results into the endpoint routing system so requests avoid unhealthy regions.
Key components
HealthProbetrait — swappable probe abstraction. Current implementation usesGetDatabaseAccount(lightweight metadata read). When the dedicated gateway health API ships, only the probe implementation changes — orchestration, scheduling, state management, and config remain untouched.GetDatabaseAccountProbe— reuses the existingCosmosTransportmetadata pipeline for authenticatedGET /calls with per-probe timeout.EndpointHealthMonitor— backgroundtokiotask that sweeps all regional endpoints concurrently every 5 minutes (configurable). Each probe has exponential backoff retry (3 attempts, 500ms initial delay).AccountEndpointStateStorefrom the Transport Pipeline Spec (§4.4). Healthy probes remove endpoints from the unavailable set; failed probes add them. No changes needed inexecute_operation— routing automatically respects the updated state.HealthCheckOptions— public configuration with env var overrides (AZURE_COSMOS_HEALTH_CHECK_*).Python SDK alignment
Defaults match the Python SDK: 5-minute refresh interval, 3 probe retries with 500ms initial backoff, concurrent probing of read + write endpoints. The partition-level circuit breaker is out of scope (covered separately in Transport Pipeline Spec §4.5).
Open questions
5 questions flagged for discussion in the spec (startup blocking vs background, emulator behavior, metadata refresh piggybacking, gateway API contract, options visibility).
Files changed
sdk/cosmos/azure_data_cosmos_driver/docs/HEALTH_CHECKS_SPEC.md— full spec with component design, integration points, configuration, testing strategy, and 10-step migration plan