[Internal] Add dynamic timeout escalation spec by tvaron3 · Pull Request #3871 · Azure/azure-sdk-for-rust

tvaron3 · 2026-03-05T23:00:48Z

Dynamic Timeout Escalation Spec

Adds DYNAMIC_TIMEOUTS_SPEC.md to sdk/cosmos/azure_data_cosmos_driver/docs/ describing a design for escalating connection and request timeouts on transport retries, rather than using a fixed timeout for every attempt.

Motivation

When a request times out and is retried with the same timeout, it often fails again. Escalating the timeout on each retry gives the operation a better chance of succeeding without making the initial attempt unnecessarily slow.

Key Design Decisions

Data plane request timeouts: 6s → 10s → 65s
Data plane connection timeouts: 1s → 3s → 5s
Metadata request timeouts: 5s → 10s → 20s (aligned with Java SDK DatabaseAccount pattern)
Metadata connection timeouts: 1s → 3s → 5s
Max transport retries: increased from 1 → 2 (3 total attempts to match ladder length)
Fixed defaults: the escalation ladder is not user-configurable; existing ConnectionPoolOptions min/max bounds still clamp the effective values
E2E deadline: end-to-end operation deadline still takes precedence over per-attempt ladder values

Cross-SDK Alignment

The Java SDK already implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.

Deferred

Query plan timeout escalation (not yet implemented)
Address refresh timeout escalation (direct mode not yet implemented)

Open Questions

See §10 of the spec for discussion on per-attempt connection timeout mechanism, hedging interaction, and diagnostics reporting.

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <[email protected]>

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

Resolve 3 of 4 open questions based on review feedback: - Commit to Context-based approach (Option A) for delivering per-attempt timeouts, removing Options B/C. No changes needed in azure_core or typespec_client_core. - Decide hedged attempts always start at tier 0 of the timeout ladder since they target independent regional endpoints. - Record effective per-attempt timeout values in DiagnosticsContext for observability. Co-authored-by: Copilot <[email protected]>

Update the dynamic timeout spec to reflect the new pipeline architecture from PR Azure#3887 (Driver Transport step 02): - Replace MAX_TRANSPORT_RETRIES/TransportRetry references with the new FailoverRetry/SessionRetry model and failover_retry_count - Move retry loop references from cosmos_driver.rs to the 7-stage loop in operation_pipeline.rs - Change timeout delivery from Context type-map to TransportRequest field, matching the new pipeline data flow - Update goal from 'increase retry count' to 'leverage existing retry budget' since max_failover_retries already defaults to 3 Co-authored-by: Copilot <[email protected]>

tvaron3 · 2026-03-09T20:11:18Z

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

+  override the step durations. The existing `ConnectionPoolOptions` min/max bounds still act as
+  clamping limits (see [§7](#7-interaction-with-connectionpooloptions-bounds)).
+- **Query plan timeout escalation**: Deferred until query plan execution is implemented.
+- **Address refresh timeout escalation**: Deferred until direct mode is implemented.


Direct mode won't be implemented

Propose adding an optional per-request timeout field to typespec_client_core::http::Request with set_timeout()/timeout() methods. The reqwest HttpClient implementation applies it via reqwest::RequestBuilder::timeout(). For connection timeout escalation, the reqwest client is built with the maximum ladder value (5s) and the per-request timeout provides the tighter overall bound on each attempt. Co-authored-by: Copilot <[email protected]>

Remove all connection timeout escalation from the spec. Connection timeouts remain static (configured via ConnectionPoolOptions). Only request timeouts are escalated per retry attempt. Co-authored-by: Copilot <[email protected]>

Connection timeouts use a failure-rate adaptive model instead of per-retry escalation: - Start at 1s (sufficient for any cloud/datacenter network) - ShardedHttpTransport monitors connection failure rate - On sustained failures, create new HttpClient instances with 5s connect_timeout (one-time persistent transition) - No azure_core changes needed — entirely internal to the sharded transport Co-authored-by: Copilot <[email protected]>

- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout) - Rename max_probe_retries to max_probe_attempts, fix off-by-one semantics with exclusive range (3 = 3 total attempts) primitives for runtime-agnostic design - Store CancellationToken in EndpointHealthMonitor, implement Drop to cancel background task on shutdown - Replace snapshot+swap with per-endpoint mark_available/ mark_unavailable calls to avoid lost concurrent updates - Update PR description to match current spec (no env vars, non-blocking startup, no optional blocking) Co-authored-by: Copilot <[email protected]>

…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts

tvaron3

PR Deep Review — §10 Adaptive Connection Timeout

Focused review of the adaptive connection timeout section. 3 comments (1 recommendation, 2 suggestions).

tvaron3 · 2026-03-16T18:37:25Z

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

+   reads) have their own ladders? For now, a single metadata ladder is proposed. Query plan and
+   address refresh are deferred.
+
+2. **Connection failure rate threshold**: What failure rate should trigger the connection timeout


🟡 Recommendation · Completeness: Unresolved Design Parameters

Failure rate threshold and window are open questions that affect correctness

The threshold (>50%? >N consecutive?) and window (last 20 attempts? last 30s?) are listed as open questions, but they directly determine whether the adaptive mechanism triggers at all or triggers too aggressively. These aren’t implementation details — they’re design decisions that change the observable behavior.

A threshold that’s too low causes false positives on transient blips (one bad connection in a burst of pool expansion triggers permanent 5s timeout). A threshold that’s too high means the mechanism never activates in practice.

Suggested fix: Resolve these before merging the spec. Propose concrete defaults (e.g., “>3 consecutive connection failures within a 30s window”) and document the rationale, even if the values are expected to be tuned later.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

tvaron3 · 2026-03-16T18:37:25Z

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

+2. **Threshold check**: When the connection failure rate exceeds a configurable threshold (e.g.,
+   >50% of recent connection attempts fail), the transport transitions to the elevated timeout.
+3. **Create new clients**: New `HttpClient` instances are created with `connect_timeout = 5s`
+   via the `HttpClientFactory`. Existing shards with the old timeout continue serving inflight


🟢 Suggestion · Design: Transition Latency

“Exactly-once transition” means inflight requests on old shards keep failing at 1s during escalation

When the transport decides to escalate, new HttpClient instances get 5s but existing shards continue with 1s until they drain naturally. If the network issue is affecting all connections, requests on old shards will keep timing out at 1s until those shards are reclaimed by the health sweep.

The transition latency depends entirely on how fast old shards drain — which could be many seconds if they have active connections.

Consider whether existing shards should be actively replaced (e.g., marked unhealthy immediately so the health sweep reclaims them on its next pass) rather than waiting for natural drain.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

tvaron3 · 2026-03-16T18:37:25Z

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md

+
+```rust
+// In ShardedHttpTransport, on health sweep or after connection failure:
+if connection_failure_rate > CONNECT_TIMEOUT_ESCALATION_THRESHOLD {


🟢 Suggestion · Design: Scope Ambiguity

Per-endpoint vs global scope for failure rate tracking is unspecified

The pseudocode shows connection_failure_rate as a single value, but doesn’t specify whether it’s tracked per-endpoint or globally across all endpoints.

If one regional endpoint has connectivity issues (e.g., a specific Azure region), should ALL endpoints get the 5s timeout? Per-endpoint tracking would be more precise (only the slow region gets 5s) but adds complexity. Global tracking is simpler but could penalize healthy regions.

The spec should state the scope explicitly. Given that ShardedHttpTransport already manages shards per-endpoint, per-endpoint tracking would be a natural fit.

_{⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.}

- Remove direct mode non-goal (won't be implemented) - Resolve connection failure threshold: >3 consecutive failures per endpoint triggers 1s to 5s escalation - Old shards are actively marked unhealthy on escalation for immediate reclamation by health sweep - Explicitly state per-endpoint failure tracking scope - Remove resolved open questions 2 and 3 Co-authored-by: Copilot <[email protected]>

- ShardedHttpTransport is no longer 'future integration' — it exists in sharded_transport.rs with per-shard health sweeps - Fix HttpClientFactory::create() → build() - Fix HttpClientConfig connect_timeout claim — connect_timeout is sourced from ConnectionPoolOptions::max_connect_timeout() inside the factory build() method, not as a config field - Reference existing per-request timeout mechanism in transport pipeline (azure_core::sleep racing the HTTP future) Co-authored-by: Copilot <[email protected]>

Add dynamic timeout escalation spec

6d95b43

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <[email protected]>

github-actions bot added the Cosmos The azure_cosmos crate label Mar 5, 2026

github-project-automation bot added this to CosmosDB Go/Rust Crew Mar 5, 2026

github-project-automation bot moved this to Todo in CosmosDB Go/Rust Crew Mar 5, 2026

tvaron3 commented Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 commented Mar 9, 2026

View reviewed changes

sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md Outdated Show resolved Hide resolved

tvaron3 linked an issue Mar 9, 2026 that may be closed by this pull request

Cosmos: Add Support for Gateway Timeouts #3758

Open

tvaron3 and others added 2 commits March 9, 2026 12:07

tvaron3 commented Mar 9, 2026

View reviewed changes

tvaron3 mentioned this pull request Mar 9, 2026

[Cosmos] Health checks spec for gateway endpoints #3866

Closed

tvaron3 and others added 3 commits March 9, 2026 13:20

Scope spec to request timeouts only

b4a5403

Remove all connection timeout escalation from the spec. Connection timeouts remain static (configured via ConnectionPoolOptions). Only request timeouts are escalated per retry attempt. Co-authored-by: Copilot <[email protected]>

Merge branch 'release/azure_data_cosmos-previews' of https://github.c…

9b1b79d

…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts

tvaron3 commented Mar 16, 2026

View reviewed changes

tvaron3 and others added 2 commits March 16, 2026 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Internal] Add dynamic timeout escalation spec#3871

[Internal] Add dynamic timeout escalation spec#3871
tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts

tvaron3 commented Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tvaron3 Mar 9, 2026

Uh oh!

tvaron3 left a comment

Uh oh!

tvaron3 Mar 16, 2026

Uh oh!

tvaron3 Mar 16, 2026

Uh oh!

tvaron3 Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tvaron3 commented Mar 5, 2026

Dynamic Timeout Escalation Spec

Motivation

Key Design Decisions

Cross-SDK Alignment

Deferred

Open Questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tvaron3 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

tvaron3 left a comment

Choose a reason for hiding this comment

PR Deep Review — §10 Adaptive Connection Timeout

Uh oh!

tvaron3 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

tvaron3 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

tvaron3 Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant