Skip to content

[Internal] Add dynamic timeout escalation spec#3871

Draft
tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts
Draft

[Internal] Add dynamic timeout escalation spec#3871
tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
tvaron3:tvaron3/dynamicTimeouts

Conversation

@tvaron3
Copy link
Member

@tvaron3 tvaron3 commented Mar 5, 2026

Dynamic Timeout Escalation Spec

Adds DYNAMIC_TIMEOUTS_SPEC.md to sdk/cosmos/azure_data_cosmos_driver/docs/ describing a design for escalating connection and request timeouts on transport retries, rather than using a fixed timeout for every attempt.

Motivation

When a request times out and is retried with the same timeout, it often fails again. Escalating the timeout on each retry gives the operation a better chance of succeeding without making the initial attempt unnecessarily slow.

Key Design Decisions

  • Data plane request timeouts: 6s → 10s → 65s
  • Data plane connection timeouts: 1s → 3s → 5s
  • Metadata request timeouts: 5s → 10s → 20s (aligned with Java SDK DatabaseAccount pattern)
  • Metadata connection timeouts: 1s → 3s → 5s
  • Max transport retries: increased from 1 → 2 (3 total attempts to match ladder length)
  • Fixed defaults: the escalation ladder is not user-configurable; existing ConnectionPoolOptions min/max bounds still clamp the effective values
  • E2E deadline: end-to-end operation deadline still takes precedence over per-attempt ladder values

Cross-SDK Alignment

The Java SDK already implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.

Deferred

  • Query plan timeout escalation (not yet implemented)
  • Address refresh timeout escalation (direct mode not yet implemented)

Open Questions

See §10 of the spec for discussion on per-attempt connection timeout mechanism, hedging interaction, and diagnostics reporting.

Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing
escalating connection and request timeouts on transport retries.

Co-authored-by: Copilot <[email protected]>
@tvaron3 tvaron3 linked an issue Mar 9, 2026 that may be closed by this pull request
tvaron3 and others added 2 commits March 9, 2026 12:07
Resolve 3 of 4 open questions based on review feedback:

- Commit to Context-based approach (Option A) for delivering
  per-attempt timeouts, removing Options B/C. No changes needed
  in azure_core or typespec_client_core.
- Decide hedged attempts always start at tier 0 of the timeout
  ladder since they target independent regional endpoints.
- Record effective per-attempt timeout values in DiagnosticsContext
  for observability.

Co-authored-by: Copilot <[email protected]>
Update the dynamic timeout spec to reflect the new pipeline
architecture from PR Azure#3887 (Driver Transport step 02):

- Replace MAX_TRANSPORT_RETRIES/TransportRetry references with
  the new FailoverRetry/SessionRetry model and failover_retry_count
- Move retry loop references from cosmos_driver.rs to the 7-stage
  loop in operation_pipeline.rs
- Change timeout delivery from Context type-map to TransportRequest
  field, matching the new pipeline data flow
- Update goal from 'increase retry count' to 'leverage existing
  retry budget' since max_failover_retries already defaults to 3

Co-authored-by: Copilot <[email protected]>
override the step durations. The existing `ConnectionPoolOptions` min/max bounds still act as
clamping limits (see [§7](#7-interaction-with-connectionpooloptions-bounds)).
- **Query plan timeout escalation**: Deferred until query plan execution is implemented.
- **Address refresh timeout escalation**: Deferred until direct mode is implemented.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct mode won't be implemented

tvaron3 and others added 3 commits March 9, 2026 13:20
Propose adding an optional per-request timeout field to
typespec_client_core::http::Request with set_timeout()/timeout()
methods. The reqwest HttpClient implementation applies it via
reqwest::RequestBuilder::timeout().

For connection timeout escalation, the reqwest client is built
with the maximum ladder value (5s) and the per-request timeout
provides the tighter overall bound on each attempt.

Co-authored-by: Copilot <[email protected]>
Remove all connection timeout escalation from the spec.
Connection timeouts remain static (configured via
ConnectionPoolOptions). Only request timeouts are escalated
per retry attempt.

Co-authored-by: Copilot <[email protected]>
Connection timeouts use a failure-rate adaptive model instead
of per-retry escalation:

- Start at 1s (sufficient for any cloud/datacenter network)
- ShardedHttpTransport monitors connection failure rate
- On sustained failures, create new HttpClient instances with
  5s connect_timeout (one-time persistent transition)
- No azure_core changes needed — entirely internal to the
  sharded transport

Co-authored-by: Copilot <[email protected]>
tvaron3 added a commit to tvaron3/azure-sdk-for-rust that referenced this pull request Mar 9, 2026
- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout)
- Rename max_probe_retries to max_probe_attempts, fix off-by-one
  semantics with exclusive range (3 = 3 total attempts)
  primitives for runtime-agnostic design
- Store CancellationToken in EndpointHealthMonitor, implement
  Drop to cancel background task on shutdown
- Replace snapshot+swap with per-endpoint mark_available/
  mark_unavailable calls to avoid lost concurrent updates
- Update PR description to match current spec (no env vars,
  non-blocking startup, no optional blocking)

Co-authored-by: Copilot <[email protected]>
Copy link
Member Author

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Deep Review — §10 Adaptive Connection Timeout

Focused review of the adaptive connection timeout section. 3 comments (1 recommendation, 2 suggestions).

reads) have their own ladders? For now, a single metadata ladder is proposed. Query plan and
address refresh are deferred.

2. **Connection failure rate threshold**: What failure rate should trigger the connection timeout
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Recommendation · Completeness: Unresolved Design Parameters

Failure rate threshold and window are open questions that affect correctness

The threshold (>50%? >N consecutive?) and window (last 20 attempts? last 30s?) are listed as open questions, but they directly determine whether the adaptive mechanism triggers at all or triggers too aggressively. These aren’t implementation details — they’re design decisions that change the observable behavior.

A threshold that’s too low causes false positives on transient blips (one bad connection in a burst of pool expansion triggers permanent 5s timeout). A threshold that’s too high means the mechanism never activates in practice.

Suggested fix: Resolve these before merging the spec. Propose concrete defaults (e.g., “>3 consecutive connection failures within a 30s window”) and document the rationale, even if the values are expected to be tuned later.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

2. **Threshold check**: When the connection failure rate exceeds a configurable threshold (e.g.,
>50% of recent connection attempts fail), the transport transitions to the elevated timeout.
3. **Create new clients**: New `HttpClient` instances are created with `connect_timeout = 5s`
via the `HttpClientFactory`. Existing shards with the old timeout continue serving inflight
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion · Design: Transition Latency

“Exactly-once transition” means inflight requests on old shards keep failing at 1s during escalation

When the transport decides to escalate, new HttpClient instances get 5s but existing shards continue with 1s until they drain naturally. If the network issue is affecting all connections, requests on old shards will keep timing out at 1s until those shards are reclaimed by the health sweep.

The transition latency depends entirely on how fast old shards drain — which could be many seconds if they have active connections.

Consider whether existing shards should be actively replaced (e.g., marked unhealthy immediately so the health sweep reclaims them on its next pass) rather than waiting for natural drain.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.


```rust
// In ShardedHttpTransport, on health sweep or after connection failure:
if connection_failure_rate > CONNECT_TIMEOUT_ESCALATION_THRESHOLD {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Suggestion · Design: Scope Ambiguity

Per-endpoint vs global scope for failure rate tracking is unspecified

The pseudocode shows connection_failure_rate as a single value, but doesn’t specify whether it’s tracked per-endpoint or globally across all endpoints.

If one regional endpoint has connectivity issues (e.g., a specific Azure region), should ALL endpoints get the 5s timeout? Per-endpoint tracking would be more precise (only the slow region gets 5s) but adds complexity. Global tracking is simpler but could penalize healthy regions.

The spec should state the scope explicitly. Given that ShardedHttpTransport already manages shards per-endpoint, per-endpoint tracking would be a natural fit.


⚠️ AI-generated review — may be incorrect. Agree? → resolve the conversation. Disagree? → reply with your reasoning.

tvaron3 and others added 2 commits March 16, 2026 12:23
- Remove direct mode non-goal (won't be implemented)
- Resolve connection failure threshold: >3 consecutive failures
  per endpoint triggers 1s to 5s escalation
- Old shards are actively marked unhealthy on escalation for
  immediate reclamation by health sweep
- Explicitly state per-endpoint failure tracking scope
- Remove resolved open questions 2 and 3

Co-authored-by: Copilot <[email protected]>
- ShardedHttpTransport is no longer 'future integration' — it
  exists in sharded_transport.rs with per-shard health sweeps
- Fix HttpClientFactory::create() → build()
- Fix HttpClientConfig connect_timeout claim — connect_timeout
  is sourced from ConnectionPoolOptions::max_connect_timeout()
  inside the factory build() method, not as a config field
- Reference existing per-request timeout mechanism in transport
  pipeline (azure_core::sleep racing the HTTP future)

Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Cosmos: Add Support for Gateway Timeouts

1 participant