[Internal] Add dynamic timeout escalation spec#3871
[Internal] Add dynamic timeout escalation spec#3871tvaron3 wants to merge 9 commits intoAzure:release/azure_data_cosmos-previewsfrom
Conversation
Add DYNAMIC_TIMEOUTS_SPEC.md to the driver docs describing escalating connection and request timeouts on transport retries. Co-authored-by: Copilot <[email protected]>
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure_data_cosmos_driver/docs/DYNAMIC_TIMEOUTS_SPEC.md
Outdated
Show resolved
Hide resolved
Resolve 3 of 4 open questions based on review feedback: - Commit to Context-based approach (Option A) for delivering per-attempt timeouts, removing Options B/C. No changes needed in azure_core or typespec_client_core. - Decide hedged attempts always start at tier 0 of the timeout ladder since they target independent regional endpoints. - Record effective per-attempt timeout values in DiagnosticsContext for observability. Co-authored-by: Copilot <[email protected]>
Update the dynamic timeout spec to reflect the new pipeline architecture from PR Azure#3887 (Driver Transport step 02): - Replace MAX_TRANSPORT_RETRIES/TransportRetry references with the new FailoverRetry/SessionRetry model and failover_retry_count - Move retry loop references from cosmos_driver.rs to the 7-stage loop in operation_pipeline.rs - Change timeout delivery from Context type-map to TransportRequest field, matching the new pipeline data flow - Update goal from 'increase retry count' to 'leverage existing retry budget' since max_failover_retries already defaults to 3 Co-authored-by: Copilot <[email protected]>
| override the step durations. The existing `ConnectionPoolOptions` min/max bounds still act as | ||
| clamping limits (see [§7](#7-interaction-with-connectionpooloptions-bounds)). | ||
| - **Query plan timeout escalation**: Deferred until query plan execution is implemented. | ||
| - **Address refresh timeout escalation**: Deferred until direct mode is implemented. |
There was a problem hiding this comment.
Direct mode won't be implemented
Propose adding an optional per-request timeout field to typespec_client_core::http::Request with set_timeout()/timeout() methods. The reqwest HttpClient implementation applies it via reqwest::RequestBuilder::timeout(). For connection timeout escalation, the reqwest client is built with the maximum ladder value (5s) and the per-request timeout provides the tighter overall bound on each attempt. Co-authored-by: Copilot <[email protected]>
Remove all connection timeout escalation from the spec. Connection timeouts remain static (configured via ConnectionPoolOptions). Only request timeouts are escalated per retry attempt. Co-authored-by: Copilot <[email protected]>
Connection timeouts use a failure-rate adaptive model instead of per-retry escalation: - Start at 1s (sufficient for any cloud/datacenter network) - ShardedHttpTransport monitors connection failure rate - On sustained failures, create new HttpClient instances with 5s connect_timeout (one-time persistent transition) - No azure_core changes needed — entirely internal to the sharded transport Co-authored-by: Copilot <[email protected]>
- Remove probe_timeout (deferred to PR Azure#3871 per-request timeout) - Rename max_probe_retries to max_probe_attempts, fix off-by-one semantics with exclusive range (3 = 3 total attempts) primitives for runtime-agnostic design - Store CancellationToken in EndpointHealthMonitor, implement Drop to cancel background task on shutdown - Replace snapshot+swap with per-endpoint mark_available/ mark_unavailable calls to avoid lost concurrent updates - Update PR description to match current spec (no env vars, non-blocking startup, no optional blocking) Co-authored-by: Copilot <[email protected]>
…om/Azure/azure-sdk-for-rust into tvaron3/dynamicTimeouts
tvaron3
left a comment
There was a problem hiding this comment.
PR Deep Review — §10 Adaptive Connection Timeout
Focused review of the adaptive connection timeout section. 3 comments (1 recommendation, 2 suggestions).
| reads) have their own ladders? For now, a single metadata ladder is proposed. Query plan and | ||
| address refresh are deferred. | ||
|
|
||
| 2. **Connection failure rate threshold**: What failure rate should trigger the connection timeout |
There was a problem hiding this comment.
🟡 Recommendation · Completeness: Unresolved Design Parameters
Failure rate threshold and window are open questions that affect correctness
The threshold (>50%? >N consecutive?) and window (last 20 attempts? last 30s?) are listed as open questions, but they directly determine whether the adaptive mechanism triggers at all or triggers too aggressively. These aren’t implementation details — they’re design decisions that change the observable behavior.
A threshold that’s too low causes false positives on transient blips (one bad connection in a burst of pool expansion triggers permanent 5s timeout). A threshold that’s too high means the mechanism never activates in practice.
Suggested fix: Resolve these before merging the spec. Propose concrete defaults (e.g., “>3 consecutive connection failures within a 30s window”) and document the rationale, even if the values are expected to be tuned later.
| 2. **Threshold check**: When the connection failure rate exceeds a configurable threshold (e.g., | ||
| >50% of recent connection attempts fail), the transport transitions to the elevated timeout. | ||
| 3. **Create new clients**: New `HttpClient` instances are created with `connect_timeout = 5s` | ||
| via the `HttpClientFactory`. Existing shards with the old timeout continue serving inflight |
There was a problem hiding this comment.
🟢 Suggestion · Design: Transition Latency
“Exactly-once transition” means inflight requests on old shards keep failing at 1s during escalation
When the transport decides to escalate, new HttpClient instances get 5s but existing shards continue with 1s until they drain naturally. If the network issue is affecting all connections, requests on old shards will keep timing out at 1s until those shards are reclaimed by the health sweep.
The transition latency depends entirely on how fast old shards drain — which could be many seconds if they have active connections.
Consider whether existing shards should be actively replaced (e.g., marked unhealthy immediately so the health sweep reclaims them on its next pass) rather than waiting for natural drain.
|
|
||
| ```rust | ||
| // In ShardedHttpTransport, on health sweep or after connection failure: | ||
| if connection_failure_rate > CONNECT_TIMEOUT_ESCALATION_THRESHOLD { |
There was a problem hiding this comment.
🟢 Suggestion · Design: Scope Ambiguity
Per-endpoint vs global scope for failure rate tracking is unspecified
The pseudocode shows connection_failure_rate as a single value, but doesn’t specify whether it’s tracked per-endpoint or globally across all endpoints.
If one regional endpoint has connectivity issues (e.g., a specific Azure region), should ALL endpoints get the 5s timeout? Per-endpoint tracking would be more precise (only the slow region gets 5s) but adds complexity. Global tracking is simpler but could penalize healthy regions.
The spec should state the scope explicitly. Given that ShardedHttpTransport already manages shards per-endpoint, per-endpoint tracking would be a natural fit.
- Remove direct mode non-goal (won't be implemented) - Resolve connection failure threshold: >3 consecutive failures per endpoint triggers 1s to 5s escalation - Old shards are actively marked unhealthy on escalation for immediate reclamation by health sweep - Explicitly state per-endpoint failure tracking scope - Remove resolved open questions 2 and 3 Co-authored-by: Copilot <[email protected]>
- ShardedHttpTransport is no longer 'future integration' — it exists in sharded_transport.rs with per-shard health sweeps - Fix HttpClientFactory::create() → build() - Fix HttpClientConfig connect_timeout claim — connect_timeout is sourced from ConnectionPoolOptions::max_connect_timeout() inside the factory build() method, not as a config field - Reference existing per-request timeout mechanism in transport pipeline (azure_core::sleep racing the HTTP future) Co-authored-by: Copilot <[email protected]>
Dynamic Timeout Escalation Spec
Adds
DYNAMIC_TIMEOUTS_SPEC.mdtosdk/cosmos/azure_data_cosmos_driver/docs/describing a design for escalating connection and request timeouts on transport retries, rather than using a fixed timeout for every attempt.Motivation
When a request times out and is retried with the same timeout, it often fails again. Escalating the timeout on each retry gives the operation a better chance of succeeding without making the initial attempt unnecessarily slow.
Key Design Decisions
DatabaseAccountpattern)ConnectionPoolOptionsmin/max bounds still clamp the effective valuesCross-SDK Alignment
The Java SDK already implements timeout escalation for gateway/metadata HTTP calls (QueryPlan: 0.5s→5s→10s, DatabaseAccount: 5s→10s→20s). This spec extends the pattern to both data plane and metadata requests in the Rust driver.
Deferred
Open Questions
See §10 of the spec for discussion on per-attempt connection timeout mechanism, hedging interaction, and diagnostics reporting.