-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Bail out from barriers when barriers hit 410 Lease Not Found.
#47232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ilFastBarrierLeaseNotFound
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Lease Not Found.
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…ilFastBarrierLeaseNotFound # Conflicts: # sdk/cosmos/azure-cosmos/CHANGELOG.md
|
/azp run java - cosmos - tests |
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements fail-fast behavior for barrier requests that encounter 410 "Lease Not Found" errors in Strong Consistency scenarios. The key improvement is avoiding unnecessary retries when barriers hit persistent lease-not-found errors by performing a final barrier check on the primary replica after a forced address refresh.
Key Changes:
- Added logic in
QuorumReaderandConsistencyWriterto detectAvoidQuorumSelectionExceptionscenarios during barrier flows and bail out with appropriate errors (503 for reads, 408 for writes) - Removed conditional logic in
StoreReaderthat previously excluded barrier requests from fail-fast handling - Added comprehensive test coverage with new
multi-region-strongtest profile andExitFromConsistencyLayerTests
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| QuorumReader.java | Added fail-fast logic for read barriers encountering 410 errors, with primary replica validation |
| ConsistencyWriter.java | Added fail-fast logic for write barriers encountering 410 errors, with primary replica validation |
| StoreReader.java | Removed barrier request exclusion from fail-fast exception handling |
| StoreResponse.java | Added setHeaderValue method for test response manipulation |
| RntbdTransportClient.java | Added copy constructor for test infrastructure |
| ExitFromConsistencyLayerTests.java | New comprehensive test suite validating barrier bailout behavior |
| StoreResponseInterceptorUtils.java | Test utilities for injecting controlled failures during barrier flows |
| RntbdTransportClientWithStoreResponseInterceptor.java | Test wrapper enabling response interception |
| ReflectionUtils.java | Added helper method to access StoreReader from ConsistencyWriter |
| TestSuiteBase.java | Added multi-region-strong test group registration |
| live-platform-matrix.json | Added test configuration for multi-region strong consistency accounts |
| multi-region-strong.xml | TestNG suite configuration for new test group |
| pom.xml | Added Maven profile for multi-region-strong tests |
| CHANGELOG.md | Documented the optimization in Other Changes section |
...e-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java
Outdated
Show resolved
Hide resolved
...s/src/main/java/com/azure/cosmos/implementation/directconnectivity/RntbdTransportClient.java
Outdated
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/StoreResponseInterceptorUtils.java
Outdated
Show resolved
Hide resolved
...e-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java
Outdated
Show resolved
Hide resolved
...re-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/QuorumReader.java
Show resolved
Hide resolved
...smos/src/main/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriter.java
Show resolved
Hide resolved
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/ExitFromConsistencyLayerTests.java
Outdated
Show resolved
Hide resolved
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| </plugins> | ||
| </build> | ||
| </profile> | ||
| <profile> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[blocking]: wire up Netty bytebuf allocation tracking configs. cc: @FabianMeiswinkel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has been merged now - so, should be easy to copy the same settings
…into failFastBarrierLeaseNotFound # Conflicts: # sdk/cosmos/azure-cosmos-tests/pom.xml # sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java # sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResponse.java # sdk/cosmos/live-platform-matrix.json
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…into failFastBarrierLeaseNotFound
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run java - cosmos - ci |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
| final long targetGlobalCommittedLSN, | ||
| ReadMode readMode) { | ||
| ReadMode readMode, | ||
| MutableVolatile<CosmosException> cosmosExceptionValueHolder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an opinonated NIT - so, feel free to ignore.
I would not have used MutableVolatile here - it is safe - and actually slightly more efficient than AtomicReference which I would have preferred. Why I would have used AtomicReference - we used it in many other places - to me consistency and simplicity outweighs the nitty efficieny improvement of MutableVolatile because we don't strictly need atmoic composite operations like INcrement or comapreAndSet
FabianMeiswinkel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks!
Description
The objective of this pull request is to fail document operations when their enclosed barrier requests hit a 410
Lease Not Found. This will help with quicker cross region retries for reads and also curb resource utilization by bailing out of aggressive barrier retries on the client side.Approach
For reads:
Lease Not Found, one lastHeadrequest is performed on the primary replica post a force address refresh. If theHeadrequest on the primary also returns a 410Lease Not Foundthen enclosing request (a non-write request) is failed with a 503Lease Not Found. If a barrier from primary replica cannot be satisfied, the barrier attainment loop is reentered.sequenceDiagram autonumber participant Caller participant QuorumReader participant StoreReader participant BarrierHelper as BarrierRequestHelper Note over Caller,StoreReader: readQuorumAsync Flow Caller->>+QuorumReader: readQuorumAsync(entity, readQuorum, includePrimary, readMode) QuorumReader->>QuorumReader: Check if timeout elapsed QuorumReader->>QuorumReader: ensureQuorumSelectedStoreResponse() alt quorumSelectedStoreResponse is null QuorumReader->>+StoreReader: readMultipleReplicaAsync(entity, includePrimary, readQuorum) StoreReader-->>-QuorumReader: List<StoreResult> responses QuorumReader->>QuorumReader: Collect storeResponses for logging rect rgb(255, 200, 200) Note over QuorumReader,Caller: ⚠️ AvoidQuorumSelectionException Path alt Has AvoidQuorumSelectionException QuorumReader-->>Caller: ReadQuorumResult(QuorumNotPossibleInCurrentRegion) else responseCount < readQuorum QuorumReader-->>Caller: ReadQuorumResult(QuorumNotSelected) else isQuorumMet() returns true QuorumReader->>QuorumReader: Select response with max LSN QuorumReader-->>Caller: ReadQuorumResult(QuorumMet) else Need barrier (quorum not met) QuorumReader->>QuorumReader: Select readLsn, globalCommittedLSN, storeResult end end else quorumSelectedStoreResponse exists QuorumReader->>QuorumReader: Use cached readLsn, globalCommittedLSN, storeResult end Note over QuorumReader,BarrierHelper: Barrier Required Path QuorumReader->>+BarrierHelper: createAsync(entity, readLsn, globalCommittedLSN) BarrierHelper-->>-QuorumReader: barrierRequest QuorumReader->>QuorumReader: waitForReadBarrierAsync(barrierRequest, false, readQuorum, readLsn, globalCommittedLSN) Note over QuorumReader,StoreReader: waitForReadBarrierAsync Flow (Initial Loop) loop maxNumberOfReadBarrierReadRetries times QuorumReader->>QuorumReader: Check if timeout elapsed QuorumReader->>+StoreReader: readMultipleReplicaAsync(barrierRequest, allowPrimary, readQuorum, forceReadAll=true) StoreReader-->>-QuorumReader: List<StoreResult> responses rect rgb(255, 200, 200) Note over QuorumReader,StoreReader: ⚠️ AvoidQuorumSelectionException Path alt Has AvoidQuorumSelectionException QuorumReader->>QuorumReader: isBarrierMeetPossibleInPresenceOfAvoidQuorumSelectionException() QuorumReader->>QuorumReader: performBarrierOnPrimaryAndDetermineIfBarrierCanBeSatisfied() QuorumReader->>+StoreReader: readPrimaryAsync(entity, requiresValidLsn=true) StoreReader-->>-QuorumReader: StoreResult from primary alt Primary has required LSN and GlobalCommittedLSN QuorumReader->>QuorumReader: Set bailFromReadBarrierLoop=true, cosmosException=null Note over QuorumReader: Barrier Met - Exit Loop else Primary doesn't have required LSN QuorumReader->>QuorumReader: Set bailFromReadBarrierLoop=true, cosmosException=SERVICE_UNAVAILABLE Note over QuorumReader: Fail Fast - Exit Loop end else Responses have sufficient LSN QuorumReader->>QuorumReader: Check: count(lsn >= readBarrierLsn) >= readQuorum QuorumReader->>QuorumReader: Check: maxGlobalCommittedLsn >= targetGlobalCommittedLSN (if applicable) alt Both conditions met Note over QuorumReader: Barrier Met - Exit Loop else Conditions not met QuorumReader->>QuorumReader: Update maxGlobalCommittedLsn QuorumReader->>QuorumReader: Decrement readBarrierRetryCount alt readBarrierRetryCount == 0 Note over QuorumReader: Single-region retries exhausted else readBarrierRetryCount > 0 QuorumReader->>QuorumReader: Delay delayBetweenReadBarrierCallsInMs Note over QuorumReader: Continue retry loop end end end end end Note over QuorumReader,StoreReader: Multi-Region Barrier (if targetGlobalCommittedLSN > 0) alt targetGlobalCommittedLSN > 0 AND initial barrier failed loop maxBarrierRetriesForMultiRegion times QuorumReader->>QuorumReader: Check if timeout elapsed QuorumReader->>+StoreReader: readMultipleReplicaAsync(barrierRequest, allowPrimary, readQuorum, forceReadAll=true) StoreReader-->>-QuorumReader: List<StoreResult> responses rect rgb(255, 200, 200) Note over QuorumReader,StoreReader: ⚠️ AvoidQuorumSelectionException Path alt Has AvoidQuorumSelectionException QuorumReader->>QuorumReader: isBarrierMeetPossibleInPresenceOfAvoidQuorumSelectionException() Note over QuorumReader: Same as above else Multi-region barrier conditions met QuorumReader->>QuorumReader: Check: count(lsn >= readBarrierLsn) >= readQuorum QuorumReader->>QuorumReader: Check: maxGlobalCommittedLsn >= targetGlobalCommittedLSN alt Both conditions met Note over QuorumReader: Global Strong Barrier Met else Conditions not met QuorumReader->>QuorumReader: Update maxGlobalCommittedLsn QuorumReader->>QuorumReader: Decrement readBarrierRetryCountMultiRegion alt readBarrierRetryCountMultiRegion == 0 Note over QuorumReader: Multi-region retries exhausted else readBarrierRetryCountMultiRegion > maxShortBarrierRetriesForMultiRegion QuorumReader->>QuorumReader: Delay barrierRetryIntervalInMsForMultiRegion else QuorumReader->>QuorumReader: Delay shortBarrierRetryIntervalInMsForMultiRegion end end end end end end QuorumReader->>QuorumReader: waitForReadBarrierAsync returns Boolean alt bailOnBarrier AND cosmosException != null QuorumReader-->>Caller: ReadQuorumResult(QuorumNotPossibleInCurrentRegion, failFastException) else waitForReadBarrier == false QuorumReader-->>Caller: ReadQuorumResult(QuorumSelected, readLsn, globalCommittedLSN) else waitForReadBarrier == true QuorumReader-->>-Caller: ReadQuorumResult(QuorumMet, readLsn, globalCommittedLSN) endFor writes:
Lease Not Found, one lastHeadrequest is performed on the primary replica post a force address refresh. If theHeadrequest on the primary also returns a 410Lease Not Foundthen the enclosing request (a write request) is failed with a 408Lease Not Found. If a barrier from primary replica cannot be satisfied, the barrier attainment loop is reentered.sequenceDiagram autonumber participant Caller participant ConsistencyWriter participant StoreReader participant BarrierHelper as BarrierRequestHelper Note over Caller,StoreReader: barrierForGlobalStrong Flow Caller->>+ConsistencyWriter: barrierForGlobalStrong(request, response) ConsistencyWriter->>ConsistencyWriter: Check if ReplicatedResourceClient.isGlobalStrongEnabled() ConsistencyWriter->>ConsistencyWriter: Check isGlobalStrongRequest(request, response) alt Not a Global Strong Request ConsistencyWriter-->>Caller: Return original response else Is Global Strong Request ConsistencyWriter->>ConsistencyWriter: getLsnAndGlobalCommittedLsn(response) alt lsn == -1 OR globalCommittedLsn == -1 ConsistencyWriter-->>Caller: Error: GoneException (SERVER_GENERATED_410) else Valid LSN values ConsistencyWriter->>ConsistencyWriter: Cache response in request.requestContext.globalStrongWriteResponse ConsistencyWriter->>ConsistencyWriter: Set request.requestContext.globalCommittedSelectedLSN = lsn ConsistencyWriter->>ConsistencyWriter: Set forceRefreshAddressCache = false alt globalCommittedLsn >= lsn Note over ConsistencyWriter: No barrier needed - already globally committed ConsistencyWriter-->>Caller: Return globalStrongWriteResponse else globalCommittedLsn < lsn Note over ConsistencyWriter: Barrier Required - Write region ahead of read regions ConsistencyWriter->>+BarrierHelper: createAsync(request, null, globalCommittedSelectedLSN) BarrierHelper-->>-ConsistencyWriter: barrierRequest ConsistencyWriter->>ConsistencyWriter: waitForWriteBarrierAsync(barrierRequest, globalCommittedSelectedLSN) Note over ConsistencyWriter,StoreReader: waitForWriteBarrierAsync Flow loop MAX_NUMBER_OF_WRITE_BARRIER_READ_RETRIES times ConsistencyWriter->>ConsistencyWriter: Check if timeout elapsed ConsistencyWriter->>+StoreReader: readMultipleReplicaAsync(barrierRequest, allowPrimary=true, readQuorum=1) StoreReader-->>-ConsistencyWriter: List<StoreResult> responses rect rgb(255, 200, 200) Note over ConsistencyWriter,StoreReader: ⚠️ AvoidQuorumSelectionException Path alt Has AvoidQuorumSelectionException ConsistencyWriter->>ConsistencyWriter: isBarrierMeetPossibleInPresenceOfAvoidQuorumSelectionException() ConsistencyWriter->>ConsistencyWriter: performBarrierOnPrimaryAndDetermineIfBarrierCanBeSatisfied() ConsistencyWriter->>ConsistencyWriter: Set forceRefreshAddressCache = true ConsistencyWriter->>+StoreReader: readPrimaryAsync(entity, requiresValidLsn=true) StoreReader-->>-ConsistencyWriter: StoreResult from primary alt Primary has required GlobalCommittedLSN ConsistencyWriter->>ConsistencyWriter: Set bailFromWriteBarrierLoop=true, cosmosException=null Note over ConsistencyWriter: Barrier Met - Exit Loop else Primary doesn't have required GlobalCommittedLSN ConsistencyWriter->>ConsistencyWriter: Set bailFromWriteBarrierLoop=true ConsistencyWriter->>ConsistencyWriter: Set cosmosException=REQUEST_TIMEOUT with original subStatusCode Note over ConsistencyWriter: Fail Fast - Exit Loop end else Check responses for barrier satisfaction alt Any response.globalCommittedLSN >= selectedGlobalCommittedLsn Note over ConsistencyWriter: Barrier Met - Return true else No sufficient globalCommittedLSN ConsistencyWriter->>ConsistencyWriter: Update maxGlobalCommittedLsnReceived ConsistencyWriter->>ConsistencyWriter: Set forceRefreshAddressCache = false ConsistencyWriter->>ConsistencyWriter: Decrement writeBarrierRetryCount alt writeBarrierRetryCount == 0 Note over ConsistencyWriter: All retries exhausted else writeBarrierRetryCount > MAX_SHORT_BARRIER_RETRIES ConsistencyWriter->>ConsistencyWriter: Delay DELAY_BETWEEN_WRITE_BARRIER_CALLS_IN_MS (30ms) Note over ConsistencyWriter: Continue retry loop else writeBarrierRetryCount <= MAX_SHORT_BARRIER_RETRIES ConsistencyWriter->>ConsistencyWriter: Delay SHORT_BARRIER_RETRY_INTERVAL_IN_MS (10ms) Note over ConsistencyWriter: Continue retry loop (short interval) end end end end end ConsistencyWriter->>ConsistencyWriter: waitForWriteBarrierAsync returns Boolean alt Barrier satisfied (true) ConsistencyWriter-->>Caller: Return globalStrongWriteResponse else Barrier not satisfied (false) alt cosmosException != null ConsistencyWriter-->>Caller: Error: cosmosException (REQUEST_TIMEOUT) else No exception ConsistencyWriter-->>Caller: Error: GoneException (GLOBAL_STRONG_WRITE_BARRIER_NOT_MET) end end end end end deactivate ConsistencyWriterNOTE
This PR does not bail out barrier requests when a quorum of document requests could not be achieved and read is being performed on purely the primary replica (
QuorumNotSelectedflow).Fixes
#46135
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines