Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Refresh Cross Regional Retries #4979

Open
NaluTripician opened this issue Jan 23, 2025 · 0 comments · May be fixed by #5017
Open

Address Refresh Cross Regional Retries #4979

NaluTripician opened this issue Jan 23, 2025 · 0 comments · May be fixed by #5017
Assignees
Labels
bug Something isn't working

Comments

@NaluTripician
Copy link
Contributor

Address Refresh Cross Regional Retries

Following the 12/26 outage of SCUS, during the analyis of the impact of several customers there was a gap in the retry logic of the SDK, particuallry with the logic of cross regional retries for address refresh calls. This document outlines the impact of the gap and the proposed solution.

Impact

In the case of a regional outage, currently if the SDK attempts Address Refresh calls and the primary region is down, the SDK will not attempt to retry the call in the secondary region. This is due to the fact that when the address refresh times out, the SDK will thow a task cancelled exception. Currently the SDK does not have logic to catch this exception and will treat this as a timeout. With other timeouts, the SDK will wrap the exception in a 503 and upon reaching the RetryLayer, the ClientRetryPolicy will attempt to retry the call in the secondary region if available.

Proposed Solution

The proposed solution would be to catch the OperationCanceledException as well as any 410s (Timeouts) and wrap them in a 503. This will allow the ClientRetryPolicy to attempt to retry the call in the secondary region if available.

Impact of the Solution

There might be an impact with the use of this with the compute gateway. Further investigation will be needed to determine the impact of this change. A possible way to mitigate the impact would be to have a flag that would allow the user to enable this feature. This flag would be internal and not accessible to external customers.

Testing

The testing will be done with the FaultInjeciton Library, which will need to have metadata request support added before this fix can be tested. See #4795 for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
1 participant