-
Notifications
You must be signed in to change notification settings - Fork 974
Componentize RetryingClient and RetryingRpcClient
#6292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- bundle parameters of private methods into a `RetryingContext` - extract attempt execution - improve naming of variables
- extract attempt execution - improve naming of variables
577e9d5 to
4aaa678
Compare
RetryingClientRetryingClient and RetryingRpcClient
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6292 +/- ##
============================================
- Coverage 74.46% 74.05% -0.41%
- Complexity 22234 23072 +838
============================================
Files 1963 2068 +105
Lines 82437 86556 +4119
Branches 10764 11419 +655
============================================
+ Hits 61385 64101 +2716
- Misses 15918 16945 +1027
- Partials 5134 5510 +376 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| final RetryConfig<HttpResponse> config = retryingContext.config(); | ||
| final ClientRequestContext ctx = retryingContext.ctx(); | ||
| final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator(); | ||
| final HttpRequest req = retryingContext.req(); | ||
| final HttpResponse res = retryingContext.res(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-allocating the fields to the local variables looks redundant.
What do you think of moving doExecute0() to RetryingContext and directly execute the method?
new RetryingContext(ctx, mappedRetryConfig(ctx),
req, reqDuplicator, res, resFuture).execute();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-allocating the fields to the local variables looks redundant.
Yes they are redundant and with the sole purpose to shorten the field accesses to RetryingContext.
What do you think of moving doExecute0() to RetryingContext and directly execute the method?
...together with all the helper methods (e.g. handleAggregatedResponse), right? Let me try it out and see how it looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...together with all the helper methods (e.g. handleAggregatedResponse), right? Let me try it out and see how it looks like
👍 I had a similar thought, though it wasn’t very specific. I imagined RetryingClient spawns a new RetryContext and delegates it to execute requests and gather results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think understand where you're heading. I also believe that encapsulating a retry attempt and strongly hiding all internals is a good idea. I've pushed a design. Instead of moving everything into RetryingContext, I kept the constant/global data in RetryingContext and moved everything attempt-related into a new class, Attempt.
Please let me know what you think @ikhoon. FYI: I didn't strive for green tests— but just enough to have a discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the current approach.
- Would you move
RetryingContextandAttemptto top-level classes? - Some methods, such as
abort(), which only access the fields ofRetryingContext, may be moved intoRetryingContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jrhee17. Thanks for the review, I am also curious what you about the two points above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about deriving an interface RetryingContext and provide HttpRetryingContext and RpcRetryingContext for the respective clients?
I understand the current functionality of RetryingContext seems to be to 1) schedule retry attempts 2) handle the decision of a retry attempt.
class RetryingContext {
CompletableFuture<Boolean> init();
RetryAttempt newRetryAttempt();
void commit(RetryAttempt attempt);
void abort(Throwable cause);
}I think the above basic functionalities can also be used from RetryingRpcClient.
I'm unsure if inheritance or composition is best, but overall I agree that unifying the logic between the two retrying clients seems useful for implementing hedging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought I had is to unify state(ctx) with RetryingContext so RetryingContext knows about the deadline and last backoff. For hedging later on it would also know about all the pending attempts to be able to properly commit and abort when the winning attempt is determined. With that we don't have no two places of global state anymore.
I agree there is no need for maintaining two states at different locations and I agree State can be a local field of RetryingContext instead of ctx#attr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. I agree with refactoring RetryingClient logic into separate classes, such as HttpRetryingContext but I will check whether adding a lock is truly necessary. I may add some commits myself or leave comments on that.
- RetryingClient is broken, just for discussion purposes
…ngContext - put RetryingContext and RetryAttempt into separate classes
f51b78c to
b31c646
Compare
- also do some cleanup in RetryAttempt
- provide a copy of response with content, otherwise we will have a double subscription
jrhee17
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal, left some thoughts on the direction of the PR
| } | ||
|
|
||
| CompletableFuture<Void> execute() { | ||
| assert state == State.INITIALIZED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we remove the state related assertions because:
- The assertions run asynchronously will be mostly swallowed, so users won't be able to know why an assertion failed most of the time.
- The additional state bookkeeping doesn't isn't thread-safe. However, I think certain methods (e.g.
RetryContext#abort) can be called from multiple threads which makes it difficult to keep track. I imagine this will become more of an issue when hedging is introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find asserts invaluable as they
- document the (otherwise implicit!) assumptions we have about a piece of code
- let the program crash early at the point of violation without letting the program run in a corrupt state
I am wondering about this matter as in the codebase I can find several occurrences of asserts in async settings (e.g. ) and actually also in combination with state machine pattern:
The assertions run asynchronously will be mostly swallowed, so users won't be able to know why an assertion failed most of the time.
Im unsure what you mean by swallowing errors, could you expand on that?
The additional state bookkeeping doesn't isn't thread-safe. However, I think certain methods (e.g. RetryContext#abort) can be called from multiple threads which makes it difficult to keep track. I imagine this will become more of an issue when hedging is introduced.
Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?
On that point: I am very curious what your/Armerias stance is on synchronization in the framework, given Armeria's thread-model. Because alternatively to using synchronization primitives, we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code. I imagine for RetryingContext this would mean that solely ctx.eventloop() is allowed to execute code whereas for an attempt it is attemptCtx.eventloop(). Then, everytime we need do a switch we would wo want to reschedule on the appropriate eventloop. For example, when the attempt is resolving its completeness promise we reschedule from the attempt eventloop to the main event loop:
attempt.whenComplete().handle((unused, unexpectedAttemptCause) -> {
if (unexpectedAttemptCause != null) {
assert attempt.state() == RetryAttempt.State.ABORTED;
rctx.abort(unexpectedAttemptCause);
return null;
}
...
});I would confine the handler to the original request eventloop like so:
attempt.whenComplete().handleAsync((unused, unexpectedAttemptCause) -> {
if (unexpectedAttemptCause != null) {
assert attempt.state() == RetryAttempt.State.ABORTED;
rctx.abort(unexpectedAttemptCause);
return null;
}
...
}, rctx.ctx().eventLoop());There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I find asserts invaluable as they
document the (otherwise implicit!) assumptions we have about a piece of code
let the program crash early at the point of violation without letting the program run in a corrupt state
I am wondering about this matter as in the codebase I can find several occurrences of asserts in async settings (e.g. ) and actually also in combination with state machine pattern:
Sorry, let me rephrase. I'm fine with introducing states in general. I don't think RetryAttempt needs to maintain a separate state as it is relatively simple.
Im unsure what you mean by swallowing errors, could you expand on that?
I was thinking of the following pattern which seems to be used often in this class.
CompletableFuture cf = new CompletableFuture();
someAsyncMethod().handle((val, cause) -> {
// the returned cf won't reflect the failed assertion
assert state == State.Expected; // assertion without a try..catch
cf.complete(val)
})
return cf;Unless the thread defines an UncaughtExceptionHandler, I guessed that users won't be able to receive a signal that an assertion failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean with the assertions. I will keep an eye on that and assure that the assertion errors are properly bubbling up into the futures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On that point: I am very curious what your/Armerias stance is on synchronization in the framework, given Armeria's thread-model. [...]
What are your/Armeria's thoughts on that matter? For example, do I see it correctly, that the call to HttpResponse.abort MUST be done by the eventloop the response is associated with as the response handlers expect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?
I was simply pointing out that:
-
The armeria constructs being used (i.e.
RequestLog#endRequest,HttpResponse#abortetc..) are already guarded by primitive synchronizations.
e.g.armeria/core/src/main/java/com/linecorp/armeria/common/stream/DefaultStreamMessage.java
Line 197 in 7aef07a
private void abort0(@Nullable Throwable cause) { -
On the other hand,
RetryingContext#statedoesn't seem to be guarded by any synchronizations.
Due to the introduction of a new state instead of using the previously existing constructs (RequestLog, HttpResponse), we may need to introduce a new synchronization method.
we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code
My intuition is that given that this logic is on the request path, it would be nice if we can minimize event loop rescheduling to minimize latency like done in the other constructs mentioned above (since event loops are shared with other requests).
Having said this, I'm not 100% sure which synchronization method is best before actually going through the implementation.
| ABORTED | ||
| } | ||
|
|
||
| private State state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I like the idea of introducing RetryAttempt, I imagine the functionality of this class to:
- Represent a single request execution. Hence, the
ctxandresis fixed and final. RetryAttemptis responsible for storing the result of an execution. Bookkeeping for the child log (ctx.log) is done in this class as well.
On the other hand, I don't think RetryAttempt necessarily should be responsible for enqueueing a request. What do you think of allowing RetryingContext to have more responsibility on scheduling retry requests?
e.g. RetryingContext can be responsible for enqueueing a new request
class RetryingClient
private void doExecuteAttempt(RetryingContext rctx) {
...
rctx.newRetryAttempt().handle((retryAttempt, cause) -> {...})Where the method signature could look something like the following:
class RetryingContext {
CompletableFuture<RetryAttempt> newRetryAttempt() {
....
HttpResponse res = executeAttemptRequest(ctx, number, delegate);
....
}And RetryAttempt can contain final fields
class RetryAttempt {
...
private final ClientRequestContext ctx;
private final HttpResponse res;
@Nullable
private final HttpResponse truncatedRes;
@Nullable
private final Throwable cause;This way, I think the responsibility of each class could become more clear - and we can worry less about race conditions.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might cut the INITIALIZED state from a RetryAttempt as it then starts as EXECUTED so I like your idea. Let me implement that
| final RetryConfig<HttpResponse> config = retryingContext.config(); | ||
| final ClientRequestContext ctx = retryingContext.ctx(); | ||
| final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator(); | ||
| final HttpRequest req = retryingContext.req(); | ||
| final HttpResponse res = retryingContext.res(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about deriving an interface RetryingContext and provide HttpRetryingContext and RpcRetryingContext for the respective clients?
I understand the current functionality of RetryingContext seems to be to 1) schedule retry attempts 2) handle the decision of a retry attempt.
class RetryingContext {
CompletableFuture<Boolean> init();
RetryAttempt newRetryAttempt();
void commit(RetryAttempt attempt);
void abort(Throwable cause);
}I think the above basic functionalities can also be used from RetryingRpcClient.
I'm unsure if inheritance or composition is best, but overall I agree that unifying the logic between the two retrying clients seems useful for implementing hedging
| final RetryConfig<HttpResponse> config = retryingContext.config(); | ||
| final ClientRequestContext ctx = retryingContext.ctx(); | ||
| final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator(); | ||
| final HttpRequest req = retryingContext.req(); | ||
| final HttpResponse res = retryingContext.res(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought I had is to unify state(ctx) with RetryingContext so RetryingContext knows about the deadline and last backoff. For hedging later on it would also know about all the pending attempts to be able to properly commit and abort when the winning attempt is determined. With that we don't have no two places of global state anymore.
I agree there is no need for maintaining two states at different locations and I agree State can be a local field of RetryingContext instead of ctx#attr.
| // The request or response has been aborted by the client before it receives a response, | ||
| // so stop retrying. | ||
| if (req.whenComplete().isCompletedExceptionally()) { | ||
| state = State.COMPLETING; | ||
| req.whenComplete().handle((unused, cause) -> { | ||
| abort(cause); | ||
| return null; | ||
| }); | ||
| return true; | ||
| } | ||
|
|
||
| if (res.isComplete()) { | ||
| state = State.COMPLETING; | ||
| res.whenComplete().handle((result, cause) -> { | ||
| final Throwable abortCause; | ||
| if (cause != null) { | ||
| abortCause = cause; | ||
| } else { | ||
| abortCause = AbortedStreamException.get(); | ||
| } | ||
| abort(abortCause); | ||
| return null; | ||
| }); | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that these blocks are simply adding a callback to req.whenComplete(), res.whenComplete() I wonder if this can be moved to init() instead. The fact that this is called inside isCompleted() which seems to simply return a state was a little surprising to me
| } | ||
|
|
||
| CompletableFuture<Void> execute() { | ||
| assert state == State.INITIALIZED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?
I was simply pointing out that:
-
The armeria constructs being used (i.e.
RequestLog#endRequest,HttpResponse#abortetc..) are already guarded by primitive synchronizations.
e.g.armeria/core/src/main/java/com/linecorp/armeria/common/stream/DefaultStreamMessage.java
Line 197 in 7aef07a
private void abort0(@Nullable Throwable cause) { -
On the other hand,
RetryingContext#statedoesn't seem to be guarded by any synchronizations.
Due to the introduction of a new state instead of using the previously existing constructs (RequestLog, HttpResponse), we may need to introduce a new synchronization method.
we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code
My intuition is that given that this logic is on the request path, it would be nice if we can minimize event loop rescheduling to minimize latency like done in the other constructs mentioned above (since event loops are shared with other requests).
Having said this, I'm not 100% sure which synchronization method is best before actually going through the implementation.
…edRequest to AbstractRetryingClient - also move out setting the deadline, saving deadlineTimeNanos() from RetriedRequest
f35c70a to
2f4b698
Compare
cb5f41b to
588761b
Compare
588761b to
df10849
Compare
26d9797 to
c5b883b
Compare
| retryConfig != null ? retryConfig | ||
| : requireNonNull(retryMapping.get(ctx, req), | ||
| "retryMapping.get() returned null"); | ||
| final RetryContext rctx = newRetryContext(unwrap(), ctx, req, config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this PR is attempting to solve three problems: 1) improving readability 2) unifying http/rpc logic 3) refactoring for hedging.
While I like the proposed end-design, given that the previous RetryingClient was already pretty complex and most users are using this feature, it is difficult (at least for me) to review and ensure that there aren't any regressions due to the large amount of changes.
What do you think of focusing on 3) refactoring for hedging specifically in a single PR? (it could be this PR or a separate PR)
Specifically, I like the idea of having a dedicated RetryContext. The RetryContext (or RetryScheduler/RetryExecutor) would be responsible for scheduling retries and completing the overall retry attempt.
class ExecutionResult {
private final RetryDecision retryDecision;
@Nullable
private final Throwable cause;
@Nullable
private final HttpResponse originalRes;
@Nullable
private final ClientRequestContext derivedCtx;
...
}
class RetryContext {
// Pretty much the same as the current `RetryingClient#doExecute0`
CompletableFuture<ExecutionResult> doExecute()
// Functions the same as `RetryingClient#handleException`
void handleException(Throwable cause)
// Calls https://github.com/line/armeria/blob/1051e9650432a90f0d4ee8e296f6d37166ec4710/core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java#L541-L544
void complete(HttpResponse res)
}
private void doExecute0(RetryContext retryContext) {
final CompletableFuture<ExecutionResult> executionRes = retryContext.doExecute();
executionRes.thenAccept(executionResult -> {
handleRetryDecision(executionResult, retryContext);
});
}This would also minimize diffs so that other maintainers can easily review the changeset.
Later on when implementing hedging, I think a single ExecutionResult can be selected, and the rest can be cancelled using ExecutionResult#derivedCtx#cancel (possibly the scheduled future can also be added to the ExecutionResult and could be cancelled as well).
Once the above is implemented and merged, I think RetryContext can then be further abstracted to RetriedRequest, RetryScheduler like done in this PR. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thank you for your proposal @jrhee17.
I fear that during the discussions and changes, we shifted toward a slightly broader goal than initially intended. In the beginning, if you remember, I focused on readability, specifically by removing parameter passing in the client. That led me to a solution very similar to what you are suggesting now. Back then, I tried to minimize refactoring overhead. However, as @ikhoon noted, this left the roles (scheduling, execution, controlling, etc.) mixed inside RetryContext, which I now see as the main problem of the retrying classes.
Because of that, splitting the PR into two or more smaller PRs — first introducing RetryContext bundling and then extracting components — would not help much, as most of the complexity comes from the refactoring itself.
I agree this PR might be challenging to review due to its length. However, since the retrying code is now extracted into clear components (or so I hope 😄), it should be possible to review step by step.
To assist reviewing, I updated the PR description to clearly state the goals and included a small component diagram and descriptions in bottom-up fashion.
For the review, I suggest starting with the control flow — AbstractRetryingClient and how it interacts with the interfaces.
After that, I would be glad to get feedback on how to simplify the API further.
If you agree with the high-level API, you can then look at the implementation in any order (DefaultRetryScheduler, RetryCounter, Retried*Request, Retry*Attempt).
For RetriedHttpRequest and RetryingHttpAttempt, it is best to have the old RetryingClient open for comparison, and for RetriedRpcRequest and RetryingRpcAttempt, compare with RetryingRpcClient.
Except for minor changes, they should contain the same code, just reorganized differently.
Can I do anything else to help with the review, @jrhee17?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I think each design has its pros and cons. I agree the as-is code is difficult to reason about due to the large number of parameters and complex functinoality. However, I also think we didn't have to worry about concurrency (because everything was just passed via parameters) so at least the flow was clear.
With the proposed design, I think the fact that each component is maintaining a separate state - and the effort made to synchronize the state across different components is what is making it difficult for me to review.
I'm curious if there is any way state management can be isolated to a single component.
Back then, I tried to minimize refactoring overhead.
If a simple refactoring effort is done first, I thought at the very least hedging (which we are all excited about) can gain some traction - paving ways for further refactoring which I am all for as well.
Having said this, this is just my opinion so I'm curious of your and other maintainers' opinions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I thought about how to best split this refactoring and came up with a battle plan you can find below while having this PR as a "north star" where we want to go. Technically, we can implement hedging already after PR 2 [1] so we may not need to go the full way. Let me now if it makes sense to you @jrhee17
[1]: Because we need to have a central place dealing with racing attempts trying to schedule the next retry task.
RetryingClient and RetryingRpcClientRetryingClient and RetryingRpcClient
PR 1 - Extract counting state and remove methods from
|
|
Thanks for understanding.
I doubt there are any users actually rolling their own
Sounds good to me.
Just to make sure we're on the same page, I understood
👍
👍
I didn't really understand the motivation of this, but I assume this is due to different handling of Also, just to make sure I understand the idea correctly - I'm curious where the actual client call ( Overall, I have no objection with step 1. Also, for quick iterations I don't think adding tests are necessary as long as mechanical refactoring is done without a different threading model. |
It would be really nice to decide early that we can make it private so that we can just forget about the effects of refactoring on users that might inherit from this class. Wdyt @ikhoon; is making
At this step RetryAttempt will be just be value-class containing attempt-related fields, correct. Extracting the execution logic is something for PR 3.
It is part of the overall process of removing the inheritance bond between
Similar to .cc @jrhee17 |
I see, I'm not particularly pro or against keeping/removing
I guess I imagined that given e.g. val retryCtx = new RetryContext(ctx, reqDuplicator, req, res, responseFuture, unwrap())
...
// pretty much does the same thing as the current `RetryingClient#doExecute0`
retryCtx.execute()
// and gives an easy segway to hedging
scheduleAtRate(() -> retryCtx.execute())Having said this, I understood your idea as follows (which I'm also fine with): private void doExecute0(RetryContext retryCtx) {
// access parameters like retryCtx.originalCtx, retryCtx.reqDuplicator, etc.. as needed while keeping the current structure
}
// later on for hedging
scheduleAtRate(doExecute0(retryCtx)) |
|
First PR ready for review: #6411 |

Motivation
In the context of implementing hedging in #6252, I found it hard to understand and extend
AbstractRetryingClient,RetryingClient, andRetryingRpcClient.Currently,
AbstractRetryingClientmanagesctx.attr(STATE), whileRetrying(Rpc)Clientmaintains a "backpack state" by passing parameters around internal methods. The roles and differences between these two states are unclear.Additionally,
AbstractRetryingClientsplits delay calculation and scheduling into separate methods, even though they are tightly coupled, which makes safe use of this API difficult.Lastly, since most of the code resides in
Retrying(Rpc)Client, it is hard to focus on one aspect of the retry process, as there is insufficient encapsulation.Altogether, this makes it hard to extend (e.g., with hedging or retry throttling) and customize (e.g., with a custom retry scheduler) retrying in Armeria.
Modifications
tl;dr
We split up retrying into the following components:
Here
AbstractRetryingClientstays as the central component being the driver of retrying. It interacts withRetriedRequestto execute, abort and commitRetry*Attempts . It uses a dedicatedRetrySchedulerto schedule while respecting the request deadline and potential backoffs received from remote peers. All components are retrieved in aRetryContextfrom theRetry(Rpc)Clients throughnewRetryContextwhich acts as a key central extension point in which users are free to pick the implementation for the respective interfaces (dashed in the diagram).Components
Let me describe the roles of the components in bottom-up. More detailed explanations can be found in the doc comments.
Retry(Rpc|Http)AttemptEncapsulates a single RPC/HTTP attempt, from its execution up to the point where its response is decided upon by the
RetryRule.The resulting
RetryDecisionis returned to theRetried(Rpc)Request, which has full ownership of theRetry(Rpc|Http)Attempt.After returning the
RetryDecision, aRetry(Rpc|Http)Attemptcan either be aborted or committed.Aborting an attempt discards its response, as it is not selected as the final response.
Committing an attempt marks it as selected as the final response. A full state diagram can be found
here.
Retried(Rpc)RequestBased on the original request, it provides methods to execute, commit, and abort
Retry(Rpc|Http)Attempts. It has full ownership of all attempts and does not expose them. In particular if a user wants to abort or commit an attempt, they do so by specifying the attempt number.RetryCounterTracks the number of attempts made, both in total and per backoff.
RetrySchedulerSchedules retries on the retry event loop while respecting the response deadline and the backoff intervals returned by endpoints.
AbstractRetryingClientCalls
newRetryContextfromRetrying(Rpc)Clientto receive all components required to control the retry process.It uses them to:
RetryContextA simple, immutable, record-like class used by
AbstractRetryingClientto hold all components required for retry control.Together with
newRetryContext, it allows users to fully customize the execution and scheduling parts of the retry process.In the future, responsibilities inside
AbstractRetryingClientcould be further split into a new component, which could also be injected viaRetryContext.Concurrency Policy
Using locks in these components significantly increased the difficulty of verifying correctness and liveness properties.
Instead, I chose thread confinement: all logic, except attempt execution, runs on a single event loop; the "retry event loop". This approach is nice for users wanting to implementing components themselves as they do not need to worry about internal synchronization.
Performance-wise I don't expect any noteworthy downsides as retries are not made with a high rate that could cause contention. However, when an attempt is completing, it will first switch to the retry loop before it gets committed which adds scheduling latency. I feel like this is not critical as I have seen multiple such reschedules in Armeria code but let me know what you of this.
Result
Retrying in Armeria is now componentized, making it easier to understand and extend.