Componentize `RetryingClient` and `RetryingRpcClient` #6292

schiemon · 2025-06-25T11:44:51Z

Motivation

In the context of implementing hedging in #6252, I found it hard to understand and extend AbstractRetryingClient, RetryingClient, and RetryingRpcClient.
Currently, AbstractRetryingClient manages ctx.attr(STATE), while Retrying(Rpc)Client maintains a "backpack state" by passing parameters around internal methods. The roles and differences between these two states are unclear.

Additionally, AbstractRetryingClient splits delay calculation and scheduling into separate methods, even though they are tightly coupled, which makes safe use of this API difficult.
Lastly, since most of the code resides in Retrying(Rpc)Client, it is hard to focus on one aspect of the retry process, as there is insufficient encapsulation.

Altogether, this makes it hard to extend (e.g., with hedging or retry throttling) and customize (e.g., with a custom retry scheduler) retrying in Armeria.

Modifications

tl;dr

We split up retrying into the following components:

Here AbstractRetryingClient stays as the central component being the driver of retrying. It interacts with RetriedRequest to execute, abort and commit Retry*Attempts . It uses a dedicated RetryScheduler to schedule while respecting the request deadline and potential backoffs received from remote peers. All components are retrieved in a RetryContext from the Retry(Rpc)Clients through newRetryContext which acts as a key central extension point in which users are free to pick the implementation for the respective interfaces (dashed in the diagram).

Components

Let me describe the roles of the components in bottom-up. More detailed explanations can be found in the doc comments.

`Retry(Rpc|Http)Attempt`

Encapsulates a single RPC/HTTP attempt, from its execution up to the point where its response is decided upon by the RetryRule.
The resulting RetryDecision is returned to the Retried(Rpc)Request, which has full ownership of the Retry(Rpc|Http)Attempt.
After returning the RetryDecision, a Retry(Rpc|Http)Attempt can either be aborted or committed.
Aborting an attempt discards its response, as it is not selected as the final response.
Committing an attempt marks it as selected as the final response. A full state diagram can be found
here.

`Retried(Rpc)Request`

Based on the original request, it provides methods to execute, commit, and abort Retry(Rpc|Http)Attempts. It has full ownership of all attempts and does not expose them. In particular if a user wants to abort or commit an attempt, they do so by specifying the attempt number.

`RetryCounter`

Tracks the number of attempts made, both in total and per backoff.

`RetryScheduler`

Schedules retries on the retry event loop while respecting the response deadline and the backoff intervals returned by endpoints.

`AbstractRetryingClient`

Calls newRetryContext from Retrying(Rpc)Client to receive all components required to control the retry process.
It uses them to:

execute attempts,
decide to commit or abort attempts based on attempt execution results,
abort the retry process when outside conditions occur (e.g., the original request completes early), and to
complete the original response.

`RetryContext`

A simple, immutable, record-like class used by AbstractRetryingClient to hold all components required for retry control.
Together with newRetryContext, it allows users to fully customize the execution and scheduling parts of the retry process.
In the future, responsibilities inside AbstractRetryingClient could be further split into a new component, which could also be injected via RetryContext.

Concurrency Policy

Using locks in these components significantly increased the difficulty of verifying correctness and liveness properties.
Instead, I chose thread confinement: all logic, except attempt execution, runs on a single event loop; the "retry event loop". This approach is nice for users wanting to implementing components themselves as they do not need to worry about internal synchronization.
Performance-wise I don't expect any noteworthy downsides as retries are not made with a high rate that could cause contention. However, when an attempt is completing, it will first switch to the retry loop before it gets committed which adds scheduling latency. I feel like this is not critical as I have seen multiple such reschedules in Armeria code but let me know what you of this.

Result

Retrying in Armeria is now componentized, making it easier to understand and extend.

- bundle parameters of private methods into a `RetryingContext` - extract attempt execution - improve naming of variables

- extract attempt execution - improve naming of variables

github-actions · 2025-06-25T12:03:37Z

🔍 Build Scan® (commit: `edfc33b`)

Job name	Status	Build Scan®
build-ubicloud-standard-16-jdk-8	✅	https://ge.armeria.dev/s/cvpkeo44e4ihw
build-ubicloud-standard-16-jdk-21-snapshot-blockhound	✅	https://ge.armeria.dev/s/fjsnwoj2svyjs
build-ubicloud-standard-16-jdk-17-min-java-17-coverage	✅	https://ge.armeria.dev/s/iubzrzfb7kwdc
build-ubicloud-standard-16-jdk-17-min-java-11	✅	https://ge.armeria.dev/s/gtfrdn6wfh7ne
build-ubicloud-standard-16-jdk-17-leak	✅	https://ge.armeria.dev/s/ozdxhoshnua5u
build-ubicloud-standard-16-jdk-11	✅	https://ge.armeria.dev/s/glsjt3hbi5yl4
build-macos-latest-jdk-21	❌ (failure)	https://ge.armeria.dev/s/a3ltl2ljm3i24

codecov · 2025-06-26T05:50:16Z

Codecov Report

❌ Patch coverage is 74.32905% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.05%. Comparing base (8150425) to head (edfc33b).
⚠️ Report is 197 commits behind head on main.

Files with missing lines	Patch %	Lines
...inecorp/armeria/client/retry/HttpRetryAttempt.java	72.32%	31 Missing and 31 partials ⚠️
...linecorp/armeria/client/retry/RpcRetryAttempt.java	56.55%	29 Missing and 24 partials ⚠️
...ecorp/armeria/client/retry/RetriedHttpRequest.java	73.38%	7 Missing and 30 partials ⚠️
...necorp/armeria/client/retry/RetriedRpcRequest.java	70.96%	6 Missing and 21 partials ⚠️
...p/armeria/client/retry/AbstractRetryingClient.java	84.09%	7 Missing and 7 partials ⚠️
...corp/armeria/client/retry/DefaultRetryCounter.java	65.51%	7 Missing and 3 partials ⚠️
...rp/armeria/client/retry/DefaultRetryScheduler.java	90.72%	3 Missing and 6 partials ⚠️
...m/linecorp/armeria/internal/client/ClientUtil.java	80.55%	4 Missing and 3 partials ⚠️
.../armeria/client/retry/AbortedAttemptException.java	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6292      +/-   ##
============================================
- Coverage     74.46%   74.05%   -0.41%     
- Complexity    22234    23072     +838     
============================================
  Files          1963     2068     +105     
  Lines         82437    86556    +4119     
  Branches      10764    11419     +655     
============================================
+ Hits          61385    64101    +2716     
- Misses        15918    16945    +1027     
- Partials       5134     5510     +376

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ikhoon · 2025-06-26T06:36:05Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java

+        final RetryConfig<HttpResponse> config = retryingContext.config();
+        final ClientRequestContext ctx = retryingContext.ctx();
+        final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator();
+        final HttpRequest req = retryingContext.req();
+        final HttpResponse res = retryingContext.res();


Re-allocating the fields to the local variables looks redundant.
What do you think of moving doExecute0() to RetryingContext and directly execute the method?

new RetryingContext(ctx, mappedRetryConfig(ctx), req, reqDuplicator, res, resFuture).execute();

Re-allocating the fields to the local variables looks redundant.

Yes they are redundant and with the sole purpose to shorten the field accesses to RetryingContext.

What do you think of moving doExecute0() to RetryingContext and directly execute the method?

...together with all the helper methods (e.g. handleAggregatedResponse), right? Let me try it out and see how it looks like

...together with all the helper methods (e.g. handleAggregatedResponse), right? Let me try it out and see how it looks like

👍 I had a similar thought, though it wasn’t very specific. I imagined RetryingClient spawns a new RetryContext and delegates it to execute requests and gather results.

I think understand where you're heading. I also believe that encapsulating a retry attempt and strongly hiding all internals is a good idea. I've pushed a design. Instead of moving everything into RetryingContext, I kept the constant/global data in RetryingContext and moved everything attempt-related into a new class, Attempt.
Please let me know what you think @ikhoon. FYI: I didn't strive for green tests— but just enough to have a discussion.

I like the current approach.

Would you move RetryingContext and Attempt to top-level classes?

Some methods, such as abort(), which only access the fields of RetryingContext, may be moved into RetryingContext.

cc @jrhee17. Thanks for the review, I am also curious what you about the two points above

@jrhee17 @ikhoon bump :)

What about deriving an interface RetryingContext and provide HttpRetryingContext and RpcRetryingContext for the respective clients?

I understand the current functionality of RetryingContext seems to be to 1) schedule retry attempts 2) handle the decision of a retry attempt.

class RetryingContext { CompletableFuture<Boolean> init(); RetryAttempt newRetryAttempt(); void commit(RetryAttempt attempt); void abort(Throwable cause); }

I think the above basic functionalities can also be used from RetryingRpcClient.

I'm unsure if inheritance or composition is best, but overall I agree that unifying the logic between the two retrying clients seems useful for implementing hedging

Another thought I had is to unify state(ctx) with RetryingContext so RetryingContext knows about the deadline and last backoff. For hedging later on it would also know about all the pending attempts to be able to properly commit and abort when the winning attempt is determined. With that we don't have no two places of global state anymore.

I agree there is no need for maintaining two states at different locations and I agree State can be a local field of RetryingContext instead of ctx#attr.

Sorry for the late review. I agree with refactoring RetryingClient logic into separate classes, such as HttpRetryingContext but I will check whether adding a lock is truly necessary. I may add some commits myself or leave comments on that.

- RetryingClient is broken, just for discussion purposes

…ngContext - put RetryingContext and RetryAttempt into separate classes

- also do some cleanup in RetryAttempt

- provide a copy of response with content, otherwise we will have a double subscription

jrhee17

Thanks for the proposal, left some thoughts on the direction of the PR

jrhee17 · 2025-07-03T07:30:23Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryAttempt.java

+    }
+
+    CompletableFuture<Void> execute() {
+        assert state == State.INITIALIZED;


I prefer we remove the state related assertions because:

The assertions run asynchronously will be mostly swallowed, so users won't be able to know why an assertion failed most of the time.

The additional state bookkeeping doesn't isn't thread-safe. However, I think certain methods (e.g. RetryContext#abort) can be called from multiple threads which makes it difficult to keep track. I imagine this will become more of an issue when hedging is introduced.

Personally, I find asserts invaluable as they

document the (otherwise implicit!) assumptions we have about a piece of code

let the program crash early at the point of violation without letting the program run in a corrupt state

I am wondering about this matter as in the codebase I can find several occurrences of asserts in async settings (e.g. ) and actually also in combination with state machine pattern:

The assertions run asynchronously will be mostly swallowed, so users won't be able to know why an assertion failed most of the time.

Im unsure what you mean by swallowing errors, could you expand on that?

The additional state bookkeeping doesn't isn't thread-safe. However, I think certain methods (e.g. RetryContext#abort) can be called from multiple threads which makes it difficult to keep track. I imagine this will become more of an issue when hedging is introduced.

Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?
On that point: I am very curious what your/Armerias stance is on synchronization in the framework, given Armeria's thread-model. Because alternatively to using synchronization primitives, we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code. I imagine for RetryingContext this would mean that solely ctx.eventloop() is allowed to execute code whereas for an attempt it is attemptCtx.eventloop(). Then, everytime we need do a switch we would wo want to reschedule on the appropriate eventloop. For example, when the attempt is resolving its completeness promise we reschedule from the attempt eventloop to the main event loop:

attempt.whenComplete().handle((unused, unexpectedAttemptCause) -> { if (unexpectedAttemptCause != null) { assert attempt.state() == RetryAttempt.State.ABORTED; rctx.abort(unexpectedAttemptCause); return null; } ... });

I would confine the handler to the original request eventloop like so:

attempt.whenComplete().handleAsync((unused, unexpectedAttemptCause) -> { if (unexpectedAttemptCause != null) { assert attempt.state() == RetryAttempt.State.ABORTED; rctx.abort(unexpectedAttemptCause); return null; } ... }, rctx.ctx().eventLoop());

Personally, I find asserts invaluable as they

document the (otherwise implicit!) assumptions we have about a piece of code
let the program crash early at the point of violation without letting the program run in a corrupt state
I am wondering about this matter as in the codebase I can find several occurrences of asserts in async settings (e.g. ) and actually also in combination with state machine pattern:

Sorry, let me rephrase. I'm fine with introducing states in general. I don't think RetryAttempt needs to maintain a separate state as it is relatively simple.

Im unsure what you mean by swallowing errors, could you expand on that?

I was thinking of the following pattern which seems to be used often in this class.

CompletableFuture cf = new CompletableFuture(); someAsyncMethod().handle((val, cause) -> { // the returned cf won't reflect the failed assertion assert state == State.Expected; // assertion without a try..catch cf.complete(val) }) return cf;

Unless the thread defines an UncaughtExceptionHandler, I guessed that users won't be able to receive a signal that an assertion failed.

I see what you mean with the assertions. I will keep an eye on that and assure that the assertion errors are properly bubbling up into the futures.

On that point: I am very curious what your/Armerias stance is on synchronization in the framework, given Armeria's thread-model. [...]

What are your/Armeria's thoughts on that matter? For example, do I see it correctly, that the call to HttpResponse.abort MUST be done by the eventloop the response is associated with as the response handlers expect that?

Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?

I was simply pointing out that:

The armeria constructs being used (i.e. RequestLog#endRequest, HttpResponse#abort etc..) are already guarded by primitive synchronizations.
e.g.

armeria/core/src/main/java/com/linecorp/armeria/common/stream/DefaultStreamMessage.java

Line 197 in 7aef07a

private void abort0(@Nullable Throwable cause) {

On the other hand, RetryingContext#state doesn't seem to be guarded by any synchronizations.

Due to the introduction of a new state instead of using the previously existing constructs (RequestLog, HttpResponse), we may need to introduce a new synchronization method.

we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code

My intuition is that given that this logic is on the request path, it would be nice if we can minimize event loop rescheduling to minimize latency like done in the other constructs mentioned above (since event loops are shared with other requests).

Having said this, I'm not 100% sure which synchronization method is best before actually going through the implementation.

jrhee17 · 2025-07-03T09:54:55Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryAttempt.java

+        ABORTED
+    }
+
+    private State state;


While I like the idea of introducing RetryAttempt, I imagine the functionality of this class to:

Represent a single request execution. Hence, the ctx and res is fixed and final.

RetryAttempt is responsible for storing the result of an execution. Bookkeeping for the child log (ctx.log) is done in this class as well.

On the other hand, I don't think RetryAttempt necessarily should be responsible for enqueueing a request. What do you think of allowing RetryingContext to have more responsibility on scheduling retry requests?

e.g. RetryingContext can be responsible for enqueueing a new request

class RetryingClient private void doExecuteAttempt(RetryingContext rctx) { ... rctx.newRetryAttempt().handle((retryAttempt, cause) -> {...})

Where the method signature could look something like the following:

class RetryingContext { CompletableFuture<RetryAttempt> newRetryAttempt() { .... HttpResponse res = executeAttemptRequest(ctx, number, delegate); .... }

And RetryAttempt can contain final fields

class RetryAttempt { ... private final ClientRequestContext ctx; private final HttpResponse res; @Nullable private final HttpResponse truncatedRes; @Nullable private final Throwable cause;

This way, I think the responsibility of each class could become more clear - and we can worry less about race conditions.
Let me know what you think.

This might cut the INITIALIZED state from a RetryAttempt as it then starts as EXECUTED so I like your idea. Let me implement that

…etryingContext`

jrhee17 · 2025-07-14T05:43:17Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java

+        final RetryConfig<HttpResponse> config = retryingContext.config();
+        final ClientRequestContext ctx = retryingContext.ctx();
+        final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator();
+        final HttpRequest req = retryingContext.req();
+        final HttpResponse res = retryingContext.res();


What about deriving an interface RetryingContext and provide HttpRetryingContext and RpcRetryingContext for the respective clients?

I understand the current functionality of RetryingContext seems to be to 1) schedule retry attempts 2) handle the decision of a retry attempt.

class RetryingContext { CompletableFuture<Boolean> init(); RetryAttempt newRetryAttempt(); void commit(RetryAttempt attempt); void abort(Throwable cause); }

I think the above basic functionalities can also be used from RetryingRpcClient.

I'm unsure if inheritance or composition is best, but overall I agree that unifying the logic between the two retrying clients seems useful for implementing hedging

jrhee17 · 2025-07-14T05:49:44Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java

+        final RetryConfig<HttpResponse> config = retryingContext.config();
+        final ClientRequestContext ctx = retryingContext.ctx();
+        final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator();
+        final HttpRequest req = retryingContext.req();
+        final HttpResponse res = retryingContext.res();


Another thought I had is to unify state(ctx) with RetryingContext so RetryingContext knows about the deadline and last backoff. For hedging later on it would also know about all the pending attempts to be able to properly commit and abort when the winning attempt is determined. With that we don't have no two places of global state anymore.

I agree there is no need for maintaining two states at different locations and I agree State can be a local field of RetryingContext instead of ctx#attr.

jrhee17 · 2025-07-14T05:58:55Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryingContext.java

+        // The request or response has been aborted by the client before it receives a response,
+        // so stop retrying.
+        if (req.whenComplete().isCompletedExceptionally()) {
+            state = State.COMPLETING;
+            req.whenComplete().handle((unused, cause) -> {
+                abort(cause);
+                return null;
+            });
+            return true;
+        }
+
+        if (res.isComplete()) {
+            state = State.COMPLETING;
+            res.whenComplete().handle((result, cause) -> {
+                final Throwable abortCause;
+                if (cause != null) {
+                    abortCause = cause;
+                } else {
+                    abortCause = AbortedStreamException.get();
+                }
+                abort(abortCause);
+                return null;
+            });
+            return true;
+        }


Given that these blocks are simply adding a callback to req.whenComplete(), res.whenComplete() I wonder if this can be moved to init() instead. The fact that this is called inside isCompleted() which seems to simply return a state was a little surprising to me

jrhee17 · 2025-07-14T06:44:04Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryAttempt.java

+    }
+
+    CompletableFuture<Void> execute() {
+        assert state == State.INITIALIZED;


Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?

I was simply pointing out that:

The armeria constructs being used (i.e. RequestLog#endRequest, HttpResponse#abort etc..) are already guarded by primitive synchronizations.
e.g.

armeria/core/src/main/java/com/linecorp/armeria/common/stream/DefaultStreamMessage.java

Line 197 in 7aef07a

private void abort0(@Nullable Throwable cause) {

On the other hand, RetryingContext#state doesn't seem to be guarded by any synchronizations.

Due to the introduction of a new state instead of using the previously existing constructs (RequestLog, HttpResponse), we may need to introduce a new synchronization method.

we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code

My intuition is that given that this logic is on the request path, it would be nice if we can minimize event loop rescheduling to minimize latency like done in the other constructs mentioned above (since event loops are shared with other requests).

Having said this, I'm not 100% sure which synchronization method is best before actually going through the implementation.

…edRequest to AbstractRetryingClient - also move out setting the deadline, saving deadlineTimeNanos() from RetriedRequest

…sCancelled

…eduler

- prevent scheduler error accumulation

schiemon · 2025-09-13T14:45:33Z

Currently the CI fails because of coverage. Once you agree on the current architecture @ikhoon and @jrhee17 and concurrency policy, I will add tests to the RetryingClients, RetriedRequests and corresponding attempts and also update the PR description.

jrhee17 · 2025-09-15T12:42:41Z

core/src/main/java/com/linecorp/armeria/client/retry/AbstractRetryingClient.java

+                retryConfig != null ? retryConfig
+                                    : requireNonNull(retryMapping.get(ctx, req),
+                                                     "retryMapping.get() returned null");
+        final RetryContext rctx = newRetryContext(unwrap(), ctx, req, config);


I feel like this PR is attempting to solve three problems: 1) improving readability 2) unifying http/rpc logic 3) refactoring for hedging.
While I like the proposed end-design, given that the previous RetryingClient was already pretty complex and most users are using this feature, it is difficult (at least for me) to review and ensure that there aren't any regressions due to the large amount of changes.

What do you think of focusing on 3) refactoring for hedging specifically in a single PR? (it could be this PR or a separate PR)

Specifically, I like the idea of having a dedicated RetryContext. The RetryContext (or RetryScheduler/RetryExecutor) would be responsible for scheduling retries and completing the overall retry attempt.

class ExecutionResult { private final RetryDecision retryDecision; @Nullable private final Throwable cause; @Nullable private final HttpResponse originalRes; @Nullable private final ClientRequestContext derivedCtx; ... } class RetryContext { // Pretty much the same as the current `RetryingClient#doExecute0` CompletableFuture<ExecutionResult> doExecute() // Functions the same as `RetryingClient#handleException` void handleException(Throwable cause) // Calls https://github.com/line/armeria/blob/1051e9650432a90f0d4ee8e296f6d37166ec4710/core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java#L541-L544 void complete(HttpResponse res) } private void doExecute0(RetryContext retryContext) { final CompletableFuture<ExecutionResult> executionRes = retryContext.doExecute(); executionRes.thenAccept(executionResult -> { handleRetryDecision(executionResult, retryContext); }); }

This would also minimize diffs so that other maintainers can easily review the changeset.

Later on when implementing hedging, I think a single ExecutionResult can be selected, and the rest can be cancelled using ExecutionResult#derivedCtx#cancel (possibly the scheduled future can also be added to the ExecutionResult and could be cancelled as well).

Once the above is implemented and merged, I think RetryContext can then be further abstracted to RetriedRequest, RetryScheduler like done in this PR. Let me know what you think.

Hey, thank you for your proposal @jrhee17.
I fear that during the discussions and changes, we shifted toward a slightly broader goal than initially intended. In the beginning, if you remember, I focused on readability, specifically by removing parameter passing in the client. That led me to a solution very similar to what you are suggesting now. Back then, I tried to minimize refactoring overhead. However, as @ikhoon noted, this left the roles (scheduling, execution, controlling, etc.) mixed inside RetryContext, which I now see as the main problem of the retrying classes.
Because of that, splitting the PR into two or more smaller PRs — first introducing RetryContext bundling and then extracting components — would not help much, as most of the complexity comes from the refactoring itself.

I agree this PR might be challenging to review due to its length. However, since the retrying code is now extracted into clear components (or so I hope 😄), it should be possible to review step by step.
To assist reviewing, I updated the PR description to clearly state the goals and included a small component diagram and descriptions in bottom-up fashion.

For the review, I suggest starting with the control flow — AbstractRetryingClient and how it interacts with the interfaces.
After that, I would be glad to get feedback on how to simplify the API further.
If you agree with the high-level API, you can then look at the implementation in any order (DefaultRetryScheduler, RetryCounter, Retried*Request, Retry*Attempt).

For RetriedHttpRequest and RetryingHttpAttempt, it is best to have the old RetryingClient open for comparison, and for RetriedRpcRequest and RetryingRpcAttempt, compare with RetryingRpcClient.
Except for minor changes, they should contain the same code, just reorganized differently.

Can I do anything else to help with the review, @jrhee17?

I see.

I think each design has its pros and cons. I agree the as-is code is difficult to reason about due to the large number of parameters and complex functinoality. However, I also think we didn't have to worry about concurrency (because everything was just passed via parameters) so at least the flow was clear.

With the proposed design, I think the fact that each component is maintaining a separate state - and the effort made to synchronize the state across different components is what is making it difficult for me to review.
I'm curious if there is any way state management can be isolated to a single component.

Back then, I tried to minimize refactoring overhead.

If a simple refactoring effort is done first, I thought at the very least hedging (which we are all excited about) can gain some traction - paving ways for further refactoring which I am all for as well.

Having said this, this is just my opinion so I'm curious of your and other maintainers' opinions as well.

OK I thought about how to best split this refactoring and came up with a battle plan you can find below while having this PR as a "north star" where we want to go. Technically, we can implement hedging already after PR 2 [1] so we may not need to go the full way. Let me now if it makes sense to you @jrhee17

[1]: Because we need to have a central place dealing with racing attempts trying to schedule the next retry task.

schiemon · 2025-09-18T08:38:08Z

PR 1 - Extract counting state and remove methods from `AbstractRetryingClient`. Shorten method signatures

Make AbstractRetryingClient private.
Extract lastBackoff, currentAttemptNoWithLastBackoff, and totalAttemptNo from State into RetryCounter.
Extract derivedCtx and response into RetryAttempt<O>.
Put the RetryCounter, the remaining parts of State (deadlineNanos), and ctx, retryConfig, rootReqDuplicator, originalReq, originalRes, originalResFuture into RetryContext<I, O>.
Move setResponseTimeout(ClientRequestContext ctx) into RetryContext.
Move onRetryingComplete into Retrying(Rpc)Client.
Change the signature of all methods of Retrying(Rpc)Client and AbstractRetryingClient to accept only RetryContext and RetryAttempt. For example:

private void handleRetryDecision(@Nullable RetryDecision decision, ClientRequestContext ctx,
                                 ClientRequestContext derivedCtx, HttpRequestDuplicator rootReqDuplicator,
                                 HttpRequest originalReq, HttpResponse returnedRes,
                                 CompletableFuture<HttpResponse> future, HttpResponse originalRes) { ... }

becomes

private void handleRetryDecision(RetryContext<HttpRequest, HttpResponse> rctx,
                                 RetryAttempt<HttpResponse> attempt) { ... }

and

getNextDelay(ClientRequestContext ctx, Backoff backoff)

becomes

getNextDelay(RetryContext<HttpRequest, HttpResponse> rctx, Backoff backoff)

Remove getTotalAttempts, newDerivedContext, retryRuleWithContent (unused), retryRule (unused), mappedRetryConfig (it is passed into doExecute) and getNextDelay(ClientRequestContext ctx, Backoff backoff).

Result:

Counting is now encapsulated in RetryCounter.
RetryContext<I, O> and RetryAttempt<O> are immutable classes holding the complete state.
All methods are simplified as parameters can be collapsed into RetryContext or RetryAttempt.
AbstractRetryingClient is left with execute, doExecute, scheduleNextRetry, and
getNextDelay(ClientRequestContext ctx, Backoff backoff, long millisAfterFromServer).
The retry logic in Retrying(Rpc)Client stays untouched except for changed references to state.

PR 2 – Extract scheduling

Move scheduleNextRetry and getNextDelay into RetryScheduler (see this PR).
Move RetryScheduler into RetryContext.
Change doExecute to doExecute(RetryContext rctx).
Move getRetryAfterMillis to ClientUtils.
After calls to getRetryAfterMillis add calls to rctx.scheduler().addMinimumBackoffMillis.
Change calls to scheduleNextRetry to RetryScheduler.schedule.
Move ARMERIA_RETRY_COUNT to ClientUtils.
Move mapping and retryConfig from AbstractRetryingClient to Retrying(Rpc)Client.
Remove AbstractRetryingClient.

Result:

Scheduling is now encapsulated in RetryScheduler.
AbstractRetryingClient is removed.

PR 3 – Extract execution

PR 4 – Extract control

"We can predict everything, except the future"

jrhee17 · 2025-09-22T02:17:42Z

Thanks for understanding.

Make AbstractRetryingClient private.

I doubt there are any users actually rolling their own RetryingClient. Having said this, this class was added a long time ago, so the other maintainers may know more about whether this should remain public/without breaking changes or not.

Extract lastBackoff, currentAttemptNoWithLastBackoff, and totalAttemptNo from State into RetryCounter.

Sounds good to me.

Extract derivedCtx and response into RetryAttempt<O>.

Just to make sure we're on the same page, I understood RetryAttempt to be a value object which holds the result of each retry attempt. (Instead of the current semantics where RetryAttempt is responsible for the entire lifecycle of initializing/scheduling the retry attempt.) Let me know if I'm misunderstanding.

Put the RetryCounter, the remaining parts of State (deadlineNanos), and ctx, retryConfig, rootReqDuplicator, originalReq, originalRes, originalResFuture into RetryContext<I, O>.

👍

Move setResponseTimeout(ClientRequestContext ctx) into RetryContext.

👍

Move onRetryingComplete into Retrying(Rpc)Client.

I didn't really understand the motivation of this, but I assume this is due to different handling of onRetryingComplete for http and rpc.

Also, just to make sure I understand the idea correctly - I'm curious where the actual client call (executeWithFallback) will be done at (RetryingClient or at RetryContext)

Overall, I have no objection with step 1. Also, for quick iterations I don't think adding tests are necessary as long as mechanical refactoring is done without a different threading model.

schiemon · 2025-09-22T07:53:11Z

I doubt there are any users actually rolling their own RetryingClient. Having said this, this class was added a long time ago, so the other maintainers may know more about whether this should remain public/without breaking changes or not.

It would be really nice to decide early that we can make it private so that we can just forget about the effects of refactoring on users that might inherit from this class. Wdyt @ikhoon; is making AbstractRetryingClient private a problem?

Just to make sure we're on the same page, I understood RetryAttempt to be a value object which holds the result of each retry attempt. (Instead of the current semantics where RetryAttempt is responsible for the entire lifecycle of initializing/scheduling the retry attempt.) Let me know if I'm misunderstanding.

At this step RetryAttempt will be just be value-class containing attempt-related fields, correct. Extracting the execution logic is something for PR 3.

I didn't really understand the motivation of this, but I assume this is due to different handling of onRetryingComplete for http and rpc.

It is part of the overall process of removing the inheritance bond between Retrying(Rpc)Client and AbstractRetryingClient and move towards composition via smaller, testable components. In case we want users to build their own retrying client then exposing clear-cut components is easier understand then exposing methods that require a specific order in which they need to be called. If you have a look at PR 2 where I plan to extract scheduling, there I would also like to remove AbstractRetryingClient as a whole as there will be nothing left there (which is reasonable as AbstractRetryingClient also does not own any state after that PR.
But you are also correct in that at latest at the point of refactoring execution we need to get that method inside the executing component which owns the retry finalization process.

Also, just to make sure I understand the idea correctly - I'm curious where the actual client call (executeWithFallback) will be done at (RetryingClient or at RetryContext)

Similar to RetryAttempt, RetryContext will be a value class holding together request-level fields to prevent parameter explosion inside the Retrying(Rpc)Client. The execution logic (this includesexecuteWithFallback) should stay inside the Retrying(Rpc)Clients. As mentioned above, refactoring attempt execution should be tackled in a later PR. However even after PR 4 I expect that RetryContext will still be a value class just holding together all the components (RetryScheduler, RetryCounter, etc.) while some other component is responsible for execution, e.g. RetriedRequest as we have now or RetryExecutor or something else.

.cc @jrhee17

jrhee17 · 2025-09-24T07:11:38Z

It is part of the overall process of removing the inheritance bond between Retrying(Rpc)Client and AbstractRetryingClient and move towards composition via smaller, testable components

I see, I'm not particularly pro or against keeping/removing AbstractRetryingClient, so sounds fine to me.
I understood this is an issue for part 2.

The execution logic (this includes executeWithFallback) should stay inside the Retrying(Rpc)Clients. As mentioned above, refactoring attempt execution should be tackled in a later PR. However even after PR 4 I expect that RetryContext will still be a value class just holding together all the components (RetryScheduler, RetryCounter, etc.)

I guess I imagined that given RetryContext would be holding everything related to the original request/response (ctx, retryConfig, rootReqDuplicator, originalReq, originalRes, originalResFuture into RetryContext<I, O>), to preserve the current shape execution would be done in the RetryContext as well.

e.g.

val retryCtx = new RetryContext(ctx, reqDuplicator, req, res, responseFuture, unwrap())
...
// pretty much does the same thing as the current `RetryingClient#doExecute0`
retryCtx.execute()
// and gives an easy segway to hedging
scheduleAtRate(() -> retryCtx.execute())

Having said this, I understood your idea as follows (which I'm also fine with):

private void doExecute0(RetryContext retryCtx) {
  // access parameters like retryCtx.originalCtx, retryCtx.reqDuplicator, etc.. as needed while keeping the current structure
}

// later on for hedging
scheduleAtRate(doExecute0(retryCtx))

schiemon · 2025-09-28T17:43:59Z

First PR ready for review: #6411

schiemon added 2 commits June 25, 2025 13:46

refactor: tidy up RetryingClient

aa2c19e

- bundle parameters of private methods into a `RetryingContext` - extract attempt execution - improve naming of variables

refactor: tidy up RetryingRpcClient

4aaa678

- extract attempt execution - improve naming of variables

schiemon force-pushed the tidy-up-retrying-client branch from 577e9d5 to 4aaa678 Compare June 25, 2025 11:58

schiemon changed the title ~~Clean up RetryingClient~~ Clean up RetryingClient and RetryingRpcClient Jun 25, 2025

schiemon marked this pull request as ready for review June 25, 2025 12:13

schiemon requested review from ikhoon, jrhee17, minwoox and trustin as code owners June 25, 2025 12:13

schiemon mentioned this pull request Jun 25, 2025

feat: support request hedging #6252

Closed

6 tasks

ikhoon added the cleanup label Jun 26, 2025

ikhoon reviewed Jun 26, 2025

View reviewed changes

schiemon added 3 commits June 26, 2025 18:17

[WIP] refactor: encapsulate an attempt into a separate class

3739f0c

- RetryingClient is broken, just for discussion purposes

[WIP] refactor: continue modelling: push commit and abort into Retryi…

26a4e0d

…ngContext - put RetryingContext and RetryAttempt into separate classes

[WIP] refactor: further extraction

b31c646

schiemon force-pushed the tidy-up-retrying-client branch from f51b78c to b31c646 Compare June 27, 2025 08:37

schiemon added 7 commits June 27, 2025 14:30

fix: RetryClientTest

eeb179f

- also do some cleanup in RetryAttempt

fix: GrpcWebRetryTest

6d1ca15

- provide a copy of response with content, otherwise we will have a double subscription

docs: remove comments

5ea6ebd

docs: correct comment

7501816

refactor: deprecate underdefined onRetryingComplete

a9d9cd2

fix: assert RETRYING state when creating attempt context

6a0e105

refactor: move req duplicator init into RetryingContext

e428900

jrhee17 reviewed Jul 3, 2025

View reviewed changes

refactor: move attempt context creation and attempt execution into `R…

784c54e

…etryingContext`

jrhee17 reviewed Jul 14, 2025

View reviewed changes

schiemon mentioned this pull request Jul 21, 2025

Add retry limiter support #6318

Open

feat: log when rctx initialization failed

ceb30ed

schiemon added 2 commits September 9, 2025 10:35

refactor: remove legacy RetryingContext

6c6f7e8

refactor: remove responsibility of completing original res from Retri…

2f4b698

…edRequest to AbstractRetryingClient - also move out setting the deadline, saving deadlineTimeNanos() from RetriedRequest

schiemon force-pushed the tidy-up-retrying-client branch from f35c70a to 2f4b698 Compare September 9, 2025 09:23

Merge branch 'main' into tidy-up-retrying-client

7a7c0a6

schiemon force-pushed the tidy-up-retrying-client branch 3 times, most recently from cb5f41b to 588761b Compare September 11, 2025 09:41

fix: fix RetryingRpcClientTest.doNotRetryWhenResponseIsCancelled

df10849

schiemon force-pushed the tidy-up-retrying-client branch from 588761b to df10849 Compare September 11, 2025 10:07

schiemon added 4 commits September 11, 2025 13:00

fix: fix blockhound for RetryingRpcClientTest.doNotRetryWhenResponseI…

c7a607f

…sCancelled

fix: skip retry task when scheduler after deadline in DefaultRetrySch…

07f3fd7

…eduler

refactor: simplify RetryScheduler for sequential retrying

1fb4259

test: add RetrySchedulerTest

c5b883b

schiemon force-pushed the tidy-up-retrying-client branch from 26d9797 to c5b883b Compare September 12, 2025 12:47

schiemon added 5 commits September 12, 2025 18:02

fix: fix overflow when calculating deadline

df62fbe

fix: fix RetrySchedulerTest

0785355

- prevent scheduler error accumulation

docs: add comment in DefaultRetryScheduler

a7fc28b

test: stabilize and extend RetrySchedulerTest

006e74e

test: reduce number of direct invocations for CI pipeline

edfc33b

schiemon requested a review from jrhee17 September 13, 2025 14:45

jrhee17 reviewed Sep 16, 2025

View reviewed changes

schiemon changed the title ~~Clean up RetryingClient and RetryingRpcClient~~ Componentize RetryingClient and RetryingRpcClient Sep 17, 2025

schiemon requested a review from jrhee17 September 17, 2025 13:17

schiemon closed this Oct 15, 2025

Componentize RetryingClient and RetryingRpcClient #6292

Componentize RetryingClient and RetryingRpcClient #6292

Uh oh!

Conversation

schiemon commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

tl;dr

Components

Retry(Rpc|Http)Attempt

Retried(Rpc)Request

RetryCounter

RetryScheduler

AbstractRetryingClient

RetryContext

Concurrency Policy

Result

Uh oh!

github-actions bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Build Scan® (commit: edfc33b)

Uh oh!

codecov bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schiemon Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schiemon Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schiemon Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrhee17 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schiemon Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schiemon Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Componentize `RetryingClient` and `RetryingRpcClient` #6292

Componentize `RetryingClient` and `RetryingRpcClient` #6292

schiemon commented Jun 25, 2025 •

edited

Loading

`Retry(Rpc|Http)Attempt`

`Retried(Rpc)Request`

`RetryCounter`

`RetryScheduler`

`AbstractRetryingClient`

`RetryContext`

github-actions bot commented Jun 25, 2025 •

edited

Loading

🔍 Build Scan® (commit: `edfc33b`)

codecov bot commented Jun 26, 2025 •

edited

Loading

schiemon Jun 26, 2025 •

edited

Loading

schiemon Jun 26, 2025 •

edited

Loading

schiemon Jul 10, 2025 •

edited

Loading

schiemon Jul 3, 2025 •

edited

Loading

schiemon Jul 3, 2025 •

edited

Loading

schiemon commented Sep 13, 2025 •

edited

Loading

schiemon Sep 17, 2025 •

edited

Loading

schiemon Sep 18, 2025 •

edited

Loading

schiemon commented Sep 18, 2025 •

edited

Loading

PR 1 - Extract counting state and remove methods from `AbstractRetryingClient`. Shorten method signatures

schiemon commented Sep 22, 2025 •

edited

Loading