Skip to content

Conversation

@schiemon
Copy link
Contributor

@schiemon schiemon commented Jun 25, 2025

Motivation

In the context of implementing hedging in #6252, I found it hard to understand and extend AbstractRetryingClient, RetryingClient, and RetryingRpcClient.
Currently, AbstractRetryingClient manages ctx.attr(STATE), while Retrying(Rpc)Client maintains a "backpack state" by passing parameters around internal methods. The roles and differences between these two states are unclear.

Additionally, AbstractRetryingClient splits delay calculation and scheduling into separate methods, even though they are tightly coupled, which makes safe use of this API difficult.
Lastly, since most of the code resides in Retrying(Rpc)Client, it is hard to focus on one aspect of the retry process, as there is insufficient encapsulation.

Altogether, this makes it hard to extend (e.g., with hedging or retry throttling) and customize (e.g., with a custom retry scheduler) retrying in Armeria.

Modifications

tl;dr

We split up retrying into the following components:

image

Here AbstractRetryingClient stays as the central component being the driver of retrying. It interacts with RetriedRequest to execute, abort and commit Retry*Attempts . It uses a dedicated RetryScheduler to schedule while respecting the request deadline and potential backoffs received from remote peers. All components are retrieved in a RetryContext from the Retry(Rpc)Clients through newRetryContext which acts as a key central extension point in which users are free to pick the implementation for the respective interfaces (dashed in the diagram).

Components

Let me describe the roles of the components in bottom-up. More detailed explanations can be found in the doc comments.

Retry(Rpc|Http)Attempt

Encapsulates a single RPC/HTTP attempt, from its execution up to the point where its response is decided upon by the RetryRule.
The resulting RetryDecision is returned to the Retried(Rpc)Request, which has full ownership of the Retry(Rpc|Http)Attempt.
After returning the RetryDecision, a Retry(Rpc|Http)Attempt can either be aborted or committed.
Aborting an attempt discards its response, as it is not selected as the final response.
Committing an attempt marks it as selected as the final response. A full state diagram can be found
here.

Retried(Rpc)Request

Based on the original request, it provides methods to execute, commit, and abort Retry(Rpc|Http)Attempts. It has full ownership of all attempts and does not expose them. In particular if a user wants to abort or commit an attempt, they do so by specifying the attempt number.

RetryCounter

Tracks the number of attempts made, both in total and per backoff.

RetryScheduler

Schedules retries on the retry event loop while respecting the response deadline and the backoff intervals returned by endpoints.

AbstractRetryingClient

Calls newRetryContext from Retrying(Rpc)Client to receive all components required to control the retry process.
It uses them to:

  • execute attempts,
  • decide to commit or abort attempts based on attempt execution results,
  • abort the retry process when outside conditions occur (e.g., the original request completes early), and to
  • complete the original response.

RetryContext

A simple, immutable, record-like class used by AbstractRetryingClient to hold all components required for retry control.
Together with newRetryContext, it allows users to fully customize the execution and scheduling parts of the retry process.
In the future, responsibilities inside AbstractRetryingClient could be further split into a new component, which could also be injected via RetryContext.

Concurrency Policy

Using locks in these components significantly increased the difficulty of verifying correctness and liveness properties.
Instead, I chose thread confinement: all logic, except attempt execution, runs on a single event loop; the "retry event loop". This approach is nice for users wanting to implementing components themselves as they do not need to worry about internal synchronization.
Performance-wise I don't expect any noteworthy downsides as retries are not made with a high rate that could cause contention. However, when an attempt is completing, it will first switch to the retry loop before it gets committed which adds scheduling latency. I feel like this is not critical as I have seen multiple such reschedules in Armeria code but let me know what you of this.

Result

Retrying in Armeria is now componentized, making it easier to understand and extend.

schiemon added 2 commits June 25, 2025 13:46
- bundle parameters of private methods into a `RetryingContext`
- extract attempt execution
- improve naming of variables
- extract attempt execution
- improve naming of variables
@schiemon schiemon force-pushed the tidy-up-retrying-client branch from 577e9d5 to 4aaa678 Compare June 25, 2025 11:58
@schiemon schiemon changed the title Clean up RetryingClient Clean up RetryingClient and RetryingRpcClient Jun 25, 2025
@schiemon schiemon marked this pull request as ready for review June 25, 2025 12:13
@schiemon schiemon mentioned this pull request Jun 25, 2025
6 tasks
@ikhoon ikhoon added the cleanup label Jun 26, 2025
@codecov
Copy link

codecov bot commented Jun 26, 2025

Codecov Report

❌ Patch coverage is 74.32905% with 220 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.05%. Comparing base (8150425) to head (edfc33b).
⚠️ Report is 197 commits behind head on main.

Files with missing lines Patch % Lines
...inecorp/armeria/client/retry/HttpRetryAttempt.java 72.32% 31 Missing and 31 partials ⚠️
...linecorp/armeria/client/retry/RpcRetryAttempt.java 56.55% 29 Missing and 24 partials ⚠️
...ecorp/armeria/client/retry/RetriedHttpRequest.java 73.38% 7 Missing and 30 partials ⚠️
...necorp/armeria/client/retry/RetriedRpcRequest.java 70.96% 6 Missing and 21 partials ⚠️
...p/armeria/client/retry/AbstractRetryingClient.java 84.09% 7 Missing and 7 partials ⚠️
...corp/armeria/client/retry/DefaultRetryCounter.java 65.51% 7 Missing and 3 partials ⚠️
...rp/armeria/client/retry/DefaultRetryScheduler.java 90.72% 3 Missing and 6 partials ⚠️
...m/linecorp/armeria/internal/client/ClientUtil.java 80.55% 4 Missing and 3 partials ⚠️
.../armeria/client/retry/AbortedAttemptException.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6292      +/-   ##
============================================
- Coverage     74.46%   74.05%   -0.41%     
- Complexity    22234    23072     +838     
============================================
  Files          1963     2068     +105     
  Lines         82437    86556    +4119     
  Branches      10764    11419     +655     
============================================
+ Hits          61385    64101    +2716     
- Misses        15918    16945    +1027     
- Partials       5134     5510     +376     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines 272 to 276
final RetryConfig<HttpResponse> config = retryingContext.config();
final ClientRequestContext ctx = retryingContext.ctx();
final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator();
final HttpRequest req = retryingContext.req();
final HttpResponse res = retryingContext.res();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-allocating the fields to the local variables looks redundant.
What do you think of moving doExecute0() to RetryingContext and directly execute the method?

new RetryingContext(ctx, mappedRetryConfig(ctx),
                    req, reqDuplicator, res, resFuture).execute();

Copy link
Contributor Author

@schiemon schiemon Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-allocating the fields to the local variables looks redundant.

Yes they are redundant and with the sole purpose to shorten the field accesses to RetryingContext.

What do you think of moving doExecute0() to RetryingContext and directly execute the method?

...together with all the helper methods (e.g. handleAggregatedResponse), right? Let me try it out and see how it looks like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...together with all the helper methods (e.g. handleAggregatedResponse), right? Let me try it out and see how it looks like

👍 I had a similar thought, though it wasn’t very specific. I imagined RetryingClient spawns a new RetryContext and delegates it to execute requests and gather results.

Copy link
Contributor Author

@schiemon schiemon Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think understand where you're heading. I also believe that encapsulating a retry attempt and strongly hiding all internals is a good idea. I've pushed a design. Instead of moving everything into RetryingContext, I kept the constant/global data in RetryingContext and moved everything attempt-related into a new class, Attempt.
Please let me know what you think @ikhoon. FYI: I didn't strive for green tests— but just enough to have a discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current approach.

  • Would you move RetryingContext and Attempt to top-level classes?
  • Some methods, such as abort(), which only access the fields of RetryingContext, may be moved into RetryingContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jrhee17. Thanks for the review, I am also curious what you about the two points above

Copy link
Contributor Author

@schiemon schiemon Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrhee17 @ikhoon bump :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about deriving an interface RetryingContext and provide HttpRetryingContext and RpcRetryingContext for the respective clients?

I understand the current functionality of RetryingContext seems to be to 1) schedule retry attempts 2) handle the decision of a retry attempt.

class RetryingContext {
  CompletableFuture<Boolean> init();

  RetryAttempt newRetryAttempt();

  void commit(RetryAttempt attempt);
  void abort(Throwable cause);
}

I think the above basic functionalities can also be used from RetryingRpcClient.

I'm unsure if inheritance or composition is best, but overall I agree that unifying the logic between the two retrying clients seems useful for implementing hedging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought I had is to unify state(ctx) with RetryingContext so RetryingContext knows about the deadline and last backoff. For hedging later on it would also know about all the pending attempts to be able to properly commit and abort when the winning attempt is determined. With that we don't have no two places of global state anymore.

I agree there is no need for maintaining two states at different locations and I agree State can be a local field of RetryingContext instead of ctx#attr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review. I agree with refactoring RetryingClient logic into separate classes, such as HttpRetryingContext but I will check whether adding a lock is truly necessary. I may add some commits myself or leave comments on that.

schiemon added 3 commits June 26, 2025 18:17
- RetryingClient is broken, just for discussion purposes
…ngContext

- put RetryingContext and RetryAttempt into separate classes
@schiemon schiemon force-pushed the tidy-up-retrying-client branch from f51b78c to b31c646 Compare June 27, 2025 08:37
Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal, left some thoughts on the direction of the PR

}

CompletableFuture<Void> execute() {
assert state == State.INITIALIZED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we remove the state related assertions because:

  1. The assertions run asynchronously will be mostly swallowed, so users won't be able to know why an assertion failed most of the time.
  2. The additional state bookkeeping doesn't isn't thread-safe. However, I think certain methods (e.g. RetryContext#abort) can be called from multiple threads which makes it difficult to keep track. I imagine this will become more of an issue when hedging is introduced.

Copy link
Contributor Author

@schiemon schiemon Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I find asserts invaluable as they

  1. document the (otherwise implicit!) assumptions we have about a piece of code
  2. let the program crash early at the point of violation without letting the program run in a corrupt state

I am wondering about this matter as in the codebase I can find several occurrences of asserts in async settings (e.g. ) and actually also in combination with state machine pattern:

image

The assertions run asynchronously will be mostly swallowed, so users won't be able to know why an assertion failed most of the time.

Im unsure what you mean by swallowing errors, could you expand on that?

The additional state bookkeeping doesn't isn't thread-safe. However, I think certain methods (e.g. RetryContext#abort) can be called from multiple threads which makes it difficult to keep track. I imagine this will become more of an issue when hedging is introduced.

Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?
On that point: I am very curious what your/Armerias stance is on synchronization in the framework, given Armeria's thread-model. Because alternatively to using synchronization primitives, we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code. I imagine for RetryingContext this would mean that solely ctx.eventloop() is allowed to execute code whereas for an attempt it is attemptCtx.eventloop(). Then, everytime we need do a switch we would wo want to reschedule on the appropriate eventloop. For example, when the attempt is resolving its completeness promise we reschedule from the attempt eventloop to the main event loop:

        attempt.whenComplete().handle((unused, unexpectedAttemptCause) -> {
            if (unexpectedAttemptCause != null) {
                assert attempt.state() == RetryAttempt.State.ABORTED;
                rctx.abort(unexpectedAttemptCause);
                return null;
            }
            ...
        });

I would confine the handler to the original request eventloop like so:

        attempt.whenComplete().handleAsync((unused, unexpectedAttemptCause) -> {
            if (unexpectedAttemptCause != null) {
                assert attempt.state() == RetryAttempt.State.ABORTED;
                rctx.abort(unexpectedAttemptCause);
                return null;
            }
            ...
        }, rctx.ctx().eventLoop());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I find asserts invaluable as they

document the (otherwise implicit!) assumptions we have about a piece of code
let the program crash early at the point of violation without letting the program run in a corrupt state
I am wondering about this matter as in the codebase I can find several occurrences of asserts in async settings (e.g. ) and actually also in combination with state machine pattern:

Sorry, let me rephrase. I'm fine with introducing states in general. I don't think RetryAttempt needs to maintain a separate state as it is relatively simple.

Im unsure what you mean by swallowing errors, could you expand on that?

I was thinking of the following pattern which seems to be used often in this class.

CompletableFuture cf = new CompletableFuture();
someAsyncMethod().handle((val, cause) -> {
    // the returned cf won't reflect the failed assertion
    assert state == State.Expected; // assertion without a try..catch
    cf.complete(val)
})
return cf;

Unless the thread defines an UncaughtExceptionHandler, I guessed that users won't be able to receive a signal that an assertion failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean with the assertions. I will keep an eye on that and assure that the assertion errors are properly bubbling up into the futures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that point: I am very curious what your/Armerias stance is on synchronization in the framework, given Armeria's thread-model. [...]

What are your/Armeria's thoughts on that matter? For example, do I see it correctly, that the call to HttpResponse.abort MUST be done by the eventloop the response is associated with as the response handlers expect that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?

I was simply pointing out that:

  1. The armeria constructs being used (i.e. RequestLog#endRequest, HttpResponse#abort etc..) are already guarded by primitive synchronizations.
    e.g.

    private void abort0(@Nullable Throwable cause) {

  2. On the other hand, RetryingContext#state doesn't seem to be guarded by any synchronizations.

Due to the introduction of a new state instead of using the previously existing constructs (RequestLog, HttpResponse), we may need to introduce a new synchronization method.

we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code

My intuition is that given that this logic is on the request path, it would be nice if we can minimize event loop rescheduling to minimize latency like done in the other constructs mentioned above (since event loops are shared with other requests).

Having said this, I'm not 100% sure which synchronization method is best before actually going through the implementation.

ABORTED
}

private State state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I like the idea of introducing RetryAttempt, I imagine the functionality of this class to:

  • Represent a single request execution. Hence, the ctx and res is fixed and final.
  • RetryAttempt is responsible for storing the result of an execution. Bookkeeping for the child log (ctx.log) is done in this class as well.

On the other hand, I don't think RetryAttempt necessarily should be responsible for enqueueing a request. What do you think of allowing RetryingContext to have more responsibility on scheduling retry requests?

e.g. RetryingContext can be responsible for enqueueing a new request

class RetryingClient
    private void doExecuteAttempt(RetryingContext rctx) {
...
       rctx.newRetryAttempt().handle((retryAttempt, cause) -> {...})

Where the method signature could look something like the following:

class RetryingContext {
    CompletableFuture<RetryAttempt> newRetryAttempt() {
....
        HttpResponse res = executeAttemptRequest(ctx, number, delegate);
....
    }

And RetryAttempt can contain final fields

class RetryAttempt {
...
    private final ClientRequestContext ctx;
    private final HttpResponse res;
    @Nullable
    private final HttpResponse truncatedRes;
    @Nullable
    private final Throwable cause;

This way, I think the responsibility of each class could become more clear - and we can worry less about race conditions.
Let me know what you think.

Copy link
Contributor Author

@schiemon schiemon Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might cut the INITIALIZED state from a RetryAttempt as it then starts as EXECUTED so I like your idea. Let me implement that

Comment on lines 272 to 276
final RetryConfig<HttpResponse> config = retryingContext.config();
final ClientRequestContext ctx = retryingContext.ctx();
final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator();
final HttpRequest req = retryingContext.req();
final HttpResponse res = retryingContext.res();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about deriving an interface RetryingContext and provide HttpRetryingContext and RpcRetryingContext for the respective clients?

I understand the current functionality of RetryingContext seems to be to 1) schedule retry attempts 2) handle the decision of a retry attempt.

class RetryingContext {
  CompletableFuture<Boolean> init();

  RetryAttempt newRetryAttempt();

  void commit(RetryAttempt attempt);
  void abort(Throwable cause);
}

I think the above basic functionalities can also be used from RetryingRpcClient.

I'm unsure if inheritance or composition is best, but overall I agree that unifying the logic between the two retrying clients seems useful for implementing hedging

Comment on lines 272 to 276
final RetryConfig<HttpResponse> config = retryingContext.config();
final ClientRequestContext ctx = retryingContext.ctx();
final HttpRequestDuplicator reqDuplicator = retryingContext.reqDuplicator();
final HttpRequest req = retryingContext.req();
final HttpResponse res = retryingContext.res();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought I had is to unify state(ctx) with RetryingContext so RetryingContext knows about the deadline and last backoff. For hedging later on it would also know about all the pending attempts to be able to properly commit and abort when the winning attempt is determined. With that we don't have no two places of global state anymore.

I agree there is no need for maintaining two states at different locations and I agree State can be a local field of RetryingContext instead of ctx#attr.

Comment on lines 208 to 232
// The request or response has been aborted by the client before it receives a response,
// so stop retrying.
if (req.whenComplete().isCompletedExceptionally()) {
state = State.COMPLETING;
req.whenComplete().handle((unused, cause) -> {
abort(cause);
return null;
});
return true;
}

if (res.isComplete()) {
state = State.COMPLETING;
res.whenComplete().handle((result, cause) -> {
final Throwable abortCause;
if (cause != null) {
abortCause = cause;
} else {
abortCause = AbortedStreamException.get();
}
abort(abortCause);
return null;
});
return true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these blocks are simply adding a callback to req.whenComplete(), res.whenComplete() I wonder if this can be moved to init() instead. The fact that this is called inside isCompleted() which seems to simply return a state was a little surprising to me

}

CompletableFuture<Void> execute() {
assert state == State.INITIALIZED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I still need to check synchronization-safety but this is not due to the state bookkeeping, right?

I was simply pointing out that:

  1. The armeria constructs being used (i.e. RequestLog#endRequest, HttpResponse#abort etc..) are already guarded by primitive synchronizations.
    e.g.

    private void abort0(@Nullable Throwable cause) {

  2. On the other hand, RetryingContext#state doesn't seem to be guarded by any synchronizations.

Due to the introduction of a new state instead of using the previously existing constructs (RequestLog, HttpResponse), we may need to introduce a new synchronization method.

we could use thread confinement/enforcing that only a specific eventloop is executing a specific piece of code

My intuition is that given that this logic is on the request path, it would be nice if we can minimize event loop rescheduling to minimize latency like done in the other constructs mentioned above (since event loops are shared with other requests).

Having said this, I'm not 100% sure which synchronization method is best before actually going through the implementation.

…edRequest to AbstractRetryingClient

- also move out setting the deadline, saving deadlineTimeNanos() from RetriedRequest
@schiemon schiemon force-pushed the tidy-up-retrying-client branch from f35c70a to 2f4b698 Compare September 9, 2025 09:23
@schiemon schiemon force-pushed the tidy-up-retrying-client branch 3 times, most recently from cb5f41b to 588761b Compare September 11, 2025 09:41
@schiemon schiemon force-pushed the tidy-up-retrying-client branch from 588761b to df10849 Compare September 11, 2025 10:07
@schiemon schiemon force-pushed the tidy-up-retrying-client branch from 26d9797 to c5b883b Compare September 12, 2025 12:47
@schiemon
Copy link
Contributor Author

schiemon commented Sep 13, 2025

Currently the CI fails because of coverage. Once you agree on the current architecture @ikhoon and @jrhee17 and concurrency policy, I will add tests to the RetryingClients, RetriedRequests and corresponding attempts and also update the PR description.

@schiemon schiemon requested a review from jrhee17 September 13, 2025 14:45
retryConfig != null ? retryConfig
: requireNonNull(retryMapping.get(ctx, req),
"retryMapping.get() returned null");
final RetryContext rctx = newRetryContext(unwrap(), ctx, req, config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this PR is attempting to solve three problems: 1) improving readability 2) unifying http/rpc logic 3) refactoring for hedging.
While I like the proposed end-design, given that the previous RetryingClient was already pretty complex and most users are using this feature, it is difficult (at least for me) to review and ensure that there aren't any regressions due to the large amount of changes.

What do you think of focusing on 3) refactoring for hedging specifically in a single PR? (it could be this PR or a separate PR)

Specifically, I like the idea of having a dedicated RetryContext. The RetryContext (or RetryScheduler/RetryExecutor) would be responsible for scheduling retries and completing the overall retry attempt.

class ExecutionResult {
    private final RetryDecision retryDecision;
    @Nullable
    private final Throwable cause;
    @Nullable
    private final HttpResponse originalRes;
    @Nullable
    private final ClientRequestContext derivedCtx;
...
}

class RetryContext {
    // Pretty much the same as the current `RetryingClient#doExecute0`
    CompletableFuture<ExecutionResult> doExecute()
    // Functions the same as `RetryingClient#handleException`
    void handleException(Throwable cause)
    // Calls https://github.com/line/armeria/blob/1051e9650432a90f0d4ee8e296f6d37166ec4710/core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java#L541-L544
    void complete(HttpResponse res)
}

private void doExecute0(RetryContext retryContext) {
    final CompletableFuture<ExecutionResult> executionRes = retryContext.doExecute();
    executionRes.thenAccept(executionResult -> {
        handleRetryDecision(executionResult, retryContext);
    });
}

This would also minimize diffs so that other maintainers can easily review the changeset.

Later on when implementing hedging, I think a single ExecutionResult can be selected, and the rest can be cancelled using ExecutionResult#derivedCtx#cancel (possibly the scheduled future can also be added to the ExecutionResult and could be cancelled as well).

Once the above is implemented and merged, I think RetryContext can then be further abstracted to RetriedRequest, RetryScheduler like done in this PR. Let me know what you think.

Copy link
Contributor Author

@schiemon schiemon Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you for your proposal @jrhee17.
I fear that during the discussions and changes, we shifted toward a slightly broader goal than initially intended. In the beginning, if you remember, I focused on readability, specifically by removing parameter passing in the client. That led me to a solution very similar to what you are suggesting now. Back then, I tried to minimize refactoring overhead. However, as @ikhoon noted, this left the roles (scheduling, execution, controlling, etc.) mixed inside RetryContext, which I now see as the main problem of the retrying classes.
Because of that, splitting the PR into two or more smaller PRs — first introducing RetryContext bundling and then extracting components — would not help much, as most of the complexity comes from the refactoring itself.

I agree this PR might be challenging to review due to its length. However, since the retrying code is now extracted into clear components (or so I hope 😄), it should be possible to review step by step.
To assist reviewing, I updated the PR description to clearly state the goals and included a small component diagram and descriptions in bottom-up fashion.

For the review, I suggest starting with the control flow — AbstractRetryingClient and how it interacts with the interfaces.
After that, I would be glad to get feedback on how to simplify the API further.
If you agree with the high-level API, you can then look at the implementation in any order (DefaultRetryScheduler, RetryCounter, Retried*Request, Retry*Attempt).

For RetriedHttpRequest and RetryingHttpAttempt, it is best to have the old RetryingClient open for comparison, and for RetriedRpcRequest and RetryingRpcAttempt, compare with RetryingRpcClient.
Except for minor changes, they should contain the same code, just reorganized differently.

Can I do anything else to help with the review, @jrhee17?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

I think each design has its pros and cons. I agree the as-is code is difficult to reason about due to the large number of parameters and complex functinoality. However, I also think we didn't have to worry about concurrency (because everything was just passed via parameters) so at least the flow was clear.

With the proposed design, I think the fact that each component is maintaining a separate state - and the effort made to synchronize the state across different components is what is making it difficult for me to review.
I'm curious if there is any way state management can be isolated to a single component.

Back then, I tried to minimize refactoring overhead.

If a simple refactoring effort is done first, I thought at the very least hedging (which we are all excited about) can gain some traction - paving ways for further refactoring which I am all for as well.

Having said this, this is just my opinion so I'm curious of your and other maintainers' opinions as well.

Copy link
Contributor Author

@schiemon schiemon Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I thought about how to best split this refactoring and came up with a battle plan you can find below while having this PR as a "north star" where we want to go. Technically, we can implement hedging already after PR 2 [1] so we may not need to go the full way. Let me now if it makes sense to you @jrhee17

[1]: Because we need to have a central place dealing with racing attempts trying to schedule the next retry task.

@schiemon schiemon changed the title Clean up RetryingClient and RetryingRpcClient Componentize RetryingClient and RetryingRpcClient Sep 17, 2025
@schiemon schiemon requested a review from jrhee17 September 17, 2025 13:17
@schiemon
Copy link
Contributor Author

schiemon commented Sep 18, 2025

PR 1 - Extract counting state and remove methods from AbstractRetryingClient. Shorten method signatures

  • Make AbstractRetryingClient private.
  • Extract lastBackoff, currentAttemptNoWithLastBackoff, and totalAttemptNo from State into RetryCounter.
  • Extract derivedCtx and response into RetryAttempt<O>.
  • Put the RetryCounter, the remaining parts of State (deadlineNanos), and ctx, retryConfig, rootReqDuplicator, originalReq, originalRes, originalResFuture into RetryContext<I, O>.
  • Move setResponseTimeout(ClientRequestContext ctx) into RetryContext.
  • Move onRetryingComplete into Retrying(Rpc)Client.
  • Change the signature of all methods of Retrying(Rpc)Client and AbstractRetryingClient to accept only RetryContext and RetryAttempt. For example:
private void handleRetryDecision(@Nullable RetryDecision decision, ClientRequestContext ctx,
                                 ClientRequestContext derivedCtx, HttpRequestDuplicator rootReqDuplicator,
                                 HttpRequest originalReq, HttpResponse returnedRes,
                                 CompletableFuture<HttpResponse> future, HttpResponse originalRes) { ... }

becomes

private void handleRetryDecision(RetryContext<HttpRequest, HttpResponse> rctx,
                                 RetryAttempt<HttpResponse> attempt) { ... }

and

getNextDelay(ClientRequestContext ctx, Backoff backoff)

becomes

getNextDelay(RetryContext<HttpRequest, HttpResponse> rctx, Backoff backoff)
  • Remove getTotalAttempts, newDerivedContext, retryRuleWithContent (unused), retryRule (unused), mappedRetryConfig (it is passed into doExecute) and getNextDelay(ClientRequestContext ctx, Backoff backoff).

Result:

  • Counting is now encapsulated in RetryCounter.
  • RetryContext<I, O> and RetryAttempt<O> are immutable classes holding the complete state.
  • All methods are simplified as parameters can be collapsed into RetryContext or RetryAttempt.
  • AbstractRetryingClient is left with execute, doExecute, scheduleNextRetry, and
    getNextDelay(ClientRequestContext ctx, Backoff backoff, long millisAfterFromServer).
  • The retry logic in Retrying(Rpc)Client stays untouched except for changed references to state.

PR 2 – Extract scheduling

  • Move scheduleNextRetry and getNextDelay into RetryScheduler (see this PR).
  • Move RetryScheduler into RetryContext.
  • Change doExecute to doExecute(RetryContext rctx).
  • Move getRetryAfterMillis to ClientUtils.
  • After calls to getRetryAfterMillis add calls to rctx.scheduler().addMinimumBackoffMillis.
  • Change calls to scheduleNextRetry to RetryScheduler.schedule.
  • Move ARMERIA_RETRY_COUNT to ClientUtils.
  • Move mapping and retryConfig from AbstractRetryingClient to Retrying(Rpc)Client.
  • Remove AbstractRetryingClient.

Result:

  • Scheduling is now encapsulated in RetryScheduler.
  • AbstractRetryingClient is removed.

PR 3 – Extract execution

PR 4 – Extract control

"We can predict everything, except the future"

@jrhee17
Copy link
Contributor

jrhee17 commented Sep 22, 2025

Thanks for understanding.

  • Make AbstractRetryingClient private.

I doubt there are any users actually rolling their own RetryingClient. Having said this, this class was added a long time ago, so the other maintainers may know more about whether this should remain public/without breaking changes or not.

  • Extract lastBackoff, currentAttemptNoWithLastBackoff, and totalAttemptNo from State into RetryCounter.

Sounds good to me.

  • Extract derivedCtx and response into RetryAttempt<O>.

Just to make sure we're on the same page, I understood RetryAttempt to be a value object which holds the result of each retry attempt. (Instead of the current semantics where RetryAttempt is responsible for the entire lifecycle of initializing/scheduling the retry attempt.) Let me know if I'm misunderstanding.

  • Put the RetryCounter, the remaining parts of State (deadlineNanos), and ctx, retryConfig, rootReqDuplicator, originalReq, originalRes, originalResFuture into RetryContext<I, O>.

👍

  • Move setResponseTimeout(ClientRequestContext ctx) into RetryContext.

👍

  • Move onRetryingComplete into Retrying(Rpc)Client.

I didn't really understand the motivation of this, but I assume this is due to different handling of onRetryingComplete for http and rpc.

Also, just to make sure I understand the idea correctly - I'm curious where the actual client call (executeWithFallback) will be done at (RetryingClient or at RetryContext)

Overall, I have no objection with step 1. Also, for quick iterations I don't think adding tests are necessary as long as mechanical refactoring is done without a different threading model.

@schiemon
Copy link
Contributor Author

schiemon commented Sep 22, 2025

I doubt there are any users actually rolling their own RetryingClient. Having said this, this class was added a long time ago, so the other maintainers may know more about whether this should remain public/without breaking changes or not.

It would be really nice to decide early that we can make it private so that we can just forget about the effects of refactoring on users that might inherit from this class. Wdyt @ikhoon; is making AbstractRetryingClient private a problem?

Just to make sure we're on the same page, I understood RetryAttempt to be a value object which holds the result of each retry attempt. (Instead of the current semantics where RetryAttempt is responsible for the entire lifecycle of initializing/scheduling the retry attempt.) Let me know if I'm misunderstanding.

At this step RetryAttempt will be just be value-class containing attempt-related fields, correct. Extracting the execution logic is something for PR 3.

I didn't really understand the motivation of this, but I assume this is due to different handling of onRetryingComplete for http and rpc.

It is part of the overall process of removing the inheritance bond between Retrying(Rpc)Client and AbstractRetryingClient and move towards composition via smaller, testable components. In case we want users to build their own retrying client then exposing clear-cut components is easier understand then exposing methods that require a specific order in which they need to be called. If you have a look at PR 2 where I plan to extract scheduling, there I would also like to remove AbstractRetryingClient as a whole as there will be nothing left there (which is reasonable as AbstractRetryingClient also does not own any state after that PR.
But you are also correct in that at latest at the point of refactoring execution we need to get that method inside the executing component which owns the retry finalization process.

Also, just to make sure I understand the idea correctly - I'm curious where the actual client call (executeWithFallback) will be done at (RetryingClient or at RetryContext)

Similar to RetryAttempt, RetryContext will be a value class holding together request-level fields to prevent parameter explosion inside the Retrying(Rpc)Client. The execution logic (this includesexecuteWithFallback) should stay inside the Retrying(Rpc)Clients. As mentioned above, refactoring attempt execution should be tackled in a later PR. However even after PR 4 I expect that RetryContext will still be a value class just holding together all the components (RetryScheduler, RetryCounter, etc.) while some other component is responsible for execution, e.g. RetriedRequest as we have now or RetryExecutor or something else.

.cc @jrhee17

@jrhee17
Copy link
Contributor

jrhee17 commented Sep 24, 2025

It is part of the overall process of removing the inheritance bond between Retrying(Rpc)Client and AbstractRetryingClient and move towards composition via smaller, testable components

I see, I'm not particularly pro or against keeping/removing AbstractRetryingClient, so sounds fine to me.
I understood this is an issue for part 2.

The execution logic (this includes executeWithFallback) should stay inside the Retrying(Rpc)Clients. As mentioned above, refactoring attempt execution should be tackled in a later PR. However even after PR 4 I expect that RetryContext will still be a value class just holding together all the components (RetryScheduler, RetryCounter, etc.)

I guess I imagined that given RetryContext would be holding everything related to the original request/response (ctx, retryConfig, rootReqDuplicator, originalReq, originalRes, originalResFuture into RetryContext<I, O>), to preserve the current shape execution would be done in the RetryContext as well.

e.g.

val retryCtx = new RetryContext(ctx, reqDuplicator, req, res, responseFuture, unwrap())
...
// pretty much does the same thing as the current `RetryingClient#doExecute0`
retryCtx.execute()
// and gives an easy segway to hedging
scheduleAtRate(() -> retryCtx.execute())

Having said this, I understood your idea as follows (which I'm also fine with):

private void doExecute0(RetryContext retryCtx) {
  // access parameters like retryCtx.originalCtx, retryCtx.reqDuplicator, etc.. as needed while keeping the current structure
}

// later on for hedging
scheduleAtRate(doExecute0(retryCtx))

@schiemon
Copy link
Contributor Author

First PR ready for review: #6411

@schiemon schiemon closed this Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants