Extract `AbstractRetryingClient.State` into `RetryContext` and `RetryCounter` #6411

schiemon · 2025-09-24T08:44:33Z

Motivation:

Currently, AbstractRetryingClient manages ctx.attr(STATE), while Retrying(Rpc)Client passes a "backpack state" by forwarding multiple parameters through internal methods. This leads to several issues:

Backpacking that many parameters makes Retrying(Rpc)Client harder to read and changes to it error-prone, as some parameters share the same types.
Understanding the retry control flow requires jumping between AbstractRetryingClient and Retrying(Rpc)Client, reducing readability.
The distinction between the two states is unclear, making it harder to quickly grasp the high-level retry logic in Armeria.

This PR removes AbstractRetryingClient.State, splitting it into RetryContext and RetryCounter, which are now owned by Retrying(Rpc)Client. All methods referencing AbstractRetryingClient.State are removed. With that AbstractRetryingClient is left with only two non-trival methods: getDelay and scheduleNextRetry. As we plan to also extract scheduling from and then delete AbstractRetryingClient in a later PR, AbstractRetryingClient is already made private to simplify the overall change process to the public API [1].

Finally, the PR updates Retrying(Rpc)Client to use the new RetryContext and refactors its code for improved clarity.

Modifications:

Make AbstractRetryingClient private
Introduce RetryCounter and move counting state and logic from AbstractRetryingClient.State into it
Introduce RetryContext, an immutable value class holding the non-attempt specific retry state from AbstractRetryingClient.State like the original request or HttpRequestDuplicator
Remove AbstractRetryingClient.State
Introduce RetryAttempt , an immutable class holding the attempt context and response
Rename variables inside Retrying(Rpc)Client for better readability

Result:

[Breaking] AbstractRetryingClient is now private.
Improves readability of RetryingClient and RetryingRpcClient

[1]: Making AbstractRetryingClient private should be less of a problem anyways as AbstractRetryingClient was likely not overridden and customized as the current design makes it hard to do so correctly. This is of course is not guaranteed and as such need to be considered if we need to provide a migration path. At the moment I did not implement one but please let me know if you think this is necessary.

…RetryCounter

github-actions · 2025-09-24T09:28:34Z

🔍 Build Scan® (commit: `5cb758a`)

Job name	Status	Build Scan®
build-ubicloud-standard-16-jdk-8	✅	https://ge.armeria.dev/s/cotvx2ylbiowa
build-ubicloud-standard-16-jdk-21-snapshot-blockhound	✅	https://ge.armeria.dev/s/tifmja5bayu62
build-ubicloud-standard-16-jdk-17-min-java-17-coverage	❌ (failure)	https://ge.armeria.dev/s/mjqw4pqqqyjcu
build-ubicloud-standard-16-jdk-17-min-java-11	✅	https://ge.armeria.dev/s/5o62d75zzdnqq
build-ubicloud-standard-16-jdk-17-leak	✅	https://ge.armeria.dev/s/eqjl6bryeuxog
build-ubicloud-standard-16-jdk-11	✅	https://ge.armeria.dev/s/qvzgyhoimcugo
build-macos-latest-jdk-21	❌ (failure)	https://ge.armeria.dev/s/smu2v3oafie2q

codecov · 2025-09-24T18:57:17Z

Codecov Report

❌ Patch coverage is 86.05442% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.09%. Comparing base (8150425) to head (d269081).
⚠️ Report is 192 commits behind head on main.

Files with missing lines	Patch %	Lines
.../linecorp/armeria/client/retry/RetryingClient.java	88.19%	10 Missing and 9 partials ⚠️
...om/linecorp/armeria/client/retry/RetryCounter.java	60.00%	8 Missing and 4 partials ⚠️
...om/linecorp/armeria/client/retry/RetryContext.java	86.84%	2 Missing and 3 partials ⚠️
...necorp/armeria/client/retry/RetryingRpcClient.java	89.79%	2 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #6411      +/-   ##
============================================
- Coverage     74.46%   74.09%   -0.37%     
- Complexity    22234    23027     +793     
============================================
  Files          1963     2064     +101     
  Lines         82437    86166    +3729     
  Branches      10764    11310     +546     
============================================
+ Hits          61385    63847    +2462     
- Misses        15918    16906     +988     
- Partials       5134     5413     +279

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…d before retrying completed

We do not offer retry customization at the moment. In a later PR we will add an API to pass in a custom counter (and scheduler).

…ptResponse

core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java

jrhee17

Thanks, this looks a lot easier to review.

Whilel looking at the overall APIs, I was curious how this will be used for hedging.

I was imagining there would be a single method which invokes retry attempts and returns a representation of the asynchronous result so that it could be later used to complete the original response or be cancelled.
(e.g. RetryAttempt retry(RetryContext))

Let me know if I'm missing anything

schiemon · 2025-09-30T07:09:03Z

FYI: This PR is not exclusively for hedging. This PR is more of a cleanup so we all can better argue about RetryingClient - when discussing changes for hedging but also in general.

Let me address your question anyway.

I was imagining there would be a single method which invokes retry attempts and returns a representation of the asynchronous result so that it could later be used to complete the original response or be cancelled.
(e.g. RetryAttempt retry(RetryContext))

Abstracting away a single retry and returning some kind of handle is certainly something we need to do for hedging, yes. This is because we want to easily start the original retry as well as schedule the hedged one while keeping a reference to the attempts in order to abort the one that does not get committed/ that looses. Compare to the other changes this is a fairly simple and small change we can do when implementing hedging.

Maybe to give you a general overview, here are the things we need to have for hedging later on:

A way to check whether attempts are in flight or retry tasks are scheduled. This determines whether a completing attempt is the last one (i.e. maxAttempts reached and no others active) and thus whether to abort or commit it.
A mechanism to abort in-flight attempts in case concurrent attempts commit.
A mechanism to cancel scheduling a retry task in case another retry task was scheduled with a shorter delay.
A mechanism to cancel scheduling a retry task when retry is completed.
A mechanism to reschedule a retry task in case another attempt returns a server pushback that is greater than the retry task’s scheduling delay.

For 3.,4., and 5. I would use the RetryScheduler from the previous PR.

Originally, 1. and 2. were done by the RetriedRequest and with the help of using a single event loop. When aiming for the least invasive change set, for 1., we just need to have some counter in RetryContext and synchronize that with the RetryScheduler.

For 2. I think we can solve it via a CompletableFuture<RetryAttempt> committedAttempt which contains the attempt that is going to be committed. Every new attempt would attach a listener to it and call abortAttempt if they notice they lost the race. This future would also be used to call commitAttempt. The handlers in doExecute for rctx.req() and rctx.res() would complete this future first (exceptionally) and handles to committedAttempt would then call handleException. Furtheremore if an attempt was prepared and ready for decision it would check committedAttempt before to check whether the race is over already.

For 1. and 2. we would have to think carefully about concurrency though.

schiemon · 2025-10-06T21:17:07Z

Hi @jrhee17, do you have any updates on this?

jrhee17 · 2025-10-17T08:41:55Z

Sorry about the delay, just got back from a long vacation

I think we can solve it via a CompletableFuture committedAttempt which contains the attempt that is going to be committed.

Sounds good to me 👍 Just wanted to make sure of the plan after this PR is merged.

For 3.,4., and 5. I would use the RetryScheduler from the previous PR.

I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.

jrhee17

Checked that there aren't any obvious regressions (or at least I couldn't find any)

jrhee17 · 2025-10-17T07:29:49Z

core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java

+                                                null);
                    } catch (Throwable cause) {
-                        duplicator.abort(cause);
-                        handleException(ctx, rootReqDuplicator, future, cause, false);


Question) Is there reason why this call was removed?

Seems to be a slip-through; will correct that

schiemon · 2025-10-18T17:17:34Z

I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.

Sure, I will have some time in the upcoming week but then I will get less available. So if I can get green light on this refactoring also from other maintainers I can build the e2e hedging prototype on top of that

schiemon added 2 commits September 23, 2025 18:29

refactor: make AbstractRetryingClient private

6ae5293

refactor: extract AbstractRetryingClient.State into RetryContext and …

ef95d6b

…RetryCounter

fix: GrpcWebRetryTest

d2d03d2

schiemon added 12 commits September 26, 2025 09:03

refactor: move out rctx construction

809cee2

refactor: make req and res completion handlers more compact

65d646a

refactor: centralize retry decision-making

cd443c0

fix: only treat res completion as exception when the res was complete…

af0c3a4

…d before retrying completed

Merge branch 'main' into refactor-retrying-pr1-extract-state

a24e0a0

refactor: remove newDerivedContext from AbstractRetryingClient

8fc6d84

refactor: remove onRetryingComplete from AbstractRetryingClient

ec4b6ea

refactor: remove mapping from AbstractRetryingClient

cf2f828

refactor: remove retryRule from AbstractRetryingClient

83e8209

revert: remove second constructor

5b309b2

We do not offer retry customization at the moment. In a later PR we will add an API to pass in a custom counter (and scheduler).

fix: abort res to decide and hand over cause in handleAggregatedAttem…

f6ce24c

…ptResponse

fix: RetryingRpcClientTest.doNotRetryWhenResponseIsCancelled

d269081

schiemon marked this pull request as ready for review September 28, 2025 17:43

schiemon requested review from ikhoon, jrhee17, minwoox and trustin as code owners September 28, 2025 17:43

schiemon mentioned this pull request Sep 28, 2025

Componentize RetryingClient and RetryingRpcClient #6292

Closed

schiemon commented Sep 29, 2025

View reviewed changes

core/src/main/java/com/linecorp/armeria/client/retry/RetryingClient.java Show resolved Hide resolved

jrhee17 reviewed Sep 30, 2025

View reviewed changes

schiemon requested a review from jrhee17 October 1, 2025 07:37

Merge branch 'main' into refactor-retrying-pr1-extract-state

5cb758a

jrhee17 approved these changes Oct 17, 2025

View reviewed changes

jrhee17 added the improvement label Oct 17, 2025

jrhee17 added this to the 1.34.0 milestone Oct 17, 2025

github-actions bot added the Stale label Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract `AbstractRetryingClient.State` into `RetryContext` and `RetryCounter` #6411

Extract `AbstractRetryingClient.State` into `RetryContext` and `RetryCounter` #6411

Uh oh!

schiemon commented Sep 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 24, 2025 •

edited

Loading

Uh oh!

codecov bot commented Sep 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

jrhee17 left a comment

Uh oh!

schiemon commented Sep 30, 2025 •

edited

Loading

Uh oh!

schiemon commented Oct 6, 2025

Uh oh!

jrhee17 commented Oct 17, 2025 •

edited

Loading

Uh oh!

jrhee17 left a comment

Uh oh!

jrhee17 Oct 17, 2025

Uh oh!

schiemon Oct 18, 2025

Uh oh!

schiemon commented Oct 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Extract AbstractRetryingClient.State into RetryContext and RetryCounter #6411

Are you sure you want to change the base?

Extract AbstractRetryingClient.State into RetryContext and RetryCounter #6411

Uh oh!

Conversation

schiemon commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation:

Modifications:

Result:

Uh oh!

github-actions bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Build Scan® (commit: 5cb758a)

Uh oh!

codecov bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

jrhee17 left a comment

Choose a reason for hiding this comment

Uh oh!

schiemon commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schiemon commented Oct 6, 2025

Uh oh!

jrhee17 commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrhee17 left a comment

Choose a reason for hiding this comment

Uh oh!

jrhee17 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

schiemon Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

schiemon commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Extract `AbstractRetryingClient.State` into `RetryContext` and `RetryCounter` #6411

Extract `AbstractRetryingClient.State` into `RetryContext` and `RetryCounter` #6411

schiemon commented Sep 24, 2025 •

edited

Loading

github-actions bot commented Sep 24, 2025 •

edited

Loading

🔍 Build Scan® (commit: `5cb758a`)

codecov bot commented Sep 24, 2025 •

edited

Loading

schiemon commented Sep 30, 2025 •

edited

Loading

jrhee17 commented Oct 17, 2025 •

edited

Loading

schiemon commented Oct 18, 2025 •

edited

Loading