Skip to content

Conversation

@schiemon
Copy link
Contributor

@schiemon schiemon commented Sep 24, 2025

Motivation:

Currently, AbstractRetryingClient manages ctx.attr(STATE), while Retrying(Rpc)Client passes a "backpack state" by forwarding multiple parameters through internal methods. This leads to several issues:

  1. Backpacking that many parameters makes Retrying(Rpc)Client harder to read and changes to it error-prone, as some parameters share the same types.
  2. Understanding the retry control flow requires jumping between AbstractRetryingClient and Retrying(Rpc)Client, reducing readability.
  3. The distinction between the two states is unclear, making it harder to quickly grasp the high-level retry logic in Armeria.

This PR removes AbstractRetryingClient.State, splitting it into RetryContext and RetryCounter, which are now owned by Retrying(Rpc)Client. All methods referencing AbstractRetryingClient.State are removed. With that AbstractRetryingClient is left with only two non-trival methods: getDelay and scheduleNextRetry. As we plan to also extract scheduling from and then delete AbstractRetryingClient in a later PR, AbstractRetryingClient is already made private to simplify the overall change process to the public API [1].

Finally, the PR updates Retrying(Rpc)Client to use the new RetryContext and refactors its code for improved clarity.

Modifications:

  • Make AbstractRetryingClient private
  • Introduce RetryCounter and move counting state and logic from AbstractRetryingClient.State into it
  • Introduce RetryContext, an immutable value class holding the non-attempt specific retry state from AbstractRetryingClient.State like the original request or HttpRequestDuplicator
  • Remove AbstractRetryingClient.State
  • Introduce RetryAttempt , an immutable class holding the attempt context and response
  • Rename variables inside Retrying(Rpc)Client for better readability

Result:

  • [Breaking] AbstractRetryingClient is now private.
  • Improves readability of RetryingClient and RetryingRpcClient

[1]: Making AbstractRetryingClient private should be less of a problem anyways as AbstractRetryingClient was likely not overridden and customized as the current design makes it hard to do so correctly. This is of course is not guaranteed and as such need to be considered if we need to provide a migration path. At the moment I did not implement one but please let me know if you think this is necessary.

@codecov
Copy link

codecov bot commented Sep 24, 2025

Codecov Report

❌ Patch coverage is 86.05442% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.09%. Comparing base (8150425) to head (d269081).
⚠️ Report is 192 commits behind head on main.

Files with missing lines Patch % Lines
.../linecorp/armeria/client/retry/RetryingClient.java 88.19% 10 Missing and 9 partials ⚠️
...om/linecorp/armeria/client/retry/RetryCounter.java 60.00% 8 Missing and 4 partials ⚠️
...om/linecorp/armeria/client/retry/RetryContext.java 86.84% 2 Missing and 3 partials ⚠️
...necorp/armeria/client/retry/RetryingRpcClient.java 89.79% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6411      +/-   ##
============================================
- Coverage     74.46%   74.09%   -0.37%     
- Complexity    22234    23027     +793     
============================================
  Files          1963     2064     +101     
  Lines         82437    86166    +3729     
  Branches      10764    11310     +546     
============================================
+ Hits          61385    63847    +2462     
- Misses        15918    16906     +988     
- Partials       5134     5413     +279     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks a lot easier to review.

Whilel looking at the overall APIs, I was curious how this will be used for hedging.

I was imagining there would be a single method which invokes retry attempts and returns a representation of the asynchronous result so that it could be later used to complete the original response or be cancelled.
(e.g. RetryAttempt retry(RetryContext))

Let me know if I'm missing anything

@schiemon
Copy link
Contributor Author

schiemon commented Sep 30, 2025

FYI: This PR is not exclusively for hedging. This PR is more of a cleanup so we all can better argue about RetryingClient - when discussing changes for hedging but also in general.

Let me address your question anyway.

I was imagining there would be a single method which invokes retry attempts and returns a representation of the asynchronous result so that it could later be used to complete the original response or be cancelled.
(e.g. RetryAttempt retry(RetryContext))

Abstracting away a single retry and returning some kind of handle is certainly something we need to do for hedging, yes. This is because we want to easily start the original retry as well as schedule the hedged one while keeping a reference to the attempts in order to abort the one that does not get committed/ that looses. Compare to the other changes this is a fairly simple and small change we can do when implementing hedging.

Maybe to give you a general overview, here are the things we need to have for hedging later on:

  1. A way to check whether attempts are in flight or retry tasks are scheduled. This determines whether a completing attempt is the last one (i.e. maxAttempts reached and no others active) and thus whether to abort or commit it.
  2. A mechanism to abort in-flight attempts in case concurrent attempts commit.
  3. A mechanism to cancel scheduling a retry task in case another retry task was scheduled with a shorter delay.
  4. A mechanism to cancel scheduling a retry task when retry is completed.
  5. A mechanism to reschedule a retry task in case another attempt returns a server pushback that is greater than the retry task’s scheduling delay.

For 3.,4., and 5. I would use the RetryScheduler from the previous PR.

Originally, 1. and 2. were done by the RetriedRequest and with the help of using a single event loop. When aiming for the least invasive change set, for 1., we just need to have some counter in RetryContext and synchronize that with the RetryScheduler.

For 2. I think we can solve it via a CompletableFuture<RetryAttempt> committedAttempt which contains the attempt that is going to be committed. Every new attempt would attach a listener to it and call abortAttempt if they notice they lost the race. This future would also be used to call commitAttempt. The handlers in doExecute for rctx.req() and rctx.res() would complete this future first (exceptionally) and handles to committedAttempt would then call handleException. Furtheremore if an attempt was prepared and ready for decision it would check committedAttempt before to check whether the race is over already.

For 1. and 2. we would have to think carefully about concurrency though.

image

@schiemon schiemon requested a review from jrhee17 October 1, 2025 07:37
@schiemon
Copy link
Contributor Author

schiemon commented Oct 6, 2025

Hi @jrhee17, do you have any updates on this?

@jrhee17
Copy link
Contributor

jrhee17 commented Oct 17, 2025

Sorry about the delay, just got back from a long vacation

I think we can solve it via a CompletableFuture committedAttempt which contains the attempt that is going to be committed.

Sounds good to me 👍 Just wanted to make sure of the plan after this PR is merged.

For 3.,4., and 5. I would use the RetryScheduler from the previous PR.

I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked that there aren't any obvious regressions (or at least I couldn't find any)

null);
} catch (Throwable cause) {
duplicator.abort(cause);
handleException(ctx, rootReqDuplicator, future, cause, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question) Is there reason why this call was removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a slip-through; will correct that

@jrhee17 jrhee17 added this to the 1.34.0 milestone Oct 17, 2025
@schiemon
Copy link
Contributor Author

schiemon commented Oct 18, 2025

I prefer that hedging is implemented first so that we can be sure that refactoring isn't done just for the sake of refactoring (unless there is a good reason to do so like done in this PR). While cleaner code is always appreciated, I think it's difficult to judge whether a refactor really helps since we never know how the code/requirements will evolve unless the benefit is very obvious. It would also help convince other maintainers as well.

Sure, I will have some time in the upcoming week but then I will get less available. So if I can get green light on this refactoring also from other maintainers I can build the e2e hedging prototype on top of that

@github-actions github-actions bot added the Stale label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants