feat: Introduce centralized retryable HTTP client creation #273

isSerge · 2025-06-13T04:04:20Z

Summary

Related to #243

This PR includes following changes:

Introduction of centralized retryable HTTP client creation using create_retryable_http_client utility
HttpTransportClient:

Initial health checks are now retryable
try_connect during rotation is now retryable
currently uses default retry policy for both endpoint manager and test connection

Removed EndpointManager::set_retry_policy since retry configuration is now being handled at client creation time via HttpRetryConfig.

Also, not sure we need EndpointManager::update_client, I kept it for now, but use-case is less clear. Let me know what you think.

Testing Process

Checklist

Add a reference to related issues in the PR description.
Add unit tests if applicable.
Add integration tests if applicable.
Add property-based tests if applicable.
Update documentation if applicable.

…o re-use in `try_connect` method

…otation strategies

…ation

codecov · 2025-06-13T04:21:43Z

Codecov Report

Attention: Patch coverage is 92.85714% with 10 lines in your changes missing coverage. Please review.

Project coverage is 96.3%. Comparing base (6665fb5) to head (9e91239).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...services/blockchain/transports/endpoint_manager.rs	90.5%	8 Missing ⚠️
src/utils/http.rs	93.5%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main    #273     +/-   ##
=======================================
- Coverage   96.3%   96.3%   -0.1%     
=======================================
  Files         74      75      +1     
  Lines      24656   24642     -14     
=======================================
- Hits       23759   23741     -18     
- Misses       897     901      +4

Flag	Coverage Δ
integration	`60.3% <92.8%> (-0.2%)`	⬇️
properties	`32.0% <0.0%> (+<0.1%)`	⬆️
unittests	`86.5% <0.0%> (+<0.1%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/utils/http.rs

shahnami

LGTM, thanks @isSerge!
I'll checkout the PR and test it locally some time today or early next week.

docs/modules/ROOT/pages/rpc.adoc

shahnami · 2025-06-13T08:41:11Z

Looks like it's working! 🙌

However, whilst testing on ethereum_mainnet using the USDC transfer monitor, I noticed the first time it hits the should_rotate we're getting No rotation possible or needed after HTTP error status 429 Too Many Requests.. Shortly after it successfully rotates. I wonder if this is a race-condition or due the MutexGuard.. have you seen this before?

My mainnet network config for testing:

I increased the max_past_blocks to 100 and minimised the confirmation_blocks to 1 to overload the RPC

{
  "network_type": "EVM",
  "slug": "ethereum_mainnet",
  "name": "Ethereum Mainnet",
  "rpc_urls": [
    {
      "type_": "rpc",
      "url": {
        "type": "plain",
        "value": "https://eth.llamarpc.com"
      },
      "weight": 90
    },
    {
      "type_": "rpc",
      "url": {
        "type": "plain",
        "value": "https://eth.drpc.org"
      },
      "weight": 80
    }
  ],
  "chain_id": 1,
  "block_time_ms": 12000,
  "confirmation_blocks": 1,
  "cron_schedule": "0 */1 * * * *",
  "max_past_blocks": 100,
  "store_blocks": false
}

…econds

isSerge · 2025-06-15T04:26:14Z

@shahnami I am trying to reproduce this, but so far my monitor is running without issues for 1hr+ using your configuration.
My suspicion is that 1 of 2 URLs has failed during the initial check, so it ended up running with only one active URL and no fallback.
If you are still experiencing this, could you log active_url and fallback_urls during initialization and before should_attempt_rotation?

UPDATE: I have it now after running for few hours, investigating

isSerge · 2025-06-16T01:24:08Z

@shahnami seems like the problem was due to two tasks using initial URL and at some point RPC would return 429 then Task 1 would attempt rotation. Previously it would acquire lock and remove second URL from fallbacks, at the same time Task 2 will have 429 as well and it will try rotation, but there are no fallbacks available - modification of fallback_urls was happening too early. I have refined the rotation logic to fix this: URLs are being updated under single lock and after try_connect and update_client succeed with new URL.
Now if you run again, it would still fail due to both URLs having 429, but it should happen later 🫠

NicoMolinaOZ

LGTM! thanks 👍
I have been trying to reproduce the error, but after 40 mins I could not see errors..
I believe we should try to increase a bit the test coverage, we are going down from 96.3% to 95.7% :)

isSerge · 2025-06-17T21:28:30Z

@NicoMolinaOZ coverage decreased due to additional tracing::debug lines

shahnami

Code looks good, but while testing, I'm still seeing some issues with URL rotation. For example, url1 is hitting throttling errors, so we rotate to url2. Since requests are made in parallel, we're still seeing a lot of throttling errors from requests that were made prior to URL rotation (expected). However, at some point, i'm getting "Failed to connect to new URL url1", trying to rotate from current active url2 to url1, and then falls back to url1, even though the error originated from that url. It seems like a race-condition. However, I might be missing something..

2025-06-18T09:08:03.114855Z  WARN process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Request to https://eth.llamarpc.com failed with status 429 Too Many Requests: error code: 1015 network: "ethereum_mainnet"
2025-06-18T09:08:03.114997Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: send_raw_request: HTTP status 429 Too Many Requests on 'https://eth.llamarpc.com' triggers URL rotation attempt network: "ethereum_mainnet"
2025-06-18T09:08:03.115124Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Trying to rotate URL: Current Active: 'https://eth.llamarpc.com', Fallbacks: ["https://eth.drpc.org"] network: "ethereum_mainnet"
2025-06-18T09:08:03.115238Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Attempting try_connect to new_url during rotation: 'https://eth.drpc.org' network: "ethereum_mainnet"
2025-06-18T09:08:03.115406Z DEBUG process_new_blocks:get_blocks: reqwest::connect: starting new connection: https://eth.drpc.org/ network: "ethereum_mainnet"
...
2025-06-18T09:08:05.835129Z ERROR process_new_blocks:get_blocks: openzeppelin_monitor::utils::logging::error: Error occurred, Failed to connect to new URL 'https://eth.llamarpc.com', trace_id: 6bec1588-db5b-49d0-a80f-4daa80043f3c, timestamp: 2025-06-18T09:08:05.835054+00:00, error.chain: Failed to connect to https://eth.llamarpc.com/: 429 network: "ethereum_mainnet"
2025-06-18T09:08:05.835839Z ERROR process_new_blocks:get_blocks: openzeppelin_monitor::utils::logging::error: Error occurred, HTTP error: status 429 Too Many Requests for URL https://eth.llamarpc.com, trace_id: fdd8e800-c902-45c8-8cc5-d50162aa745d, timestamp: 2025-06-18T09:08:05.835820+00:00, error.chain: URL rotation failed: Failed to connect to new URL 'https://eth.llamarpc.com' network: "ethereum_mainnet"
2025-06-18T09:08:05.836480Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Trying to rotate URL: Current Active: 'https://eth.drpc.org', Fallbacks: ["https://eth.llamarpc.com"] network: "ethereum_mainnet"
2025-06-18T09:08:05.836993Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Attempting try_connect to new_url during rotation: 'https://eth.llamarpc.com' network: "ethereum_mainnet"

Full logs here

NicoMolinaOZ · 2025-06-18T20:33:04Z

Adding new information, I was able to see the same error during testing:

2025-06-18T20:27:07.496686Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: reuse idle connection for ("https", eth.drpc.org) network: ethereum_mainnet
2025-06-18T20:27:07.587579Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: reuse idle connection for ("https", eth.llamarpc.com) network: ethereum_mainnet
2025-06-18T20:27:07.763796Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: pooling idle connection for ("https", eth.llamarpc.com) network: ethereum_mainnet
2025-06-18T20:27:07.763926Z ERROR filter_block:get_transaction_receipt: openzeppelin_monitor::utils::logging::error: Error occurred, Failed to connect to new URL 'https://eth.llamarpc.com', trace_id: d5d54ce8-ce4e-4199-926d-7ad9feb04a01, timestamp: 2025-06-18T20:27:07.763906+00:00, error.chain: Failed to connect to https://eth.llamarpc.com/: 429 network: ethereum_mainnet
2025-06-18T20:27:07.763973Z ERROR filter_block:get_transaction_receipt: openzeppelin_monitor::utils::logging::error: Error occurred, HTTP error: status 429 Too Many Requests for URL https://eth.llamarpc.com, trace_id: 27496899-8714-4dd8-b8ed-2ea5815631d9, timestamp: 2025-06-18T20:27:07.763965+00:00, error.chain: URL rotation failed: Failed to connect to new URL 'https://eth.llamarpc.com' network: ethereum_mainnet
2025-06-18T20:27:07.764427Z DEBUG filter_block:get_transaction_receipt: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Trying to rotate URL: Current Active: 'https://eth.drpc.org', Fallbacks: ["https://eth.llamarpc.com"] network: ethereum_mainnet
2025-06-18T20:27:07.764450Z DEBUG filter_block:get_transaction_receipt: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Attempting try_connect to new_url during rotation: 'https://eth.llamarpc.com' network: ethereum_mainnet
2025-06-18T20:27:07.764586Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: reuse idle connection for ("https", eth.llamarpc.com) network: ethereum_mainnet
2025-06-18T20:27:07.780681Z DEBUG hyper_util::client::legacy::pool: pooling idle connection for ("https", eth.drpc.org)

I have a monitor which the following filter expression, so we are fetching txs receipts (due to gas_used)

"transactions": [
      {
        "status": "Success",
        "expression": "gas_used > 100"
      }
    ]

isSerge · 2025-06-19T00:58:59Z

Here is my understanding:

We start with initial state: active_url = llamarpc, fallback_urls = [drpc.org]
Task A attempts requests to active_url until it hits 429, triggers rotation to drpc.org, which becomes active_url
Task B runs concurrently, reads active_url - if it reads before Task A completes rotation it gets llamarpc. Attempts request to llamarpc, gets 429, triggers rotation. Once Task B acquires lock, shared state is already updated and active_url is drpc.org, so it rotates to the only available fallback - llamarpc, which still returns 429, rotation fails.

Currently we don't have a mechanism to mark endpoint as problematic globally, so it is being re-used according to current rotation logic.

What we can do to mitigate this:

Add a cooldown to avoid putting failed URL back into fallbacks
Add circuit breaker per endpoint
Introduce max_attempts limit to stop thrashing and let it fail
Use more endpoints to decrease probability of 429 (+ add load distribution)

shahnami

I agree, circuit breakers are probably the best solution here. We can tackle this in a new PR. Thanks for your hard work Serge!

isSerge added 10 commits June 9, 2025 12:31

feat: Implement retryable HTTP client with configurable policies

2b495fd

Merge remote-tracking branch 'upstream/main' into retryable-http-client

fd29ae8

feat: Add base_for_backoff to HttpRetryConfig

c9e774e

feat: update HttpTransportClient to use create_retryable_http_client

fd421de

test: adjust integration tests expectations for number of retries

3a06a99

minor

f88e62c

refactor: remove set_retry_policy methods and related tests

32b5492

refactor: store retryable client as client in HttpTransportClient t…

17ee609

…o re-use in `try_connect` method

docs: update RPC client documentation to clarify retry and endpoint r…

b9634d2

…otation strategies

minor: use TransientErrorRetryStrategy in retryable HTTP client cre…

5d26f77

…ation

isSerge requested review from NicoMolinaOZ, collins-w, d-carmo, shahnami, tirumerla and zeljkoX as code owners June 13, 2025 04:04

github-actions bot added the cla: signed label Jun 13, 2025

minor cleanup

cfb09a6

github-actions bot added cla: allowlist and removed cla: signed labels Jun 13, 2025

shahnami reviewed Jun 13, 2025

View reviewed changes

src/utils/http.rs Outdated Show resolved Hide resolved

shahnami reviewed Jun 13, 2025

View reviewed changes

docs/modules/ROOT/pages/rpc.adoc Show resolved Hide resolved

shahnami reviewed Jun 13, 2025

View reviewed changes

docs/modules/ROOT/pages/rpc.adoc Show resolved Hide resolved

shahnami reviewed Jun 13, 2025

View reviewed changes

docs/modules/ROOT/pages/rpc.adoc Show resolved Hide resolved

fix: update initial backoff duration in HttpRetryConfig to 250 millis…

a6df0c1

…econds

github-actions bot added the cla: signed label Jun 13, 2025

docs: fix docs formatting

46fe612

github-actions bot removed the cla: signed label Jun 15, 2025

isSerge added 2 commits June 15, 2025 18:48

fix: fix race condition by refining rotation logic

05f017f

test: update tests to use renamed try_rotate_url, add aditional tests

6551e47

github-actions bot added the cla: signed label Jun 15, 2025

isSerge added 2 commits June 15, 2025 19:22

fix: update shared state after all checks in try_rotate_url

dded2d0

test: minor fix

9e91239

github-actions bot removed the cla: signed label Jun 16, 2025

NicoMolinaOZ approved these changes Jun 17, 2025

View reviewed changes

shahnami reviewed Jun 18, 2025

View reviewed changes

shahnami approved these changes Jun 20, 2025

View reviewed changes

shahnami merged commit 5f6edaf into OpenZeppelin:main Jun 20, 2025
15 checks passed

github-actions bot locked and limited conversation to collaborators Jun 20, 2025

github-actions bot added the cla: signed label Jun 20, 2025

feat: Introduce centralized retryable HTTP client creation #273

feat: Introduce centralized retryable HTTP client creation #273

Uh oh!

Conversation

isSerge commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Process

Checklist

Uh oh!

codecov bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

shahnami left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shahnami commented Jun 13, 2025

Uh oh!

isSerge commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isSerge commented Jun 16, 2025

Uh oh!

NicoMolinaOZ left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isSerge commented Jun 17, 2025

Uh oh!

shahnami left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicoMolinaOZ commented Jun 18, 2025

Uh oh!

isSerge commented Jun 19, 2025

Uh oh!

shahnami left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

isSerge commented Jun 13, 2025 •

edited

Loading

codecov bot commented Jun 13, 2025 •

edited

Loading

shahnami left a comment •

edited

Loading

isSerge commented Jun 15, 2025 •

edited

Loading

NicoMolinaOZ left a comment •

edited

Loading

shahnami left a comment •

edited

Loading