Skip to content

Conversation

@isSerge
Copy link
Contributor

@isSerge isSerge commented Jun 13, 2025

Summary

Related to #243

This PR includes following changes:

  1. Introduction of centralized retryable HTTP client creation using create_retryable_http_client utility
  2. HttpTransportClient:
  • Initial health checks are now retryable
  • try_connect during rotation is now retryable
  • currently uses default retry policy for both endpoint manager and test connection
  1. Removed EndpointManager::set_retry_policy since retry configuration is now being handled at client creation time via HttpRetryConfig.

Also, not sure we need EndpointManager::update_client, I kept it for now, but use-case is less clear. Let me know what you think.

Testing Process

Checklist

  • Add a reference to related issues in the PR description.
  • Add unit tests if applicable.
  • Add integration tests if applicable.
  • Add property-based tests if applicable.
  • Update documentation if applicable.

@codecov
Copy link

codecov bot commented Jun 13, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 10 lines in your changes missing coverage. Please review.

Project coverage is 96.3%. Comparing base (6665fb5) to head (9e91239).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...services/blockchain/transports/endpoint_manager.rs 90.5% 8 Missing ⚠️
src/utils/http.rs 93.5% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main    #273     +/-   ##
=======================================
- Coverage   96.3%   96.3%   -0.1%     
=======================================
  Files         74      75      +1     
  Lines      24656   24642     -14     
=======================================
- Hits       23759   23741     -18     
- Misses       897     901      +4     
Flag Coverage Δ
integration 60.3% <92.8%> (-0.2%) ⬇️
properties 32.0% <0.0%> (+<0.1%) ⬆️
unittests 86.5% <0.0%> (+<0.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@shahnami shahnami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @isSerge!
I'll checkout the PR and test it locally some time today or early next week.

@shahnami
Copy link
Member

Looks like it's working! 🙌

However, whilst testing on ethereum_mainnet using the USDC transfer monitor, I noticed the first time it hits the should_rotate we're getting No rotation possible or needed after HTTP error status 429 Too Many Requests.. Shortly after it successfully rotates. I wonder if this is a race-condition or due the MutexGuard.. have you seen this before?

Screenshot 2025-06-13 at 12 32 24


My mainnet network config for testing:

I increased the max_past_blocks to 100 and minimised the confirmation_blocks to 1 to overload the RPC

{
  "network_type": "EVM",
  "slug": "ethereum_mainnet",
  "name": "Ethereum Mainnet",
  "rpc_urls": [
    {
      "type_": "rpc",
      "url": {
        "type": "plain",
        "value": "https://eth.llamarpc.com"
      },
      "weight": 90
    },
    {
      "type_": "rpc",
      "url": {
        "type": "plain",
        "value": "https://eth.drpc.org"
      },
      "weight": 80
    }
  ],
  "chain_id": 1,
  "block_time_ms": 12000,
  "confirmation_blocks": 1,
  "cron_schedule": "0 */1 * * * *",
  "max_past_blocks": 100,
  "store_blocks": false
}

@isSerge
Copy link
Contributor Author

isSerge commented Jun 15, 2025

@shahnami I am trying to reproduce this, but so far my monitor is running without issues for 1hr+ using your configuration.
My suspicion is that 1 of 2 URLs has failed during the initial check, so it ended up running with only one active URL and no fallback.
If you are still experiencing this, could you log active_url and fallback_urls during initialization and before should_attempt_rotation?

UPDATE: I have it now after running for few hours, investigating

@isSerge
Copy link
Contributor Author

isSerge commented Jun 16, 2025

@shahnami seems like the problem was due to two tasks using initial URL and at some point RPC would return 429 then Task 1 would attempt rotation. Previously it would acquire lock and remove second URL from fallbacks, at the same time Task 2 will have 429 as well and it will try rotation, but there are no fallbacks available - modification of fallback_urls was happening too early. I have refined the rotation logic to fix this: URLs are being updated under single lock and after try_connect and update_client succeed with new URL.
Now if you run again, it would still fail due to both URLs having 429, but it should happen later 🫠

Copy link
Contributor

@NicoMolinaOZ NicoMolinaOZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks 👍
I have been trying to reproduce the error, but after 40 mins I could not see errors..
I believe we should try to increase a bit the test coverage, we are going down from 96.3% to 95.7% :)

@isSerge
Copy link
Contributor Author

isSerge commented Jun 17, 2025

@NicoMolinaOZ coverage decreased due to additional tracing::debug lines

Copy link
Member

@shahnami shahnami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, but while testing, I'm still seeing some issues with URL rotation. For example, url1 is hitting throttling errors, so we rotate to url2. Since requests are made in parallel, we're still seeing a lot of throttling errors from requests that were made prior to URL rotation (expected). However, at some point, i'm getting "Failed to connect to new URL url1", trying to rotate from current active url2 to url1, and then falls back to url1, even though the error originated from that url. It seems like a race-condition. However, I might be missing something..

2025-06-18T09:08:03.114855Z  WARN process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Request to https://eth.llamarpc.com failed with status 429 Too Many Requests: error code: 1015 network: "ethereum_mainnet"
2025-06-18T09:08:03.114997Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: send_raw_request: HTTP status 429 Too Many Requests on 'https://eth.llamarpc.com' triggers URL rotation attempt network: "ethereum_mainnet"
2025-06-18T09:08:03.115124Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Trying to rotate URL: Current Active: 'https://eth.llamarpc.com', Fallbacks: ["https://eth.drpc.org"] network: "ethereum_mainnet"
2025-06-18T09:08:03.115238Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Attempting try_connect to new_url during rotation: 'https://eth.drpc.org' network: "ethereum_mainnet"
2025-06-18T09:08:03.115406Z DEBUG process_new_blocks:get_blocks: reqwest::connect: starting new connection: https://eth.drpc.org/ network: "ethereum_mainnet"
...
2025-06-18T09:08:05.835129Z ERROR process_new_blocks:get_blocks: openzeppelin_monitor::utils::logging::error: Error occurred, Failed to connect to new URL 'https://eth.llamarpc.com', trace_id: 6bec1588-db5b-49d0-a80f-4daa80043f3c, timestamp: 2025-06-18T09:08:05.835054+00:00, error.chain: Failed to connect to https://eth.llamarpc.com/: 429 network: "ethereum_mainnet"
2025-06-18T09:08:05.835839Z ERROR process_new_blocks:get_blocks: openzeppelin_monitor::utils::logging::error: Error occurred, HTTP error: status 429 Too Many Requests for URL https://eth.llamarpc.com, trace_id: fdd8e800-c902-45c8-8cc5-d50162aa745d, timestamp: 2025-06-18T09:08:05.835820+00:00, error.chain: URL rotation failed: Failed to connect to new URL 'https://eth.llamarpc.com' network: "ethereum_mainnet"
2025-06-18T09:08:05.836480Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Trying to rotate URL: Current Active: 'https://eth.drpc.org', Fallbacks: ["https://eth.llamarpc.com"] network: "ethereum_mainnet"
2025-06-18T09:08:05.836993Z DEBUG process_new_blocks:get_blocks: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Attempting try_connect to new_url during rotation: 'https://eth.llamarpc.com' network: "ethereum_mainnet"

Full logs here

@NicoMolinaOZ
Copy link
Contributor

Adding new information, I was able to see the same error during testing:

2025-06-18T20:27:07.496686Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: reuse idle connection for ("https", eth.drpc.org) network: ethereum_mainnet
2025-06-18T20:27:07.587579Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: reuse idle connection for ("https", eth.llamarpc.com) network: ethereum_mainnet
2025-06-18T20:27:07.763796Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: pooling idle connection for ("https", eth.llamarpc.com) network: ethereum_mainnet
2025-06-18T20:27:07.763926Z ERROR filter_block:get_transaction_receipt: openzeppelin_monitor::utils::logging::error: Error occurred, Failed to connect to new URL 'https://eth.llamarpc.com', trace_id: d5d54ce8-ce4e-4199-926d-7ad9feb04a01, timestamp: 2025-06-18T20:27:07.763906+00:00, error.chain: Failed to connect to https://eth.llamarpc.com/: 429 network: ethereum_mainnet
2025-06-18T20:27:07.763973Z ERROR filter_block:get_transaction_receipt: openzeppelin_monitor::utils::logging::error: Error occurred, HTTP error: status 429 Too Many Requests for URL https://eth.llamarpc.com, trace_id: 27496899-8714-4dd8-b8ed-2ea5815631d9, timestamp: 2025-06-18T20:27:07.763965+00:00, error.chain: URL rotation failed: Failed to connect to new URL 'https://eth.llamarpc.com' network: ethereum_mainnet
2025-06-18T20:27:07.764427Z DEBUG filter_block:get_transaction_receipt: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Trying to rotate URL: Current Active: 'https://eth.drpc.org', Fallbacks: ["https://eth.llamarpc.com"] network: ethereum_mainnet
2025-06-18T20:27:07.764450Z DEBUG filter_block:get_transaction_receipt: openzeppelin_monitor::services::blockchain::transports::endpoint_manager: Attempting try_connect to new_url during rotation: 'https://eth.llamarpc.com' network: ethereum_mainnet
2025-06-18T20:27:07.764586Z DEBUG filter_block:get_transaction_receipt: hyper_util::client::legacy::pool: reuse idle connection for ("https", eth.llamarpc.com) network: ethereum_mainnet
2025-06-18T20:27:07.780681Z DEBUG hyper_util::client::legacy::pool: pooling idle connection for ("https", eth.drpc.org)

I have a monitor which the following filter expression, so we are fetching txs receipts (due to gas_used)

"transactions": [
      {
        "status": "Success",
        "expression": "gas_used > 100"
      }
    ]

@isSerge
Copy link
Contributor Author

isSerge commented Jun 19, 2025

Here is my understanding:

  1. We start with initial state: active_url = llamarpc, fallback_urls = [drpc.org]
  2. Task A attempts requests to active_url until it hits 429, triggers rotation to drpc.org, which becomes active_url
  3. Task B runs concurrently, reads active_url - if it reads before Task A completes rotation it gets llamarpc. Attempts request to llamarpc, gets 429, triggers rotation. Once Task B acquires lock, shared state is already updated and active_url is drpc.org, so it rotates to the only available fallback - llamarpc, which still returns 429, rotation fails.

Currently we don't have a mechanism to mark endpoint as problematic globally, so it is being re-used according to current rotation logic.

What we can do to mitigate this:

  • Add a cooldown to avoid putting failed URL back into fallbacks
  • Add circuit breaker per endpoint
  • Introduce max_attempts limit to stop thrashing and let it fail
  • Use more endpoints to decrease probability of 429 (+ add load distribution)

Copy link
Member

@shahnami shahnami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, circuit breakers are probably the best solution here. We can tackle this in a new PR. Thanks for your hard work Serge!

@shahnami shahnami merged commit 5f6edaf into OpenZeppelin:main Jun 20, 2025
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants