Skip to content

Conversation

@l00py
Copy link
Contributor

@l00py l00py commented Oct 31, 2025

Description

Add quarantine mechanism for unhealthy endpoints

  • Added a quarantine feature for unhealthy endpoints, delaying retries to those endpoints after a configurable period (default: 30s).
  • Quarantine settings are configurable via the DNS resolver's quarantine section.
  • The load balancer will avoid sending data to endpoints marked as unhealthy until their quarantine period expires, using healthy endpoints in the hash ring without triggering unnecessary ring updates.
  • This increases resilience by reducing the risk of exporters being stuck in degraded states with repeated failed attempts.
  • This feature currently applies only to the DNS resolver.

Link to tracking issue

Fixes #43644

Test Coverage for Quarantine Feature

Core Quarantine Functionality Tests

File: loadbalancer_test.go

New Tests

TestQuarantineEndpoints

Validates the quarantine mechanism by marking an endpoint as unhealthy and verifying it automatically becomes healthy again after the quarantine duration expires (1ms in test). Confirms that only the quarantined endpoint is marked unhealthy while others remain healthy.

TestIsQuarantineEnabled

Verifies that the quarantine feature is correctly enabled when configured with Quarantine: QuarantineSettings{Enabled: true} in the DNS resolver settings.

TestIsQuarantineEnabledFalse

Confirms that quarantine is properly disabled when explicitly set to false in the configuration.

TestIsQuarantineWithDNSQuarantineOmitted

Ensures that quarantine is disabled by default when the quarantine configuration section is omitted entirely, maintaining backward compatibility.

DNS Resolver Configuration Tests

File: resolver_dns_test.go

New Tests

TestQuarantineEnabled

Validates that quarantine settings are properly initialized with the default duration (30 seconds) when enabled.

TestQuarantineConfigOmitted

Confirms default behavior when quarantine configuration is not provided (disabled with 30s default duration).

TestQuarantineDurationOmitted

Verifies that the default quarantine duration (30 seconds) is used when duration is not specified.

TestQuarantineDurationZero

Ensures that invalid zero duration values are replaced with the default 30-second duration.

TestQuarantineDurationNegative

Confirms that invalid negative duration values are replaced with the default 30-second duration.

Trace Exporter Retry Tests

File: trace_exporter_test.go

New Tests

TestConsumeTraces_DNSResolverRetriesOnUnreachableEndpoint

Tests the retry mechanism for traces when the primary endpoint fails. Verifies that:

  • The exporter successfully fails over to a healthy endpoint
  • Traces are consumed by the alternative endpoint without data loss
  • The quarantine feature is properly invoked during retry

TestConsumeTraces_DNSResolverRetriesExhausted

Validates error handling when all endpoints are unreachable. Confirms that:

  • All available endpoints in the ring are tried
  • An appropriate error is returned: "all endpoints were tried and failed: map[endpoint-1:true endpoint-2:true]"
  • No silent failures occur

Log Exporter Retry Tests

File: log_exporter_test.go

New Tests

TestConsumeLogs_DNSResolverRetriesOnUnreachableEndpoint

Mirrors the trace retry test for logs, ensuring the same failover behavior works correctly for log data.

TestConsumeLogs_DNSResolverRetriesExhausted

Validates exhaustive retry behavior for logs when all endpoints fail, confirming proper error reporting.

Metrics Exporter Retry Tests

File: metrics_exporter_test.go

New Tests

TestConsumeMetrics_DNSResolverRetriesOnUnreachableEndpoint

Tests retry logic for metrics when one endpoint is unreachable, verifying successful failover to healthy endpoints.

TestConsumeMetrics_DNSResolverRetriesExhausted

Confirms proper error handling when all metric endpoints are unreachable.

Test Patterns and Coverage

All the retry tests follow a consistent pattern:

  1. Configure the load balancer with DNS resolver and quarantine enabled
  2. Set up a hash ring with multiple test endpoints
  3. Simulate endpoint failures using mock exporters
  4. Verify successful failover to healthy endpoints
  5. Confirm proper error messages when all retries are exhausted

The tests ensure comprehensive coverage across all three signal types (traces, logs, and metrics) and validate both the "happy path" (successful retry) and "unhappy path" (all endpoints failed) scenarios.

Summary

Total New Tests: 14

  • Core Quarantine: 4 tests
  • DNS Configuration: 5 tests
  • Retry Logic (Traces): 2 tests
  • Retry Logic (Logs): 2 tests
  • Retry Logic (Metrics): 2 tests

All tests validate the quarantine feature's ability to temporarily exclude unhealthy endpoints, automatically recover them after the quarantine period, and handle edge cases with proper default values.

Local Test Environment Validation

Test Configuration Example

exporters:
  loadbalancing:
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 60s
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 10000
      block_on_overflow: false
      sizer: items
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-receivers.test   # 3x DNS A records
        port: '4317'
        interval: 5s
        timeout: 1s
        quarantine:
          enabled: true
          duration: 30s

Tested Scenarios

  • Scenario 1: Single Endpoint Failure with Automatic Recovery
    • DNS entries unchanged: 3x DNS A records
    • Disabled one backend
    • Results:
      • Backend marked unhealthy
      • Data rerouted to healthy backends (2x)
      • Data distributed "equally" on healthy backends
  • Scenario 2: Multiple Endpoint Failures
    • DNS entries unchanged: 3x DNS A records
    • Disabled one backend at a time
    • Results:
      • Related backends marked unhealthy
      • Data rerouted to healthy backends until all backends unhealthy
  • Scenario 3: Cascading Failures and Recovery
    • DNS entries unchanged: 3x DNS A records
    • Disabled one backend at a time
    • Gradual restoration of backends
    • Results:
      • Related backends marked unhealthy
      • Data rerouted to healthy backends until all backends unhealthy
      • Unhealthy backends are retried after set duration (30s), and marked healthy again if successful retry
      • Data distributed "equally" on healthy backends
  • Scenario 4: Quarantine with Different Signal Types
  • Scenario 5: Queue and Retries effects
    • Enabling retry and queueing at top-level exporter loadbalancer
    • DNS entries unchanged: 3x DNS A records
    • Disabled all backends

Documentation

@l00py l00py force-pushed the feature/43644-loadbalancing-dns-quarantine branch 4 times, most recently from 29c491c to 06882ad Compare November 3, 2025 19:57
@l00py
Copy link
Contributor Author

l00py commented Nov 3, 2025

@rlankfo -- here is the draft implementation.
Also, not sure about how it's handled here for reverting a previous PR, e.g. #43719, if it should be within its own separate PR.

Please advise if anything. Refactoring and feedback is welcome :)

Appreciate your help!

…r unhealthy endpoints

- Added a quarantine feature for unhealthy endpoints, delaying retries to those endpoints after a configurable period (default: 30s).
- Quarantine settings are configurable via the DNS resolver's `quarantine` section.
- The load balancer will avoid sending data to endpoints marked as unhealthy until their quarantine period expires, using healthy endpoints in the hash ring without triggering unnecessary ring updates.
- This increases resilience by reducing the risk of exporters being stuck in degraded states with repeated failed attempts.
- This feature currently applies only to the DNS resolver.

Refs open-telemetry#43644
@l00py l00py force-pushed the feature/43644-loadbalancing-dns-quarantine branch from 06882ad to 8ff57b5 Compare November 6, 2025 16:47
Copy link
Member

@rlankfo rlankfo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! Overall, I think the direction looks good here. I left a couple of comments. I'm not 100% certain about the port issue and it's not a problem until we expand this functionality, but I think a simple counter on the ring would alleviate my concerns here.

Let me know what you think.

currentPos := getPosition(identifier)

// Try until we've used all available endpoints
for len(tried) < len(lb.exporters) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this condition will work fine for DNS resolver but if we ever extend this quarantine logic to other resolvers there is a possibility this loop will not terminate. The k8s resolver allows setting more than one port per endpoint. IIRC, lb.exporters includes these endpoints with ports, where the ring just contains the bare endpoint.

Could you update this to use count of bare endpoints from the ring instead of length of lb.exporters? We might need to add a count to the hashRing. We can probably set this counter to len(endpoints) in newHashRing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed commit 8d3cdbc which adds endpoints to hashRing. It may be useful rather than having a simple count. Open for change :)

Add list of endpoints to hashRing
@l00py
Copy link
Contributor Author

l00py commented Nov 7, 2025

Effects of Quarantine and BackOffConfig enabled

Added tests in 7f6a9a6 to test against the effect of quarantine with BackOffConfig enabled.

  1. BackOffConfig enabled at parent exporter:
    • Quarantine: each endpoint should be tried once with quarantine enabled
    • BackOffConfig: subsequent retries of the endpoints triggered by the parent exporter
  2. BackOffConfig disabled at parent and sub
    • Quarantine: each endpoint should only be tried once with quarantine enabled
    • BackOffConfig: no retries
  3. BackOffConfig enabled at parent with partial recovery of the endpoints
    • Quarantine: on the first cycle, each endpoint should be tried once
    • BackOffConfig: on the second cycle, only the recovered endpoint should be tried
  4. BackOffConfig enabled at sub only
    • Quarantine: on the first cycle, each endpoint should be tried once
    • BackOffConfig: the subexporter should retry as many time per the MaxElapsedTime set

Example for log exporter

❯ go test -v -run "TestConsumeLogs_DNSResolverQuarantine" .
=== RUN   TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled
    log_exporter_test.go:791: Total attempts: 10, Endpoint-1: 5, Endpoint-2: 5, Elapsed: 153.477875ms
    log_exporter_test.go:804: Backoff retry occurred: multiple retry cycles observed
    log_exporter_test.go:812: Time between first and last attempt: 153.325041ms
    log_exporter_test.go:821: Total elapsed time: 153.477875ms
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled (0.26s)


=== RUN   TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffDisabled
    log_exporter_test.go:908: Total attempts: 2, Endpoint-1: 1, Endpoint-2: 1, Elapsed: 39.083µs
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffDisabled (0.00s)


=== RUN   TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled_PartialRecovery
    log_exporter_test.go:964: endpoint-1 temporarily unreachable
    log_exporter_test.go:961: endpoint-1 recovered
    log_exporter_test.go:1007: Endpoint-1 attempts: 2, Endpoint-2 attempts: 1
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled_PartialRecovery (0.16s)


=== RUN   TestConsumeLogs_DNSResolverQuarantineWithSubExporterBackoffEnabled
    log_exporter_test.go:1137: Total attempts: 10, Endpoint-1: 5, Endpoint-2: 5, Elapsed: 305.961541ms
    log_exporter_test.go:1150: Backoff retry occurred: multiple retry cycles observed
    log_exporter_test.go:1158: Time between first and last attempt: 305.872875ms
    log_exporter_test.go:1167: Total elapsed time: 305.961541ms
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithSubExporterBackoffEnabled (0.31s)
PASS
ok      github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter        1.440s

@l00py l00py force-pushed the feature/43644-loadbalancing-dns-quarantine branch 2 times, most recently from 7f6a9a6 to 1756e9c Compare November 7, 2025 22:43
@l00py l00py force-pushed the feature/43644-loadbalancing-dns-quarantine branch from 1756e9c to db12aca Compare November 7, 2025 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[exporter/loadbalancing] Health Check / Heartbeat / Failover Mechanism

2 participants