-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[43644] feat(exporter/loadbalancing/dnsresolver): add quarantine mechanism for unhealthy endpoints #43961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[43644] feat(exporter/loadbalancing/dnsresolver): add quarantine mechanism for unhealthy endpoints #43961
Conversation
29c491c to
06882ad
Compare
…r unhealthy endpoints - Added a quarantine feature for unhealthy endpoints, delaying retries to those endpoints after a configurable period (default: 30s). - Quarantine settings are configurable via the DNS resolver's `quarantine` section. - The load balancer will avoid sending data to endpoints marked as unhealthy until their quarantine period expires, using healthy endpoints in the hash ring without triggering unnecessary ring updates. - This increases resilience by reducing the risk of exporters being stuck in degraded states with repeated failed attempts. - This feature currently applies only to the DNS resolver. Refs open-telemetry#43644
06882ad to
8ff57b5
Compare
rlankfo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! Overall, I think the direction looks good here. I left a couple of comments. I'm not 100% certain about the port issue and it's not a problem until we expand this functionality, but I think a simple counter on the ring would alleviate my concerns here.
Let me know what you think.
| currentPos := getPosition(identifier) | ||
|
|
||
| // Try until we've used all available endpoints | ||
| for len(tried) < len(lb.exporters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this condition will work fine for DNS resolver but if we ever extend this quarantine logic to other resolvers there is a possibility this loop will not terminate. The k8s resolver allows setting more than one port per endpoint. IIRC, lb.exporters includes these endpoints with ports, where the ring just contains the bare endpoint.
Could you update this to use count of bare endpoints from the ring instead of length of lb.exporters? We might need to add a count to the hashRing. We can probably set this counter to len(endpoints) in newHashRing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed commit 8d3cdbc which adds endpoints to hashRing. It may be useful rather than having a simple count. Open for change :)
Add list of endpoints to hashRing
Effects of Quarantine and BackOffConfig enabledAdded tests in 7f6a9a6 to test against the effect of
Example for log exporter❯ go test -v -run "TestConsumeLogs_DNSResolverQuarantine" .
=== RUN TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled
log_exporter_test.go:791: Total attempts: 10, Endpoint-1: 5, Endpoint-2: 5, Elapsed: 153.477875ms
log_exporter_test.go:804: Backoff retry occurred: multiple retry cycles observed
log_exporter_test.go:812: Time between first and last attempt: 153.325041ms
log_exporter_test.go:821: Total elapsed time: 153.477875ms
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled (0.26s)
=== RUN TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffDisabled
log_exporter_test.go:908: Total attempts: 2, Endpoint-1: 1, Endpoint-2: 1, Elapsed: 39.083µs
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffDisabled (0.00s)
=== RUN TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled_PartialRecovery
log_exporter_test.go:964: endpoint-1 temporarily unreachable
log_exporter_test.go:961: endpoint-1 recovered
log_exporter_test.go:1007: Endpoint-1 attempts: 2, Endpoint-2 attempts: 1
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithParentExporterBackoffEnabled_PartialRecovery (0.16s)
=== RUN TestConsumeLogs_DNSResolverQuarantineWithSubExporterBackoffEnabled
log_exporter_test.go:1137: Total attempts: 10, Endpoint-1: 5, Endpoint-2: 5, Elapsed: 305.961541ms
log_exporter_test.go:1150: Backoff retry occurred: multiple retry cycles observed
log_exporter_test.go:1158: Time between first and last attempt: 305.872875ms
log_exporter_test.go:1167: Total elapsed time: 305.961541ms
--- PASS: TestConsumeLogs_DNSResolverQuarantineWithSubExporterBackoffEnabled (0.31s)
PASS
ok github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter 1.440s |
7f6a9a6 to
1756e9c
Compare
1756e9c to
db12aca
Compare
Description
Add quarantine mechanism for unhealthy endpoints
quarantinesection.Link to tracking issue
Fixes #43644
Test Coverage for Quarantine Feature
Core Quarantine Functionality Tests
File:
loadbalancer_test.goNew Tests
TestQuarantineEndpointsValidates the quarantine mechanism by marking an endpoint as unhealthy and verifying it automatically becomes healthy again after the quarantine duration expires (1ms in test). Confirms that only the quarantined endpoint is marked unhealthy while others remain healthy.
TestIsQuarantineEnabledVerifies that the quarantine feature is correctly enabled when configured with
Quarantine: QuarantineSettings{Enabled: true}in the DNS resolver settings.TestIsQuarantineEnabledFalseConfirms that quarantine is properly disabled when explicitly set to
falsein the configuration.TestIsQuarantineWithDNSQuarantineOmittedEnsures that quarantine is disabled by default when the quarantine configuration section is omitted entirely, maintaining backward compatibility.
DNS Resolver Configuration Tests
File:
resolver_dns_test.goNew Tests
TestQuarantineEnabledValidates that quarantine settings are properly initialized with the default duration (30 seconds) when enabled.
TestQuarantineConfigOmittedConfirms default behavior when quarantine configuration is not provided (disabled with 30s default duration).
TestQuarantineDurationOmittedVerifies that the default quarantine duration (30 seconds) is used when duration is not specified.
TestQuarantineDurationZeroEnsures that invalid zero duration values are replaced with the default 30-second duration.
TestQuarantineDurationNegativeConfirms that invalid negative duration values are replaced with the default 30-second duration.
Trace Exporter Retry Tests
File:
trace_exporter_test.goNew Tests
TestConsumeTraces_DNSResolverRetriesOnUnreachableEndpointTests the retry mechanism for traces when the primary endpoint fails. Verifies that:
TestConsumeTraces_DNSResolverRetriesExhaustedValidates error handling when all endpoints are unreachable. Confirms that:
"all endpoints were tried and failed: map[endpoint-1:true endpoint-2:true]"Log Exporter Retry Tests
File:
log_exporter_test.goNew Tests
TestConsumeLogs_DNSResolverRetriesOnUnreachableEndpointMirrors the trace retry test for logs, ensuring the same failover behavior works correctly for log data.
TestConsumeLogs_DNSResolverRetriesExhaustedValidates exhaustive retry behavior for logs when all endpoints fail, confirming proper error reporting.
Metrics Exporter Retry Tests
File:
metrics_exporter_test.goNew Tests
TestConsumeMetrics_DNSResolverRetriesOnUnreachableEndpointTests retry logic for metrics when one endpoint is unreachable, verifying successful failover to healthy endpoints.
TestConsumeMetrics_DNSResolverRetriesExhaustedConfirms proper error handling when all metric endpoints are unreachable.
Test Patterns and Coverage
All the retry tests follow a consistent pattern:
The tests ensure comprehensive coverage across all three signal types (traces, logs, and metrics) and validate both the "happy path" (successful retry) and "unhappy path" (all endpoints failed) scenarios.
Summary
Total New Tests: 14
All tests validate the quarantine feature's ability to temporarily exclude unhealthy endpoints, automatically recover them after the quarantine period, and handle edge cases with proper default values.
Local Test Environment Validation
Test Configuration Example
Tested Scenarios
loadbalancerDocumentation