Add ARC support #14144

raghu999 · 2025-11-08T02:21:36Z

Description

This PR introduces an Adaptive Request Concurrency (ARC) controller to the exporterhelper.

When enabled via the new sending_queue.arc.enabled flag, this controller dynamically manages the number of concurrent export requests, effectively overriding the static num_consumers setting. It adjusts the concurrency limit based on observed RTT (Round-Trip Time) and backpressure signals (e.g., HTTP 429/503, gRPC ResourceExhausted/Unavailable).

The controller follows an AIMD (Additive Increase, Multiplicative Decrease) pattern to find the optimal concurrency limit, maximizing throughput during healthy operation and automatically backing off upon detecting export failures or RTT spikes.

This feature is disabled by default and introduces no behavior change unless explicitly enabled. It also adds a new set of otelcol_exporter_arc_* metrics (detailed in the documentation) for observing its behavior.

Link to tracking issue

Fixes #14080

Testing

Added comprehensive unit tests for the core ARC logic in internal/arc/controller_test.go, covering additive increase, multiplicative decrease (TestAdjustIncreaseAndDecrease), and the cold-start backoff heuristic (TestEarlyBackoffOnColdStart).
Added specific unit tests for the new shrinkSem (a custom shrinkable semaphore) to validate its concurrency, prioritization, and shutdown safety.
Added a critical test (TestController_Shutdown_UnblocksWaiters) to ensure that any goroutines blocked on Acquire are correctly unblocked with a shutdown error, preventing collector hangs.
Added a new integration test in internal/queue_sender_test.go (TestQueueSender_ArcAcquireWaitMetric) that validates the end-to-end flow. It confirms that when the limit is reached, new requests block on Acquire and the exporter_arc_acquire_wait_ms metric records the wait time.
Added unit tests for the new internal/experr/back_pressure.go utility to verify its detection logic.

Documentation

Updated exporterhelper/README.md to include the new sending_queue.arc block with all its configuration options.
Updated exporterhelper/metadata.yaml to define all new otelcol_exporter_arc_* metrics, which are in turn reflected in the generated documentation.md.

linux-foundation-easycla · 2025-11-08T02:21:43Z

The committers listed above are authorized under a signed CLA.

✅ login: raghu999 / name: Raghu999 (d9c349b)

raghu999 · 2025-11-08T03:32:33Z

# HELP otelcol_exporter_arc_acquire_wait_ms_milliseconds Time a worker waited to acquire an ARC permit. [Alpha]
# TYPE otelcol_exporter_arc_acquire_wait_ms_milliseconds histogram
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="0"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="25"} 114
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="50"} 115
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="75"} 115
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="100"} 115
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="250"} 117
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="500"} 125
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="750"} 133
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="1000"} 137
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="2500"} 156
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5000"} 210
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="7500"} 221
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10000"} 221
otelcol_exporter_arc_acquire_wait_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="+Inf"} 221
otelcol_exporter_arc_acquire_wait_ms_milliseconds_sum{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 301819
otelcol_exporter_arc_acquire_wait_ms_milliseconds_count{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 221
# HELP otelcol_exporter_arc_limit Current ARC dynamic concurrency limit. [Alpha]
# TYPE otelcol_exporter_arc_limit gauge
otelcol_exporter_arc_limit{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 10
# HELP otelcol_exporter_arc_limit_changes_total Number of times ARC changed its concurrency limit. [Alpha]
# TYPE otelcol_exporter_arc_limit_changes_total counter
otelcol_exporter_arc_limit_changes_total{data_type="traces",direction="up",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 9
# HELP otelcol_exporter_arc_permits_in_use Number of permits currently acquired. [Alpha]
# TYPE otelcol_exporter_arc_permits_in_use gauge
otelcol_exporter_arc_permits_in_use{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 10
# HELP otelcol_exporter_arc_rtt_ms_milliseconds Request round-trip-time measured by ARC (from permit acquire to release). [Alpha]
# TYPE otelcol_exporter_arc_rtt_ms_milliseconds histogram
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="0"} 0
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="25"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="50"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="75"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="100"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="250"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="500"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="750"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="1000"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="2500"} 11
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="5000"} 109
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="7500"} 205
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="10000"} 206
otelcol_exporter_arc_rtt_ms_milliseconds_bucket{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version="",le="+Inf"} 211
otelcol_exporter_arc_rtt_ms_milliseconds_sum{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 1.045341e+06
otelcol_exporter_arc_rtt_ms_milliseconds_count{data_type="traces",exporter="otlp/e2e_test",otel_scope_name="go.opentelemetry.io/collector/exporter/exporterhelper",otel_scope_schema_url="",otel_scope_version=""} 211

go tool -modfile /Users/rchall201/work/observability/unified-ingest/platform/opentelemetry-collector/internal/tools/go.mod gotestsum --packages="./..." -- -timeout 240s  -race
✓  internal/hosttest (cached)
∅  internal/oteltest
✓  internal/experr (cached)
✓  internal/metadatatest (cached)
✓  internal/metadata (cached)
∅  internal/requesttest
✓  internal/queue (cached)
✓  internal/request (cached)
✓  internal/sender (cached)
✓  internal/sendertest (cached)
✓  internal/queuebatch (cached)
∅  internal/storagetest
✓  internal/sizer (cached)
✓  . (1.415s)
✓  internal/arc (1.918s)
✓  internal (2.517s)

DONE 412 tests in 2.570s

Added the test results and generated metrics

…try-collector into arc-feature

exporter/exporterhelper/internal/experr/back_pressure.go

jmacd · 2025-11-10T17:38:01Z

exporter/exporterhelper/internal/arc/controller.go

+	}
+}
+
+// Acquire obtains a permit unless ARC is disabled or the context is cancelled.


My main feedback is this: we have here a large piece of code, it looks good, but it is very specific and opinionated. I would prefer to see this become an extension implementation. This PR is great, I just think it belongs in //extensions.

See #13902 explaining how extension APIs are added, it would begin with //extensions/extensioncontroller APIs, for example, then the bulk of the code would land in //extensions/arccontrollerextension (could be in contrib).

Could we invent an extension point to let you insert your ARC controller into the send pipeline for all exporters? Then, exporterhelper would only need to load and insert the extensions into the pipeline.

Hi @jmacd, thanks for the detailed feedback on moving this to an extension.

My original thinking was that putting ARC in exporterhelper would make it easier to enable by default for any exporter that is using sending queue after ARC is battle-tested.

This isn't just a theoretical feature for us; it's based on our own painful experience of overwhelming and taking down downstream systems. We operate at a large scale, using both Vector and OTel to ingest hundreds of TBs of telemetry data, and we've seen the critical need for this kind of adaptive backpressure.

This experience is what led us to look for proven solutions. The Netflix tech blog (https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581) inspired Vector's implementation (https://vector.dev/blog/adaptive-request-concurrency/). We've seen the wins from this model firsthand in our own Vector deployments, and it's a key reason they enable it by default. My goal is to bring that same proven stability to OTel.

That said, I'm completely fine with moving this to an extension if the team prefers that. I did talk with @atoulme at KubeCon about how the sending_queue manages its goroutine pool, and I believe this implementation fits in correctly.

Since it's a big architectural choice, I will wait for feedback from the other exporterhelper code owners. @bogdandrutu and @dmitryax, I'd strongly request you read the two blog posts above if you have a moment. They provide the full context for why this feature is so important and why I'm proposing it.

I'll hold off on the refactoring until you've all had a chance to weigh in. Please let me know which direction you'd like me to take. I'm happy to go whichever way the project prefers.

Here are the gradient-based ARC implementations from Netflix and Envoy.

Netflix: https://github.com/Netflix/concurrency-limits/blob/main/concurrency-limits-core/src/main/java/com/netflix/concurrency/limits/limit/Gradient2Limit.java

Envoy: https://www.envoyproxy.io/docs/envoy/v1.15.0/configuration/http/http_filters/adaptive_concurrency_filter

raghu999 requested review from a team, bogdandrutu, dmathieu, dmitryax and mx-psi as code owners November 8, 2025 02:21

Add ARC support

8d7fba5

raghu999 force-pushed the arc-feature branch from 96a5fd3 to 8d7fba5 Compare November 8, 2025 03:03

raghu999 added 3 commits November 7, 2025 22:32

Merge branch 'main' into arc-feature

6ef1cc8

update version to 0.139.0 and fix lint errors

232d073

Merge branch 'arc-feature' of github.com-raghu999:raghu999/openteleme…

f79d24e

…try-collector into arc-feature

raghu999 force-pushed the arc-feature branch from 3d76293 to f79d24e Compare November 8, 2025 07:34

update exporter helper with new implementation

88486c8

raghu999 force-pushed the arc-feature branch from ebe4438 to 88486c8 Compare November 9, 2025 09:35

Merge branch 'main' into arc-feature

2dda0b2

jmacd requested changes Nov 10, 2025

View reviewed changes

fix: remove back pressure and add retryable error

b1c96fb

raghu999 force-pushed the arc-feature branch from 7a98d38 to b1c96fb Compare November 16, 2025 05:46

raghu999 added 3 commits November 16, 2025 11:24

Merge branch 'main' into arc-feature

4434abe

Merge branch 'main' into arc-feature

354213c

Merge branch 'main' into arc-feature

d9c349b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ARC support #14144

Add ARC support #14144

Uh oh!

raghu999 commented Nov 8, 2025

Uh oh!

linux-foundation-easycla bot commented Nov 8, 2025 •

edited

Loading

Uh oh!

raghu999 commented Nov 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmacd Nov 10, 2025 •

edited

Loading

Uh oh!

raghu999 Nov 16, 2025 •

edited

Loading

Uh oh!

raghu999 Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add ARC support #14144

Are you sure you want to change the base?

Add ARC support #14144

Uh oh!

Conversation

raghu999 commented Nov 8, 2025

Description

Link to tracking issue

Testing

Documentation

Uh oh!

linux-foundation-easycla bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghu999 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmacd Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghu999 Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghu999 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linux-foundation-easycla bot commented Nov 8, 2025 •

edited

Loading

raghu999 commented Nov 8, 2025 •

edited

Loading

jmacd Nov 10, 2025 •

edited

Loading

raghu999 Nov 16, 2025 •

edited

Loading