Batch processor concurrency and error transmission #11308

jmacd · 2024-09-30T20:17:30Z

Description
The OTel-Arrow project forked the batch processor component and extended it with support for concurrency and error transmission. We have been satisfied with this approach and wish to contribute the whole set of features back to the core repository and use feature-flags to transition away from the legacy behaviors.

The functional changes include:

Tracing support: new root spans are linked with incoming requests; incoming requests are linked without outgoing exports
Concurrent export behavior
Synchronous error transmission

The existing component differs from our desired behavior in the second two of these points--it is limited to 1 concurrent export and does not transmit errors back. In this issue, we propose the use of feature-flags to transition from both of these legacy behaviors, which run counter to our performance and reliability goals.

The proposed configuration changes:

early_return (current default: true; eventual default: false)
max_concurrency (current default: 1; eventual default: unlimited)

The changes here were developed downstream, see https://github.com/open-telemetry/otel-arrow/blob/main/collector/processor/concurrentbatchprocessor

Link to tracking issue
Here are a number of issues that are connected with this one, both issues pointing to unmet needs in the exporterhelper batcher and missing features in the legacy batcher. Accepting these changes will allow significantly improved batching support in the interim period until the new batching support is complete.

Part of #10825 -- until this features is complete, users who depend on metadata_keys in the batch processor will not be able to upgrade
Part of #10368 -- I see this as the root cause, we haven't been able to introduce concurrency to the exporterhelper w/o also introducing a queue, which interferes with error transmission
Fixes #8245 -- my original report about the problem solved here -- we add concurrency with batching and error transmission and do not depend on a queue (persistent or in-memory)
Part of #9591 -- users must use one of the available memory limiting mechanisms in conjunction with the batch processor
Part of #8122 -- until this is finished, users depend on the original batch processor
Part of #7460 -- another statement of #8245; the batch processor does not propagate errors today, and this fixes the batch processor's contribution to the problem.

#### Description Split the internal `batcher.export()` interface into three methods. This is a refactoring that was applied in https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor and is being back-ported as part of #11308. The reason this refactoring is needed is that the parent context of the export() request will be manipulated in common code (vs signal-specific code) for tracing support. #### Link to tracking issue Part of #11308 #### Testing Existing tests cover this. --------- Co-authored-by: Bogdan Drutu <[email protected]>

…d names (#11314) #### Description There are two implementations of the `batcher` interface with several inconsistencies. Notably, the single-shard case would start a goroutine before `Start()` is called, which can be confusing when the goroutine leak checker notices it. This makes the single-shard batcher wait until Start() to create its single shard. A confusing field name `batcher` in this case becomes `single`, and a new field is added for use in start named `processor` to create the new shard. For the multi-shard batcher, `start()` does nothing, but this structure confusingly embeds the batchProcessor. Rename the field `processor` for consistency with the single-shard case. #### Link to tracking issue Part of #11308.

#### Description Split the internal `batcher.export()` interface into three methods. This is a refactoring that was applied in https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor and is being back-ported as part of open-telemetry#11308. The reason this refactoring is needed is that the parent context of the export() request will be manipulated in common code (vs signal-specific code) for tracing support. #### Link to tracking issue Part of open-telemetry#11308 #### Testing Existing tests cover this. --------- Co-authored-by: Bogdan Drutu <[email protected]>

…d names (open-telemetry#11314) #### Description There are two implementations of the `batcher` interface with several inconsistencies. Notably, the single-shard case would start a goroutine before `Start()` is called, which can be confusing when the goroutine leak checker notices it. This makes the single-shard batcher wait until Start() to create its single shard. A confusing field name `batcher` in this case becomes `single`, and a new field is added for use in start named `processor` to create the new shard. For the multi-shard batcher, `start()` does nothing, but this structure confusingly embeds the batchProcessor. Rename the field `processor` for consistency with the single-shard case. #### Link to tracking issue Part of open-telemetry#11308.

bogdandrutu · 2024-10-09T04:07:26Z

As one of the maintainer who tries to review this changes, I am sure everyone understands why the multi-consumer helps and is a useful capability to support, but I cannot understand why the Synchronous error transmission is important to support and who would benefit.

This issue comes to the community as we did "foo", it works, you should use it and support it. I would like to better understand what we try to achieve, and how this will help our customers.

Thanks for the detailed issue.

bogdandrutu · 2024-10-09T04:25:52Z

To clarify a bit more, I had a conversation about this with @jmacd and what I understood was that we want to "block" the incoming request to signal LBs that this instance is busy and potentially trigger auto-scale or LB marking this as busy to avoid overwhelming.

Then, @jmacd send a overly complicated implementation to achieve just this (so, unless I better understand other reasons I think the proposed solution is too complicated): The proposed solution also tries to count the number of "failed to emit data". This number cannot be correctly calculated as proposed, because data may be filtered, or request may be queued and drop later, etc. This can happen even without filter processor, only with a OTLP exporter, because the destination may return that 2 out of 1000 (which is formed out of 4 incoming requests) spans were not accepted you cannot know which of the incoming request is responsible for the errors, etc.

Currently, it is not in the scope of the collector to correctly calculate the correct number of "failed requests" because of components like filter processor, or fanout components (on one exporter everything is fine, on another is not), etc. Unless we design a solution that works in all these case (which should be a different effort and should be done across all the components), it is an unnecessary complication to try to calculate the number of "failed requests".

jmacd mentioned this issue Sep 30, 2024

Refactor batch processor export method #11309

Merged

jmacd mentioned this issue Sep 30, 2024

(batchprocessor): Multi/single-shard consistent start method and field names #11314

Merged

This was referenced Oct 1, 2024

(batchprocessor): Add tracing support #11324

Closed

New component: Concurrent Batch Processor open-telemetry/opentelemetry-collector-contrib#33422

Closed

This was referenced Oct 16, 2024

Allow Simple/Batch processor to concurrently invoke Exporter.Export() open-telemetry/opentelemetry-specification#4231

Open

(otelarrowexporter) README: on batching w/ otel-arrow open-telemetry/opentelemetry-collector-contrib#35225

Open

jmacd mentioned this issue Nov 1, 2024

Collector Reliability Review #11499

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch processor concurrency and error transmission #11308

Batch processor concurrency and error transmission #11308

jmacd commented Sep 30, 2024

bogdandrutu commented Oct 9, 2024

bogdandrutu commented Oct 9, 2024

Batch processor concurrency and error transmission #11308

Batch processor concurrency and error transmission #11308

Comments

jmacd commented Sep 30, 2024

bogdandrutu commented Oct 9, 2024

bogdandrutu commented Oct 9, 2024