[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632

carsonip · 2024-04-23T10:34:13Z

Description:

Instead of buffering documents in bulk indexer, buffer events using exporterhelper batchsender so that events will not be lost when persistent queue is used.

Changes:

Move buffering from bulk indexer to batch sender to improve reliability.
With this change, there should be no event loss when used with persistent queue in the event of a collector crash.
Introduce batcher.* to configure the batch sender which is now enabled by default.
Option flush.bytes is deprecated. Use the new batcher.min_size_items option to control the minimum number of items (log records, spans) to trigger a flush. batcher.min_size_items will be set to the value of flush.bytes / 1000 if flush.bytes is non-zero.
Option flush.interval is deprecated. Use the new batcher.flush_timeout option to control max age of buffer. batcher.flush_timeout will be set to the value of flush.interval if flush.interval is non-zero.
Queue sender sending_queue.enabled defaults to true.

Blocked by:

Link to tracking Issue: Fixes #32377

Testing:

Updated benchmarks

Documentation:

github-actions · 2024-05-08T05:19:33Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

exporter/elasticsearchexporter/README.md

andrzej-stencel · 2024-07-10T07:53:32Z

exporter/elasticsearchexporter/README.md

+  - `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit.
+  - `interval` (DEPRECATED, use `batcher.flush_timeout` instead): Maximum time of the oldest item spent inside the buffer, aka "max age of buffer". A flush will happen regardless of the size of content in buffer.


Given that the batcher options are experimental, I don't think we should be deprecating these options at this point. Users would be left with two options: one deprecated and one experimental, which could cause confusion. Users should have at least one non-deprecated, non-experimental batching configuration to use.

What we could do is mention here the advantages of using the batcher options instead of flush.

It controls the same behavior of batching, and flush settings are converted to batcher settings anyway. It is marked deprecated to make it obvious to users that using batcher settings is the recommended way.

I understand the benefits of using the batch sender (configured in the batcher option) for reliability, and I applaud this addition. I still believe we cannot force users into this new feature, for the following reasons:

The batch sender feature is experimental - this means that the configuration options may change.

The batch sender feature is not well tested in production. The author of this PR is the best person to confirm this, given the issues they've raised (and fixed! 👏) so far #9952, #10166, #10306.

The batch sender feature currently lacks the option to specify batch size in bytes.

This pull request in current state not only makes the experimental batch sender the default, but also leaves users with it as the only option, effectively removing the functionality that has existed before. If I understand correctly, disabling the batcher sender by setting batcher::enabled to false and specifying the flush::bytes and flush::interval disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.

I propose the following changes to this pull request:

Set batcher::enabled to false by default. In this case, use the current flush settings. This means there is no change in behavior for users upgrading to new version.

Do not deprecate the flush options.

Describe in the README the benefits of enabling the new batcher options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.

If I understand correctly, disabling the batcher sender by setting batcher::enabled to false and specifying the flush::bytes and flush::interval disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.

That is incorrect. There is batching baked into bulk indexer implementation in the current main (see code). Batcher is added and enabled by default to replace the existing functionality of batching within bulk indexer without change in behavior.

Set batcher::enabled to false by default

As explained above, setting batcher::enabled to true is actually breaking, rather than the opposite.

Do not deprecate the flush options.

I believe this is a separate topic. Marking it deprecated or not doesn't affect the actual behavior. Therefore, I would like to know if you are concerned about the fact that bytes based flushing will only be approximated after the PR, or the fact that new recommended config is experiemental.

enabling the new batcher options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.

Although batcher is new, I don't see why an upstream helper is necessarily less preferable to an existing batching feature within the exporter with possibly lower test coverage.

exporter/elasticsearchexporter/factory.go

andrzej-stencel · 2024-07-10T12:22:21Z

exporter/elasticsearchexporter/metadata.yaml

@@ -13,3 +13,4 @@ status:
 tests:
  config:
    endpoints: [http://localhost:9200]
+  expect_consumer_error: true


What is the reason for this change?

The test doesn't specify a valid es backend. The exporter will return an error as sending will fail.

exporter/elasticsearchexporter/elasticsearch_bulk.go

cmacknz · 2024-07-11T13:34:48Z

exporter/elasticsearchexporter/README.md

 - `flush`: Event bulk indexer buffer flush settings
-  - `bytes` (default=5000000): Write buffer flush size limit.
-  - `interval` (default=30s): Write buffer flush time limit.
+  - `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit. When specified, it is translated to `batcher.min_size_items` using an estimate of average item size of 1000 bytes.


FYI we are in the process of adding the ability to specify the queue and flush sizes in bytes to Beats. Across arbitrary data sources, using only item sizes is terrible for performance tuning consistency. There are use cases where the individual item size varies greatly, think of someone ingesting data from blob storage where the source data is from disparate systems. You can have 10 kb and 1 MB "items" from the same source.

There are several data sources we need to support where item size is not consistent. I highly suspect making the entire pipeline able to be specified in terms of bytes is something that will need to be revisited.

Why did we deprecate bytes in favour of items here?

The motivation behind this move is that, if a queue (persistent or not) is configured, and that batching (buffering) happens after the queue, which is the current implementation in main, there will be data loss on collector crash. Persistent queue will provide no additional reliability in that case. With this PR, we are taking advantage of the new upstream batch sender exporter helper, which will only remove the item from queue if data is sent to ES successfully.

However, despite byte-based batching on its roadmap, only item-based batching is supported at the moment, see: open-telemetry/opentelemetry-collector#8122 . I understand that item size varies and completely agree that byte-based batching is critical to performance. In any case, using batch sender should be the way forward, despite the lack of byte-based batching support at the current moment.

Thanks! Makes sense to me, definitely choosing reliability over byte based configuration is the right move now.

exporter/elasticsearchexporter/README.md

andrzej-stencel · 2024-07-16T10:13:14Z

exporter/elasticsearchexporter/README.md

+  - `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit.
+  - `interval` (DEPRECATED, use `batcher.flush_timeout` instead): Maximum time of the oldest item spent inside the buffer, aka "max age of buffer". A flush will happen regardless of the size of content in buffer.


I understand the benefits of using the batch sender (configured in the batcher option) for reliability, and I applaud this addition. I still believe we cannot force users into this new feature, for the following reasons:

The batch sender feature is experimental - this means that the configuration options may change.

The batch sender feature is not well tested in production. The author of this PR is the best person to confirm this, given the issues they've raised (and fixed! 👏) so far #9952, #10166, #10306.

The batch sender feature currently lacks the option to specify batch size in bytes.

This pull request in current state not only makes the experimental batch sender the default, but also leaves users with it as the only option, effectively removing the functionality that has existed before. If I understand correctly, disabling the batcher sender by setting batcher::enabled to false and specifying the flush::bytes and flush::interval disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.

I propose the following changes to this pull request:

Set batcher::enabled to false by default. In this case, use the current flush settings. This means there is no change in behavior for users upgrading to new version.

Do not deprecate the flush options.

Describe in the README the benefits of enabling the new batcher options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.

Co-authored-by: Andrzej Stencel <[email protected]>

…batch sender (#34127) **Description:** Refactor the Elasticsearch bulk indexer code to create an abstraction around the existing buffering, asynchronous bulk indexer. This is preparation for supporting two implementations of bulk indexing: the existing asynchronous one, and a new synchronous one that works well with exporterhelper's batch sender -- see #32632. **Link to tracking Issue:** #32377 **Testing:** N/A, this is a non-functional change. **Documentation:** N/A, pure refactoring.

**Description:** Add opt-in support for the experimental batch sender (open-telemetry/opentelemetry-collector#8122). When opting into this functionality the exporter's `Consume*` methods will make synchronous bulk requests to Elasticsearch, without additional batching/buffering in the exporter. By default the exporter continues to use its own buffering, which supports thresholds for time, number of documents, and size of encoded documents in bytes. The batch sender does not currently support a bytes-based threshold, and is experimental, hence why we are not yet making it the default for the Elasticsearch exporter. This PR is based on #32632, but made to be non-breaking. **Link to tracking Issue:** #32377 **Testing:** Added unit and integration tests. Manually tested with Elasticsearch, using the following config to highlight that client metadata can now flow through all the way: ```yaml receivers: otlp: protocols: http: endpoint: 0.0.0.0:4318 include_metadata: true exporters: elasticsearch: endpoint: "http://localhost:9200" auth: authenticator: headers_setter batcher: enabled: false extensions: headers_setter: headers: - action: insert key: Authorization from_context: authorization service: extensions: [headers_setter] pipelines: traces: receivers: [otlp] processors: [] exporters: [elasticsearch] ``` I have Elasticsearch running locally, with an "admin" user with the password "changeme". Sending OTLP/HTTP to the collector with `telemetrygen traces --otlp-http --otlp-insecure http://localhost:4318 --otlp-header "Authorization=\"Basic YWRtaW46Y2hhbmdlbWU=\""`, I observe the following: - Without the `batcher` config, the exporter fails to index data into Elasticsearch due to an auth error. That's because the exporter is buffering and dropping the context with client metadata, so there's no Authorization header attached to the requests going out. - With `batcher: {enabled: true}`, same behaviour as above. Unlike the [`batch` processor](https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessor), the batch sender does not maintain client metadata. - With `batcher: {enabled: false}`, the exporter successfully indexes data into Elasticsearch. **Documentation:** Updated the README. --------- Co-authored-by: Carson Ip <[email protected]>

github-actions · 2024-07-31T05:20:35Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-08-15T05:20:00Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

carsonip · 2024-08-15T09:40:10Z

Superseded by #34238

github-actions bot added cmd/configschema configschema command cmd/otelcontribcol otelcontribcol command exporter/elasticsearch labels Apr 23, 2024

github-actions bot requested review from dmitryax, JaredTan95, mx-psi and ycombinator April 23, 2024 10:37

github-actions bot added Stale and removed Stale labels May 8, 2024

carsonip added 8 commits May 9, 2024 16:37

WIP

3cd3903

Pass integration test

4341a37

Refactor

5c04204

Unexport BulkIndexerItem

39f0fa1

Unexport Request

bc692f0

Fix es bulk test

c5fd34c

Fix logs exporter test

7fdf150

Add traces

97d5c19

carsonip force-pushed the exporterhelper-batchsender branch from f2bbfbe to 97d5c19 Compare May 9, 2024 15:41

carsonip added 11 commits May 9, 2024 16:46

Log once only

5608755

Merge branch 'main' into exporterhelper-batchsender

c5cbe79

gofmt

889fe7f

Fix lifecycle tests

f71d3b2

Use PersistentQueueConfig

709ca01

Fix failing tests

4474b3b

Missing newline

215295e

Add FIXME

961ffab

Configure batcher

683dca3

FIXME

e46b165

Clean up

ca810d4

andrzej-stencel reviewed Jul 10, 2024

View reviewed changes

exporter/elasticsearchexporter/README.md Outdated Show resolved Hide resolved

andrzej-stencel reviewed Jul 10, 2024

View reviewed changes

exporter/elasticsearchexporter/README.md Outdated Show resolved Hide resolved

andrzej-stencel reviewed Jul 10, 2024

View reviewed changes

exporter/elasticsearchexporter/factory.go Outdated Show resolved Hide resolved

andrzej-stencel reviewed Jul 10, 2024

View reviewed changes

exporter/elasticsearchexporter/elasticsearch_bulk.go Outdated Show resolved Hide resolved

carsonip and others added 8 commits July 10, 2024 13:46

Add link to batcher settings and mention experimental

6a5a2e3

sending_queue.num_consumers vs num_workers

50e9353

Link to persistent queue

e87e797

Mention flush.bytes to batcher.min_size_items translation

38c01eb

Refactor handleDeprecations

9113d1c

Remove getTimeoutConfig

1ad29d4

Stop using type alias

862876e

Merge branch 'main' into exporterhelper-batchsender

7b289f1

cmacknz reviewed Jul 11, 2024

View reviewed changes

Merge branch 'main' into exporterhelper-batchsender

d7881b3

carsonip requested a review from andrzej-stencel July 14, 2024 23:11

andrzej-stencel reviewed Jul 16, 2024

View reviewed changes

Update exporter/elasticsearchexporter/README.md

be1a7f7

Co-authored-by: Andrzej Stencel <[email protected]>

axw mentioned this pull request Jul 17, 2024

[chore] [exporter/elasticsearch] refactor bulkindexer to prepare for batch sender #34127

Merged

axw mentioned this pull request Jul 24, 2024

[exporter/elasticsearch] add support for batcher config #34238

Merged

github-actions bot added the Stale label Jul 31, 2024

github-actions bot closed this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632

[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632

carsonip commented Apr 23, 2024 •

edited

Loading

github-actions bot commented May 8, 2024

andrzej-stencel Jul 10, 2024 •

edited

Loading

carsonip Jul 10, 2024

andrzej-stencel Jul 16, 2024

carsonip Jul 16, 2024

andrzej-stencel Jul 10, 2024

carsonip Jul 10, 2024

cmacknz Jul 11, 2024 •

edited

Loading

carsonip Jul 11, 2024

cmacknz Jul 11, 2024

andrzej-stencel Jul 16, 2024

github-actions bot commented Jul 31, 2024

github-actions bot commented Aug 15, 2024

carsonip commented Aug 15, 2024

		- `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit.
		- `interval` (DEPRECATED, use `batcher.flush_timeout` instead): Maximum time of the oldest item spent inside the buffer, aka "max age of buffer". A flush will happen regardless of the size of content in buffer.

[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632

[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632

Conversation

carsonip commented Apr 23, 2024 • edited Loading

github-actions bot commented May 8, 2024

andrzej-stencel Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 31, 2024

github-actions bot commented Aug 15, 2024

carsonip commented Aug 15, 2024

carsonip commented Apr 23, 2024 •

edited

Loading

andrzej-stencel Jul 10, 2024 •

edited

Loading

cmacknz Jul 11, 2024 •

edited

Loading