-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632
[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632
Conversation
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
f2bbfbe
to
97d5c19
Compare
- `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit. | ||
- `interval` (DEPRECATED, use `batcher.flush_timeout` instead): Maximum time of the oldest item spent inside the buffer, aka "max age of buffer". A flush will happen regardless of the size of content in buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the batcher
options are experimental, I don't think we should be deprecating these options at this point. Users would be left with two options: one deprecated and one experimental, which could cause confusion. Users should have at least one non-deprecated, non-experimental batching configuration to use.
What we could do is mention here the advantages of using the batcher
options instead of flush
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It controls the same behavior of batching, and flush settings are converted to batcher settings anyway. It is marked deprecated to make it obvious to users that using batcher settings is the recommended way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the benefits of using the batch sender (configured in the batcher
option) for reliability, and I applaud this addition. I still believe we cannot force users into this new feature, for the following reasons:
- The batch sender feature is experimental - this means that the configuration options may change.
- The batch sender feature is not well tested in production. The author of this PR is the best person to confirm this, given the issues they've raised (and fixed! 👏) so far #9952, #10166, #10306.
- The batch sender feature currently lacks the option to specify batch size in bytes.
This pull request in current state not only makes the experimental batch sender the default, but also leaves users with it as the only option, effectively removing the functionality that has existed before. If I understand correctly, disabling the batcher sender by setting batcher::enabled
to false
and specifying the flush::bytes
and flush::interval
disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.
I propose the following changes to this pull request:
- Set
batcher::enabled
tofalse
by default. In this case, use the currentflush
settings. This means there is no change in behavior for users upgrading to new version. - Do not deprecate the
flush
options. - Describe in the README the benefits of enabling the new
batcher
options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, disabling the batcher sender by setting batcher::enabled to false and specifying the flush::bytes and flush::interval disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.
That is incorrect. There is batching baked into bulk indexer implementation in the current main (see code). Batcher is added and enabled by default to replace the existing functionality of batching within bulk indexer without change in behavior.
Set batcher::enabled to false by default
As explained above, setting batcher::enabled to true is actually breaking, rather than the opposite.
Do not deprecate the flush options.
I believe this is a separate topic. Marking it deprecated or not doesn't affect the actual behavior. Therefore, I would like to know if you are concerned about the fact that bytes based flushing will only be approximated after the PR, or the fact that new recommended config is experiemental.
enabling the new batcher options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.
Although batcher is new, I don't see why an upstream helper is necessarily less preferable to an existing batching feature within the exporter with possibly lower test coverage.
@@ -13,3 +13,4 @@ status: | |||
tests: | |||
config: | |||
endpoints: [http://localhost:9200] | |||
expect_consumer_error: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test doesn't specify a valid es backend. The exporter will return an error as sending will fail.
- `flush`: Event bulk indexer buffer flush settings | ||
- `bytes` (default=5000000): Write buffer flush size limit. | ||
- `interval` (default=30s): Write buffer flush time limit. | ||
- `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit. When specified, it is translated to `batcher.min_size_items` using an estimate of average item size of 1000 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we are in the process of adding the ability to specify the queue and flush sizes in bytes to Beats. Across arbitrary data sources, using only item sizes is terrible for performance tuning consistency. There are use cases where the individual item size varies greatly, think of someone ingesting data from blob storage where the source data is from disparate systems. You can have 10 kb and 1 MB "items" from the same source.
There are several data sources we need to support where item size is not consistent. I highly suspect making the entire pipeline able to be specified in terms of bytes is something that will need to be revisited.
Why did we deprecate bytes in favour of items here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation behind this move is that, if a queue (persistent or not) is configured, and that batching (buffering) happens after the queue, which is the current implementation in main, there will be data loss on collector crash. Persistent queue will provide no additional reliability in that case. With this PR, we are taking advantage of the new upstream batch sender exporter helper, which will only remove the item from queue if data is sent to ES successfully.
However, despite byte-based batching on its roadmap, only item-based batching is supported at the moment, see: open-telemetry/opentelemetry-collector#8122 . I understand that item size varies and completely agree that byte-based batching is critical to performance. In any case, using batch sender should be the way forward, despite the lack of byte-based batching support at the current moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Makes sense to me, definitely choosing reliability over byte based configuration is the right move now.
- `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit. | ||
- `interval` (DEPRECATED, use `batcher.flush_timeout` instead): Maximum time of the oldest item spent inside the buffer, aka "max age of buffer". A flush will happen regardless of the size of content in buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the benefits of using the batch sender (configured in the batcher
option) for reliability, and I applaud this addition. I still believe we cannot force users into this new feature, for the following reasons:
- The batch sender feature is experimental - this means that the configuration options may change.
- The batch sender feature is not well tested in production. The author of this PR is the best person to confirm this, given the issues they've raised (and fixed! 👏) so far #9952, #10166, #10306.
- The batch sender feature currently lacks the option to specify batch size in bytes.
This pull request in current state not only makes the experimental batch sender the default, but also leaves users with it as the only option, effectively removing the functionality that has existed before. If I understand correctly, disabling the batcher sender by setting batcher::enabled
to false
and specifying the flush::bytes
and flush::interval
disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.
I propose the following changes to this pull request:
- Set
batcher::enabled
tofalse
by default. In this case, use the currentflush
settings. This means there is no change in behavior for users upgrading to new version. - Do not deprecate the
flush
options. - Describe in the README the benefits of enabling the new
batcher
options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.
Co-authored-by: Andrzej Stencel <[email protected]>
…batch sender (#34127) **Description:** Refactor the Elasticsearch bulk indexer code to create an abstraction around the existing buffering, asynchronous bulk indexer. This is preparation for supporting two implementations of bulk indexing: the existing asynchronous one, and a new synchronous one that works well with exporterhelper's batch sender -- see #32632. **Link to tracking Issue:** #32377 **Testing:** N/A, this is a non-functional change. **Documentation:** N/A, pure refactoring.
**Description:** Add opt-in support for the experimental batch sender (open-telemetry/opentelemetry-collector#8122). When opting into this functionality the exporter's `Consume*` methods will make synchronous bulk requests to Elasticsearch, without additional batching/buffering in the exporter. By default the exporter continues to use its own buffering, which supports thresholds for time, number of documents, and size of encoded documents in bytes. The batch sender does not currently support a bytes-based threshold, and is experimental, hence why we are not yet making it the default for the Elasticsearch exporter. This PR is based on #32632, but made to be non-breaking. **Link to tracking Issue:** #32377 **Testing:** Added unit and integration tests. Manually tested with Elasticsearch, using the following config to highlight that client metadata can now flow through all the way: ```yaml receivers: otlp: protocols: http: endpoint: 0.0.0.0:4318 include_metadata: true exporters: elasticsearch: endpoint: "http://localhost:9200" auth: authenticator: headers_setter batcher: enabled: false extensions: headers_setter: headers: - action: insert key: Authorization from_context: authorization service: extensions: [headers_setter] pipelines: traces: receivers: [otlp] processors: [] exporters: [elasticsearch] ``` I have Elasticsearch running locally, with an "admin" user with the password "changeme". Sending OTLP/HTTP to the collector with `telemetrygen traces --otlp-http --otlp-insecure http://localhost:4318 --otlp-header "Authorization=\"Basic YWRtaW46Y2hhbmdlbWU=\""`, I observe the following: - Without the `batcher` config, the exporter fails to index data into Elasticsearch due to an auth error. That's because the exporter is buffering and dropping the context with client metadata, so there's no Authorization header attached to the requests going out. - With `batcher: {enabled: true}`, same behaviour as above. Unlike the [`batch` processor](https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessor), the batch sender does not maintain client metadata. - With `batcher: {enabled: false}`, the exporter successfully indexes data into Elasticsearch. **Documentation:** Updated the README. --------- Co-authored-by: Carson Ip <[email protected]>
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Superseded by #34238 |
Description:
Instead of buffering documents in bulk indexer, buffer events using exporterhelper batchsender so that events will not be lost when persistent queue is used.
Changes:
batcher.*
to configure the batch sender which is now enabled by default.flush.bytes
is deprecated. Use the newbatcher.min_size_items
option to control the minimum number of items (log records, spans) to trigger a flush.batcher.min_size_items
will be set to the value offlush.bytes
/ 1000 ifflush.bytes
is non-zero.flush.interval
is deprecated. Use the newbatcher.flush_timeout
option to control max age of buffer.batcher.flush_timeout
will be set to the value offlush.interval
ifflush.interval
is non-zero.sending_queue.enabled
defaults totrue
.Blocked by:
Link to tracking Issue: Fixes #32377
Testing:
Updated benchmarks
Documentation: