Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/elasticsearch] Use exporterhelper/batchsender for reliability #32632

Closed

Conversation

carsonip
Copy link
Contributor

@carsonip carsonip commented Apr 23, 2024

Description:

Instead of buffering documents in bulk indexer, buffer events using exporterhelper batchsender so that events will not be lost when persistent queue is used.

Changes:

  • Move buffering from bulk indexer to batch sender to improve reliability.
  • With this change, there should be no event loss when used with persistent queue in the event of a collector crash.
  • Introduce batcher.* to configure the batch sender which is now enabled by default.
  • Option flush.bytes is deprecated. Use the new batcher.min_size_items option to control the minimum number of items (log records, spans) to trigger a flush. batcher.min_size_items will be set to the value of flush.bytes / 1000 if flush.bytes is non-zero.
  • Option flush.interval is deprecated. Use the new batcher.flush_timeout option to control max age of buffer. batcher.flush_timeout will be set to the value of flush.interval if flush.interval is non-zero.
  • Queue sender sending_queue.enabled defaults to true.

Blocked by:

Link to tracking Issue: Fixes #32377

Testing:

Updated benchmarks

Documentation:

Copy link
Contributor

github-actions bot commented May 8, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels May 8, 2024
@carsonip carsonip force-pushed the exporterhelper-batchsender branch from f2bbfbe to 97d5c19 Compare May 9, 2024 15:41
Comment on lines 176 to 177
- `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit.
- `interval` (DEPRECATED, use `batcher.flush_timeout` instead): Maximum time of the oldest item spent inside the buffer, aka "max age of buffer". A flush will happen regardless of the size of content in buffer.
Copy link
Member

@andrzej-stencel andrzej-stencel Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the batcher options are experimental, I don't think we should be deprecating these options at this point. Users would be left with two options: one deprecated and one experimental, which could cause confusion. Users should have at least one non-deprecated, non-experimental batching configuration to use.

What we could do is mention here the advantages of using the batcher options instead of flush.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It controls the same behavior of batching, and flush settings are converted to batcher settings anyway. It is marked deprecated to make it obvious to users that using batcher settings is the recommended way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the benefits of using the batch sender (configured in the batcher option) for reliability, and I applaud this addition. I still believe we cannot force users into this new feature, for the following reasons:

  • The batch sender feature is experimental - this means that the configuration options may change.
  • The batch sender feature is not well tested in production. The author of this PR is the best person to confirm this, given the issues they've raised (and fixed! 👏) so far #9952, #10166, #10306.
  • The batch sender feature currently lacks the option to specify batch size in bytes.

This pull request in current state not only makes the experimental batch sender the default, but also leaves users with it as the only option, effectively removing the functionality that has existed before. If I understand correctly, disabling the batcher sender by setting batcher::enabled to false and specifying the flush::bytes and flush::interval disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.

I propose the following changes to this pull request:

  1. Set batcher::enabled to false by default. In this case, use the current flush settings. This means there is no change in behavior for users upgrading to new version.
  2. Do not deprecate the flush options.
  3. Describe in the README the benefits of enabling the new batcher options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, disabling the batcher sender by setting batcher::enabled to false and specifying the flush::bytes and flush::interval disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.

That is incorrect. There is batching baked into bulk indexer implementation in the current main (see code). Batcher is added and enabled by default to replace the existing functionality of batching within bulk indexer without change in behavior.

Set batcher::enabled to false by default

As explained above, setting batcher::enabled to true is actually breaking, rather than the opposite.

Do not deprecate the flush options.

I believe this is a separate topic. Marking it deprecated or not doesn't affect the actual behavior. Therefore, I would like to know if you are concerned about the fact that bytes based flushing will only be approximated after the PR, or the fact that new recommended config is experiemental.

enabling the new batcher options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.

Although batcher is new, I don't see why an upstream helper is necessarily less preferable to an existing batching feature within the exporter with possibly lower test coverage.

@@ -13,3 +13,4 @@ status:
tests:
config:
endpoints: [http://localhost:9200]
expect_consumer_error: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't specify a valid es backend. The exporter will return an error as sending will fail.

- `flush`: Event bulk indexer buffer flush settings
- `bytes` (default=5000000): Write buffer flush size limit.
- `interval` (default=30s): Write buffer flush time limit.
- `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit. When specified, it is translated to `batcher.min_size_items` using an estimate of average item size of 1000 bytes.
Copy link

@cmacknz cmacknz Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI we are in the process of adding the ability to specify the queue and flush sizes in bytes to Beats. Across arbitrary data sources, using only item sizes is terrible for performance tuning consistency. There are use cases where the individual item size varies greatly, think of someone ingesting data from blob storage where the source data is from disparate systems. You can have 10 kb and 1 MB "items" from the same source.

There are several data sources we need to support where item size is not consistent. I highly suspect making the entire pipeline able to be specified in terms of bytes is something that will need to be revisited.

Why did we deprecate bytes in favour of items here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation behind this move is that, if a queue (persistent or not) is configured, and that batching (buffering) happens after the queue, which is the current implementation in main, there will be data loss on collector crash. Persistent queue will provide no additional reliability in that case. With this PR, we are taking advantage of the new upstream batch sender exporter helper, which will only remove the item from queue if data is sent to ES successfully.

However, despite byte-based batching on its roadmap, only item-based batching is supported at the moment, see: open-telemetry/opentelemetry-collector#8122 . I understand that item size varies and completely agree that byte-based batching is critical to performance. In any case, using batch sender should be the way forward, despite the lack of byte-based batching support at the current moment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Makes sense to me, definitely choosing reliability over byte based configuration is the right move now.

exporter/elasticsearchexporter/README.md Outdated Show resolved Hide resolved
Comment on lines 176 to 177
- `bytes` (DEPRECATED, use `batcher.min_size_items` instead): Write buffer flush size limit.
- `interval` (DEPRECATED, use `batcher.flush_timeout` instead): Maximum time of the oldest item spent inside the buffer, aka "max age of buffer". A flush will happen regardless of the size of content in buffer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the benefits of using the batch sender (configured in the batcher option) for reliability, and I applaud this addition. I still believe we cannot force users into this new feature, for the following reasons:

  • The batch sender feature is experimental - this means that the configuration options may change.
  • The batch sender feature is not well tested in production. The author of this PR is the best person to confirm this, given the issues they've raised (and fixed! 👏) so far #9952, #10166, #10306.
  • The batch sender feature currently lacks the option to specify batch size in bytes.

This pull request in current state not only makes the experimental batch sender the default, but also leaves users with it as the only option, effectively removing the functionality that has existed before. If I understand correctly, disabling the batcher sender by setting batcher::enabled to false and specifying the flush::bytes and flush::interval disables batching altoghether, whereas I would expect this configuration to work as in previous versions. Please correct me if I'm wrong @carsonip.

I propose the following changes to this pull request:

  1. Set batcher::enabled to false by default. In this case, use the current flush settings. This means there is no change in behavior for users upgrading to new version.
  2. Do not deprecate the flush options.
  3. Describe in the README the benefits of enabling the new batcher options for reliability, but also warn that the feature is experimental and usage in production scenarios is not encouraged.

mx-psi pushed a commit that referenced this pull request Jul 22, 2024
…batch sender (#34127)

**Description:**

Refactor the Elasticsearch bulk indexer code to create an abstraction
around the existing buffering, asynchronous bulk indexer.

This is preparation for supporting two implementations of bulk indexing:
the existing asynchronous one, and a new synchronous one that works well
with exporterhelper's batch sender -- see
#32632.

**Link to tracking Issue:**


#32377

**Testing:**

N/A, this is a non-functional change.

**Documentation:**

N/A, pure refactoring.
mx-psi pushed a commit that referenced this pull request Jul 30, 2024
**Description:**

Add opt-in support for the experimental batch sender
(open-telemetry/opentelemetry-collector#8122).
When opting into this functionality the exporter's `Consume*` methods
will make synchronous bulk requests to Elasticsearch, without additional
batching/buffering in the exporter.

By default the exporter continues to use its own buffering, which
supports thresholds for time, number of documents, and size of encoded
documents in bytes. The batch sender does not currently support a
bytes-based threshold, and is experimental, hence why we are not yet
making it the default for the Elasticsearch exporter.

This PR is based on
#32632,
but made to be non-breaking.

**Link to tracking Issue:**

#32377

**Testing:**

Added unit and integration tests.

Manually tested with Elasticsearch, using the following config to
highlight that client metadata can now flow through all the way:

```yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
        include_metadata: true

exporters:
  elasticsearch:
    endpoint: "http://localhost:9200"
    auth:
      authenticator: headers_setter
    batcher:
      enabled: false

extensions:
  headers_setter:
    headers:
      - action: insert
        key: Authorization
        from_context: authorization

service:
  extensions: [headers_setter]
  pipelines:
    traces:
      receivers: [otlp]
      processors: []
      exporters: [elasticsearch]
```

I have Elasticsearch running locally, with an "admin" user with the
password "changeme". Sending OTLP/HTTP to the collector with
`telemetrygen traces --otlp-http --otlp-insecure http://localhost:4318
--otlp-header "Authorization=\"Basic YWRtaW46Y2hhbmdlbWU=\""`, I observe
the following:
- Without the `batcher` config, the exporter fails to index data into
Elasticsearch due to an auth error. That's because the exporter is
buffering and dropping the context with client metadata, so there's no
Authorization header attached to the requests going out.
- With `batcher: {enabled: true}`, same behaviour as above. Unlike the
[`batch`
processor](https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessor),
the batch sender does not maintain client metadata.
- With `batcher: {enabled: false}`, the exporter successfully indexes
data into Elasticsearch.

**Documentation:**

Updated the README.

---------

Co-authored-by: Carson Ip <[email protected]>
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jul 31, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Aug 15, 2024
@carsonip
Copy link
Contributor Author

Superseded by #34238

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[exporter/elasticsearch] Use batch sender exporter helper for reliability
8 participants