Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opensearch/elasticsearch output plugin not to retry failed chunks based on API response status #10197

Open
cheewai-bash opened this issue Apr 10, 2025 · 0 comments

Comments

@cheewai-bash
Copy link

Is your feature request related to a problem? Please describe.

fluent-bit opensearch output plugin retries sending chunks up to Retry_Limit (or indefinitely) for which any chunk it has not received response status code 200, regardless of the error condition.

This retry strategy perfect when the error is due to network connectivity or transient server failure (out of free space, upgrade in progress, circuit-breaker, etc.) but it causes system instability when the failure is due to malformed logs, e.g. server response status 400 error type "mapper_parsing_exception".

When an application is (malformed) logging heavily, retrying failed chunks results rapid increase cpu/memory waste in fluent-bit and OpenSearch server. It can take days to nail the root cause when many applications and different teams are involved.

elasticsearch output plugin has exactly the same issue.

This has been mentioned in #4905 unfortunately the issue was auto-marked as unplanned and closed on June 8 2023. 🙏

Describe the solution you'd like

It would be great if the output plugin provides additional configuration parameters to fine-tune when to retry:

  • Retry only when failure is due network errors
  • Never retry a failed chunk when the error is in a user-specified list

The complete list of OpenSearch error are:

  • mapper_parsing_exception: Occurs when there are issues parsing the mapping definitions or field types
  • action_request_validation_exception: Happens when bulk request validation fails
  • illegal_argument_exception: General errors for invalid parameters
  • index_not_found_exception: When the target index doesn't exist
  • resource_already_exists_exception: When trying to create a document with an ID that already exists
  • document_missing_exception: Referenced document doesn't exist
  • cluster_block_exception: Cluster-level blocks preventing operations
  • index_closed_exception: Target index is closed
  • not_serializable_exception: Failure to serialize an object
  • remote_transport_exception: Remote node transport failure
  • elasticsearch_parse_exception: General parsing errors (note: still called this in OpenSearch)
  • settings_exception: Invalid settings
  • index_shard_closed_exception: When target shard is closed
  • script_exception: Errors in script execution
  • invalid_index_name_exception: Invalid index name format
  • invalid_type_name_exception: Invalid type name
  • snapshots_in_progress_exception: Cannot perform operation while snapshots are in progress
  • invalid_index_template_exception: Invalid index template
  • illegal_state_exception: State-related errors
  • routing_missing_exception: Routing value required but missing
  • circuit_breaking_exception: Circuit breaker tripped
  • version_conflict_engine_exception: Document version conflicts

Describe alternatives you've considered

I am currently using Retry_Limit in combination with exponential backoff strategy but it is far from ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant