opensearch/elasticsearch output plugin not to retry failed chunks based on API response status #10197

cheewai-bash · 2025-04-10T10:03:06Z

Is your feature request related to a problem? Please describe.

fluent-bit opensearch output plugin retries sending chunks up to Retry_Limit (or indefinitely) for which any chunk it has not received response status code 200, regardless of the error condition.

This retry strategy perfect when the error is due to network connectivity or transient server failure (out of free space, upgrade in progress, circuit-breaker, etc.) but it causes system instability when the failure is due to malformed logs, e.g. server response status 400 error type "mapper_parsing_exception".

When an application is (malformed) logging heavily, retrying failed chunks results rapid increase cpu/memory waste in fluent-bit and OpenSearch server. It can take days to nail the root cause when many applications and different teams are involved.

elasticsearch output plugin has exactly the same issue.

This has been mentioned in #4905 unfortunately the issue was auto-marked as unplanned and closed on June 8 2023. 🙏

Describe the solution you'd like

It would be great if the output plugin provides additional configuration parameters to fine-tune when to retry:

Retry only when failure is due network errors
Never retry a failed chunk when the error is in a user-specified list

The complete list of OpenSearch error are:

mapper_parsing_exception: Occurs when there are issues parsing the mapping definitions or field types
action_request_validation_exception: Happens when bulk request validation fails
illegal_argument_exception: General errors for invalid parameters
index_not_found_exception: When the target index doesn't exist
resource_already_exists_exception: When trying to create a document with an ID that already exists
document_missing_exception: Referenced document doesn't exist
cluster_block_exception: Cluster-level blocks preventing operations
index_closed_exception: Target index is closed
not_serializable_exception: Failure to serialize an object
remote_transport_exception: Remote node transport failure
elasticsearch_parse_exception: General parsing errors (note: still called this in OpenSearch)
settings_exception: Invalid settings
index_shard_closed_exception: When target shard is closed
script_exception: Errors in script execution
invalid_index_name_exception: Invalid index name format
invalid_type_name_exception: Invalid type name
snapshots_in_progress_exception: Cannot perform operation while snapshots are in progress
invalid_index_template_exception: Invalid index template
illegal_state_exception: State-related errors
routing_missing_exception: Routing value required but missing
circuit_breaking_exception: Circuit breaker tripped
version_conflict_engine_exception: Document version conflicts

Describe alternatives you've considered

I am currently using Retry_Limit in combination with exponential backoff strategy but it is far from ideal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opensearch/elasticsearch output plugin not to retry failed chunks based on API response status #10197

opensearch/elasticsearch output plugin not to retry failed chunks based on API response status #10197

cheewai-bash commented Apr 10, 2025

opensearch/elasticsearch output plugin not to retry failed chunks based on API response status #10197

opensearch/elasticsearch output plugin not to retry failed chunks based on API response status #10197

Comments

cheewai-bash commented Apr 10, 2025