You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
fluent-bit opensearch output plugin retries sending chunks up to Retry_Limit (or indefinitely) for which any chunk it has not received response status code 200, regardless of the error condition.
This retry strategy perfect when the error is due to network connectivity or transient server failure (out of free space, upgrade in progress, circuit-breaker, etc.) but it causes system instability when the failure is due to malformed logs, e.g. server response status 400 error type "mapper_parsing_exception".
When an application is (malformed) logging heavily, retrying failed chunks results rapid increase cpu/memory waste in fluent-bit and OpenSearch server. It can take days to nail the root cause when many applications and different teams are involved.
elasticsearch output plugin has exactly the same issue.
This has been mentioned in #4905 unfortunately the issue was auto-marked as unplanned and closed on June 8 2023. 🙏
Describe the solution you'd like
It would be great if the output plugin provides additional configuration parameters to fine-tune when to retry:
Retry only when failure is due network errors
Never retry a failed chunk when the error is in a user-specified list
The complete list of OpenSearch error are:
mapper_parsing_exception: Occurs when there are issues parsing the mapping definitions or field types
action_request_validation_exception: Happens when bulk request validation fails
illegal_argument_exception: General errors for invalid parameters
index_not_found_exception: When the target index doesn't exist
resource_already_exists_exception: When trying to create a document with an ID that already exists
Is your feature request related to a problem? Please describe.
fluent-bit opensearch output plugin retries sending chunks up to Retry_Limit (or indefinitely) for which any chunk it has not received response status code 200, regardless of the error condition.
This retry strategy perfect when the error is due to network connectivity or transient server failure (out of free space, upgrade in progress, circuit-breaker, etc.) but it causes system instability when the failure is due to malformed logs, e.g. server response status 400 error type "mapper_parsing_exception".
When an application is (malformed) logging heavily, retrying failed chunks results rapid increase cpu/memory waste in fluent-bit and OpenSearch server. It can take days to nail the root cause when many applications and different teams are involved.
elasticsearch output plugin has exactly the same issue.
This has been mentioned in #4905 unfortunately the issue was auto-marked as unplanned and closed on June 8 2023. 🙏
Describe the solution you'd like
It would be great if the output plugin provides additional configuration parameters to fine-tune when to retry:
The complete list of OpenSearch error are:
Describe alternatives you've considered
I am currently using Retry_Limit in combination with exponential backoff strategy but it is far from ideal.
The text was updated successfully, but these errors were encountered: