-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Bug Report
Describe the bug
When sending messages over the kafka fluent-bit output we see error messages in fluent-bit after a while. These messages appear semi-randomly, not to the 30 second cadence of when we send fluent-bit logs to Kafka.
Also, even though we see this error, we do not lose any messages. Everything is successfully reaching out kafka servers.
To Reproduce
- Example log
[2025/03/24 16:44:54] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:44:55] [error] [output:kafka:kafka.2] fluent-bit#producer-3: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:45:02] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:45:15] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:45:15] [error] [output:kafka:kafka.2] fluent-bit#producer-3: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
- Kafka Output Example
[OUTPUT]
Name kafka
Match tachyon.logs.crust.*
Timestamp_key @timestamp
Timestamp_format iso8601
Brokers (address1), (address2), (address3), (address4), (address5)
Topics log-core
# rdkafka.ssl.certificate.location /etc/ssl
# rdkafka.ssl.key.location /certs/some.key
# rdkafka.ssl.ca.location /certs/some-bundle.crt
rdkafka.security.protocol ssl
rdkafka.request.required.acks 1
rdkafka.log.connection.close false
storage.total_limit_size 5M
# Timeout settings
rdkafka.request.timeout.ms 10000
rdkafka.message.timeout.ms 70000
rdkafka.connections.max.idle.ms 20000
On the main config:
[SERVICE]
Flush 30
Daemon off
tls on
tls.verify on
tls.ca_path /etc/ssl/
Expected behavior
We expect to not see these errors if all 5 of our servers are running successfully.
Your Environment
- Version used: 3.0.7 and 3.2.10
- Environment name and version (e.g. Kubernetes? What version?): Kubernetes, GitVersion:"v1.18.2-rc2+k3s1"
- Server type and version:
- Operating System and version: Ubuntu 20.04.6 LTS
- Filters and plugins: [FILTERS]: nest, modify, record_modifier, lua,
Additional context
We were looking at the open connections using net-tools and nsenter -t. What we found was that we would see many open connections as fluent-bit remained active with our 5 brokers. Three inputs using this and 5 brokers meant we were seeing around 15 open connections at a time.
To reduce this we implemented rdkafka.connections.max.idle.ms 20000, which brings the connections back down to 3-4 (for our 3 inputs). However, we see this broker is down error. Increasing this to 70000 gets rid of the error, but increases our connections to 4-6.