Skip to content

Connecting and sending to Kafka, but seeing 'brokers are down' error. #10170

@Gwave6123

Description

@Gwave6123

Bug Report

Describe the bug
When sending messages over the kafka fluent-bit output we see error messages in fluent-bit after a while. These messages appear semi-randomly, not to the 30 second cadence of when we send fluent-bit logs to Kafka.

Also, even though we see this error, we do not lose any messages. Everything is successfully reaching out kafka servers.

To Reproduce

  • Example log
[2025/03/24 16:44:54] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:44:55] [error] [output:kafka:kafka.2] fluent-bit#producer-3: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:45:02] [error] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:45:15] [error] [output:kafka:kafka.1] fluent-bit#producer-2: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
[2025/03/24 16:45:15] [error] [output:kafka:kafka.2] fluent-bit#producer-3: [thrd:ssl://(address):9096/bootstrap]: 5/5 brokers are down
  • Kafka Output Example
[OUTPUT]
        Name        kafka
        Match       tachyon.logs.crust.*
        Timestamp_key @timestamp
        Timestamp_format iso8601
        Brokers     (address1), (address2), (address3), (address4), (address5)
        Topics      log-core

        # rdkafka.ssl.certificate.location /etc/ssl
        # rdkafka.ssl.key.location /certs/some.key
        # rdkafka.ssl.ca.location /certs/some-bundle.crt
        rdkafka.security.protocol ssl
        rdkafka.request.required.acks 1
        rdkafka.log.connection.close false
        storage.total_limit_size 5M

        # Timeout settings
        rdkafka.request.timeout.ms 10000
        rdkafka.message.timeout.ms 70000
        rdkafka.connections.max.idle.ms  20000

On the main config:

[SERVICE]
    Flush           30
    Daemon          off
    tls             on
    tls.verify      on
    tls.ca_path     /etc/ssl/

Expected behavior
We expect to not see these errors if all 5 of our servers are running successfully.

Your Environment

  • Version used: 3.0.7 and 3.2.10
  • Environment name and version (e.g. Kubernetes? What version?): Kubernetes, GitVersion:"v1.18.2-rc2+k3s1"
  • Server type and version:
  • Operating System and version: Ubuntu 20.04.6 LTS
  • Filters and plugins: [FILTERS]: nest, modify, record_modifier, lua,

Additional context
We were looking at the open connections using net-tools and nsenter -t. What we found was that we would see many open connections as fluent-bit remained active with our 5 brokers. Three inputs using this and 5 brokers meant we were seeing around 15 open connections at a time.

To reduce this we implemented rdkafka.connections.max.idle.ms 20000, which brings the connections back down to 3-4 (for our 3 inputs). However, we see this broker is down error. Increasing this to 70000 gets rid of the error, but increases our connections to 4-6.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions