Make startup more robust and prevent auto topic creation when using CruiseControlMetricsReporterSampler #2211

k0b3rIT · 2024-10-23T20:39:54Z

Summary

Why: During the first startup of Kafka brokers and Cruise Control, it could happen that one of the broker is lagging behind, or the cluster is not ready (not enough broker is available) to create the metrics topic (cruise.control.metrics.topic.auto.create=true) with the desired guaranties (replica count). If we (startup script/automation) did not wait enough with the CC start command, then it could fail.
There is now retry logic in the CruiseControlMetricsReporterSampler initialization to handle this edge case.
2 things could happen depending on the Kafka's auto.create.topics.enable config

`auto.create.topics.enable=false`

CC startup fails with the following error after the unsuccessful refreshPartitionAssignment() as the topic does not exist:

java.lang.IllegalStateException: Cruise Control cannot find partitions for the metrics reporter that topic matches __CruiseControlMetrics in the target cluster.
	at com.linkedin.kafka.cruisecontrol.monitor.sampling.CruiseControlMetricsReporterSampler.configure(CruiseControlMetricsReporterSampler.java:195) ~[classes/:?]
	at com.linkedin.kafka.cruisecontrol.config.KafkaCruiseControlConfigUtils.getConfiguredInstance(KafkaCruiseControlConfigUtils.java:49) ~[classes/:?]
	at com.linkedin.kafka.cruisecontrol.config.KafkaCruiseControlConfig.getConfiguredInstance(KafkaCruiseControlConfig.java:98) ~[classes/:?]

`auto.create.topics.enable=true`

During the refreshPartitionAssignment() the Kafka Consumer send a topicMetadata request to the brokers. This request triggers the auto topic creation mechanism in Kafka. Based on which broker actually creates the topic, the topic creation can be sync or async, meaning the refreshPartitionAssignment() could fail or succeed. In either case, the topic creation won't use the desired topic properties we defined in the CruiseControlMetricsReporter config cruise.control.metrics.topic.num.partitions, cruise.control.metrics.topic.replication.factor. It will use a default topic configuration silently.
If the refreshPartitionAssignment() succeed, we did not even notice anything.
If the refreshPartitionAssignment() fails, the CC won't start. However, the topic auto creation got triggered, and the brokers will create the topic. If we check the logs, we will see the error above, but we will have a hard time to figure out why the error happened since the topic exist in Kafka (with the default topic config). If we start the CC again, then it starts without any issue, and will use the topic created by the topic auto creation mechanism.

What:

Changed the Kafka consumer creation to disable the topic auto creation during the metadata request.
Introduced a mechanism to explicitly wait for the metrics topic to appear during the CruiseControlMetricsReporterSampler configuration.

Expected Behavior

Metrics topic got created with the defined configuration.
CC tolerate lagging, slow broker startup

Actual Behavior

Metrics topic got created with the default topic config
CC fails instantly if the topic does not exist during the startup

Steps to Reproduce

Set the following broker configs

          metric.reporters: com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter
          cruise.control.metrics.topic: __CruiseControlMetrics
          cruise.control.metrics.topic.auto.create: true
          cruise.control.metrics.topic.num.partitions: 1
          cruise.control.metrics.topic.replication.factor: 3
          cruise.control.metrics.topic.auto.create.retries: 12
          cruise.control.metrics.topic.auto.create.timeout.ms: 15000
          auto.create.topics.enable: true

Start a 2 node Kafka cluster (this simulates the lagging 3rd broker)
Start CC