ReplicationThrottleHelper: Add support for AdminClient bulk operations #2305

il-kyun · 2025-08-18T07:56:51Z

Summary

Why: Per-entity AdminClient calls (incrementalAlterConfigs/describeConfigs) scale poorly on large clusters and are fragile when topics are deleted mid-run.
What: Replaced the legacy per-entity configuration path with a new bulk-based implementation as the default behavior.
The new design applies and verifies broker and topic throttles in batches by default

Expected Behavior

Same throttling semantics as before with significantly fewer AdminClient round-trips and lower latency on large operations.
Bulk verification reduces describe calls by grouping resources.
Non-existent topics detected and skipped without failing the whole operation.
Safe no-op on empty inputs.
Continues to respect wildcard * and static broker configs (skips removal of static values).

Actual Behavior (Before Change)

Previous per-entity path triggers many calls per broker/topic, causing slowdowns and timeouts in large clusters.
Config verification performed per resource, increasing cost.

Steps to Reproduce

In a cluster with 100+ brokers and many topics/partitions, generate large inter-broker replica movements.
Compare old per-entity implementation (pre-change) vs. new default bulk implementation.
Record total execution time, AdminClient call counts, and any timeouts/failures.

Additional evidence

Environment: Kafka version, cluster size (brokers/topics/partitions), Cruise Control version.
Logs: Presence of “Removing leader/follower throttle rate …”, and bulk verification messages.
Metrics: Before/after comparisons of call counts, execution time.

Categorization

This PR resolves #1972

kyguy · 2025-09-16T14:31:21Z

Hi @il-kyun, let me know once you get the CI tests passing, I'll be happy to add a review if you would like!

cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/executor/ExecutorTest.java

kyguy

This is a really useful enhancement! Why not have this request batching implementation simply replace the existing non-batching implementation instead of having it be configurable? Is there any specific reason users would not want to have the AdminClient operations batched like this?

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

mimaison · 2025-10-13T13:48:09Z

...-control/src/main/java/com/linkedin/kafka/cruisecontrol/config/constants/ExecutorConfig.java

+  /**
+   * <code>bulk.replication.throttle.bulk.ops.enabled</code>
+   */
+  public static final String BULK_REPLICATION_THROTTLE_BULK_OPS_ENABLED_CONFIG = "bulk.replication.throttle.bulk.ops.enabled";


In which scenarios would users not want to bulk alter/describe configs operations? Are there any drawbacks of enabling this by default?

@mimaison Thanks for bringing that up!
We actually discussed this earlier in here — at the time, we kept _useBulkOps configurable mainly out of caution, since we hadn’t yet fully validated that replacing the per-entity logic wouldn’t introduce unexpected side effects.
Removes the _useBulkOps flag and makes the bulk path the default behavior to simplify the codebase and reduce configuration complexity.

# Conflicts: # cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

...-control/src/main/java/com/linkedin/kafka/cruisecontrol/config/constants/ExecutorConfig.java

mimaison · 2025-10-14T16:32:29Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

 import org.apache.kafka.clients.admin.ConfigEntry;
 import org.apache.kafka.common.config.ConfigResource;
+import org.apache.kafka.common.KafkaFuture;
+import org.apache.kafka.server.config.QuotaConfigs;


This is not part of the Kafka public API, we should avoid using it.

mimaison · 2025-10-14T16:32:46Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

There's something wrong with this file, it does not compile

I rebased onto main and resolved the conflicts — it’s all good now.

kyguy

Thanks for the updates @il-kyun! I just had a quick pass, mostly minor comments so far, I'll have a closer look later!

I was wondering if it would make sense to batch the bulk operations to reduce memory footprint and improve concurrency control for when working with large Kafka clusters. Left a comment concerning that below, let me know what you think!

kyguy · 2025-10-28T08:10:07Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

+      participatingBrokers.addAll(
+          proposal.oldReplicas().stream().map(ReplicaPlacementInfo::brokerId).collect(Collectors.toSet()));
+      participatingBrokers.addAll(
+          proposal.newReplicas().stream().map(ReplicaPlacementInfo::brokerId).collect(Collectors.toSet()));


Are these changes necessary? It looks like this is just changing the formatting

I reverted it. 44251ab

kyguy · 2025-10-28T09:44:47Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

+    List<ConfigResource> resources = brokerIds.stream()
+      .map(id -> new ConfigResource(ConfigResource.Type.BROKER, String.valueOf(id)))
+      .collect(Collectors.toList());
+    return _adminClient.describeConfigs(resources).all().get(CLIENT_REQUEST_TIMEOUT_MS, TimeUnit.MILLISECONDS);


For the cases where we are dealing with large Kafka clusters, would it make sense to batch the requests here? From what I understand it would help avoid memory and concurrency issues with the admin client. Alternatively, maybe it would make more sense to batch the set of brokerIds in the calling methods to reduce the memory impact of storing the broker configs as well. What do you think?

Yes — splitting the broker set before calling describeConfigs makes sense to reduce payload size, heap pressure, and timeout risk in the AdminClient.

Recomputing with ~100 KB per broker:
• 50 brokers ≈ ~5 MB per call
• 40 brokers ≈ ~4 MB per call
• 25 brokers ≈ ~2.5 MB per call

Given this, I think 25 is a reasonable default batch size — it keeps each request lightweight while maintaining good throughput. The value should remain configurable so it can be tuned based on cluster size and network performance.
Would 25 per batch as a default (configurable) work for you?

kyguy · 2025-10-28T10:17:43Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

+  private static final List<String> REPLICATION_THROTTLED_RATE_CONFIGS = Arrays.asList(
+      LEADER_REPLICATION_THROTTLED_RATE_CONFIG,
+      FOLLOWER_REPLICATION_THROTTLED_RATE_CONFIG);


Is this necessary? It appears we only the value use it in one place

I just changed it to use List.of() 96b6471

kyguy · 2025-10-28T10:17:58Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

+  private static final List<String> REPLICATION_THROTTLED_REPLICAS_CONFIGS = Arrays.asList(
+      LEADER_REPLICATION_THROTTLED_REPLICAS_CONFIG,
+      FOLLOWER_REPLICATION_THROTTLED_REPLICAS_CONFIG);


Is this necessary? It appears we only the value use it in one place

I just changed it to use List.of() 96b6471

kyguy · 2025-10-28T10:40:02Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

    }
  }

+  void waitForConfigs(Map<ConfigResource, Collection<AlterConfigOp>> opsByResource) {


Is there any reason we are keeping the original waitForConfigs() method? Can we completely replace it with this implementation?

You're right - we can remove the single-resource version. 0fc26ab

kyguy · 2025-10-28T10:53:10Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

+    applyIncrementalAlterConfigsForBrokers(bulkOps);
+  }
+
+  private void applyIncrementalAlterConfigsForBrokers(Map<ConfigResource, Collection<AlterConfigOp>> bulkOps)


This method is meant to replace changeBrokerConfigs, right? If so, would it make sense to keep the method name a changeBrokerConfigs?

I've renamed it back to changeBrokerConfigs() to maintain consistency with the previous implementation and make the purpose clearer. 44251ab

kyguy · 2025-10-28T10:53:49Z

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

-        LOG.debug("Removing leader throttle rate: {} on broker {}", currLeaderThrottle, brokerId);
-        ops.add(new AlterConfigOp(new ConfigEntry(LEADER_REPLICATION_THROTTLED_RATE_CONFIG, null), AlterConfigOp.OpType.DELETE));
-      }
+  private void applyIncrementalAlterConfigsForTopics(Map<ConfigResource, Collection<AlterConfigOp>> bulkOps)


This method is meant to replace changeTopicConfigs, right? If so, would it make sense to keep the method name a changeTopicConfigs?

I've renamed it back to changeTopicConfigs() for consistency and clarity. 44251ab

il-kyun · 2025-11-01T03:25:54Z

@CCisGG
Thanks for checking it so quickly.

I think I may have misunderstood earlier — this PR seems to be a more appropriate solution.
Please take a look when you have some time.
#2304 (comment)

CCisGG · 2025-11-06T21:16:45Z

@CCisGG
Thanks for checking it so quickly.

I think I may have misunderstood earlier — this PR seems to be a more appropriate solution.
Please take a look when you have some time.
#2304 (comment)

Just to confirm, are you saying #2304 is not a good path to go, and instead we use this PR 2305 to address the issue?

il-kyun · 2025-11-16T09:44:05Z

Just to confirm, are you saying #2304 is not a good path to go, and instead we use this PR 2305 to address the issue?

@CCisGG
Not exactly. #2304 optimizes when we apply throttling by setting the replication throttle once before the inter-broker phase and clearing it once after, which avoids per-batch set/clear churn. #2305 optimizes a different layer—how we apply it—by batching the AdminClient operations in ReplicationThrottleHelper (apply/verify) instead of issuing one call per entity. So I don’t see them as substitutes; they’re complementary options that can be used together: #2304 reduces throttle churn, and #2305 reduces AdminClient round-trips. Also, for #2304, I’d keep it configuration-driven as in the initial proposal.
I’d appreciate your thoughts on this approach—does this align with your view?

il-kyun commented Sep 17, 2025

View reviewed changes

cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/executor/ExecutorTest.java Outdated Show resolved Hide resolved

kyguy reviewed Sep 17, 2025

View reviewed changes

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java Outdated Show resolved Hide resolved

...ntrol/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java Show resolved Hide resolved

kyguy mentioned this pull request Sep 18, 2025

Add bulk replication throttle mode (set throttle once inter-broker) #2304

Open

6 tasks

mimaison reviewed Oct 13, 2025

View reviewed changes

il-kyun added 5 commits October 15, 2025 01:08

ReplicationThrottleHelper: Add support for AdminClient bulk updates

4aefbd3

# Conflicts: # cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/executor/ReplicationThrottleHelper.java

change name

b06b635

for test : enable bulk replication

71dc72c

grouped common code

9e2072a

Always perform replication throttle updates via bulk AdminClient calls

c58d6b9

mimaison reviewed Oct 14, 2025

View reviewed changes

fix conflicts

d1b3c4c

il-kyun force-pushed the feature/repl-throttle-helper-optimize branch from 9311e2d to d1b3c4c Compare October 14, 2025 17:12

rollback

b2e21f6

il-kyun requested review from kyguy and mimaison October 24, 2025 14:08

kyguy reviewed Oct 28, 2025

View reviewed changes

il-kyun added 3 commits October 31, 2025 20:22

- revert code style and method name

44251ab

- just use List.of for throttle configs

96b6471

waitForConfigs

0fc26ab

ReplicationThrottleHelper: Add support for AdminClient bulk operations #2305

Are you sure you want to change the base?

ReplicationThrottleHelper: Add support for AdminClient bulk operations #2305

Uh oh!

Conversation

il-kyun commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Expected Behavior

Actual Behavior (Before Change)

Steps to Reproduce

Additional evidence

Categorization

Uh oh!

kyguy commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kyguy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kyguy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

il-kyun Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

il-kyun Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

il-kyun Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

il-kyun Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

il-kyun Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

il-kyun Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

il-kyun commented Nov 1, 2025

Uh oh!

CCisGG commented Nov 6, 2025

il-kyun commented Aug 18, 2025 •

edited

Loading

kyguy commented Sep 16, 2025 •

edited

Loading

il-kyun Oct 31, 2025 •

edited

Loading

il-kyun Oct 31, 2025 •

edited

Loading

il-kyun Oct 31, 2025 •

edited

Loading

il-kyun Oct 31, 2025 •

edited

Loading

il-kyun Oct 31, 2025 •

edited

Loading

il-kyun Oct 31, 2025 •

edited

Loading