Skip to content

Conversation

@il-kyun
Copy link
Contributor

@il-kyun il-kyun commented Aug 18, 2025

Summary

  1. Why: Per-entity AdminClient calls (incrementalAlterConfigs/describeConfigs) scale poorly on large clusters and are fragile when topics are deleted mid-run.
  2. What: Replaced the legacy per-entity configuration path with a new bulk-based implementation as the default behavior.
    The new design applies and verifies broker and topic throttles in batches by default

Expected Behavior

  • Same throttling semantics as before with significantly fewer AdminClient round-trips and lower latency on large operations.
  • Bulk verification reduces describe calls by grouping resources.
  • Non-existent topics detected and skipped without failing the whole operation.
  • Safe no-op on empty inputs.
  • Continues to respect wildcard * and static broker configs (skips removal of static values).

Actual Behavior (Before Change)

  • Previous per-entity path triggers many calls per broker/topic, causing slowdowns and timeouts in large clusters.
  • Config verification performed per resource, increasing cost.

Steps to Reproduce

  1. In a cluster with 100+ brokers and many topics/partitions, generate large inter-broker replica movements.
  2. Compare old per-entity implementation (pre-change) vs. new default bulk implementation.
  3. Record total execution time, AdminClient call counts, and any timeouts/failures.

Additional evidence

  1. Environment: Kafka version, cluster size (brokers/topics/partitions), Cruise Control version.
  2. Logs: Presence of “Removing leader/follower throttle rate …”, and bulk verification messages.
  3. Metrics: Before/after comparisons of call counts, execution time.

Categorization

  • documentation
  • bugfix
  • new feature
  • refactor
  • security/CVE
  • other

This PR resolves #1972

@kyguy
Copy link
Contributor

kyguy commented Sep 16, 2025

Hi @il-kyun, let me know once you get the CI tests passing, I'll be happy to add a review if you would like!

Copy link
Contributor

@kyguy kyguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really useful enhancement! Why not have this request batching implementation simply replace the existing non-batching implementation instead of having it be configurable? Is there any specific reason users would not want to have the AdminClient operations batched like this?

/**
* <code>bulk.replication.throttle.bulk.ops.enabled</code>
*/
public static final String BULK_REPLICATION_THROTTLE_BULK_OPS_ENABLED_CONFIG = "bulk.replication.throttle.bulk.ops.enabled";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which scenarios would users not want to bulk alter/describe configs operations? Are there any drawbacks of enabling this by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimaison Thanks for bringing that up!
We actually discussed this earlier in here — at the time, we kept _useBulkOps configurable mainly out of caution, since we hadn’t yet fully validated that replacing the per-entity logic wouldn’t introduce unexpected side effects.
Removes the _useBulkOps flag and makes the bulk path the default behavior to simplify the codebase and reduce configuration complexity.

import org.apache.kafka.clients.admin.ConfigEntry;
import org.apache.kafka.common.config.ConfigResource;
import org.apache.kafka.common.KafkaFuture;
import org.apache.kafka.server.config.QuotaConfigs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of the Kafka public API, we should avoid using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something wrong with this file, it does not compile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased onto main and resolved the conflicts — it’s all good now.

@il-kyun il-kyun force-pushed the feature/repl-throttle-helper-optimize branch from 9311e2d to d1b3c4c Compare October 14, 2025 17:12
@il-kyun il-kyun requested review from kyguy and mimaison October 24, 2025 14:08
Copy link
Contributor

@kyguy kyguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @il-kyun! I just had a quick pass, mostly minor comments so far, I'll have a closer look later!

I was wondering if it would make sense to batch the bulk operations to reduce memory footprint and improve concurrency control for when working with large Kafka clusters. Left a comment concerning that below, let me know what you think!

Comment on lines 155 to 158
participatingBrokers.addAll(
proposal.oldReplicas().stream().map(ReplicaPlacementInfo::brokerId).collect(Collectors.toSet()));
participatingBrokers.addAll(
proposal.newReplicas().stream().map(ReplicaPlacementInfo::brokerId).collect(Collectors.toSet()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes necessary? It looks like this is just changing the formatting

Copy link
Contributor Author

@il-kyun il-kyun Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted it. 44251ab

Comment on lines +438 to +441
List<ConfigResource> resources = brokerIds.stream()
.map(id -> new ConfigResource(ConfigResource.Type.BROKER, String.valueOf(id)))
.collect(Collectors.toList());
return _adminClient.describeConfigs(resources).all().get(CLIENT_REQUEST_TIMEOUT_MS, TimeUnit.MILLISECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cases where we are dealing with large Kafka clusters, would it make sense to batch the requests here? From what I understand it would help avoid memory and concurrency issues with the admin client. Alternatively, maybe it would make more sense to batch the set of brokerIds in the calling methods to reduce the memory impact of storing the broker configs as well. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — splitting the broker set before calling describeConfigs makes sense to reduce payload size, heap pressure, and timeout risk in the AdminClient.

Recomputing with ~100 KB per broker:
• 50 brokers ≈ ~5 MB per call
• 40 brokers ≈ ~4 MB per call
• 25 brokers ≈ ~2.5 MB per call

Given this, I think 25 is a reasonable default batch size — it keeps each request lightweight while maintaining good throughput. The value should remain configurable so it can be tuned based on cluster size and network performance.
Would 25 per batch as a default (configurable) work for you?

Comment on lines 43 to 45
private static final List<String> REPLICATION_THROTTLED_RATE_CONFIGS = Arrays.asList(
LEADER_REPLICATION_THROTTLED_RATE_CONFIG,
FOLLOWER_REPLICATION_THROTTLED_RATE_CONFIG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? It appears we only the value use it in one place

Copy link
Contributor Author

@il-kyun il-kyun Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed it to use List.of() 96b6471

Comment on lines 48 to 50
private static final List<String> REPLICATION_THROTTLED_REPLICAS_CONFIGS = Arrays.asList(
LEADER_REPLICATION_THROTTLED_REPLICAS_CONFIG,
FOLLOWER_REPLICATION_THROTTLED_REPLICAS_CONFIG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? It appears we only the value use it in one place

Copy link
Contributor Author

@il-kyun il-kyun Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed it to use List.of() 96b6471

}
}

void waitForConfigs(Map<ConfigResource, Collection<AlterConfigOp>> opsByResource) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we are keeping the original waitForConfigs() method? Can we completely replace it with this implementation?

Copy link
Contributor Author

@il-kyun il-kyun Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - we can remove the single-resource version. 0fc26ab

applyIncrementalAlterConfigsForBrokers(bulkOps);
}

private void applyIncrementalAlterConfigsForBrokers(Map<ConfigResource, Collection<AlterConfigOp>> bulkOps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is meant to replace changeBrokerConfigs, right? If so, would it make sense to keep the method name a changeBrokerConfigs?

Copy link
Contributor Author

@il-kyun il-kyun Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed it back to changeBrokerConfigs() to maintain consistency with the previous implementation and make the purpose clearer. 44251ab

LOG.debug("Removing leader throttle rate: {} on broker {}", currLeaderThrottle, brokerId);
ops.add(new AlterConfigOp(new ConfigEntry(LEADER_REPLICATION_THROTTLED_RATE_CONFIG, null), AlterConfigOp.OpType.DELETE));
}
private void applyIncrementalAlterConfigsForTopics(Map<ConfigResource, Collection<AlterConfigOp>> bulkOps)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is meant to replace changeTopicConfigs, right? If so, would it make sense to keep the method name a changeTopicConfigs?

Copy link
Contributor Author

@il-kyun il-kyun Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed it back to changeTopicConfigs() for consistency and clarity. 44251ab

@il-kyun
Copy link
Contributor Author

il-kyun commented Nov 1, 2025

@CCisGG
Thanks for checking it so quickly.

I think I may have misunderstood earlier — this PR seems to be a more appropriate solution.
Please take a look when you have some time.
#2304 (comment)

@CCisGG
Copy link
Contributor

CCisGG commented Nov 6, 2025

@CCisGG
Thanks for checking it so quickly.

I think I may have misunderstood earlier — this PR seems to be a more appropriate solution.
Please take a look when you have some time.
#2304 (comment)

Just to confirm, are you saying #2304 is not a good path to go, and instead we use this PR 2305 to address the issue?

@il-kyun
Copy link
Contributor Author

il-kyun commented Nov 16, 2025

Just to confirm, are you saying #2304 is not a good path to go, and instead we use this PR 2305 to address the issue?

@CCisGG
Not exactly. #2304 optimizes when we apply throttling by setting the replication throttle once before the inter-broker phase and clearing it once after, which avoids per-batch set/clear churn. #2305 optimizes a different layer—how we apply it—by batching the AdminClient operations in ReplicationThrottleHelper (apply/verify) instead of issuing one call per entity. So I don’t see them as substitutes; they’re complementary options that can be used together: #2304 reduces throttle churn, and #2305 reduces AdminClient round-trips. Also, for #2304, I’d keep it configuration-driven as in the initial proposal.
I’d appreciate your thoughts on this approach—does this align with your view?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redundant set/unset throttling during the rebalance

4 participants