Skip to content

Conversation

@dgldn
Copy link

@dgldn dgldn commented Jul 17, 2025

Summary

This feature is completely based on work from @ecojan who worked on the PR. And, reviewed previoulsy by @alexander-falca @amuraru @zornhsu @CCisGG

  1. Why:
    This PR resolves #1264 by the goal violation detection and self-healing for intra-broker goals. This however does not account for race conditions between Cluster level Goals and will only trigger after said goals are finished.
    The scope of this PR is to allow cruise control run IntraBroker goal violation detection and execute self-healing for now (as future work might make this obsolete as intra-broker goals and regular goals might be made to work together).

  2. What:
    Implementation of intra-broker goals violation detection and executing self-healing in case of JBOD(multiple logDirs) based kafka clusters.

New Configurations:

anomaly.detection.intra.broker.goals= List of intra-broker goals for the Anomaly detector to look for
self.healing.intra.broker.goals= List of Intra broker Goals for self-healing
self.healing.intra.broker.goal.violation.enabled= If the self-healing should be enabled for intra-broker-goals. This configuration defaults to self.healing.enabled

The configs anomaly.detection.intra.broker.goals and self.healing.intra.broker.goals both default to com.linkedin.kafka.cruisecontrol.analyzer.goals.IntraBrokerDiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.IntraBrokerDiskUsageDistributionGoal since these are the only cruise control intra-broker goals available at the moment.

The self.healing.intra.broker.goal.violation.enabled can be disbaled explicilty overriding the config to false.

New Anomaly Type: INTRA_BROKER_GOAL_VIOLATION

A new type of Anomaly that needs to be treated differently from the regular GOAL_VIOLATIONS.
When configuring for the rebalance parameters for this violation, we are skipping hard goals check and ignoring proposalsCache.

We are skipping hard goals check because intra-broker goals may not be hard goals (at the current given time).
We are ignoring proposalsCache because the proposals model takes into account all given default goals. If we add the intraBrokersGoals this will not work properly as currently, disk-granularity goals do not work with broker-granularity goals.

New Anomaly detectors: IntraBrokerGoalViolationDetector

This anomaly detector is only looking for the anomaly.detection.intra.broker.goal and unlike the GoalViolationDetector, it's using a cluster model that is generating replica placement on disks as well.
This anomaly detector mainly functions for INTRA_BROKER_GOAL_VIOLATIONS.

What is missing from this feature

Missing representation on INTRA_BROKER_GOAL_VIOLATIONS in CruiseControl UI when an anomaly is detected (as this is a structure in cruise-control-ui project)

Expected Behavior

In case of JBOD kafka clusters, IntraBroker goal violation detection and self-healing should be excuted by cruise control similar to Inter-Broker goal violations without any user inputs.

Known Workarounds

Workaround is the existing feature of manually executing rebalance API with rebalance_disk=true or using the cruise control UI

Categorization

  • documentation
  • bugfix
  • new feature
  • refactor
  • security/CVE
  • other

This PR resolves #1264.

image

Logs of cruise control execution against a 3 broker kafka cluster: kafkacruisecontrol.log

@ecojan
Copy link

ecojan commented Jul 22, 2025

Thank you for picking this up @dgldn!

@ecojan
Copy link

ecojan commented Aug 4, 2025

@CCisGG could you have a look, when you have some time?

@CCisGG
Copy link
Contributor

CCisGG commented Aug 4, 2025

Hi @ecojan , I may not have bandwidth to review this PR any time soon. Would you be able to find another reviewer? I can help to merge the PR if we get a reasonable review and a sign-off.

@ecojan
Copy link

ecojan commented Aug 5, 2025

@CCisGG could you help us with a list of reviewers you worked with before? I tried looking through past PRs, but you seem the main reviewer :(

Copy link
Contributor

@kyguy kyguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dgldn, just had an initial pass and left a couple comments. It appears the maintainers are a bit swamped so the bulk of the code review is probably going to have to come from volunteers in the community. To save time on reviews, it would be great to double check that this PR addresses all the feedback left on the original! [1]

[1] #1721

}

@Override
public void run() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this implementation of run() address the feedback posted in the original PR where this code came from? [1]

[1] https://github.com/linkedin/cruise-control/pull/1721/files#r945326261

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I didn't refactor this one. Not sure if will have enough bandwidth to refactor this.

@dgldn dgldn force-pushed the intra-broker-self-healing-disk-rebalance branch from 1e5b21f to 929bb86 Compare October 21, 2025 09:43
egyedt and others added 2 commits October 21, 2025 14:16
* Remove ZooKeeper from Cruise Control production

Authored-by: Tamas Barnabas Egyed <[email protected]>
Change-Id: Ic9a7e255193cfbbc6c537e5138c86725363a74b9

* Use KRaft in Cruise Control tests

Co-authored-by: Patrik Marton <[email protected]>
Co-authored-by: Tamas Barnabas Egyed <[email protected]>

---------

Co-authored-by: Patrik Marton <[email protected]>
# Conflicts:
#	cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java
#	docs/wiki/User Guide/Configurations.md

# Conflicts:
#	README.md
#	build.gradle
#	cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/CruiseControlMetricsReporterTest.java
#	cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/utils/CCEmbeddedKRaftController.java
#	cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/utils/CCKafkaIntegrationTestHarness.java
#	cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/utils/CCKafkaRaftServer.java
#	cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java
#	cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/KafkaCruiseControlUnitTestUtils.java
#	cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/analyzer/GoalOptimizerTest.java
#	cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/config/BrokerSetFileResolverTest.java
#	cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/config/CaseInsensitiveGoalConfigTest.java
#	cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/executor/ConcurrencyAdjusterTest.java
#	gradle.properties
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ability to use intra.broker.goals in anomaly detection / self healing

6 participants