-
Notifications
You must be signed in to change notification settings - Fork 643
Implementation of intra-broker goal violation detection and self-healing #2288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implementation of intra-broker goal violation detection and self-healing #2288
Conversation
|
Thank you for picking this up @dgldn! |
|
@CCisGG could you have a look, when you have some time? |
|
Hi @ecojan , I may not have bandwidth to review this PR any time soon. Would you be able to find another reviewer? I can help to merge the PR if we get a reasonable review and a sign-off. |
|
@CCisGG could you help us with a list of reviewers you worked with before? I tried looking through past PRs, but you seem the main reviewer :( |
kyguy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dgldn, just had an initial pass and left a couple comments. It appears the maintainers are a bit swamped so the bulk of the code review is probably going to have to come from volunteers in the community. To save time on reviews, it would be great to double check that this PR addresses all the feedback left on the original! [1]
[1] #1721
cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/analyzer/goals/AbstractGoal.java
Outdated
Show resolved
Hide resolved
cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/analyzer/goals/AbstractGoal.java
Outdated
Show resolved
Hide resolved
.../src/main/java/com/linkedin/kafka/cruisecontrol/config/BrokerCapacityConfigFileResolver.java
Show resolved
Hide resolved
| } | ||
|
|
||
| @Override | ||
| public void run() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this implementation of run() address the feedback posted in the original PR where this code came from? [1]
[1] https://github.com/linkedin/cruise-control/pull/1721/files#r945326261
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I didn't refactor this one. Not sure if will have enough bandwidth to refactor this.
1e5b21f to
929bb86
Compare
* Remove ZooKeeper from Cruise Control production Authored-by: Tamas Barnabas Egyed <[email protected]> Change-Id: Ic9a7e255193cfbbc6c537e5138c86725363a74b9 * Use KRaft in Cruise Control tests Co-authored-by: Patrik Marton <[email protected]> Co-authored-by: Tamas Barnabas Egyed <[email protected]> --------- Co-authored-by: Patrik Marton <[email protected]> # Conflicts: # cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java # docs/wiki/User Guide/Configurations.md # Conflicts: # README.md # build.gradle # cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/CruiseControlMetricsReporterTest.java # cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/utils/CCEmbeddedKRaftController.java # cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/utils/CCKafkaIntegrationTestHarness.java # cruise-control-metrics-reporter/src/test/java/com/linkedin/kafka/cruisecontrol/metricsreporter/utils/CCKafkaRaftServer.java # cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java # cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/KafkaCruiseControlUnitTestUtils.java # cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/analyzer/GoalOptimizerTest.java # cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/config/BrokerSetFileResolverTest.java # cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/config/CaseInsensitiveGoalConfigTest.java # cruise-control/src/test/java/com/linkedin/kafka/cruisecontrol/executor/ConcurrencyAdjusterTest.java # gradle.properties
Summary
This feature is completely based on work from @ecojan who worked on the PR. And, reviewed previoulsy by @alexander-falca @amuraru @zornhsu @CCisGG
Why:
This PR resolves #1264 by the goal violation detection and self-healing for intra-broker goals. This however does not account for race conditions between Cluster level Goals and will only trigger after said goals are finished.
The scope of this PR is to allow cruise control run IntraBroker goal violation detection and execute self-healing for now (as future work might make this obsolete as intra-broker goals and regular goals might be made to work together).
What:
Implementation of intra-broker goals violation detection and executing self-healing in case of JBOD(multiple logDirs) based kafka clusters.
New Configurations:
The configs
anomaly.detection.intra.broker.goalsandself.healing.intra.broker.goalsboth default tocom.linkedin.kafka.cruisecontrol.analyzer.goals.IntraBrokerDiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.IntraBrokerDiskUsageDistributionGoalsince these are the only cruise control intra-broker goals available at the moment.The
self.healing.intra.broker.goal.violation.enabledcan be disbaled explicilty overriding the config tofalse.New Anomaly Type: INTRA_BROKER_GOAL_VIOLATION
A new type of Anomaly that needs to be treated differently from the regular GOAL_VIOLATIONS.
When configuring for the rebalance parameters for this violation, we are skipping hard goals check and ignoring proposalsCache.
We are skipping hard goals check because intra-broker goals may not be hard goals (at the current given time).
We are ignoring proposalsCache because the proposals model takes into account all given default goals. If we add the intraBrokersGoals this will not work properly as currently, disk-granularity goals do not work with broker-granularity goals.
New Anomaly detectors: IntraBrokerGoalViolationDetector
This anomaly detector is only looking for the anomaly.detection.intra.broker.goal and unlike the GoalViolationDetector, it's using a cluster model that is generating replica placement on disks as well.
This anomaly detector mainly functions for INTRA_BROKER_GOAL_VIOLATIONS.
What is missing from this feature
Missing representation on INTRA_BROKER_GOAL_VIOLATIONS in CruiseControl UI when an anomaly is detected (as this is a structure in cruise-control-ui project)
Expected Behavior
In case of JBOD kafka clusters, IntraBroker goal violation detection and self-healing should be excuted by cruise control similar to Inter-Broker goal violations without any user inputs.
Known Workarounds
Workaround is the existing feature of manually executing
rebalanceAPI withrebalance_disk=trueor using the cruise control UICategorization
This PR resolves #1264.
Logs of cruise control execution against a 3 broker kafka cluster: kafkacruisecontrol.log