Proposal for adding Self healing feature in the operator #145

ShubhamRwt · 2025-01-10T08:40:26Z

This proposal shows how we plan to implement the self healing feature in the operator

Signed-off-by: ShubhamRwt <[email protected]>

scholzj

Thanks for the proposal.

I think it would be useful to use a regular human language here and not the Cruise Control language as that would bring better perspective on what exactly is being proposed here and what is not being proposed here (and what actually the self-healing does and does not do).

I do not understand the role of the StrimziNotifier. While I doubt that anyone will find it appealing to have Kubernetes events used for the notification instead of Slack or Prometheus, I cannot of course rule out that this will find its users. But it seems to be only marginally related to Strimzi. I'm sure we can host this under the Strimzi GitHub organization if desired, but it seems more like an unrelated subproject that is separate from the self-healing feature it self.

The actual self-healing ... you focus a lot on the Disk and Broker failures. But it seems to me like the self-healing does not offer good solutions for this. The idea to solve this by rescheduling the partition replicas to other brokers seems to be well suited for Kafka on bare-metal with local storage where a dead server often means that you have to get a brand new server to replace it. It does not seem to be well suited to self-healing Kubernetes infrastructure and for cloud-based environments, where Pods can be just rescheduled and new disk can be provisioned in the matter of seconds.

So it seems to me like the self-healing itself as done by Cruise Control could be useful for dealing with inbalanced brokers. But for situations such as broker or disk failures or insufficient capacity, we should be more ambitious and aim to solve them properly in the Strimzi Cluster Operator. That is where an actual StrimziNotifier might play a future role to not issue any Kubernetes Events to tell the user about broker disk, but provide a mechanism to tell Strimzi about it so that Strimzi can react to it - e.g. by scaling for capacity based recommendation from CC or as suggested delete broken PVC and get a new one provisioned to replace it. While I underttand that this might be out of scope for this proposal, I think it is important to leave space for it. For example by focusing here on the auto-rebalancing features and leaving the other detected issues unused for future development.

scholzj · 2025-01-11T19:37:18Z

092-self-healing-using-cruise-control.md

+## Current situation
+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.  
+Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.


This is not some bare-metal cluster but Kubernetes. So I'm struggling to find the issues where you would need to do what you suggest. Instead, on Kubernetes, you would normally just fix the issue on the existing broker.

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).

scholzj · 2025-01-11T19:40:06Z

092-self-healing-using-cruise-control.md

+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.  
+Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.


I think you should clarify what issues are you talking about here. From the past discussion, the terminology used here seems highly confusing ... and I suspect that the words such as issues, anomalies, and problems are ambiguous at best`.

For me they are all synonyms more or less. Maybe just use one and stick with it?

scholzj · 2025-01-11T19:41:09Z

092-self-healing-using-cruise-control.md

+
+## Motivation
+
+With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.


Nothing in this proposal suggests that Cruise Control can solve disk or broker failures.

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?

scholzj · 2025-01-11T19:42:12Z

092-self-healing-using-cruise-control.md

+
+With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
+Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
+Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).


The link seems broken? (Double ( and )?)

scholzj · 2025-01-11T19:48:10Z

092-self-healing-using-cruise-control.md

+The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
+It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.


Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

I'm not sure if we need this level of details. Maybe just link to the doc of these detectors? If not, maybe reword the sentence to something like this:

Suggested change

Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.

Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` uses metrics while other detector such as `GoalViolationDetector` and `TopicAnomalyDetector` use a different mechanism to detect anomalies.

Maybe we should mention briefly how GoalViolationDetector and TopicAnomalyDetector detect anomalies like these others too.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

scholzj · 2025-01-11T20:12:49Z

092-self-healing-using-cruise-control.md

+```
+
+The `StrimziNotifier` class is going to reside in a separate module in the operator repository called the `strimzi-cruise-control-notifier` so that a seperate jar for the notifier can be built for inclusion in the Cruise Control container image.
+We will also need to include the fabric8 client dependencies in the notifier JAR that will be used for generating the Kubernetes events as well as for reading the annotation on the Kafka resource from the notifier.


Reading what annotation?

Yeah, I think you are anticipating what's going to be in the next paragraph but this sentence doesn't make any sense here.

scholzj · 2025-01-11T20:14:06Z

092-self-healing-using-cruise-control.md

+The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
+For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
+We can set this annotation on the Kafka custom resource to pause the self-healing feature.
+self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
+When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
+The anomalies that were already getting fixed would continue with the fix.


Why is pausing the self-healing useful for the users? Assuming it is useful, why not just disable the self-healing?

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).

scholzj · 2025-01-11T20:15:54Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and brokers are rolled
+
+If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.


What exactly does it mean when the broker fails? How does CC recognize it? How does it distinguish failure for example from some temporary Kubernetes scheduling issue?

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.

scholzj · 2025-01-11T20:16:46Z

092-self-healing-using-cruise-control.md

+#### What happens when self-healing is running/fixing an anomaly and brokers are rolled
+
+If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
+If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.


How would it finish the transfer from the failed broker when the target broker is stopped? You mean that it will continue with it after the target broker is rolled?

Yes, If the target broker is rolling at that moment, then self healing would end saying self-healing finished successful with error. Then periodic anomaly check will again detect the failed broker and try to fix it. If the target broker is rolled by that time then self healing will start again else if the rolled broker is not up then we will will get two anomalies to fix the failed broker and the target broker

scholzj · 2025-01-11T20:19:12Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied
+
+If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. 


Does this need to differenciate between asking for proposal and approving a proposal? For example when asking for proposal, I would expect the operator to wait and give me a proposal after the self-healing ends.

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?

@ShubhamRwt any answer here?

Yes, in the middle of self-healing I was able to generate the proposal and when I approved it I got the error

I pushed a section around this in the proposal

tinaselenge · 2025-01-14T14:55:32Z

092-self-healing-using-cruise-control.md

+
+#### Anomaly Detector Manager
+
+The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.


Suggested change

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.

The anomaly detector manager helps in detecting the anomalies as well as handling them.

tinaselenge · 2025-01-14T15:01:05Z

092-self-healing-using-cruise-control.md

+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
+The detected anomalies can be of various types:
+* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..


So you mean the goals in KafkaRebalance and goals self.healing.goals are independent of each other?

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

If by KR you mean the KafkaRebalance custom resource, I don't think what you are saying is true. There is no relationship between a specific KafkaRebalance and self-healing configuration.
Maybe are you referring to the general CC configuration? So the default hard goals in CC has to be a superset of self.healing.goals (and not the other way around)?

tinaselenge · 2025-01-14T15:05:11Z

092-self-healing-using-cruise-control.md

+The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
+If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
+
+Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.


Suggested change

Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.

Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection.

tinaselenge · 2025-01-14T15:12:31Z

092-self-healing-using-cruise-control.md

+Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.
+
+#### Self Healing
+If self-healing is enabled in Cruise Control and anomalies are detected, then a fix for these will be attempted automatically.


A fix will be attempted automatically only if anomaly.notifier.class is configured to non default notifier right? So enabling self-healing alone is not enough, as with the default notifier (NoopNotifer), anomalies are detected but ignored?

Yes, you will need to override the notifier too

tinaselenge · 2025-01-14T15:26:34Z

092-self-healing-using-cruise-control.md

+The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status:
+* DETECTED - The anomaly is just detected and no action is yet taken on it 
+* IGNORED - Self healing is either disabled or the anomaly is unfixable
+* FIX_STARTED - The anomaly has started getting fixed


Suggested change

* FIX_STARTED - The anomaly has started getting fixed

* FIX_STARTED - The anomaly fix has started

tinaselenge · 2025-01-14T15:28:03Z

092-self-healing-using-cruise-control.md

+* FIX_STARTED - The anomaly has started getting fixed
+* FIX_FAILED_TO_START - The proposal generation for the anomaly fix failed 
+* CHECK_WITH_DELAY - The anomaly fix is delayed 
+* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state


Suggested change

* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state

* LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state

tinaselenge · 2025-01-14T15:34:43Z

092-self-healing-using-cruise-control.md

+      self.healing.enabled: true
+```
+
+Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.


Is there any anomalies enabled by default? If user does not specify anomalies to detect, self-healing will do nothing?

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies. In case the user knows that somehow any anomaly will keep appearing since they have some issues in their cluster, then in that they can use skip the check for that particular anomaly

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.

tinaselenge · 2025-01-14T15:50:33Z

092-self-healing-using-cruise-control.md

+The notifier class returns the `action` that is going to be taken on the notified anomaly.
+These actions can be classified into three types:
+* FIX - Start the anomaly fix
+* CHECK - Delay the anomaly fix


does the delay action, apply a configured delay before taking the action provided by notifier?

Yes. Even if the delay is mostly used only for broker failures and not for another anomalies AFAIK.

ppatierno

I left some comments. Can you go through the proposal and split paragraphs with one sentence per line? Thanks!

ppatierno · 2025-01-21T07:37:27Z

092-self-healing-using-cruise-control.md

+## Current situation
+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.  
+Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.


I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).

ppatierno · 2025-01-21T07:38:54Z

092-self-healing-using-cruise-control.md

+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.  
+Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.


For me they are all synonyms more or less. Maybe just use one and stick with it?

ppatierno · 2025-01-21T07:43:29Z

092-self-healing-using-cruise-control.md

+
+## Motivation
+
+With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.


AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?

ppatierno · 2025-01-21T07:45:25Z

092-self-healing-using-cruise-control.md

+
+With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
+Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
+Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).


This is the first time you mention the term "notifier" but it shows up out of the context. You should first explain what it is and why it's useful in general, not related to the self-healing but more to the alerting.

ppatierno · 2025-01-21T07:54:33Z

092-self-healing-using-cruise-control.md

+The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
+It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.


Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

ppatierno · 2025-01-21T08:31:00Z

092-self-healing-using-cruise-control.md

+### Pausing the self-healing feature
+
+The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
+For this, we will be having an annotation called `strimzi.io/self-healing-paused`.


Suggested change

For this, we will be having an annotation called `strimzi.io/self-healing-paused`.

For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource.

ppatierno · 2025-01-21T08:32:31Z

092-self-healing-using-cruise-control.md

+The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
+For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
+We can set this annotation on the Kafka custom resource to pause the self-healing feature.
+self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
+When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
+The anomalies that were already getting fixed would continue with the fix.


Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).

ppatierno · 2025-01-21T08:32:59Z

092-self-healing-using-cruise-control.md

+  uid: 71928731-df71-4b9f-ac77-68961ac25a2f
+```
+
+We will have checks on methods like `onGoalViolation`, `onBrokerFailure`, `onTopicAnomaly` etc. in the `StrimziNotifier` to check if the `pause` annotation is applied on the Kafka custom resource or not.


there is no pause annotation. Use the exact name.

ppatierno · 2025-01-21T08:35:15Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and brokers are rolled
+
+If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.


The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.

ppatierno · 2025-01-21T08:37:50Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied
+
+If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. 


From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?

kyguy · 2025-01-29T23:13:05Z

092-self-healing-using-cruise-control.md

+
+The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
+It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).


Suggested change

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).

kyguy · 2025-01-29T23:49:13Z

092-self-healing-using-cruise-control.md

+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
+The detected anomalies can be of various types:
+* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..


Suggested change

* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..

* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing.

The nuances between the goal configurations of Cruise Control can be quite confusing, to help keep them straight in my head I categorize them in the following way:

The server-side Cruise Control goal configs in spec.cruiseControl.config of the Kafka resource

The client-side Cruise Control goal config in spec of the KafkaRebalance resource

To avoid confusion, it may be worth adding some text that explains where and how the self.healing.goals compares/relates to the other server-side Cruise Control goal configs - goals, default.goals, hard.goals, anomaly.detection.goals. It wouldn't have to be right here, but somewhere in the proposal where the self.healing configuration is raised. When we add the self-healing feature we will need to add some documentation of how self.healing.goals related to the other goal configurations anyways. We could use the same language used to describe the different goal configs in the Strimzi documentation and provide a link to those docs as well.

I think the language you and Paolo use in this thread referring to other goal configs like "superset" or subset is helpful. Since the goal configs are so complicated, I don't think it is a problem if we reuse text from the Strimzi Cruise Control goal documentation to emphasize the distinction of the self.healing.goals config.

kyguy · 2025-01-30T14:31:50Z

092-self-healing-using-cruise-control.md

+* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).
+
+The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
+The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.


Suggested change

The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.

The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored.

kyguy · 2025-01-30T14:32:17Z

092-self-healing-using-cruise-control.md

+
+The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
+The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
+If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


Suggested change

If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

kyguy · 2025-01-30T14:33:03Z

092-self-healing-using-cruise-control.md

+
+The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
+The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
+If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


Suggested change

If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

kyguy · 2025-01-30T16:35:58Z

092-self-healing-using-cruise-control.md

+
+### Enabling self-healing in the Kafka custom resource
+
+When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).


Suggested change

When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).

When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource.

When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored.

kyguy · 2025-01-30T16:50:51Z

092-self-healing-using-cruise-control.md

+      self.healing.enabled: true
+```
+
+Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.


If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.

kyguy · 2025-01-30T17:47:27Z

092-self-healing-using-cruise-control.md

+
+Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected.
+`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface. 
+Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.


Suggested change

Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.

Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc.

kyguy · 2025-01-30T17:48:50Z

092-self-healing-using-cruise-control.md

+`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface. 
+Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
+The `SelfHealingNotifier` class can then be further used to implement your own notifier
+The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.


Suggested change

The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.

The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`.

kyguy · 2025-01-30T18:32:38Z

092-self-healing-using-cruise-control.md

+
+### Strimzi Operations while self-healing is running/fixing an anomaly 
+
+Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.


Suggested change

Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.

Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster.

Signed-off-by: ShubhamRwt <[email protected]>

tinaselenge · 2025-03-12T16:04:42Z

092-self-healing-using-cruise-control.md


 #### What happens when self-healing is running/fixing an anomaly and brokers are rolled

-If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
-If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.
+If some anomaly is getting healed , Cruise Control will then try to fix/mitigate the anomaly by moving the partition replicas across the brokers in the Kafka cluster.


When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.

tinaselenge · 2025-03-12T16:15:44Z

092-self-healing-using-cruise-control.md

 The detected anomalies can be of various types:
-* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
+* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section) to those used for manual rebalancing(goals configured in the KafkaRebalance custom resource).


I think this part confuses me slightly, maybe can be reworded better.
Is the below what you mean?

Goal Violation - This happens if certain optimization goals such as DiskUsageDistributionGoal are violated. The optimization goals can be configured independently from KafkaRebalance custom resource by using spec.cruiseControl.config.self.healing.goals in Kafka custom resource.

When you say certain optimization goals, are you saying only some would trigger Goal Violation anomaly?

Yes, AFAIK it would only be those goals specifically listed in the self healing goals configuration.

I tried to make the sentence less confusing, I hope its better now

tinaselenge · 2025-03-12T16:26:54Z

092-self-healing-using-cruise-control.md

@@ -76,17 +79,18 @@ statusUpdateDate=2024-12-09T09:49:47Z,
 failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}}, 
 status=FIX_STARTED}]}
 ```
+where the `anomalyID` represents the task id assigned to the anomaly, `failedDisksByTimeMs` represents the disk which failed, `detectionDate` and `statusUpdateDate` represet the date and time when the anomaly was detected and the status of the anomaly was updated.


I don't see the mentioned anomalyType field snippet. Is this a full snippet of an anomaly?

I guess not that important, since this is just an example, but should we explain the value of failedDisksByTimeMs? {2={/var/lib/kafka/data-1/kafka-log2=1733737786603}} what these value mean, how it represents the failed disk? If you think, this is not important, please feel free ignore this part.

I have added some explanation around it

tinaselenge · 2025-03-12T16:28:24Z

092-self-healing-using-cruise-control.md

 * COMPLETENESS_NOT_READY - Completeness not met for some goals. The completeness describes the confidence level of the data in the metric sample aggregator. If this is low then some Goal classes may refuse to produce proposals.

-We can also poll the `state` endpoint, with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.
+We can again poll the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control), with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.


Will we provide, how users might poll this end point?

By this you mean mentioning the complete curl command?

I have added the curl command to poll the endpoint

092-self-healing-using-cruise-control.md

Signed-off-by: ShubhamRwt <[email protected]>

ppatierno

@ShubhamRwt I had another pass but I think the next one should be with a proposal around metrics as another option instead of using the Kubernetes events notifier.

ppatierno · 2025-04-09T08:19:56Z

092-self-healing-using-cruise-control.md

+## Current situation
+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.  
+Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource in the [`rebalance`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.


There is no rebalance mode. Did you mean full? Anyway I would not specify it, because it's possible you could even decide to scale up the cluster when doing the rebalancing so add-brokers would be needed.

092-self-healing-using-cruise-control.md

ppatierno · 2025-04-09T08:32:28Z

092-self-healing-using-cruise-control.md

+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).
+Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` and `GoalViolationDetector` uses metrics to detect the anomalies.
+The detected anomalies can be of various types:
+* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the KafkaRebalance custom resource.


Suggested change

* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the KafkaRebalance custom resource.

* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the `KafkaRebalance` custom resource.

ppatierno · 2025-04-09T08:34:36Z

092-self-healing-using-cruise-control.md

+The notifier class returns the `action` that is going to be taken on the notified anomaly.
+These actions can be classified into three types:
+* FIX - Start the anomaly fix
+* CHECK - Delay the anomaly fix


Yes. Even if the delay is mostly used only for broker failures and not for another anomalies AFAIK.

ppatierno · 2025-04-09T08:35:09Z

092-self-healing-using-cruise-control.md

+* CHECK - Delay the anomaly fix
+* IGNORE - Ignore the anomaly fix
+
+The default `NoopNotifer` always sets the notifier action as `IGNORE` means that the detected anomaly will be silently ignored.


I would also add that it doesn't send any notification to the user in any way as well.

ppatierno · 2025-04-09T08:45:23Z

092-self-healing-using-cruise-control.md

+  type: Normal
+```
+
+The users can build a jar for the notifier for inclusion in the Kafka image to run on as part of Cruise Control.


The KubernetesEventsNotifier should be included out of the box within the Kafka Strimzi image in order to be configured by the operator. It's not clear if you are referring to a custom notifier that a user would write.

ppatierno · 2025-04-09T08:45:53Z

092-self-healing-using-cruise-control.md

+
+### Strimzi Operations while self-healing is running/fixing an anomaly 
+
+Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster.


Not sure what's the purpose of this single sentence paragraph.

ppatierno · 2025-04-09T08:46:38Z

092-self-healing-using-cruise-control.md

+If the deleted or rolled broker comes back before the periodic anomaly check interval, then self-healing would start again only for the earlier failed broker with a new anomalyId since it was not fixed properly. 
+In case the deleted broker doesn't come back healthy within the periodic anomaly check interval then another new anomaly for the deleted broker will be created and the earlier detected failed broker would also persist but will be updated with a new anomalyId.
+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is approved


Suggested change

#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is approved

#### What happens when self-healing is running/fixing an anomaly and KafkaRebalance is approved

ppatierno · 2025-04-09T08:47:04Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied
+
+If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. 


@ShubhamRwt any answer here?

ppatierno · 2025-04-09T08:47:50Z

092-self-healing-using-cruise-control.md

+
+## Affected/not affected projects
+
+A new repository named `kubernetes-event-notifier` will be added under the Strimzi organisation.


There is also the cluster operator to be affected because we need to allow enabling self-healing, or?

ShubhamRwt · 2025-04-16T07:49:08Z

092-self-healing-using-cruise-control.md

+  observedGeneration: 1
+```
+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal


@ppatierno this section talks about what happens if we request rebalance in middle of self healing

Signed-off-by: ShubhamRwt <[email protected]>

ppatierno

I had another pass. Left some comments and made some cosmetic changes (it seems you don't like spaces and new lines :-P).

ppatierno · 2025-04-30T07:59:32Z

092-self-healing-using-cruise-control.md

+
+The above flow diagram depicts the self-healing process.
+The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.
+The notifier then alerts the user about detected anomaly through the logs and also returns what action needs to be taken on the anomaly i.e. whether to fix it, ignore it or delay it.


Actually it depends on the notifier, the anomaly is not just logged. You can have Slack, MS Teams or anything else, with a dedicated notifier, to get a notification about the anomalies.

ppatierno · 2025-04-30T08:01:03Z

092-self-healing-using-cruise-control.md

+Each detected anomaly is implemented by a specific class which leverages a corresponding other class to run a fix.
+For example, the `GoalViolations` class uses the `RebalanceRunnable` class, the `DiskFailure` class use the `RemoveDisksRunnable` class and so on.
+An optimization proposal will then be generated by these `Runnable` classes and that proposal will be applied on the cluster to fix the anomaly.
+In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.


Suggested change

In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.

In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware (insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the Cruise Control logs will be updated with `self-healing is not possible due to unfixable goals` warning.

ppatierno · 2025-04-30T08:01:56Z

092-self-healing-using-cruise-control.md

+An optimization proposal will then be generated by these `Runnable` classes and that proposal will be applied on the cluster to fix the anomaly.
+In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.
+When self-healing starts, you can check the anomaly status by querying the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control) with `substate` parameter set as `anomaly_detector`. This would provide the anomaly details like anomalyId, the type of anomaly and the status of the anomaly etc.
+For example, the user can query the endpoint using the `curl` command:


Suggested change

For example, the user can query the endpoint using the `curl` command:

For example, the user can query the endpoint using the `curl` command:

ppatierno · 2025-04-30T08:03:09Z

092-self-healing-using-cruise-control.md

+```shell
+curl -vv -X GET "http://localhost:9090/kafkacruisecontrol/state?substates=anomaly_detector"
+```
+the detected anomaly would look something like this:


Suggested change

the detected anomaly would look something like this:

the detected anomaly would look something like this:

ppatierno · 2025-04-30T08:03:25Z

092-self-healing-using-cruise-control.md

+failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}}, 
+status=FIX_STARTED}]}
+```
+where, 


Suggested change

where,

where:

ppatierno · 2025-04-30T08:08:06Z

092-self-healing-using-cruise-control.md

+### Self-healing configuration in the Strimzi operator
+
+Users can activate self-healing by setting the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
+If the user don't set the notifier property then `NoopNotifer` will be used by default which means all the action on the anomalies would be set to `IGNORE`.


I would so make clear that the self.healing.enabled is not enough by itself then, because you also need to set a proper notifier (instead of the default NoopNotifier).

Suggested change

If the user don't set the notifier property then `NoopNotifer` will be used by default which means all the action on the anomalies would be set to `IGNORE`.

However, if the user does not set the notifier class property then the `NoopNotifer` will be used by default.

This means all the action on the detected anomalies will be set to `IGNORE` and self-healing is effectively disabled even if the `self.healing.enabled` property is set to `true`.

So given the two sentences above, what are you going to do about it? If a user sets self healing to true and not the notifier, what happens? Should there be a warning? Should the Kafka CR show an error condition, should it be set to NotReady (as that is clearly a mis-configuration)?

As I have said before, in my opinion we should prevent this situation happening by just setting up the notifier for the user (if they don't specify one themselves).

If the user doesn't specify a notifier we should think about what makes sense in that situation. Would an auto-approve notifier make sense?

But wouldn't that mean that we will then have to create another new notifier? Isn't just specifying enough or should or maybe a warning in the Kafka CR is enough?

I guess this discussion is clashing with the other one where we are thinking of setting the Kube events notifier as default, or?

Yes, for now I have updated the proposal saying that if no notifier is specified then kube events notifier will be selected as default but i wonder if we should do that? Shouldn't we just create a warning/error in Kafka CR saying that specifying the notifier is must or maybe setting the SelfHealingNotifer as Default so that self healing still works and also mentioning in the Kafka CR?

The SelfHealingNotifer has an implementation of the alert method which just log in the CC pod, which is of course not so Kube friendly or at least it means a user should take a look at log. I think in our Kube env, it would be better having something more "cloud-native friendly".

ppatierno · 2025-04-30T08:08:23Z

092-self-healing-using-cruise-control.md

+Self-healing corresponding to broker failure or disk failure is disabled by default since we have better ways to fix these anomalies instead of moving the partition replicas across the brokers.
+For example, brokers failures are generally due to lack of resources, so we can fix the issue at cluster level and in the same way failed disks can be easily replaced with a new one, so self-healing in these cases can be more resource consuming.
+
+Here is an example of how the configured `Kafka` custom resource could look:


Suggested change

Here is an example of how the configured `Kafka` custom resource could look:

Here is an example of how the configured `Kafka` custom resource could look:

Suggested change

Here is an example of how the configured `Kafka` custom resource could look:

Here is an example of what the configured `Kafka` custom resource could look:

ppatierno · 2025-04-30T08:10:15Z

092-self-healing-using-cruise-control.md

+These metrics can also provide a way to the users to debug self-healing related issues.
+For example, the `number_of_self_healing_started` metric can provides users, the way to check when self-healing was started for an anomaly, and the number of self-healing actions that were performed in the cluster. 
+This information can be useful to at what point of time the self-healing occured and the shape/replica assignment of the cluster was changed due to self-healing.
+The user can also use the  metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time.


AFAIU all the above metrics are historical ones but they don't give any information about what's going on now.
I made this comment on the CC related PR as well, so I was wondering if we need to expose metrics about self-healing actually running.

I don't see any exsisting metrics like that yet in CC. There are metrics which can tell that there are inter/intra broker replica movement but Iam not sure how we can expose a metric to say when self healing is running. After talking with Tom there was a way we can know if self healing still running. we can have the sum(self healing started) minus sum(self healing ended) that can tell if the self healing is still in progress

But then it's something we should explain in the proposals and maybe providing a Grafana dashboard with such information?
Do we have, at least, the metric about when the last fix started?

If you are asking about the existing metrics then you can still use the number_of_self_healing_started metric and use the promql query to get the timestamp for the last change in the gauge -> last_over_time(timestamp(changes(kafka_cruisecontrol_anomalydetector_num_of_self_healing_started{}[<time-range>]) > 0

If we want it based on the anomaly, we can do the same with the new metrics num_of_self_healing_started_for_<anomaly-type>_count

Ok so I think we should listing this info in the proposal.

Currently we just have single metrics called number_of_self_healing_started to know that self healing was started for fixing an anomaly(these metrics doesn't classify anomaly on types) and there is no metrics regarding when self healing ended. The PR opened on CC upstream introduces new metrics regarding -> num-of-self-healing-started-for-<anomalyType> and num-of-self-healing-ended-for-<anomalyType> which are both classified based on the anomaly types. But at the same time it doesn't add any metrics like number_of_self_healing_ended for self healing ended which can provide us an overall count of how many self healing ended.

ppatierno · 2025-04-30T08:13:00Z

092-self-healing-using-cruise-control.md

+#### What happens when self-healing is running/fixing an anomaly and KafkaRebalance is approved
+
+If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. 
+For example:


Suggested change

For example:

For example:

ppatierno · 2025-04-30T08:14:07Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal
+
+If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp.


How does this sentence cope with the previous one where you say the KafkaRebalance ends in a NotReady state. I am confused. Will it be not ready or will it have a proposal?

Yeah, I am also a bit confused. IMO this section should be merged with the one above. You need to cover all the possible situations when self healing is running and describe what happens in each case:

Brand new Kafka Rebalance created

Refresh an existing rebalance

Approve a rebalance

You also need to cover what happens if a manual rebalance is ongoing and self healing is enabled. Does the manual rebalance get cancelled, does the anomaly get ignored or delayed?

tomncooper

I had another pass. My main comments are:

You need to make sure the structure of doc is logical. Start with a description of the problem, the background including an overview of how CC anomaly detection and self healing works, what monitoring and metrics are available currently and then move on to state clearly what you are proposing to change. At the moment it is hard to tell what is just background context and what are concrete proposed changes.
IMO we need to make it simple to enable/disable self healing. User should set the self.healing.enabled to true and if they don't set the notifier we should set if for them.
The kube events notifier should then be the default (we know the user has K8s for sure). But that leads on to how it will decide to approve/ignore/delay fixes, you don't discuss that at all?
You need to cover all the ways that manual rebalances and self-healing can interact.

tomncooper · 2025-05-01T14:34:22Z

092-self-healing-using-cruise-control.md

+## Current situation
+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.  
+Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource.


Suggested change

Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource.

Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might instruct Cruise Control to move the partition replicas across the brokers in order to fix the violated goal by using the `KafkaRebalance` custom resource.

tomncooper · 2025-05-01T14:35:06Z

092-self-healing-using-cruise-control.md

+
+## Motivation
+
+With the help of the self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.


Suggested change

With the help of the self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.

With the help of Cruise Control's self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.

tomncooper · 2025-05-01T14:48:49Z

092-self-healing-using-cruise-control.md

+Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
+Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
+Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc).
+However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource.


Suggested change

However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource.

However, self-healing is currently disabled and disallowed in Strimzi.

Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.

We might want to elaborate on why we disallowed it originally and why we no longer think that it is warranted any more.

tomncooper · 2025-05-01T14:50:42Z

092-self-healing-using-cruise-control.md

+![self-healing flow diagram](./images/090-self-healing-flow.png)
+
+The above flow diagram depicts the self-healing process.
+The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.


Suggested change

The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.

The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.

tomncooper · 2025-05-01T14:51:28Z

092-self-healing-using-cruise-control.md

+#### Anomaly Detector Manager
+
+The anomaly detector manager helps in detecting the anomalies as well as handling them.
+It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.


Suggested change

It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.

It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies.

tomncooper · 2025-05-02T10:32:47Z

092-self-healing-using-cruise-control.md

+Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected.
+`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
+Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.
+The `KubernetesEventsNotifier` is based of the `SelfHealingNotifier` class.


Suggested change

The `KubernetesEventsNotifier` is based of the `SelfHealingNotifier` class.

The `KubernetesEventsNotifier` will be based of the `SelfHealingNotifier` class.

tomncooper · 2025-05-02T10:37:47Z

092-self-healing-using-cruise-control.md

+For effective alerts and monitoring, it makes use of Kubernetes events to signal to users what operations Cruise Control is performing automatically.
+The `KubernetesEventsNotifier` will override the `alert()` method of the `SelfHealingNotifier` and will trigger and publish events whenever an anomaly is detected.
+These events would publish information regarding the anomaly like the anomalyID, the type of anomaly which was detected etc.
+The events would help the user to understand the anomalies that are present, when the anomalies appeared in their cluster and also would help them know what is going on in their clusters.


So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?

Could be enabled by default if a user sets self healing enabled to true but doesn't specify a notifier class?

So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?

I think the answer to this is still in my comment here #145 (comment) to explain how the base class for notifier works unless we want to write something from scratch (but I would argue about this).

tomncooper · 2025-05-02T10:40:44Z

092-self-healing-using-cruise-control.md

+When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.
+If, during the process of moving partitions replicas among the brokers in the Kafka cluster, some broker that was being used gets deleted or rolled, then Cruise Control would finish self-healing process and log that `self-healing finished successful with error`.
+If the deleted or rolled broker comes back before the periodic anomaly check interval, then self-healing would start again only for the earlier failed broker with a new anomalyId since it was not fixed properly. 
+In case the deleted broker doesn't come back healthy within the periodic anomaly check interval then another new anomaly for the deleted broker will be created and the earlier detected failed broker would also persist but will be updated with a new anomalyId.


But didn't you say above that broker failures are disabled?

tomncooper · 2025-05-02T10:44:00Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal
+
+If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp.


Yeah, I am also a bit confused. IMO this section should be merged with the one above. You need to cover all the possible situations when self healing is running and describe what happens in each case:

Brand new Kafka Rebalance created

Refresh an existing rebalance

Approve a rebalance

tomncooper · 2025-05-02T10:45:43Z

092-self-healing-using-cruise-control.md

+
+#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal
+
+If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp.


You also need to cover what happens if a manual rebalance is ongoing and self healing is enabled. Does the manual rebalance get cancelled, does the anomaly get ignored or delayed?

ppatierno · 2025-05-05T08:17:04Z

IMO we need to make it simple to enable/disable self healing. User should set the self.healing.enabled to true and if they don't set the notifier we should set if for them.

I agree with this.

The kube events notifier should then be the default (we know the user has K8s for sure). But that leads on to how it will decide to approve/ignore/delay fixes, you don't discuss that at all?

@tomncooper All the notifiers provided by Cruise Control are based on the SelfHealingNotifier [1] and if you go through it you can notice that any self healing enabled (for each specific anomaly) is fixed (if it's fixable) otherwise it's ignored.
Of course, you can write the entire notifier logic from scratch (which was going to be the case when we started with the idea of having the user the opportunity to drive the decision) but I think that for our use case it's fine going the CC provided way right now. I agree we should provide more details in the proposal.

[1] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/notifier/SelfHealingNotifier.java

Signed-off-by: ShubhamRwt <[email protected]>

ppatierno · 2025-05-06T12:27:16Z

We would like to move forward with this. I encourage @strimzi/maintainers who had already a pass to review again and for the others providing comments if they have any. Also @tinaselenge @kyguy I saw you left comments, could you have another pass please? Thanks everyone!

tomncooper · 2025-05-06T14:19:54Z

@ppatierno @ShubhamRwt I think that wires have been crossed somewhere. It was my impression, stated multiple times, that if the user specifies self.healing.enabled: true but DOES NOT specify the notifier class, that the NOOP notifier is still used and all fixes are ignored.

Is that true? I have asked to confirm this several times, as it seems like it shouldn't be true? Does CC actually switch to the SelfHealingNotifer (auto-approve if fixable) in that case? Which would make much more sense and render several of my comments mute.

ppatierno · 2025-05-06T15:00:30Z

@ppatierno @ShubhamRwt I think that wires have been crossed somewhere. It was my impression, stated multiple times, that if the user specifies self.healing.enabled: true but DOES NOT specify the notifier class, that the NOOP notifier is still used and all fixes are ignored.

It's true.

Let me try to explain this based on the code (then @ShubhamRwt can confirm based on his testing).
Let's start from swapping the order between the notifier and the self.healing.enabled flag :-) ... I mean, the main part is setting the notifier class, then it's going to use the self healing flag.

Here [1] you can see where the anomaly detector manager is going to load the notifier class and it's defaulting to NOOP notifier if the field is not set .
The anomaly detector manager doesn't care of the self healing flag at all. It just loads the configured notifier or NOOP notifier (by default).
Dealing with the self healing flag is notifier business when it's called by the anomaly detector here [2].
The anomaly detector manager gets the info about the self healing flag just when registering the sensors for metrics but .... from where? Through the notifier [3] :-)

So if you look at the NOOP notifier or any other notifier implementation, the logic about reading the self healing flag to ignore or fix is there (in the notifier). Also the NOOP notifier is configured by setting all the self healing flag(s) to false [4]. Other notifiers take into account the real flag set by the user instead.

I hope this clarify that the logic is inverted somehow: we don't go from self healing flag to notifier, but from notifier to self healing flag. The anomaly detector manager cares of the notifier only, which is in charge of using the self healing flag(s).

Hope this helps and clarify everything!

[1] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java#L107

[2] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java#L436

[3] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java#L190

[4] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/notifier/NoopNotifier.java#L31

ShubhamRwt · 2025-05-06T16:06:57Z

@ppatierno Thanks for explaining the code part and yes its true that the notifier will be kept as NoopNotifier in case we don't set anomaly.detection.class property. When I tested this case multiple times, the anomalies detected were always ignored. @tomncooper I can maybe raise a issue in upstream and ask the maintainers if we can set the SelfHealingNotifier as default in case we enable self-healing but DO NOT set the notifier. I can also retest the scenario again plus ask the upstream maintainers if are deduction is correct

ppatierno · 2025-05-06T16:18:35Z

I can maybe raise a issue in upstream and ask the maintainers if we can set the SelfHealingNotifier as default in case we enable self-healing but DO NOT set the notifier

Not sure it makes any sense to set the SelfHealingNotifier as default because, even if it has some logic to fix or ignore, it doesn't send any alert when it's happening. It just logs warnings in CC.
Also as I explained the CC logic is kind of "inverted" in the anomaly detector manager: notifier has the primary role to be configured for the manager, then it uses the self healing flags. Anyway, we can have a separate conversation about if/how to improve CC from this pov. For now focusing on the proposal, we know what happens.

tomncooper · 2025-05-08T15:15:34Z

Right, ok, I think I understand now. In which case my comments stand. If a user sets self.healing.enabled: true in the CC config and doesn't explicitly set the notifier class. Strimzi should, IMHO, set the notifier to the SelfHealingNotifier automatically.

Or, perhaps more usefully, sets it to the K8s events notifier (which extends the SelfHealingNotifier). Of course if the user set the notifier we don't touch it.

ppatierno · 2025-05-08T15:44:06Z

Strimzi should, IMHO, set the notifier to the SelfHealingNotifier automatically. Or, perhaps more usefully, sets it to the K8s events notifier (which extends the SelfHealingNotifier). Of course if the user set the notifier we don't touch it.

The difference would be what the user get notified about anomalies. With using the base SelfHealingNotifier ... no notifications, just WARN logs in the Cruise Control log. With a Kube Events notifier, they will get ... of course Kube events :-)
Not sure how much WARN messages are enough. I would be for the Kube Events.

kyguy

Still working my way through, I'll add more ideas later.

kyguy · 2025-05-09T22:38:01Z

092-self-healing-using-cruise-control.md

+# Self Healing using Cruise Control
+
+This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator.
+The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically. 


Suggested change

The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically.

When enabled, Cruise Control's self-healing feature automatically resolves issues detected by its Anomaly Detector Manager.

This includes problems such as broker failures, goal violations, and disk failures, reducing the need for manual intervention.

kyguy · 2025-05-09T23:32:32Z

092-self-healing-using-cruise-control.md

@@ -0,0 +1,300 @@
+# Self Healing using Cruise Control
+
+This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator.


Suggested change

This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator.

This proposal is about adding support for Cruise Control's self-healing capabilities in the Strimzi operator.

kyguy · 2025-05-09T23:35:32Z

092-self-healing-using-cruise-control.md

+
+## Current situation
+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.


Suggested change

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.

Even under normal operation, it's common for Kafka clusters to encounter problems such as partition key skew leading to a uneven partition distribution, or hardware issues like as disk failures, which can degrade overall cluster health and performance.

kyguy · 2025-05-10T00:16:34Z

092-self-healing-using-cruise-control.md

+The detection of anomalies is the responsibility of the anomaly detector mechanism.
+Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
+Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
+Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.).


Wouldn't the information concerning the notifier feature be more relevant in the proposal section instead of the motivation section here where we describe the motivation behind why we want to add self-healing support? Unless we are adding support for Cruise Control notifiers as part of this proposal as well?

Two questions here:

(a) Can we use the Cruise Control notifiers feature on its own in Strimzi today without adding self-healing support?

(b) I understand that we need Cruise Control's notifier feature for supporting the self-healing feature, but wouldn't the sentences about the notifier feature be more relevant in the proposal's implementations section instead of here?

I agree with @kyguy . I think some sentences from this section can be in other more relevant section, rather than here in Motivation section.
Reading this sentence, it implies that currently, we can set these notifiers without this proposed solution to support self-healing. Basically what Kyle asked in the question.

kyguy · 2025-05-10T00:18:07Z

092-self-healing-using-cruise-control.md

+Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
+Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.).
+However, self-healing is currently disabled and disallowed in Strimzi. 
+The `self-healing` properties were disabled since there was investigation missing around how self-healing would act if pods roll in middle of it/rebalance starts in  or how the operator would know if self-healing is running or not 


I would move a variation of this sentence up to the Current Situation section.

kyguy · 2025-05-10T00:23:12Z

092-self-healing-using-cruise-control.md

+However, self-healing is currently disabled and disallowed in Strimzi. 
+The `self-healing` properties were disabled since there was investigation missing around how self-healing would act if pods roll in middle of it/rebalance starts in  or how the operator would know if self-healing is running or not 
+Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.
+It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.


Why is it useful to have these anomalies fixed automatically? It might be worth describing some of the anomalies the self-healing could fix automatically and why they are such a hassle to fix manually.

scholzj

I still don't understand the StrimziNotifier. It does not seem to be really useful. Why do you think anyone wants to use Kubernetes Events to be notified about something like this? But probably even more important - it seems to have absolutely no role in this proposal and I would suggest to lift it out into a separate proposal if you really think this makes sense for any user.

Apart from that, my take on this proposal pretty similar as it was in Janaurry. It seems like something what would work for auto-rebalancing. But:

It provides an approach completely inconsistent with the auto-rebalancing-on-scaling feature
It does not seem to have any considerations towards the future steps and integration.
For example:
- How to use the Cruise Control goal violations to feed auto-scaling
- How to use the Cruise Control goal violations to tell the operator to fix a disk or broken node
It does not mention any way how to allow the auto-rebalancing only in certain time windows which is IMHO very important for a feature like this. Or does CC have some configuration option for this?

I think the rejected alternative 2 might be able to provide clear paths for all three points. But adopting this proposal might close the doors to it (without some complicated changes and blocking the options we allow etc.).

scholzj · 2025-05-11T23:48:09Z

092-self-healing-using-cruise-control.md

+
+During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
+Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might instruct Cruise Control to move the partition replicas across the brokers in order to fix the violated goal by using the `KafkaRebalance` custom resource.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.


Isn't it a single KafkaRebalance regardless the cluster size? You can for example have a CronJob to create the KafkaRebalance every night or every weekend and execute it and while the rebalance might take different time, the process it self is the same regrdless whether you have 5 or 50 nodes.

scholzj · 2025-05-11T23:52:00Z

092-self-healing-using-cruise-control.md

+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
+The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
+Detector classes have different mechanisms to detect their corresponding anomalies.
+For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API.


What is Kafka Metadata API? The Kafka Admin API is defined by the docs - https://kafka.apache.org/documentation/#adminapi - but there is nothing like Metadata API in the Kafka docs.

scholzj · 2025-05-11T23:52:48Z

092-self-healing-using-cruise-control.md

+Furthermore, `MetricAnomalyDetector` and `GoalViolationDetector` use metrics to detect their anomalies.
+The detected anomalies can be of various types:
+* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the `KafkaRebalance` custom resource.
+* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).


Is this what you meant? Or what is some partitions are too large in disk?

Suggested change

* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).

* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions consume too much disk space).

scholzj · 2025-05-11T23:55:48Z

092-self-healing-using-cruise-control.md

+
+## Proposal
+
+### Introduction


TBH, none of the Introduction section is really a proposal of anything. You seem to be just describing how Cruise control works. It is useful, but I would not include it in the Proposal but add it before it as a separate section. That would make it easier to find what exactly is this proposing.

scholzj · 2025-05-11T23:58:55Z

092-self-healing-using-cruise-control.md

+If self-healing is enabled, then an action is returned by the notifier to would decide whether the anomaly should be fixed or not.
+If the notifier has returned `FIX` as the action then the classes which are responsible for resolving the anomaly would be called.
+Each detectable anomaly is handled by a specific detector class which then uses another remediation class to run a fix.
+For example, the `GoalViolations` class uses the `RebalanceRunnable` class, the `DiskFailure` class use the `RemoveDisksRunnable` class and so on.


Can we have custom runnables as a reactions to different failures? E.g., is it technically possible to get Cruise Control to use StrimziFixDiskRunnable to address DiskFailure by deleting the PV/PVC and having it recreated by Kubernetes instead of doing some rebalance?

Iam not sure if we can do that. I just we can create custom goals in CC but I don't know if we can write custom runnables. I can ask some CC folks if its possible or not

scholzj · 2025-05-12T00:03:38Z

092-self-healing-using-cruise-control.md

+
+In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
+If the user forgets to set the notifier property then the `KubernetesEventNotifier`(custom notifier under Strimzi org) will be used by default and the user will also get a warning in the Kafka CR regarding missing `anomaly.notifier.class` property.
+We will support most anomaly fixes however there are some fixes which we don't want self-healing to fix automatically in a Kubernetes context.


Can you be specific and list what goals will be and what will not be supported instead of saying most?

Also, given what you say below and assuming it does not change -> please make it clear here that they will be just disabled by default while the users can enable them.

scholzj · 2025-05-12T00:06:41Z

092-self-healing-using-cruise-control.md

+      self.healing.disk.failure.enabled: false # disables self healing for disk failure
+```
+
+The user can still enable self-healing for disk failures and broker failure if they want by setting the  `self.healing.broker.failure.enabled` for broker failure and `self.healing.disk.failure.enabled` for disk failure to `true`.


Why? They make zero sense in the Kubernetes context.

I would also expect that if we let the users enable them now, it will be impossible or at least much harder to provide a proper solution for them later as it will need to deal with users who have it enabled.

scholzj · 2025-05-12T00:10:52Z

092-self-healing-using-cruise-control.md

+#### What happens when self-healing is running/fixing an anomaly and brokers are rolled
+
+When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.
+If, during the process of moving partitions replicas among the brokers in the Kafka cluster, some broker that was being used gets deleted or rolled, then Cruise Control would finish self-healing process and log that `self-healing finished successful with error`.


What does finish mean in this context? It sounds like it will move a partition replica to a black hole left after a removed broker which would be pretty nasty. I assume it will stop the process with an error instead?

so the process is stopped but I don't think the there are any reverts, so its more like incomplete movement. The upstream CC logs still states finished succesful with error. Maybe we shoud get this error message fixed upstream

scholzj · 2025-05-12T00:12:04Z

092-self-healing-using-cruise-control.md

+
+* If self-healing is ongoing and a `KafkaRebalance` resource is created/applied in the middle of it then Cruise Control will still generate an optimization proposal.
+The proposal should be based on the cluster shape at that particular timestamp.
+* If self-healing is ongoing and the generated proposal is `approved` then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.


What if a proposal that is generated before or during self-healing is approved after the self-healing is finished?

that generated proposal will based on old info but yeah I can try this case to know more

scholzj · 2025-05-12T00:14:34Z

092-self-healing-using-cruise-control.md

+
+## Affected/not affected projects
+
+This change will affect the Strimzi cluster operator and a new repository named `kubernetes-event-notifier` will be added under the Strimzi organisation. 


It should probably include the cruise-control somewhere in the name? Please also move the information abotu the new repository to the section about the notifier (unless you remove it as I suggest of course). You should also clarify there how it will be distributed (likely through Maven Central?).

Okay, I will add it

tinaselenge · 2025-05-12T11:09:45Z

092-self-healing-using-cruise-control.md

+With the help of Cruise Control's self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.
+The detection of anomalies is the responsibility of the anomaly detector mechanism.
+Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
+Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.


Suggested change

Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.

Notifiers can be useful as they provide users visibility for the actions that are being performed by Cruise Control self-healing e.g. anomaly being detected, anomaly fix being started etc.

tinaselenge · 2025-05-12T11:17:20Z

092-self-healing-using-cruise-control.md

+The detection of anomalies is the responsibility of the anomaly detector mechanism.
+Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
+Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
+Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.).


I agree with @kyguy . I think some sentences from this section can be in other more relevant section, rather than here in Motivation section.
Reading this sentence, it implies that currently, we can set these notifiers without this proposed solution to support self-healing. Basically what Kyle asked in the question.

tinaselenge · 2025-05-12T11:20:55Z

092-self-healing-using-cruise-control.md

+![self-healing flow diagram](./images/090-self-healing-flow.png)
+
+The above flow diagram depicts the self-healing process.
+The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.


Suggested change

The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.

The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.

tinaselenge · 2025-05-12T11:28:23Z

092-self-healing-using-cruise-control.md

+* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).
+
+The detected anomalies are inserted into a priority queue where comparator is based upon the priority value and the detection time. 
+The smaller the priority value and detected time is, the higher priority the anomaly type has.


Who decides the priority value? How does it correlate with detected time?

The priority is set in the KafkaAnomalyType class in Cruise Control.

@JsonResponseField BROKER_FAILURE(0), @JsonResponseField MAINTENANCE_EVENT(1), @JsonResponseField DISK_FAILURE(2), @JsonResponseField METRIC_ANOMALY(3), @JsonResponseField GOAL_VIOLATION(4), @JsonResponseField TOPIC_ANOMALY(5);

where the anomaly with lowest no. has highest priority. Regarding the detection time, the anomaly which is detected first would be resolved first

tinaselenge · 2025-05-12T11:30:27Z

092-self-healing-using-cruise-control.md

+
+Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes.
+The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.
+The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.


is anomaly.notifier.class specified under spec.CruiseControl.config?

tinaselenge · 2025-05-12T12:16:00Z

092-self-healing-using-cruise-control.md

+The anomaly status changes based on how the healing is progressing.
+The anomaly can transition to these possible states:
+* DETECTED - The anomaly is just detected and no action is yet taken on it 
+* IGNORED - Self healing is either disabled or the anomaly is unfixable


Also state will be IGNORED, if self-healing is enabled but set to the default notifier, NoopNotifer right?

yes, let me add it

tinaselenge · 2025-05-12T12:20:24Z

092-self-healing-using-cruise-control.md

+These metrics can also provide a way for the users to debug self-healing related issues.
+For example, the `number_of_self_healing_started` metric can provides users, the way to check when self-healing was started for an anomaly, and the number of self-healing actions that were performed in the cluster.
+This information can be useful to at what point of time the self-healing occurred and the shape/replica assignment of the cluster was changed due to self-healing.
+The user can also use the  metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time.


Suggested change

The user can also use the metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time.

The user can also use the metric such as `number_of_self_healing_failed` to find out if the self-healing failed to start, so that they can check the reasons in the log for the failure. One example could be if load monitor was ready at that point of time.

tinaselenge · 2025-05-12T12:23:01Z

092-self-healing-using-cruise-control.md

+
+### Proposal for enabling self-healing in Strimzi
+
+In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.


Suggested change

In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.

In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specify a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.

tinaselenge · 2025-05-12T12:26:38Z

092-self-healing-using-cruise-control.md

+### Proposal for enabling self-healing in Strimzi
+
+In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
+If the user forgets to set the notifier property then the `KubernetesEventNotifier`(custom notifier under Strimzi org) will be used by default and the user will also get a warning in the Kafka CR regarding missing `anomaly.notifier.class` property.


So to clarify, currently if self-healing is enabled with CC, it uses NoopNotifer by default which ignores anomalies. With this proposal, Strimzi will always set the notifier to KubernetesEventNotifier which will issue the warning if user has not specified a notifier class?

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt · 2025-05-29T09:50:18Z

Hi @scholzj , I tried to explain the answers to your questions, I hope it make things clear in terms of why a notifier is necessory as well as other future improvements to the feature.

Question:
I still don't understand the StrimziNotifier. It does not seem to be really useful. Why do you think anyone wants to use Kubernetes Events to be notified about something like this? But probably even more important - it seems to have absolutely no role in this proposal and I would suggest to lift it out into a separate proposal if you really think this makes sense for any user.

Having our own notifer helps us with various aspects:

If the user doesn’t use any slack, MS Team or Alerta accounts then this will provide them a good cloud native way to get events regarding the detected anomalies.
For future integrations, like writing our own runnables for fixing the anomalies, we will need a notifier since the notifier decides what action to take (what runnables to trigger) on anomalies. Having our own implementation makes it easier to add that functionality in future.

The notifier plays an important role in the proposal as it is important to notify the users regarding the detected anomalies as well as in cases where users forgets to set the notifier but enables self-healing. If we don’t have any notifier set then basically all the detected anomalies would be ignored so having our own default notifier implementation provides the users a better experience so that they just need to enable self-healing and don’t have to worry about other properties. From a future perspective also, to have a more operator driven approach instead of CC doing everything we can use our notifier to execute our own custom runnable classes.

Question:
It provides an approach completely inconsistent with the auto-rebalancing-on-scaling feature
It does not seem to have any considerations towards the future steps and integration.
For example:
How to use the Cruise Control goal violations to feed auto-scaling
How to use the Cruise Control goal violations to tell the operator to fix a disk or broken node

All the interactions that we currently do with Cruise Control are manually triggered and the fixes are always driven by the operator. Taking the case of auto rebalancing on scaling, it is something which is driven by the operator since the operator knows that scaling (up/down) is happening within the cluster and therefore it will trigger the rebalance. With self-healing the approach is different, it is driven by Cruise Control and therefore different to our previous integrations.

If we want to feed auto-scaling or other operator actions from CC we can use our own notifier to annotate the Kafka CR (or use a CM or similar). We can even use our own runnable classes if we have the notifier. Using the stated approaches we can involve the operator in the process and drive the solution accordingly. The KubeEventNotifier for now only emits events based regarding the detected anomaly but we can later improve it by adding more features, for e.g. driving the fix for certain anomalies based upon our own runnables classes.

Question:
It does not mention any way to allow the auto-rebalancing only in certain time windows which is IMHO very important for a feature like this. Or does CC have some configuration option for this?

There is no configuration in CC today which can help us with this. But, if we use our own notifier then we can make use of an annotation (on the Kafka CR) to pause self healing (by no-oping all anomalies) for the duration of any maintenance window. This was actually part of an earlier version of this proposal but was removed because we decided that users can disable self-healing if they know that some big manual rolls/rebalancing is going to happen in the cluster. They can re-enable the feature once the roll/rebalance is complete.

scholzj · 2025-05-29T12:40:04Z

Having our own notifer helps us with various aspects:

If the user doesn’t use any slack, MS Team or Alerta accounts then this will provide them a good cloud native way to get events regarding the detected anomalies.

Just because someone does not use Slack or MS Teams doesn't mean that using Kubernetes Events is a good solution. So this is not a good reason to have it. There is also nothing cloud-native about it. You are just using Kubernetes Events for something they were not really designed for, and what is not a good alerting mechanism.

For future integrations, like writing our own runnables for fixing the anomalies, we will need a notifier since the notifier decides what action to take (what runnables to trigger) on anomalies. Having our own implementation makes it easier to add that functionality in future.

Well, as I said in my comments - this is a reason why not to have this right now. You do not know how the future integration will look like so you have no idea how easy or hard will it be to not break compatibility. So it is one more reason to remove the notifier.

The notifier plays an important role in the proposal as it is important to notify the users regarding the detected anomalies as well as in cases where users forgets to set the notifier but enables self-healing. If we don’t have any notifier set then basically all the detected anomalies would be ignored so having our own default notifier implementation provides the users a better experience so that they just need to enable self-healing and don’t have to worry about other properties. From a future perspective also, to have a more operator driven approach instead of CC doing everything we can use our notifier to execute our own custom runnable classes.

Sorry, but no. If the user forgets to set the Slack notifier, will they somehow miraculously remember to check Kubernetes events? Also, there is no realationship to the operator. So there is nothing operator driven on this. So I do not understand this argument and why having some Kubernetes Events is important for the self-healing.

All the interactions that we currently do with Cruise Control are manually triggered and the fixes are always driven by the operator. Taking the case of auto rebalancing on scaling, it is something which is driven by the operator since the operator knows that scaling (up/down) is happening within the cluster and therefore it will trigger the rebalance. With self-healing the approach is different, it is driven by Cruise Control and therefore different to our previous integrations.

Auto-rebalancing is not manually triggered. It is driven fully by the operator. So it is not true that all interactions are manually triggered. The self-healing as proposed here is just a fancy name for auto-rebalancing triggered by some metric inside Cruise Control. So I would expect it to be easy to have it driven by an alternative implementation driven through the operator similarly as the auto-rebalancing at scaling is. At the end, just few lines above you talk about driving things through the operator using the Strimzi notifier in the future as well. So it clearly can be passed through some notifier to Strimzi to have Strimzi trigger the auto-rebalance.

If we want to feed auto-scaling or other operator actions from CC we can use our own notifier to annotate the Kafka CR (or use a CM or similar). We can even use our own runnable classes if we have the notifier. Using the stated approaches we can involve the operator in the process and drive the solution accordingly. The KubeEventNotifier for now only emits events based regarding the detected anomaly but we can later improve it by adding more features, for e.g. driving the fix for certain anomalies based upon our own runnables classes.

Or you can first understand how to do the integration before dealing with some events with dubious value to make sure it does not cause problems in the future.

There is no configuration in CC today which can help us with this. But, if we use our own notifier then we can make use of an annotation (on the Kafka CR) to pause self healing (by no-oping all anomalies) for the duration of any maintenance window. This was actually part of an earlier version of this proposal but was removed because we decided that users can disable self-healing if they know that some big manual rolls/rebalancing is going to happen in the cluster. They can re-enable the feature once the roll/rebalance is complete.

We already have maintenance time windows in Strimzi. That is what needs to be plugged in. Not some pause self-healing annotation (although there might be other use-cases for such an annotation).

Add proposal for adding Self healing feature

aa6593e

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt requested review from tombentley, tomncooper, scholzj, ppatierno and kyguy January 10, 2025 08:41

scholzj reviewed Jan 11, 2025

View reviewed changes

tinaselenge reviewed Jan 14, 2025

View reviewed changes

ppatierno reviewed Jan 21, 2025

View reviewed changes

kyguy reviewed Jan 30, 2025

View reviewed changes

Added suggestions

5489200

Signed-off-by: ShubhamRwt <[email protected]>

tinaselenge reviewed Mar 12, 2025

View reviewed changes

Refined the proposal

656e291

Signed-off-by: ShubhamRwt <[email protected]>

ppatierno reviewed Apr 9, 2025

View reviewed changes

ShubhamRwt commented Apr 16, 2025

View reviewed changes

Added metrics paragraph and added suggestion by Paolo

be840f2

Signed-off-by: ShubhamRwt <[email protected]>

ppatierno reviewed Apr 30, 2025

View reviewed changes

tomncooper suggested changes May 2, 2025

View reviewed changes

Added suggestion by Paolo and Tom

73b7d83

Signed-off-by: ShubhamRwt <[email protected]>

ShubhamRwt force-pushed the selfHealingFeature branch from cdb9225 to 73b7d83 Compare May 6, 2025 05:26

ppatierno requested review from Frawless and see-quick May 6, 2025 12:24

ppatierno requested review from im-konge and katheris May 6, 2025 12:24

kyguy reviewed May 10, 2025

View reviewed changes

scholzj reviewed May 12, 2025

View reviewed changes

tinaselenge reviewed May 12, 2025

View reviewed changes

Added suggestions from Tina, Kyle and Jakub

e539dae

Signed-off-by: ShubhamRwt <[email protected]>


		## Motivation

		With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.


		#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

		If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.


		#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

		If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.


		#### Anomaly Detector Manager

		The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.

	Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
	Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection.

	* FIX_STARTED - The anomaly has started getting fixed
	* FIX_STARTED - The anomaly fix has started

	* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
	* LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state

	For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
	For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource.

	Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
	Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).

	* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
	* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing.

	The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
	The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored.

	If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
	If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


		### Enabling self-healing in the Kafka custom resource

		When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).

	When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
	When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource.
	When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored.

	Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
	Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc.

	The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
	The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`.


		### Strimzi Operations while self-healing is running/fixing an anomaly

		Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.


		## Affected/not affected projects

		A new repository named `kubernetes-event-notifier` will be added under the Strimzi organisation.

	For example, the user can query the endpoint using the `curl` command:

	For example, the user can query the endpoint using the `curl` command:

	the detected anomaly would look something like this:

	the detected anomaly would look something like this:

Proposal for adding Self healing feature in the operator #145

Are you sure you want to change the base?

Proposal for adding Self healing feature in the operator #145

Uh oh!

Conversation

ShubhamRwt commented Jan 10, 2025

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tinaselenge Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShubhamRwt Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tinaselenge Jan 14, 2025 •

edited

Loading

ShubhamRwt Jan 15, 2025 •

edited

Loading

	If the user don't set the notifier property then `NoopNotifer` will be used by default which means all the action on the anomalies would be set to `IGNORE`.
	However, if the user does not set the notifier class property then the `NoopNotifer` will be used by default.
	This means all the action on the detected anomalies will be set to `IGNORE` and self-healing is effectively disabled even if the `self.healing.enabled` property is set to `true`.

	Here is an example of how the configured `Kafka` custom resource could look:
	Here is an example of how the configured `Kafka` custom resource could look:

	Here is an example of how the configured `Kafka` custom resource could look:
	Here is an example of what the configured `Kafka` custom resource could look:

	Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource.
	Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might instruct Cruise Control to move the partition replicas across the brokers in order to fix the violated goal by using the `KafkaRebalance` custom resource.

	With the help of the self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.
	With the help of Cruise Control's self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.

	However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource.
	However, self-healing is currently disabled and disallowed in Strimzi.
	Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.

	The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.
	The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.

	It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
	It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies.

	The `KubernetesEventsNotifier` is based of the `SelfHealingNotifier` class.
	The `KubernetesEventsNotifier` will be based of the `SelfHealingNotifier` class.

	The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically.
	When enabled, Cruise Control's self-healing feature automatically resolves issues detected by its Anomaly Detector Manager.
	This includes problems such as broker failures, goal violations, and disk failures, reducing the need for manual intervention.