Skip to content

Proposal for adding Self healing feature in the operator #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ShubhamRwt
Copy link
Contributor

This proposal shows how we plan to implement the self healing feature in the operator

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal.

I think it would be useful to use a regular human language here and not the Cruise Control language as that would bring better perspective on what exactly is being proposed here and what is not being proposed here (and what actually the self-healing does and does not do).

I do not understand the role of the StrimziNotifier. While I doubt that anyone will find it appealing to have Kubernetes events used for the notification instead of Slack or Prometheus, I cannot of course rule out that this will find its users. But it seems to be only marginally related to Strimzi. I'm sure we can host this under the Strimzi GitHub organization if desired, but it seems more like an unrelated subproject that is separate from the self-healing feature it self.

The actual self-healing ... you focus a lot on the Disk and Broker failures. But it seems to me like the self-healing does not offer good solutions for this. The idea to solve this by rescheduling the partition replicas to other brokers seems to be well suited for Kafka on bare-metal with local storage where a dead server often means that you have to get a brand new server to replace it. It does not seem to be well suited to self-healing Kubernetes infrastructure and for cloud-based environments, where Pods can be just rescheduled and new disk can be provisioned in the matter of seconds.

So it seems to me like the self-healing itself as done by Cruise Control could be useful for dealing with inbalanced brokers. But for situations such as broker or disk failures or insufficient capacity, we should be more ambitious and aim to solve them properly in the Strimzi Cluster Operator. That is where an actual StrimziNotifier might play a future role to not issue any Kubernetes Events to tell the user about broker disk, but provide a mechanism to tell Strimzi about it so that Strimzi can react to it - e.g. by scaling for capacity based recommendation from CC or as suggested delete broken PVC and get a new one provisioned to replace it. While I underttand that this might be out of scope for this proposal, I think it is important to leave space for it. For example by focusing here on the auto-rebalancing features and leaving the other detected issues unused for future development.

## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not some bare-metal cluster but Kubernetes. So I'm struggling to find the issues where you would need to do what you suggest. Instead, on Kubernetes, you would normally just fix the issue on the existing broker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).


During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should clarify what issues are you talking about here. From the past discussion, the terminology used here seems highly confusing ... and I suspect that the words such as issues, anomalies, and problems are ambiguous at best`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me they are all synonyms more or less. Maybe just use one and stick with it?


## Motivation

With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in this proposal suggests that Cruise Control can solve disk or broker failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?


With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link seems broken? (Double ( and )?)

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need this level of details. Maybe just link to the doc of these detectors? If not, maybe reword the sentence to something like this:

Suggested change
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` uses metrics while other detector such as `GoalViolationDetector` and `TopicAnomalyDetector` use a different mechanism to detect anomalies.

Maybe we should mention briefly how GoalViolationDetector and TopicAnomalyDetector detect anomalies like these others too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

```

The `StrimziNotifier` class is going to reside in a separate module in the operator repository called the `strimzi-cruise-control-notifier` so that a seperate jar for the notifier can be built for inclusion in the Cruise Control container image.
We will also need to include the fabric8 client dependencies in the notifier JAR that will be used for generating the Kubernetes events as well as for reading the annotation on the Kafka resource from the notifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading what annotation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you are anticipating what's going to be in the next paragraph but this sentence doesn't make any sense here.

Comment on lines 185 to 190
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
We can set this annotation on the Kafka custom resource to pause the self-healing feature.
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
The anomalies that were already getting fixed would continue with the fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is pausing the self-healing useful for the users? Assuming it is useful, why not just disable the self-healing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).


#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does it mean when the broker fails? How does CC recognize it? How does it distinguish failure for example from some temporary Kubernetes scheduling issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.

#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it finish the transfer from the failed broker when the target broker is stopped? You mean that it will continue with it after the target broker is rolled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, If the target broker is rolling at that moment, then self healing would end saying self-healing finished successful with error. Then periodic anomaly check will again detect the failed broker and try to fix it. If the target broker is rolled by that time then self healing will start again else if the rolled broker is not up then we will will get two anomalies to fix the failed broker and the target broker


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to differenciate between asking for proposal and approving a proposal? For example when asking for proposal, I would expect the operator to wait and give me a proposal after the self-healing ends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShubhamRwt any answer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the middle of self-healing I was able to generate the proposal and when I approved it I got the error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a section around this in the proposal


#### Anomaly Detector Manager

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
The anomaly detector manager helps in detecting the anomalies as well as handling them.

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean the goals in KafkaRebalance and goals self.healing.goals are independent of each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

If by KR you mean the KafkaRebalance custom resource, I don't think what you are saying is true. There is no relationship between a specific KafkaRebalance and self-healing configuration.
Maybe are you referring to the general CC configuration? So the default hard goals in CC has to be a superset of self.healing.goals (and not the other way around)?

The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection.

Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.

#### Self Healing
If self-healing is enabled in Cruise Control and anomalies are detected, then a fix for these will be attempted automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fix will be attempted automatically only if anomaly.notifier.class is configured to non default notifier right? So enabling self-healing alone is not enough, as with the default notifier (NoopNotifer), anomalies are detected but ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you will need to override the notifier too

The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status:
* DETECTED - The anomaly is just detected and no action is yet taken on it
* IGNORED - Self healing is either disabled or the anomaly is unfixable
* FIX_STARTED - The anomaly has started getting fixed
Copy link
Contributor

@tinaselenge tinaselenge Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* FIX_STARTED - The anomaly has started getting fixed
* FIX_STARTED - The anomaly fix has started

* FIX_STARTED - The anomaly has started getting fixed
* FIX_FAILED_TO_START - The proposal generation for the anomaly fix failed
* CHECK_WITH_DELAY - The anomaly fix is delayed
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
* LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state

self.healing.enabled: true
```

Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any anomalies enabled by default? If user does not specify anomalies to detect, self-healing will do nothing?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies. In case the user knows that somehow any anomaly will keep appearing since they have some issues in their cluster, then in that they can use skip the check for that particular anomaly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.

The notifier class returns the `action` that is going to be taken on the notified anomaly.
These actions can be classified into three types:
* FIX - Start the anomaly fix
* CHECK - Delay the anomaly fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the delay action, apply a configured delay before taking the action provided by notifier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Even if the delay is mostly used only for broker failures and not for another anomalies AFAIK.

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Can you go through the proposal and split paragraphs with one sentence per line? Thanks!

## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).


During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me they are all synonyms more or less. Maybe just use one and stick with it?


## Motivation

With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?


With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time you mention the term "notifier" but it shows up out of the context. You should first explain what it is and why it's useful in general, not related to the self-healing but more to the alerting.

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

### Pausing the self-healing feature

The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource.

Comment on lines 185 to 190
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
We can set this annotation on the Kafka custom resource to pause the self-healing feature.
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
The anomalies that were already getting fixed would continue with the fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).

uid: 71928731-df71-4b9f-ac77-68961ac25a2f
```

We will have checks on methods like `onGoalViolation`, `onBrokerFailure`, `onTopicAnomaly` etc. in the `StrimziNotifier` to check if the `pause` annotation is applied on the Kafka custom resource or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no pause annotation. Use the exact name.


#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?


The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing.

The nuances between the goal configurations of Cruise Control can be quite confusing, to help keep them straight in my head I categorize them in the following way:

  • The server-side Cruise Control goal configs in spec.cruiseControl.config of the Kafka resource
  • The client-side Cruise Control goal config in spec of the KafkaRebalance resource

To avoid confusion, it may be worth adding some text that explains where and how the self.healing.goals compares/relates to the other server-side Cruise Control goal configs - goals, default.goals, hard.goals, anomaly.detection.goals. It wouldn't have to be right here, but somewhere in the proposal where the self.healing configuration is raised. When we add the self-healing feature we will need to add some documentation of how self.healing.goals related to the other goal configurations anyways. We could use the same language used to describe the different goal configs in the Strimzi documentation and provide a link to those docs as well.

I think the language you and Paolo use in this thread referring to other goal configs like "superset" or subset is helpful. Since the goal configs are so complicated, I don't think it is a problem if we reuse text from the Strimzi Cruise Control goal documentation to emphasize the distinction of the self.healing.goals config.

* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).

The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored.


The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


### Enabling self-healing in the Kafka custom resource

When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource.
When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored.

self.healing.enabled: true
```

Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.


Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected.
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc.

`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
The `SelfHealingNotifier` class can then be further used to implement your own notifier
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`.


### Strimzi Operations while self-healing is running/fixing an anomaly

Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.
Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster.

Signed-off-by: ShubhamRwt <[email protected]>

#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.
If some anomaly is getting healed , Cruise Control will then try to fix/mitigate the anomaly by moving the partition replicas across the brokers in the Kafka cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.

The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section) to those used for manual rebalancing(goals configured in the KafkaRebalance custom resource).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part confuses me slightly, maybe can be reworded better.
Is the below what you mean?

  • Goal Violation - This happens if certain optimization goals such as DiskUsageDistributionGoal are violated. The optimization goals can be configured independently from KafkaRebalance custom resource by using spec.cruiseControl.config.self.healing.goals in Kafka custom resource.

When you say certain optimization goals, are you saying only some would trigger Goal Violation anomaly?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AFAIK it would only be those goals specifically listed in the self healing goals configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make the sentence less confusing, I hope its better now

@@ -76,17 +79,18 @@ statusUpdateDate=2024-12-09T09:49:47Z,
failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}},
status=FIX_STARTED}]}
```
where the `anomalyID` represents the task id assigned to the anomaly, `failedDisksByTimeMs` represents the disk which failed, `detectionDate` and `statusUpdateDate` represet the date and time when the anomaly was detected and the status of the anomaly was updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the mentioned anomalyType field snippet. Is this a full snippet of an anomaly?

I guess not that important, since this is just an example, but should we explain the value of failedDisksByTimeMs? {2={/var/lib/kafka/data-1/kafka-log2=1733737786603}} what these value mean, how it represents the failed disk? If you think, this is not important, please feel free ignore this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some explanation around it

* COMPLETENESS_NOT_READY - Completeness not met for some goals. The completeness describes the confidence level of the data in the metric sample aggregator. If this is low then some Goal classes may refuse to produce proposals.

We can also poll the `state` endpoint, with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.
We can again poll the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control), with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we provide, how users might poll this end point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this you mean mentioning the complete curl command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the curl command to poll the endpoint

Signed-off-by: ShubhamRwt <[email protected]>
Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShubhamRwt I had another pass but I think the next one should be with a proposal around metrics as another option instead of using the Kubernetes events notifier.

## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource in the [`rebalance`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no rebalance mode. Did you mean full? Anyway I would not specify it, because it's possible you could even decide to scale up the cluster when doing the rebalancing so add-brokers would be needed.

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` and `GoalViolationDetector` uses metrics to detect the anomalies.
The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the KafkaRebalance custom resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the KafkaRebalance custom resource.
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the `KafkaRebalance` custom resource.

The notifier class returns the `action` that is going to be taken on the notified anomaly.
These actions can be classified into three types:
* FIX - Start the anomaly fix
* CHECK - Delay the anomaly fix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Even if the delay is mostly used only for broker failures and not for another anomalies AFAIK.

* CHECK - Delay the anomaly fix
* IGNORE - Ignore the anomaly fix

The default `NoopNotifer` always sets the notifier action as `IGNORE` means that the detected anomaly will be silently ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that it doesn't send any notification to the user in any way as well.

type: Normal
```

The users can build a jar for the notifier for inclusion in the Kafka image to run on as part of Cruise Control.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KubernetesEventsNotifier should be included out of the box within the Kafka Strimzi image in order to be configured by the operator. It's not clear if you are referring to a custom notifier that a user would write.


### Strimzi Operations while self-healing is running/fixing an anomaly

Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's the purpose of this single sentence paragraph.

If the deleted or rolled broker comes back before the periodic anomaly check interval, then self-healing would start again only for the earlier failed broker with a new anomalyId since it was not fixed properly.
In case the deleted broker doesn't come back healthy within the periodic anomaly check interval then another new anomaly for the deleted broker will be created and the earlier detected failed broker would also persist but will be updated with a new anomalyId.

#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is approved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is approved
#### What happens when self-healing is running/fixing an anomaly and KafkaRebalance is approved


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShubhamRwt any answer here?


## Affected/not affected projects

A new repository named `kubernetes-event-notifier` will be added under the Strimzi organisation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the cluster operator to be affected because we need to allow enabling self-healing, or?

observedGeneration: 1
```

#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno this section talks about what happens if we request rebalance in middle of self healing

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another pass. Left some comments and made some cosmetic changes (it seems you don't like spaces and new lines :-P).


The above flow diagram depicts the self-healing process.
The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.
The notifier then alerts the user about detected anomaly through the logs and also returns what action needs to be taken on the anomaly i.e. whether to fix it, ignore it or delay it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it depends on the notifier, the anomaly is not just logged. You can have Slack, MS Teams or anything else, with a dedicated notifier, to get a notification about the anomalies.

Each detected anomaly is implemented by a specific class which leverages a corresponding other class to run a fix.
For example, the `GoalViolations` class uses the `RebalanceRunnable` class, the `DiskFailure` class use the `RemoveDisksRunnable` class and so on.
An optimization proposal will then be generated by these `Runnable` classes and that proposal will be applied on the cluster to fix the anomaly.
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware (insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the Cruise Control logs will be updated with `self-healing is not possible due to unfixable goals` warning.

An optimization proposal will then be generated by these `Runnable` classes and that proposal will be applied on the cluster to fix the anomaly.
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.
When self-healing starts, you can check the anomaly status by querying the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control) with `substate` parameter set as `anomaly_detector`. This would provide the anomaly details like anomalyId, the type of anomaly and the status of the anomaly etc.
For example, the user can query the endpoint using the `curl` command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, the user can query the endpoint using the `curl` command:
For example, the user can query the endpoint using the `curl` command:

```shell
curl -vv -X GET "http://localhost:9090/kafkacruisecontrol/state?substates=anomaly_detector"
```
the detected anomaly would look something like this:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the detected anomaly would look something like this:
the detected anomaly would look something like this:

failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}},
status=FIX_STARTED}]}
```
where,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
where,
where:

### Self-healing configuration in the Strimzi operator

Users can activate self-healing by setting the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
If the user don't set the notifier property then `NoopNotifer` will be used by default which means all the action on the anomalies would be set to `IGNORE`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would so make clear that the self.healing.enabled is not enough by itself then, because you also need to set a proper notifier (instead of the default NoopNotifier).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the user don't set the notifier property then `NoopNotifer` will be used by default which means all the action on the anomalies would be set to `IGNORE`.
However, if the user does not set the notifier class property then the `NoopNotifer` will be used by default.
This means all the action on the detected anomalies will be set to `IGNORE` and self-healing is effectively disabled even if the `self.healing.enabled` property is set to `true`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So given the two sentences above, what are you going to do about it? If a user sets self healing to true and not the notifier, what happens? Should there be a warning? Should the Kafka CR show an error condition, should it be set to NotReady (as that is clearly a mis-configuration)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have said before, in my opinion we should prevent this situation happening by just setting up the notifier for the user (if they don't specify one themselves).

If the user doesn't specify a notifier we should think about what makes sense in that situation. Would an auto-approve notifier make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't that mean that we will then have to create another new notifier? Isn't just specifying enough or should or maybe a warning in the Kafka CR is enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this discussion is clashing with the other one where we are thinking of setting the Kube events notifier as default, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for now I have updated the proposal saying that if no notifier is specified then kube events notifier will be selected as default but i wonder if we should do that? Shouldn't we just create a warning/error in Kafka CR saying that specifying the notifier is must or maybe setting the SelfHealingNotifer as Default so that self healing still works and also mentioning in the Kafka CR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SelfHealingNotifer has an implementation of the alert method which just log in the CC pod, which is of course not so Kube friendly or at least it means a user should take a look at log. I think in our Kube env, it would be better having something more "cloud-native friendly".

Self-healing corresponding to broker failure or disk failure is disabled by default since we have better ways to fix these anomalies instead of moving the partition replicas across the brokers.
For example, brokers failures are generally due to lack of resources, so we can fix the issue at cluster level and in the same way failed disks can be easily replaced with a new one, so self-healing in these cases can be more resource consuming.

Here is an example of how the configured `Kafka` custom resource could look:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here is an example of how the configured `Kafka` custom resource could look:
Here is an example of how the configured `Kafka` custom resource could look:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here is an example of how the configured `Kafka` custom resource could look:
Here is an example of what the configured `Kafka` custom resource could look:

These metrics can also provide a way to the users to debug self-healing related issues.
For example, the `number_of_self_healing_started` metric can provides users, the way to check when self-healing was started for an anomaly, and the number of self-healing actions that were performed in the cluster.
This information can be useful to at what point of time the self-healing occured and the shape/replica assignment of the cluster was changed due to self-healing.
The user can also use the metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU all the above metrics are historical ones but they don't give any information about what's going on now.
I made this comment on the CC related PR as well, so I was wondering if we need to expose metrics about self-healing actually running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any exsisting metrics like that yet in CC. There are metrics which can tell that there are inter/intra broker replica movement but Iam not sure how we can expose a metric to say when self healing is running. After talking with Tom there was a way we can know if self healing still running. we can have the sum(self healing started) minus sum(self healing ended) that can tell if the self healing is still in progress

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then it's something we should explain in the proposals and maybe providing a Grafana dashboard with such information?
Do we have, at least, the metric about when the last fix started?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are asking about the existing metrics then you can still use the number_of_self_healing_started metric and use the promql query to get the timestamp for the last change in the gauge -> last_over_time(timestamp(changes(kafka_cruisecontrol_anomalydetector_num_of_self_healing_started{}[<time-range>]) > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want it based on the anomaly, we can do the same with the new metrics num_of_self_healing_started_for_<anomaly-type>_count

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I think we should listing this info in the proposal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we just have single metrics called number_of_self_healing_started to know that self healing was started for fixing an anomaly(these metrics doesn't classify anomaly on types) and there is no metrics regarding when self healing ended. The PR opened on CC upstream introduces new metrics regarding -> num-of-self-healing-started-for-<anomalyType> and num-of-self-healing-ended-for-<anomalyType> which are both classified based on the anomaly types. But at the same time it doesn't add any metrics like number_of_self_healing_ended for self healing ended which can provide us an overall count of how many self healing ended.

#### What happens when self-healing is running/fixing an anomaly and KafkaRebalance is approved

If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example:
For example:


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal

If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this sentence cope with the previous one where you say the KafkaRebalance ends in a NotReady state. I am confused. Will it be not ready or will it have a proposal?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am also a bit confused. IMO this section should be merged with the one above. You need to cover all the possible situations when self healing is running and describe what happens in each case:

  • Brand new Kafka Rebalance created
  • Refresh an existing rebalance
  • Approve a rebalance

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to cover what happens if a manual rebalance is ongoing and self healing is enabled. Does the manual rebalance get cancelled, does the anomaly get ignored or delayed?

Copy link

@tomncooper tomncooper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had another pass. My main comments are:

  • You need to make sure the structure of doc is logical. Start with a description of the problem, the background including an overview of how CC anomaly detection and self healing works, what monitoring and metrics are available currently and then move on to state clearly what you are proposing to change. At the moment it is hard to tell what is just background context and what are concrete proposed changes.
  • IMO we need to make it simple to enable/disable self healing. User should set the self.healing.enabled to true and if they don't set the notifier we should set if for them.
  • The kube events notifier should then be the default (we know the user has K8s for sure). But that leads on to how it will decide to approve/ignore/delay fixes, you don't discuss that at all?
  • You need to cover all the ways that manual rebalances and self-healing can interact.

## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might instruct Cruise Control to move the partition replicas across the brokers in order to fix the violated goal by using the `KafkaRebalance` custom resource.


## Motivation

With the help of the self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
With the help of the self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.
With the help of Cruise Control's self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.

Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc).
However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource.
However, self-healing is currently disabled and disallowed in Strimzi.
Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to elaborate on why we disallowed it originally and why we no longer think that it is warranted any more.

![self-healing flow diagram](./images/090-self-healing-flow.png)

The above flow diagram depicts the self-healing process.
The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.
The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.

#### Anomaly Detector Manager

The anomaly detector manager helps in detecting the anomalies as well as handling them.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies.

Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected.
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.
The `KubernetesEventsNotifier` is based of the `SelfHealingNotifier` class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `KubernetesEventsNotifier` is based of the `SelfHealingNotifier` class.
The `KubernetesEventsNotifier` will be based of the `SelfHealingNotifier` class.

For effective alerts and monitoring, it makes use of Kubernetes events to signal to users what operations Cruise Control is performing automatically.
The `KubernetesEventsNotifier` will override the `alert()` method of the `SelfHealingNotifier` and will trigger and publish events whenever an anomaly is detected.
These events would publish information regarding the anomaly like the anomalyID, the type of anomaly which was detected etc.
The events would help the user to understand the anomalies that are present, when the anomalies appeared in their cluster and also would help them know what is going on in their clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?

Could be enabled by default if a user sets self healing enabled to true but doesn't specify a notifier class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?

I think the answer to this is still in my comment here #145 (comment) to explain how the base class for notifier works unless we want to write something from scratch (but I would argue about this).

When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.
If, during the process of moving partitions replicas among the brokers in the Kafka cluster, some broker that was being used gets deleted or rolled, then Cruise Control would finish self-healing process and log that `self-healing finished successful with error`.
If the deleted or rolled broker comes back before the periodic anomaly check interval, then self-healing would start again only for the earlier failed broker with a new anomalyId since it was not fixed properly.
In case the deleted broker doesn't come back healthy within the periodic anomaly check interval then another new anomaly for the deleted broker will be created and the earlier detected failed broker would also persist but will be updated with a new anomalyId.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But didn't you say above that broker failures are disabled?


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal

If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am also a bit confused. IMO this section should be merged with the one above. You need to cover all the possible situations when self healing is running and describe what happens in each case:

  • Brand new Kafka Rebalance created
  • Refresh an existing rebalance
  • Approve a rebalance


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal

If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to cover what happens if a manual rebalance is ongoing and self healing is enabled. Does the manual rebalance get cancelled, does the anomaly get ignored or delayed?

@ppatierno
Copy link
Member

IMO we need to make it simple to enable/disable self healing. User should set the self.healing.enabled to true and if they don't set the notifier we should set if for them.

I agree with this.

The kube events notifier should then be the default (we know the user has K8s for sure). But that leads on to how it will decide to approve/ignore/delay fixes, you don't discuss that at all?

@tomncooper All the notifiers provided by Cruise Control are based on the SelfHealingNotifier [1] and if you go through it you can notice that any self healing enabled (for each specific anomaly) is fixed (if it's fixable) otherwise it's ignored.
Of course, you can write the entire notifier logic from scratch (which was going to be the case when we started with the idea of having the user the opportunity to drive the decision) but I think that for our use case it's fine going the CC provided way right now. I agree we should provide more details in the proposal.

[1] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/notifier/SelfHealingNotifier.java

@ShubhamRwt ShubhamRwt force-pushed the selfHealingFeature branch from cdb9225 to 73b7d83 Compare May 6, 2025 05:26
@ppatierno ppatierno requested review from Frawless and see-quick May 6, 2025 12:24
@ppatierno ppatierno requested review from im-konge and katheris May 6, 2025 12:24
@ppatierno
Copy link
Member

We would like to move forward with this. I encourage @strimzi/maintainers who had already a pass to review again and for the others providing comments if they have any. Also @tinaselenge @kyguy I saw you left comments, could you have another pass please? Thanks everyone!

@tomncooper
Copy link

@ppatierno @ShubhamRwt I think that wires have been crossed somewhere. It was my impression, stated multiple times, that if the user specifies self.healing.enabled: true but DOES NOT specify the notifier class, that the NOOP notifier is still used and all fixes are ignored.

Is that true? I have asked to confirm this several times, as it seems like it shouldn't be true? Does CC actually switch to the SelfHealingNotifer (auto-approve if fixable) in that case? Which would make much more sense and render several of my comments mute.

@ppatierno
Copy link
Member

@ppatierno @ShubhamRwt I think that wires have been crossed somewhere. It was my impression, stated multiple times, that if the user specifies self.healing.enabled: true but DOES NOT specify the notifier class, that the NOOP notifier is still used and all fixes are ignored.

It's true.

Let me try to explain this based on the code (then @ShubhamRwt can confirm based on his testing).
Let's start from swapping the order between the notifier and the self.healing.enabled flag :-) ... I mean, the main part is setting the notifier class, then it's going to use the self healing flag.

Here [1] you can see where the anomaly detector manager is going to load the notifier class and it's defaulting to NOOP notifier if the field is not set .
The anomaly detector manager doesn't care of the self healing flag at all. It just loads the configured notifier or NOOP notifier (by default).
Dealing with the self healing flag is notifier business when it's called by the anomaly detector here [2].
The anomaly detector manager gets the info about the self healing flag just when registering the sensors for metrics but .... from where? Through the notifier [3] :-)

So if you look at the NOOP notifier or any other notifier implementation, the logic about reading the self healing flag to ignore or fix is there (in the notifier). Also the NOOP notifier is configured by setting all the self healing flag(s) to false [4]. Other notifiers take into account the real flag set by the user instead.

I hope this clarify that the logic is inverted somehow: we don't go from self healing flag to notifier, but from notifier to self healing flag. The anomaly detector manager cares of the notifier only, which is in charge of using the self healing flag(s).

Hope this helps and clarify everything!

[1] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java#L107

[2] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java#L436

[3] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/AnomalyDetectorManager.java#L190

[4] https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/notifier/NoopNotifier.java#L31

@ShubhamRwt
Copy link
Contributor Author

@ppatierno Thanks for explaining the code part and yes its true that the notifier will be kept as NoopNotifier in case we don't set anomaly.detection.class property. When I tested this case multiple times, the anomalies detected were always ignored. @tomncooper I can maybe raise a issue in upstream and ask the maintainers if we can set the SelfHealingNotifier as default in case we enable self-healing but DO NOT set the notifier. I can also retest the scenario again plus ask the upstream maintainers if are deduction is correct

@ppatierno
Copy link
Member

I can maybe raise a issue in upstream and ask the maintainers if we can set the SelfHealingNotifier as default in case we enable self-healing but DO NOT set the notifier

Not sure it makes any sense to set the SelfHealingNotifier as default because, even if it has some logic to fix or ignore, it doesn't send any alert when it's happening. It just logs warnings in CC.
Also as I explained the CC logic is kind of "inverted" in the anomaly detector manager: notifier has the primary role to be configured for the manager, then it uses the self healing flags. Anyway, we can have a separate conversation about if/how to improve CC from this pov. For now focusing on the proposal, we know what happens.

@tomncooper
Copy link

Right, ok, I think I understand now. In which case my comments stand. If a user sets self.healing.enabled: true in the CC config and doesn't explicitly set the notifier class. Strimzi should, IMHO, set the notifier to the SelfHealingNotifier automatically.

Or, perhaps more usefully, sets it to the K8s events notifier (which extends the SelfHealingNotifier). Of course if the user set the notifier we don't touch it.

@ppatierno
Copy link
Member

Strimzi should, IMHO, set the notifier to the SelfHealingNotifier automatically. Or, perhaps more usefully, sets it to the K8s events notifier (which extends the SelfHealingNotifier). Of course if the user set the notifier we don't touch it.

The difference would be what the user get notified about anomalies. With using the base SelfHealingNotifier ... no notifications, just WARN logs in the Cruise Control log. With a Kube Events notifier, they will get ... of course Kube events :-)
Not sure how much WARN messages are enough. I would be for the Kube Events.

Copy link
Member

@kyguy kyguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still working my way through, I'll add more ideas later.

# Self Healing using Cruise Control

This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator.
The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically.
When enabled, Cruise Control's self-healing feature automatically resolves issues detected by its Anomaly Detector Manager.
This includes problems such as broker failures, goal violations, and disk failures, reducing the need for manual intervention.

@@ -0,0 +1,300 @@
# Self Healing using Cruise Control

This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator.
This proposal is about adding support for Cruise Control's self-healing capabilities in the Strimzi operator.


## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Even under normal operation, it's common for Kafka clusters to encounter problems such as partition key skew leading to a uneven partition distribution, or hardware issues like as disk failures, which can degrade overall cluster health and performance.

The detection of anomalies is the responsibility of the anomaly detector mechanism.
Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the information concerning the notifier feature be more relevant in the proposal section instead of the motivation section here where we describe the motivation behind why we want to add self-healing support? Unless we are adding support for Cruise Control notifiers as part of this proposal as well?

Two questions here:

(a) Can we use the Cruise Control notifiers feature on its own in Strimzi today without adding self-healing support?

(b) I understand that we need Cruise Control's notifier feature for supporting the self-healing feature, but wouldn't the sentences about the notifier feature be more relevant in the proposal's implementations section instead of here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kyguy . I think some sentences from this section can be in other more relevant section, rather than here in Motivation section.
Reading this sentence, it implies that currently, we can set these notifiers without this proposed solution to support self-healing. Basically what Kyle asked in the question.

Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.).
However, self-healing is currently disabled and disallowed in Strimzi.
The `self-healing` properties were disabled since there was investigation missing around how self-healing would act if pods roll in middle of it/rebalance starts in or how the operator would know if self-healing is running or not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move a variation of this sentence up to the Current Situation section.

However, self-healing is currently disabled and disallowed in Strimzi.
The `self-healing` properties were disabled since there was investigation missing around how self-healing would act if pods roll in middle of it/rebalance starts in or how the operator would know if self-healing is running or not
Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.
It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it useful to have these anomalies fixed automatically? It might be worth describing some of the anomalies the self-healing could fix automatically and why they are such a hassle to fix manually.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the StrimziNotifier. It does not seem to be really useful. Why do you think anyone wants to use Kubernetes Events to be notified about something like this? But probably even more important - it seems to have absolutely no role in this proposal and I would suggest to lift it out into a separate proposal if you really think this makes sense for any user.

Apart from that, my take on this proposal pretty similar as it was in Janaurry. It seems like something what would work for auto-rebalancing. But:

  • It provides an approach completely inconsistent with the auto-rebalancing-on-scaling feature
  • It does not seem to have any considerations towards the future steps and integration.
    For example:
    • How to use the Cruise Control goal violations to feed auto-scaling
    • How to use the Cruise Control goal violations to tell the operator to fix a disk or broken node
  • It does not mention any way how to allow the auto-rebalancing only in certain time windows which is IMHO very important for a feature like this. Or does CC have some configuration option for this?

I think the rejected alternative 2 might be able to provide clear paths for all three points. But adopting this proposal might close the doors to it (without some complicated changes and blocking the options we allow etc.).


During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might instruct Cruise Control to move the partition replicas across the brokers in order to fix the violated goal by using the `KafkaRebalance` custom resource.
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a single KafkaRebalance regardless the cluster size? You can for example have a CronJob to create the KafkaRebalance every night or every weekend and execute it and while the rebalance might take different time, the process it self is the same regrdless whether you have 5 or 50 nodes.

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
Detector classes have different mechanisms to detect their corresponding anomalies.
For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Kafka Metadata API? The Kafka Admin API is defined by the docs - https://kafka.apache.org/documentation/#adminapi - but there is nothing like Metadata API in the Kafka docs.

Furthermore, `MetricAnomalyDetector` and `GoalViolationDetector` use metrics to detect their anomalies.
The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the `KafkaRebalance` custom resource.
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you meant? Or what is some partitions are too large in disk?

Suggested change
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions consume too much disk space).


## Proposal

### Introduction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, none of the Introduction section is really a proposal of anything. You seem to be just describing how Cruise control works. It is useful, but I would not include it in the Proposal but add it before it as a separate section. That would make it easier to find what exactly is this proposing.

If self-healing is enabled, then an action is returned by the notifier to would decide whether the anomaly should be fixed or not.
If the notifier has returned `FIX` as the action then the classes which are responsible for resolving the anomaly would be called.
Each detectable anomaly is handled by a specific detector class which then uses another remediation class to run a fix.
For example, the `GoalViolations` class uses the `RebalanceRunnable` class, the `DiskFailure` class use the `RemoveDisksRunnable` class and so on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have custom runnables as a reactions to different failures? E.g., is it technically possible to get Cruise Control to use StrimziFixDiskRunnable to address DiskFailure by deleting the PV/PVC and having it recreated by Kubernetes instead of doing some rebalance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iam not sure if we can do that. I just we can create custom goals in CC but I don't know if we can write custom runnables. I can ask some CC folks if its possible or not


In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
If the user forgets to set the notifier property then the `KubernetesEventNotifier`(custom notifier under Strimzi org) will be used by default and the user will also get a warning in the Kafka CR regarding missing `anomaly.notifier.class` property.
We will support most anomaly fixes however there are some fixes which we don't want self-healing to fix automatically in a Kubernetes context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be specific and list what goals will be and what will not be supported instead of saying most?

Also, given what you say below and assuming it does not change -> please make it clear here that they will be just disabled by default while the users can enable them.

self.healing.disk.failure.enabled: false # disables self healing for disk failure
```

The user can still enable self-healing for disk failures and broker failure if they want by setting the `self.healing.broker.failure.enabled` for broker failure and `self.healing.disk.failure.enabled` for disk failure to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? They make zero sense in the Kubernetes context.

I would also expect that if we let the users enable them now, it will be impossible or at least much harder to provide a proper solution for them later as it will need to deal with users who have it enabled.

#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.
If, during the process of moving partitions replicas among the brokers in the Kafka cluster, some broker that was being used gets deleted or rolled, then Cruise Control would finish self-healing process and log that `self-healing finished successful with error`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does finish mean in this context? It sounds like it will move a partition replica to a black hole left after a removed broker which would be pretty nasty. I assume it will stop the process with an error instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the process is stopped but I don't think the there are any reverts, so its more like incomplete movement. The upstream CC logs still states finished succesful with error. Maybe we shoud get this error message fixed upstream


* If self-healing is ongoing and a `KafkaRebalance` resource is created/applied in the middle of it then Cruise Control will still generate an optimization proposal.
The proposal should be based on the cluster shape at that particular timestamp.
* If self-healing is ongoing and the generated proposal is `approved` then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a proposal that is generated before or during self-healing is approved after the self-healing is finished?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that generated proposal will based on old info but yeah I can try this case to know more


## Affected/not affected projects

This change will affect the Strimzi cluster operator and a new repository named `kubernetes-event-notifier` will be added under the Strimzi organisation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably include the cruise-control somewhere in the name? Please also move the information abotu the new repository to the section about the notifier (unless you remove it as I suggest of course). You should also clarify there how it will be distributed (likely through Maven Central?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will add it

With the help of Cruise Control's self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically.
The detection of anomalies is the responsibility of the anomaly detector mechanism.
Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
Notifiers can be useful as they provide users visibility for the actions that are being performed by Cruise Control self-healing e.g. anomaly being detected, anomaly fix being started etc.

The detection of anomalies is the responsibility of the anomaly detector mechanism.
Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster.
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc.
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @kyguy . I think some sentences from this section can be in other more relevant section, rather than here in Motivation section.
Reading this sentence, it implies that currently, we can set these notifiers without this proposed solution to support self-healing. Basically what Kyle asked in the question.

![self-healing flow diagram](./images/090-self-healing-flow.png)

The above flow diagram depicts the self-healing process.
The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier.
The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.

* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).

The detected anomalies are inserted into a priority queue where comparator is based upon the priority value and the detection time.
The smaller the priority value and detected time is, the higher priority the anomaly type has.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who decides the priority value? How does it correlate with detected time?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The priority is set in the KafkaAnomalyType class in Cruise Control.

  @JsonResponseField
  BROKER_FAILURE(0),
  @JsonResponseField
  MAINTENANCE_EVENT(1),
  @JsonResponseField
  DISK_FAILURE(2),
  @JsonResponseField
  METRIC_ANOMALY(3),
  @JsonResponseField
  GOAL_VIOLATION(4),
  @JsonResponseField
  TOPIC_ANOMALY(5);

where the anomaly with lowest no. has highest priority. Regarding the detection time, the anomaly which is detected first would be resolved first


Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes.
The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.
The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is anomaly.notifier.class specified under spec.CruiseControl.config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

The anomaly status changes based on how the healing is progressing.
The anomaly can transition to these possible states:
* DETECTED - The anomaly is just detected and no action is yet taken on it
* IGNORED - Self healing is either disabled or the anomaly is unfixable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also state will be IGNORED, if self-healing is enabled but set to the default notifier, NoopNotifer right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me add it

These metrics can also provide a way for the users to debug self-healing related issues.
For example, the `number_of_self_healing_started` metric can provides users, the way to check when self-healing was started for an anomaly, and the number of self-healing actions that were performed in the cluster.
This information can be useful to at what point of time the self-healing occurred and the shape/replica assignment of the cluster was changed due to self-healing.
The user can also use the metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The user can also use the metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time.
The user can also use the metric such as `number_of_self_healing_failed` to find out if the self-healing failed to start, so that they can check the reasons in the log for the failure. One example could be if load monitor was ready at that point of time.


### Proposal for enabling self-healing in Strimzi

In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specify a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.

### Proposal for enabling self-healing in Strimzi

In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource.
If the user forgets to set the notifier property then the `KubernetesEventNotifier`(custom notifier under Strimzi org) will be used by default and the user will also get a warning in the Kafka CR regarding missing `anomaly.notifier.class` property.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify, currently if self-healing is enabled with CC, it uses NoopNotifer by default which ignores anomalies. With this proposal, Strimzi will always set the notifier to KubernetesEventNotifier which will issue the warning if user has not specified a notifier class?

@ShubhamRwt
Copy link
Contributor Author

ShubhamRwt commented May 29, 2025

Hi @scholzj , I tried to explain the answers to your questions, I hope it make things clear in terms of why a notifier is necessory as well as other future improvements to the feature.

Question:
I still don't understand the StrimziNotifier. It does not seem to be really useful. Why do you think anyone wants to use Kubernetes Events to be notified about something like this? But probably even more important - it seems to have absolutely no role in this proposal and I would suggest to lift it out into a separate proposal if you really think this makes sense for any user.

Having our own notifer helps us with various aspects:

  1. If the user doesn’t use any slack, MS Team or Alerta accounts then this will provide them a good cloud native way to get events regarding the detected anomalies.

  2. For future integrations, like writing our own runnables for fixing the anomalies, we will need a notifier since the notifier decides what action to take (what runnables to trigger) on anomalies. Having our own implementation makes it easier to add that functionality in future.

The notifier plays an important role in the proposal as it is important to notify the users regarding the detected anomalies as well as in cases where users forgets to set the notifier but enables self-healing. If we don’t have any notifier set then basically all the detected anomalies would be ignored so having our own default notifier implementation provides the users a better experience so that they just need to enable self-healing and don’t have to worry about other properties. From a future perspective also, to have a more operator driven approach instead of CC doing everything we can use our notifier to execute our own custom runnable classes.

Question:
It provides an approach completely inconsistent with the auto-rebalancing-on-scaling feature
It does not seem to have any considerations towards the future steps and integration.
For example:
How to use the Cruise Control goal violations to feed auto-scaling
How to use the Cruise Control goal violations to tell the operator to fix a disk or broken node

All the interactions that we currently do with Cruise Control are manually triggered and the fixes are always driven by the operator. Taking the case of auto rebalancing on scaling, it is something which is driven by the operator since the operator knows that scaling (up/down) is happening within the cluster and therefore it will trigger the rebalance. With self-healing the approach is different, it is driven by Cruise Control and therefore different to our previous integrations.

If we want to feed auto-scaling or other operator actions from CC we can use our own notifier to annotate the Kafka CR (or use a CM or similar). We can even use our own runnable classes if we have the notifier. Using the stated approaches we can involve the operator in the process and drive the solution accordingly. The KubeEventNotifier for now only emits events based regarding the detected anomaly but we can later improve it by adding more features, for e.g. driving the fix for certain anomalies based upon our own runnables classes.

Question:
It does not mention any way to allow the auto-rebalancing only in certain time windows which is IMHO very important for a feature like this. Or does CC have some configuration option for this?

There is no configuration in CC today which can help us with this. But, if we use our own notifier then we can make use of an annotation (on the Kafka CR) to pause self healing (by no-oping all anomalies) for the duration of any maintenance window. This was actually part of an earlier version of this proposal but was removed because we decided that users can disable self-healing if they know that some big manual rolls/rebalancing is going to happen in the cluster. They can re-enable the feature once the roll/rebalance is complete.

@scholzj
Copy link
Member

scholzj commented May 29, 2025

Having our own notifer helps us with various aspects:

  1. If the user doesn’t use any slack, MS Team or Alerta accounts then this will provide them a good cloud native way to get events regarding the detected anomalies.

Just because someone does not use Slack or MS Teams doesn't mean that using Kubernetes Events is a good solution. So this is not a good reason to have it. There is also nothing cloud-native about it. You are just using Kubernetes Events for something they were not really designed for, and what is not a good alerting mechanism.

  1. For future integrations, like writing our own runnables for fixing the anomalies, we will need a notifier since the notifier decides what action to take (what runnables to trigger) on anomalies. Having our own implementation makes it easier to add that functionality in future.

Well, as I said in my comments - this is a reason why not to have this right now. You do not know how the future integration will look like so you have no idea how easy or hard will it be to not break compatibility. So it is one more reason to remove the notifier.

The notifier plays an important role in the proposal as it is important to notify the users regarding the detected anomalies as well as in cases where users forgets to set the notifier but enables self-healing. If we don’t have any notifier set then basically all the detected anomalies would be ignored so having our own default notifier implementation provides the users a better experience so that they just need to enable self-healing and don’t have to worry about other properties. From a future perspective also, to have a more operator driven approach instead of CC doing everything we can use our notifier to execute our own custom runnable classes.

Sorry, but no. If the user forgets to set the Slack notifier, will they somehow miraculously remember to check Kubernetes events? Also, there is no realationship to the operator. So there is nothing operator driven on this. So I do not understand this argument and why having some Kubernetes Events is important for the self-healing.

All the interactions that we currently do with Cruise Control are manually triggered and the fixes are always driven by the operator. Taking the case of auto rebalancing on scaling, it is something which is driven by the operator since the operator knows that scaling (up/down) is happening within the cluster and therefore it will trigger the rebalance. With self-healing the approach is different, it is driven by Cruise Control and therefore different to our previous integrations.

Auto-rebalancing is not manually triggered. It is driven fully by the operator. So it is not true that all interactions are manually triggered. The self-healing as proposed here is just a fancy name for auto-rebalancing triggered by some metric inside Cruise Control. So I would expect it to be easy to have it driven by an alternative implementation driven through the operator similarly as the auto-rebalancing at scaling is. At the end, just few lines above you talk about driving things through the operator using the Strimzi notifier in the future as well. So it clearly can be passed through some notifier to Strimzi to have Strimzi trigger the auto-rebalance.

If we want to feed auto-scaling or other operator actions from CC we can use our own notifier to annotate the Kafka CR (or use a CM or similar). We can even use our own runnable classes if we have the notifier. Using the stated approaches we can involve the operator in the process and drive the solution accordingly. The KubeEventNotifier for now only emits events based regarding the detected anomaly but we can later improve it by adding more features, for e.g. driving the fix for certain anomalies based upon our own runnables classes.

Or you can first understand how to do the integration before dealing with some events with dubious value to make sure it does not cause problems in the future.

There is no configuration in CC today which can help us with this. But, if we use our own notifier then we can make use of an annotation (on the Kafka CR) to pause self healing (by no-oping all anomalies) for the duration of any maintenance window. This was actually part of an earlier version of this proposal but was removed because we decided that users can disable self-healing if they know that some big manual rolls/rebalancing is going to happen in the cluster. They can re-enable the feature once the roll/rebalance is complete.

We already have maintenance time windows in Strimzi. That is what needs to be plugged in. Not some pause self-healing annotation (although there might be other use-cases for such an annotation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants