-
Notifications
You must be signed in to change notification settings - Fork 67
Proposal for adding Self healing feature in the operator #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ShubhamRwt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the proposal.
I think it would be useful to use a regular human language here and not the Cruise Control language as that would bring better perspective on what exactly is being proposed here and what is not being proposed here (and what actually the self-healing does and does not do).
I do not understand the role of the StrimziNotifier
. While I doubt that anyone will find it appealing to have Kubernetes events used for the notification instead of Slack or Prometheus, I cannot of course rule out that this will find its users. But it seems to be only marginally related to Strimzi. I'm sure we can host this under the Strimzi GitHub organization if desired, but it seems more like an unrelated subproject that is separate from the self-healing feature it self.
The actual self-healing ... you focus a lot on the Disk and Broker failures. But it seems to me like the self-healing does not offer good solutions for this. The idea to solve this by rescheduling the partition replicas to other brokers seems to be well suited for Kafka on bare-metal with local storage where a dead server often means that you have to get a brand new server to replace it. It does not seem to be well suited to self-healing Kubernetes infrastructure and for cloud-based environments, where Pods can be just rescheduled and new disk can be provisioned in the matter of seconds.
So it seems to me like the self-healing itself as done by Cruise Control could be useful for dealing with inbalanced brokers. But for situations such as broker or disk failures or insufficient capacity, we should be more ambitious and aim to solve them properly in the Strimzi Cluster Operator. That is where an actual StrimziNotifier
might play a future role to not issue any Kubernetes Events to tell the user about broker disk, but provide a mechanism to tell Strimzi about it so that Strimzi can react to it - e.g. by scaling for capacity based recommendation from CC or as suggested delete broken PVC and get a new one provisioned to replace it. While I underttand that this might be out of scope for this proposal, I think it is important to leave space for it. For example by focusing here on the auto-rebalancing features and leaving the other detected issues unused for future development.
## Current situation | ||
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not some bare-metal cluster but Kubernetes. So I'm struggling to find the issues where you would need to do what you suggest. Instead, on Kubernetes, you would normally just fix the issue on the existing broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. | ||
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should clarify what issues
are you talking about here. From the past discussion, the terminology used here seems highly confusing ... and I suspect that the words such as issues
, anomalies
, and problems
are ambiguous at best`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me they are all synonyms more or less. Maybe just use one and stick with it?
|
||
## Motivation | ||
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in this proposal suggests that Cruise Control can solve disk or broker failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. | ||
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism. | ||
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link seems broken? (Double (
and )
?)
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | ||
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. | ||
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what exactly it is trying to say ... but it does not seem to make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we need this level of details. Maybe just link to the doc of these detectors? If not, maybe reword the sentence to something like this:
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. | |
Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` uses metrics while other detector such as `GoalViolationDetector` and `TopicAnomalyDetector` use a different mechanism to detect anomalies. |
Maybe we should mention briefly how GoalViolationDetector
and TopicAnomalyDetector
detect anomalies like these others too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what exactly it is trying to say ... but it does not seem to make sense to me.
What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?
``` | ||
|
||
The `StrimziNotifier` class is going to reside in a separate module in the operator repository called the `strimzi-cruise-control-notifier` so that a seperate jar for the notifier can be built for inclusion in the Cruise Control container image. | ||
We will also need to include the fabric8 client dependencies in the notifier JAR that will be used for generating the Kubernetes events as well as for reading the annotation on the Kafka resource from the notifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading what annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you are anticipating what's going to be in the next paragraph but this sentence doesn't make any sense here.
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself. | ||
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. | ||
We can set this annotation on the Kafka custom resource to pause the self-healing feature. | ||
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically. | ||
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them. | ||
The anomalies that were already getting fixed would continue with the fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is pausing the self-healing useful for the users? Assuming it is useful, why not just disable the self-healing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).
|
||
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does it mean when the broker fails
? How does CC recognize it? How does it distinguish failure for example from some temporary Kubernetes scheduling issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. | ||
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would it finish the transfer from the failed broker
when the target broker is stopped? You mean that it will continue with it after the target broker is rolled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, If the target broker is rolling at that moment, then self healing would end saying self-healing finished successful with error
. Then periodic anomaly check will again detect the failed broker
and try to fix it. If the target
broker is rolled by that time then self healing will start again else if the rolled broker is not up then we will will get two anomalies to fix the failed broker
and the target broker
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to differenciate between asking for proposal and approving a proposal? For example when asking for proposal, I would expect the operator to wait and give me a proposal after the self-healing ends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following snippet I can see the dryrun=false
so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShubhamRwt any answer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the middle of self-healing I was able to generate the proposal and when I approved it I got the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a section around this in the proposal
|
||
#### Anomaly Detector Manager | ||
|
||
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | |
The anomaly detector manager helps in detecting the anomalies as well as handling them. |
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. | ||
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you mean the goals in KafkaRebalance and goals self.healing.goals
are independent of each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have specified hard goals in the KR, then the self.healing.goals
should be a superset of the hard goals
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.
If by KR you mean the KafkaRebalance
custom resource, I don't think what you are saying is true. There is no relationship between a specific KafkaRebalance
and self-healing configuration.
Maybe are you referring to the general CC configuration? So the default hard goals in CC has to be a superset of self.healing.goals
(and not the other way around)?
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | ||
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. | ||
|
||
Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection. | |
Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection. |
Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case. | ||
|
||
#### Self Healing | ||
If self-healing is enabled in Cruise Control and anomalies are detected, then a fix for these will be attempted automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fix will be attempted automatically only if anomaly.notifier.class
is configured to non default notifier right? So enabling self-healing alone is not enough, as with the default notifier (NoopNotifer), anomalies are detected but ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you will need to override the notifier
too
The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status: | ||
* DETECTED - The anomaly is just detected and no action is yet taken on it | ||
* IGNORED - Self healing is either disabled or the anomaly is unfixable | ||
* FIX_STARTED - The anomaly has started getting fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* FIX_STARTED - The anomaly has started getting fixed | |
* FIX_STARTED - The anomaly fix has started |
* FIX_STARTED - The anomaly has started getting fixed | ||
* FIX_FAILED_TO_START - The proposal generation for the anomaly fix failed | ||
* CHECK_WITH_DELAY - The anomaly fix is delayed | ||
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state | |
* LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state |
self.healing.enabled: true | ||
``` | ||
|
||
Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any anomalies enabled by default? If user does not specify anomalies to detect, self-healing will do nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true
) then self healing would be detecting all the anomalies. In case the user knows that somehow any anomaly will keep appearing since they have some issues in their cluster, then in that they can use skip the check for that particular anomaly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.
I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.
The notifier class returns the `action` that is going to be taken on the notified anomaly. | ||
These actions can be classified into three types: | ||
* FIX - Start the anomaly fix | ||
* CHECK - Delay the anomaly fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the delay action, apply a configured delay before taking the action provided by notifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Even if the delay is mostly used only for broker failures and not for another anomalies AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. Can you go through the proposal and split paragraphs with one sentence per line? Thanks!
## Current situation | ||
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. | ||
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me they are all synonyms more or less. Maybe just use one and stick with it?
|
||
## Motivation | ||
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?
|
||
With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically. | ||
Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism. | ||
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time you mention the term "notifier" but it shows up out of the context. You should first explain what it is and why it's useful in general, not related to the self-healing but more to the alerting.
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | ||
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. | ||
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what exactly it is trying to say ... but it does not seem to make sense to me.
What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?
### Pausing the self-healing feature | ||
|
||
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself. | ||
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. | |
For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource. |
The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself. | ||
For this, we will be having an annotation called `strimzi.io/self-healing-paused`. | ||
We can set this annotation on the Kafka custom resource to pause the self-healing feature. | ||
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically. | ||
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them. | ||
The anomalies that were already getting fixed would continue with the fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).
uid: 71928731-df71-4b9f-ac77-68961ac25a2f | ||
``` | ||
|
||
We will have checks on methods like `onGoalViolation`, `onBrokerFailure`, `onTopicAnomaly` etc. in the `StrimziNotifier` to check if the `pause` annotation is applied on the Kafka custom resource or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no pause
annotation. Use the exact name.
|
||
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the following snippet I can see the dryrun=false
so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?
|
||
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies. | ||
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. | ||
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | |
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration). |
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies. | ||
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. | |
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing. |
The nuances between the goal configurations of Cruise Control can be quite confusing, to help keep them straight in my head I categorize them in the following way:
- The server-side Cruise Control goal configs in
spec.cruiseControl.config
of theKafka
resource - The client-side Cruise Control goal config in
spec
of theKafkaRebalance
resource
To avoid confusion, it may be worth adding some text that explains where and how the self.healing.goals
compares/relates to the other server-side Cruise Control goal configs - goals
, default.goals
, hard.goals
, anomaly.detection.goals
. It wouldn't have to be right here, but somewhere in the proposal where the self.healing
configuration is raised. When we add the self-healing feature we will need to add some documentation of how self.healing.goals
related to the other goal configurations anyways. We could use the same language used to describe the different goal configs in the Strimzi documentation and provide a link to those docs as well.
I think the language you and Paolo use in this thread referring to other goal configs like "superset" or subset is helpful. Since the goal configs are so complicated, I don't think it is a problem if we reuse text from the Strimzi Cruise Control goal documentation to emphasize the distinction of the self.healing.goals
config.
* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics). | ||
|
||
The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve. | ||
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | |
The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored. |
|
||
The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve. | ||
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | ||
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. | |
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
|
||
The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve. | ||
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored. | ||
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly. | |
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
|
||
### Enabling self-healing in the Kafka custom resource | ||
|
||
When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations). | |
When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource. | |
When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored. |
self.healing.enabled: true | ||
``` | ||
|
||
Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.
I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.
|
||
Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected. | ||
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface. | ||
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc. | |
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc. |
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface. | ||
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc. | ||
The `SelfHealingNotifier` class can then be further used to implement your own notifier | ||
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`. | |
The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`. |
|
||
### Strimzi Operations while self-healing is running/fixing an anomaly | ||
|
||
Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening. | |
Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster. |
Signed-off-by: ShubhamRwt <[email protected]>
|
||
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster. | ||
If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`. | ||
If some anomaly is getting healed , Cruise Control will then try to fix/mitigate the anomaly by moving the partition replicas across the brokers in the Kafka cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster.
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing.. | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section) to those used for manual rebalancing(goals configured in the KafkaRebalance custom resource). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part confuses me slightly, maybe can be reworded better.
Is the below what you mean?
- Goal Violation - This happens if certain optimization goals such as
DiskUsageDistributionGoal
are violated. The optimization goals can be configured independently fromKafkaRebalance
custom resource by usingspec.cruiseControl.config.self.healing.goals
in Kafka custom resource.
When you say certain optimization goals, are you saying only some would trigger Goal Violation anomaly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, AFAIK it would only be those goals specifically listed in the self healing goals configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make the sentence less confusing, I hope its better now
@@ -76,17 +79,18 @@ statusUpdateDate=2024-12-09T09:49:47Z, | |||
failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}}, | |||
status=FIX_STARTED}]} | |||
``` | |||
where the `anomalyID` represents the task id assigned to the anomaly, `failedDisksByTimeMs` represents the disk which failed, `detectionDate` and `statusUpdateDate` represet the date and time when the anomaly was detected and the status of the anomaly was updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the mentioned anomalyType
field snippet. Is this a full snippet of an anomaly?
I guess not that important, since this is just an example, but should we explain the value of failedDisksByTimeMs
? {2={/var/lib/kafka/data-1/kafka-log2=1733737786603}}
what these value mean, how it represents the failed disk? If you think, this is not important, please feel free ignore this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some explanation around it
* COMPLETENESS_NOT_READY - Completeness not met for some goals. The completeness describes the confidence level of the data in the metric sample aggregator. If this is low then some Goal classes may refuse to produce proposals. | ||
|
||
We can also poll the `state` endpoint, with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control. | ||
We can again poll the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control), with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we provide, how users might poll this end point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this you mean mentioning the complete curl command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the curl command to poll the endpoint
Signed-off-by: ShubhamRwt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShubhamRwt I had another pass but I think the next one should be with a proposal around metrics as another option instead of using the Kubernetes events notifier.
## Current situation | ||
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource in the [`rebalance`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no rebalance
mode. Did you mean full
? Anyway I would not specify it, because it's possible you could even decide to scale up the cluster when doing the rebalancing so add-brokers
would be needed.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration). | ||
Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` and `GoalViolationDetector` uses metrics to detect the anomalies. | ||
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the KafkaRebalance custom resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the KafkaRebalance custom resource. | |
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the `KafkaRebalance` custom resource. |
The notifier class returns the `action` that is going to be taken on the notified anomaly. | ||
These actions can be classified into three types: | ||
* FIX - Start the anomaly fix | ||
* CHECK - Delay the anomaly fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Even if the delay is mostly used only for broker failures and not for another anomalies AFAIK.
* CHECK - Delay the anomaly fix | ||
* IGNORE - Ignore the anomaly fix | ||
|
||
The default `NoopNotifer` always sets the notifier action as `IGNORE` means that the detected anomaly will be silently ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add that it doesn't send any notification to the user in any way as well.
type: Normal | ||
``` | ||
|
||
The users can build a jar for the notifier for inclusion in the Kafka image to run on as part of Cruise Control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KubernetesEventsNotifier
should be included out of the box within the Kafka Strimzi image in order to be configured by the operator. It's not clear if you are referring to a custom notifier that a user would write.
|
||
### Strimzi Operations while self-healing is running/fixing an anomaly | ||
|
||
Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's the purpose of this single sentence paragraph.
If the deleted or rolled broker comes back before the periodic anomaly check interval, then self-healing would start again only for the earlier failed broker with a new anomalyId since it was not fixed properly. | ||
In case the deleted broker doesn't come back healthy within the periodic anomaly check interval then another new anomaly for the deleted broker will be created and the earlier detected failed broker would also persist but will be updated with a new anomalyId. | ||
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is approved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is approved | |
#### What happens when self-healing is running/fixing an anomaly and KafkaRebalance is approved |
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShubhamRwt any answer here?
|
||
## Affected/not affected projects | ||
|
||
A new repository named `kubernetes-event-notifier` will be added under the Strimzi organisation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also the cluster operator to be affected because we need to allow enabling self-healing, or?
observedGeneration: 1 | ||
``` | ||
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ppatierno this section talks about what happens if we request rebalance in middle of self healing
Signed-off-by: ShubhamRwt <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another pass. Left some comments and made some cosmetic changes (it seems you don't like spaces and new lines :-P).
|
||
The above flow diagram depicts the self-healing process. | ||
The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier. | ||
The notifier then alerts the user about detected anomaly through the logs and also returns what action needs to be taken on the anomaly i.e. whether to fix it, ignore it or delay it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it depends on the notifier, the anomaly is not just logged. You can have Slack, MS Teams or anything else, with a dedicated notifier, to get a notification about the anomalies.
Each detected anomaly is implemented by a specific class which leverages a corresponding other class to run a fix. | ||
For example, the `GoalViolations` class uses the `RebalanceRunnable` class, the `DiskFailure` class use the `RemoveDisksRunnable` class and so on. | ||
An optimization proposal will then be generated by these `Runnable` classes and that proposal will be applied on the cluster to fix the anomaly. | ||
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning. | |
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware (insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the Cruise Control logs will be updated with `self-healing is not possible due to unfixable goals` warning. |
An optimization proposal will then be generated by these `Runnable` classes and that proposal will be applied on the cluster to fix the anomaly. | ||
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning. | ||
When self-healing starts, you can check the anomaly status by querying the [`state` endpoint](https://github.com/linkedin/cruise-control/wiki/REST-APIs#query-the-state-of-cruise-control) with `substate` parameter set as `anomaly_detector`. This would provide the anomaly details like anomalyId, the type of anomaly and the status of the anomaly etc. | ||
For example, the user can query the endpoint using the `curl` command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the user can query the endpoint using the `curl` command: | |
For example, the user can query the endpoint using the `curl` command: | |
```shell | ||
curl -vv -X GET "http://localhost:9090/kafkacruisecontrol/state?substates=anomaly_detector" | ||
``` | ||
the detected anomaly would look something like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the detected anomaly would look something like this: | |
the detected anomaly would look something like this: | |
failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}}, | ||
status=FIX_STARTED}]} | ||
``` | ||
where, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where, | |
where: |
### Self-healing configuration in the Strimzi operator | ||
|
||
Users can activate self-healing by setting the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource. | ||
If the user don't set the notifier property then `NoopNotifer` will be used by default which means all the action on the anomalies would be set to `IGNORE`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would so make clear that the self.healing.enabled
is not enough by itself then, because you also need to set a proper notifier (instead of the default NoopNotifier
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user don't set the notifier property then `NoopNotifer` will be used by default which means all the action on the anomalies would be set to `IGNORE`. | |
However, if the user does not set the notifier class property then the `NoopNotifer` will be used by default. | |
This means all the action on the detected anomalies will be set to `IGNORE` and self-healing is effectively disabled even if the `self.healing.enabled` property is set to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So given the two sentences above, what are you going to do about it? If a user sets self healing to true and not the notifier, what happens? Should there be a warning? Should the Kafka CR show an error condition, should it be set to NotReady
(as that is clearly a mis-configuration)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I have said before, in my opinion we should prevent this situation happening by just setting up the notifier for the user (if they don't specify one themselves).
If the user doesn't specify a notifier we should think about what makes sense in that situation. Would an auto-approve notifier make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wouldn't that mean that we will then have to create another new notifier? Isn't just specifying enough or should or maybe a warning in the Kafka CR is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this discussion is clashing with the other one where we are thinking of setting the Kube events notifier as default, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for now I have updated the proposal saying that if no notifier is specified then kube events notifier will be selected as default but i wonder if we should do that? Shouldn't we just create a warning/error
in Kafka CR saying that specifying the notifier is must or maybe setting the SelfHealingNotifer
as Default so that self healing still works and also mentioning in the Kafka CR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SelfHealingNotifer
has an implementation of the alert
method which just log in the CC pod, which is of course not so Kube friendly or at least it means a user should take a look at log. I think in our Kube env, it would be better having something more "cloud-native friendly".
Self-healing corresponding to broker failure or disk failure is disabled by default since we have better ways to fix these anomalies instead of moving the partition replicas across the brokers. | ||
For example, brokers failures are generally due to lack of resources, so we can fix the issue at cluster level and in the same way failed disks can be easily replaced with a new one, so self-healing in these cases can be more resource consuming. | ||
|
||
Here is an example of how the configured `Kafka` custom resource could look: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of how the configured `Kafka` custom resource could look: | |
Here is an example of how the configured `Kafka` custom resource could look: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of how the configured `Kafka` custom resource could look: | |
Here is an example of what the configured `Kafka` custom resource could look: | |
These metrics can also provide a way to the users to debug self-healing related issues. | ||
For example, the `number_of_self_healing_started` metric can provides users, the way to check when self-healing was started for an anomaly, and the number of self-healing actions that were performed in the cluster. | ||
This information can be useful to at what point of time the self-healing occured and the shape/replica assignment of the cluster was changed due to self-healing. | ||
The user can also use the metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU all the above metrics are historical ones but they don't give any information about what's going on now.
I made this comment on the CC related PR as well, so I was wondering if we need to expose metrics about self-healing actually running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any exsisting metrics like that yet in CC. There are metrics which can tell that there are inter/intra broker replica movement but Iam not sure how we can expose a metric to say when self healing is running. After talking with Tom there was a way we can know if self healing still running. we can have the sum(self healing started) minus sum(self healing ended) that can tell if the self healing is still in progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then it's something we should explain in the proposals and maybe providing a Grafana dashboard with such information?
Do we have, at least, the metric about when the last fix started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are asking about the existing metrics then you can still use the number_of_self_healing_started
metric and use the promql query to get the timestamp for the last change in the gauge -> last_over_time(timestamp(changes(kafka_cruisecontrol_anomalydetector_num_of_self_healing_started{}[<time-range>]) > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want it based on the anomaly, we can do the same with the new metrics num_of_self_healing_started_for_<anomaly-type>_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I think we should listing this info in the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we just have single metrics called number_of_self_healing_started
to know that self healing was started for fixing an anomaly(these metrics doesn't classify anomaly on types) and there is no metrics regarding when self healing ended. The PR opened on CC upstream introduces new metrics regarding -> num-of-self-healing-started-for-<anomalyType>
and num-of-self-healing-ended-for-<anomalyType>
which are both classified based on the anomaly types. But at the same time it doesn't add any metrics like number_of_self_healing_ended
for self healing ended which can provide us an overall count of how many self healing ended.
#### What happens when self-healing is running/fixing an anomaly and KafkaRebalance is approved | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. | ||
For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example: | |
For example: | |
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this sentence cope with the previous one where you say the KafkaRebalance
ends in a NotReady
state. I am confused. Will it be not ready or will it have a proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am also a bit confused. IMO this section should be merged with the one above. You need to cover all the possible situations when self healing is running and describe what happens in each case:
- Brand new Kafka Rebalance created
- Refresh an existing rebalance
- Approve a rebalance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to cover what happens if a manual rebalance is ongoing and self healing is enabled. Does the manual rebalance get cancelled, does the anomaly get ignored or delayed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another pass. My main comments are:
- You need to make sure the structure of doc is logical. Start with a description of the problem, the background including an overview of how CC anomaly detection and self healing works, what monitoring and metrics are available currently and then move on to state clearly what you are proposing to change. At the moment it is hard to tell what is just background context and what are concrete proposed changes.
- IMO we need to make it simple to enable/disable self healing. User should set the
self.healing.enabled
to true and if they don't set the notifier we should set if for them. - The kube events notifier should then be the default (we know the user has K8s for sure). But that leads on to how it will decide to approve/ignore/delay fixes, you don't discuss that at all?
- You need to cover all the ways that manual rebalances and self-healing can interact.
## Current situation | ||
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might move the partition replicas across the brokers to fix the violated goal by using the `KafkaRebalance` custom resource. | |
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might instruct Cruise Control to move the partition replicas across the brokers in order to fix the violated goal by using the `KafkaRebalance` custom resource. |
|
||
## Motivation | ||
|
||
With the help of the self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the help of the self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically. | |
With the help of Cruise Control's self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically. |
Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster. | ||
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc. | ||
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc). | ||
However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource. | |
However, self-healing is currently disabled and disallowed in Strimzi. | |
Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to elaborate on why we disallowed it originally and why we no longer think that it is warranted any more.
 | ||
|
||
The above flow diagram depicts the self-healing process. | ||
The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier. | |
The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier. |
#### Anomaly Detector Manager | ||
|
||
The anomaly detector manager helps in detecting the anomalies as well as handling them. | ||
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies. | |
It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies. |
Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected. | ||
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface. | ||
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc. | ||
The `KubernetesEventsNotifier` is based of the `SelfHealingNotifier` class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `KubernetesEventsNotifier` is based of the `SelfHealingNotifier` class. | |
The `KubernetesEventsNotifier` will be based of the `SelfHealingNotifier` class. |
For effective alerts and monitoring, it makes use of Kubernetes events to signal to users what operations Cruise Control is performing automatically. | ||
The `KubernetesEventsNotifier` will override the `alert()` method of the `SelfHealingNotifier` and will trigger and publish events whenever an anomaly is detected. | ||
These events would publish information regarding the anomaly like the anomalyID, the type of anomaly which was detected etc. | ||
The events would help the user to understand the anomalies that are present, when the anomalies appeared in their cluster and also would help them know what is going on in their clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?
Could be enabled by default if a user sets self healing enabled to true but doesn't specify a notifier class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?
I think the answer to this is still in my comment here #145 (comment) to explain how the base class for notifier works unless we want to write something from scratch (but I would argue about this).
When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster. | ||
If, during the process of moving partitions replicas among the brokers in the Kafka cluster, some broker that was being used gets deleted or rolled, then Cruise Control would finish self-healing process and log that `self-healing finished successful with error`. | ||
If the deleted or rolled broker comes back before the periodic anomaly check interval, then self-healing would start again only for the earlier failed broker with a new anomalyId since it was not fixed properly. | ||
In case the deleted broker doesn't come back healthy within the periodic anomaly check interval then another new anomaly for the deleted broker will be created and the earlier detected failed broker would also persist but will be updated with a new anomalyId. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But didn't you say above that broker failures are disabled?
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am also a bit confused. IMO this section should be merged with the one above. You need to cover all the possible situations when self healing is running and describe what happens in each case:
- Brand new Kafka Rebalance created
- Refresh an existing rebalance
- Approve a rebalance
|
||
#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied to ask for a proposal | ||
|
||
If self-healing is ongoing and a `KafkaRebalance` resource is applied in the middle of it then Cruise Control will still generate an optimization proposal. The proposal should be based on the cluster shape at that particular timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to cover what happens if a manual rebalance is ongoing and self healing is enabled. Does the manual rebalance get cancelled, does the anomaly get ignored or delayed?
I agree with this.
@tomncooper All the notifiers provided by Cruise Control are based on the |
Signed-off-by: ShubhamRwt <[email protected]>
cdb9225
to
73b7d83
Compare
We would like to move forward with this. I encourage @strimzi/maintainers who had already a pass to review again and for the others providing comments if they have any. Also @tinaselenge @kyguy I saw you left comments, could you have another pass please? Thanks everyone! |
@ppatierno @ShubhamRwt I think that wires have been crossed somewhere. It was my impression, stated multiple times, that if the user specifies Is that true? I have asked to confirm this several times, as it seems like it shouldn't be true? Does CC actually switch to the |
It's true. Let me try to explain this based on the code (then @ShubhamRwt can confirm based on his testing). Here [1] you can see where the anomaly detector manager is going to load the notifier class and it's defaulting to NOOP notifier if the field is not set . So if you look at the NOOP notifier or any other notifier implementation, the logic about reading the self healing flag to ignore or fix is there (in the notifier). Also the NOOP notifier is configured by setting all the self healing flag(s) to false [4]. Other notifiers take into account the real flag set by the user instead. I hope this clarify that the logic is inverted somehow: we don't go from self healing flag to notifier, but from notifier to self healing flag. The anomaly detector manager cares of the notifier only, which is in charge of using the self healing flag(s). Hope this helps and clarify everything! |
@ppatierno Thanks for explaining the code part and yes its true that the notifier will be kept as |
Not sure it makes any sense to set the SelfHealingNotifier as default because, even if it has some logic to fix or ignore, it doesn't send any alert when it's happening. It just logs warnings in CC. |
Right, ok, I think I understand now. In which case my comments stand. If a user sets Or, perhaps more usefully, sets it to the K8s events notifier (which extends the |
The difference would be what the user get notified about anomalies. With using the base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still working my way through, I'll add more ideas later.
# Self Healing using Cruise Control | ||
|
||
This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator. | ||
The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically. | |
When enabled, Cruise Control's self-healing feature automatically resolves issues detected by its Anomaly Detector Manager. | |
This includes problems such as broker failures, goal violations, and disk failures, reducing the need for manual intervention. |
@@ -0,0 +1,300 @@ | |||
# Self Healing using Cruise Control | |||
|
|||
This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is about enabling the self-healing properties of Cruise Control in the Strimzi operator. | |
This proposal is about adding support for Cruise Control's self-healing capabilities in the Strimzi operator. |
|
||
## Current situation | ||
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | |
Even under normal operation, it's common for Kafka clusters to encounter problems such as partition key skew leading to a uneven partition distribution, or hardware issues like as disk failures, which can degrade overall cluster health and performance. |
The detection of anomalies is the responsibility of the anomaly detector mechanism. | ||
Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster. | ||
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc. | ||
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the information concerning the notifier feature be more relevant in the proposal section instead of the motivation section here where we describe the motivation behind why we want to add self-healing support? Unless we are adding support for Cruise Control notifiers as part of this proposal as well?
Two questions here:
(a) Can we use the Cruise Control notifiers feature on its own in Strimzi today without adding self-healing support?
(b) I understand that we need Cruise Control's notifier feature for supporting the self-healing feature, but wouldn't the sentences about the notifier feature be more relevant in the proposal's implementations section instead of here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kyguy . I think some sentences from this section can be in other more relevant section, rather than here in Motivation section.
Reading this sentence, it implies that currently, we can set these notifiers without this proposed solution to support self-healing. Basically what Kyle asked in the question.
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc. | ||
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.). | ||
However, self-healing is currently disabled and disallowed in Strimzi. | ||
The `self-healing` properties were disabled since there was investigation missing around how self-healing would act if pods roll in middle of it/rebalance starts in or how the operator would know if self-healing is running or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move a variation of this sentence up to the Current Situation
section.
However, self-healing is currently disabled and disallowed in Strimzi. | ||
The `self-healing` properties were disabled since there was investigation missing around how self-healing would act if pods roll in middle of it/rebalance starts in or how the operator would know if self-healing is running or not | ||
Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource. | ||
It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it useful to have these anomalies fixed automatically? It might be worth describing some of the anomalies the self-healing could fix automatically and why they are such a hassle to fix manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand the StrimziNotifier
. It does not seem to be really useful. Why do you think anyone wants to use Kubernetes Events to be notified about something like this? But probably even more important - it seems to have absolutely no role in this proposal and I would suggest to lift it out into a separate proposal if you really think this makes sense for any user.
Apart from that, my take on this proposal pretty similar as it was in Janaurry. It seems like something what would work for auto-rebalancing. But:
- It provides an approach completely inconsistent with the auto-rebalancing-on-scaling feature
- It does not seem to have any considerations towards the future steps and integration.
For example:- How to use the Cruise Control goal violations to feed auto-scaling
- How to use the Cruise Control goal violations to tell the operator to fix a disk or broken node
- It does not mention any way how to allow the auto-rebalancing only in certain time windows which is IMHO very important for a feature like this. Or does CC have some configuration option for this?
I think the rejected alternative 2 might be able to provide clear paths for all three points. But adopting this proposal might close the doors to it (without some complicated changes and blocking the options we allow etc.).
|
||
During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy. | ||
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some goal violation and the cluster is imbalanced then we might instruct Cruise Control to move the partition replicas across the brokers in order to fix the violated goal by using the `KafkaRebalance` custom resource. | ||
With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a single KafkaRebalance
regardless the cluster size? You can for example have a CronJob to create the KafkaRebalance every night or every weekend and execute it and while the rebalance might take different time, the process it self is the same regrdless whether you have 5 or 50 nodes.
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not. | ||
The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration. | ||
Detector classes have different mechanisms to detect their corresponding anomalies. | ||
For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Kafka Metadata API
? The Kafka Admin API
is defined by the docs - https://kafka.apache.org/documentation/#adminapi - but there is nothing like Metadata API in the Kafka docs.
Furthermore, `MetricAnomalyDetector` and `GoalViolationDetector` use metrics to detect their anomalies. | ||
The detected anomalies can be of various types: | ||
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` configuration under the `spec.cruiseControl.config` section. These goals are independent of the manual rebalancing goals configured in the `KafkaRebalance` custom resource. | ||
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what you meant? Or what is some partitions are too large in disk
?
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk). | |
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions consume too much disk space). |
|
||
## Proposal | ||
|
||
### Introduction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, none of the Introduction
section is really a proposal of anything. You seem to be just describing how Cruise control works. It is useful, but I would not include it in the Proposal
but add it before it as a separate section. That would make it easier to find what exactly is this proposing.
If self-healing is enabled, then an action is returned by the notifier to would decide whether the anomaly should be fixed or not. | ||
If the notifier has returned `FIX` as the action then the classes which are responsible for resolving the anomaly would be called. | ||
Each detectable anomaly is handled by a specific detector class which then uses another remediation class to run a fix. | ||
For example, the `GoalViolations` class uses the `RebalanceRunnable` class, the `DiskFailure` class use the `RemoveDisksRunnable` class and so on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have custom runnables as a reactions to different failures? E.g., is it technically possible to get Cruise Control to use StrimziFixDiskRunnable
to address DiskFailure
by deleting the PV/PVC and having it recreated by Kubernetes instead of doing some rebalance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iam not sure if we can do that. I just we can create custom goals in CC but I don't know if we can write custom runnables. I can ask some CC folks if its possible or not
|
||
In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource. | ||
If the user forgets to set the notifier property then the `KubernetesEventNotifier`(custom notifier under Strimzi org) will be used by default and the user will also get a warning in the Kafka CR regarding missing `anomaly.notifier.class` property. | ||
We will support most anomaly fixes however there are some fixes which we don't want self-healing to fix automatically in a Kubernetes context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be specific and list what goals will be and what will not be supported instead of saying most
?
Also, given what you say below and assuming it does not change -> please make it clear here that they will be just disabled by default while the users can enable them.
self.healing.disk.failure.enabled: false # disables self healing for disk failure | ||
``` | ||
|
||
The user can still enable self-healing for disk failures and broker failure if they want by setting the `self.healing.broker.failure.enabled` for broker failure and `self.healing.disk.failure.enabled` for disk failure to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? They make zero sense in the Kubernetes context.
I would also expect that if we let the users enable them now, it will be impossible or at least much harder to provide a proper solution for them later as it will need to deal with users who have it enabled.
#### What happens when self-healing is running/fixing an anomaly and brokers are rolled | ||
|
||
When an anomaly is detected, Cruise Control may try to fix/mitigate it by moving the partition replicas across the brokers in the Kafka cluster. | ||
If, during the process of moving partitions replicas among the brokers in the Kafka cluster, some broker that was being used gets deleted or rolled, then Cruise Control would finish self-healing process and log that `self-healing finished successful with error`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does finish
mean in this context? It sounds like it will move a partition replica to a black hole left after a removed broker which would be pretty nasty. I assume it will stop
the process with an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the process is stopped but I don't think the there are any reverts, so its more like incomplete movement. The upstream CC logs still states finished succesful with error. Maybe we shoud get this error message fixed upstream
|
||
* If self-healing is ongoing and a `KafkaRebalance` resource is created/applied in the middle of it then Cruise Control will still generate an optimization proposal. | ||
The proposal should be based on the cluster shape at that particular timestamp. | ||
* If self-healing is ongoing and the generated proposal is `approved` then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if a proposal that is generated before or during self-healing is approved after the self-healing is finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that generated proposal will based on old info but yeah I can try this case to know more
|
||
## Affected/not affected projects | ||
|
||
This change will affect the Strimzi cluster operator and a new repository named `kubernetes-event-notifier` will be added under the Strimzi organisation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably include the cruise-control
somewhere in the name? Please also move the information abotu the new repository to the section about the notifier (unless you remove it as I suggest of course). You should also clarify there how it will be distributed (likely through Maven Central?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will add it
With the help of Cruise Control's self-healing feature we can resolve the anomalies causing unhealthy/imbalanced cluster automatically. | ||
The detection of anomalies is the responsibility of the anomaly detector mechanism. | ||
Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster. | ||
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc. | |
Notifiers can be useful as they provide users visibility for the actions that are being performed by Cruise Control self-healing e.g. anomaly being detected, anomaly fix being started etc. |
The detection of anomalies is the responsibility of the anomaly detector mechanism. | ||
Notifiers in Cruise Control are used to send notification to user about the self-healing operations being performed in the cluster. | ||
Notifiers can be useful as they provide visibility to the user regarding the actions that are being performed by Cruise Control regarding self-healing for e.g. anomaly being detected, anomaly fix being started etc. | ||
Currently, users are [able to set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (SelfHealingNotifier, AlertaSelfHealingNotifier, SlackSelfHealingNotifier etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kyguy . I think some sentences from this section can be in other more relevant section, rather than here in Motivation section.
Reading this sentence, it implies that currently, we can set these notifiers without this proposed solution to support self-healing. Basically what Kyle asked in the question.
 | ||
|
||
The above flow diagram depicts the self-healing process. | ||
The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anomaly detector manger detects the anomaly (using the detector classes) and forwards it to the notifier. | |
The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier. |
* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics). | ||
|
||
The detected anomalies are inserted into a priority queue where comparator is based upon the priority value and the detection time. | ||
The smaller the priority value and detected time is, the higher priority the anomaly type has. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who decides the priority value? How does it correlate with detected time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The priority is set in the KafkaAnomalyType
class in Cruise Control.
@JsonResponseField
BROKER_FAILURE(0),
@JsonResponseField
MAINTENANCE_EVENT(1),
@JsonResponseField
DISK_FAILURE(2),
@JsonResponseField
METRIC_ANOMALY(3),
@JsonResponseField
GOAL_VIOLATION(4),
@JsonResponseField
TOPIC_ANOMALY(5);
where the anomaly with lowest no. has highest priority. Regarding the detection time, the anomaly which is detected first would be resolved first
|
||
Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes. | ||
The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control. | ||
The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is anomaly.notifier.class
specified under spec.CruiseControl.config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
The anomaly status changes based on how the healing is progressing. | ||
The anomaly can transition to these possible states: | ||
* DETECTED - The anomaly is just detected and no action is yet taken on it | ||
* IGNORED - Self healing is either disabled or the anomaly is unfixable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also state will be IGNORED, if self-healing is enabled but set to the default notifier, NoopNotifer
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let me add it
These metrics can also provide a way for the users to debug self-healing related issues. | ||
For example, the `number_of_self_healing_started` metric can provides users, the way to check when self-healing was started for an anomaly, and the number of self-healing actions that were performed in the cluster. | ||
This information can be useful to at what point of time the self-healing occurred and the shape/replica assignment of the cluster was changed due to self-healing. | ||
The user can also use the metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can also use the metric like `number_of_self_healing_failed` to know if the self-healing failed to start to understand, and then they can check the reasons in the log for the failure. One example could be load monitor was ready at that point of time. | |
The user can also use the metric such as `number_of_self_healing_failed` to find out if the self-healing failed to start, so that they can check the reasons in the log for the failure. One example could be if load monitor was ready at that point of time. |
|
||
### Proposal for enabling self-healing in Strimzi | ||
|
||
In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource. | |
In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specify a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource. |
### Proposal for enabling self-healing in Strimzi | ||
|
||
In order to activate self-healing, users will need to set the `self.healing.enabled` property to `true` and specifying a notifier that the self-healing mechanism is going to use in the `spec.cruiseControl.config` section of their `Kafka` resource. | ||
If the user forgets to set the notifier property then the `KubernetesEventNotifier`(custom notifier under Strimzi org) will be used by default and the user will also get a warning in the Kafka CR regarding missing `anomaly.notifier.class` property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify, currently if self-healing is enabled with CC, it uses NoopNotifer by default which ignores anomalies. With this proposal, Strimzi will always set the notifier to KubernetesEventNotifier which will issue the warning if user has not specified a notifier class?
Signed-off-by: ShubhamRwt <[email protected]>
Hi @scholzj , I tried to explain the answers to your questions, I hope it make things clear in terms of why a notifier is necessory as well as other future improvements to the feature.
Having our own notifer helps us with various aspects:
The notifier plays an important role in the proposal as it is important to notify the users regarding the detected anomalies as well as in cases where users forgets to set the notifier but enables self-healing. If we don’t have any notifier set then basically all the detected anomalies would be ignored so having our own default notifier implementation provides the users a better experience so that they just need to enable self-healing and don’t have to worry about other properties. From a future perspective also, to have a more operator driven approach instead of CC doing everything we can use our notifier to execute our own custom runnable classes.
All the interactions that we currently do with Cruise Control are manually triggered and the fixes are always driven by the operator. Taking the case of auto rebalancing on scaling, it is something which is driven by the operator since the operator knows that scaling (up/down) is happening within the cluster and therefore it will trigger the rebalance. With self-healing the approach is different, it is driven by Cruise Control and therefore different to our previous integrations. If we want to feed auto-scaling or other operator actions from CC we can use our own notifier to annotate the Kafka CR (or use a CM or similar). We can even use our own runnable classes if we have the notifier. Using the stated approaches we can involve the operator in the process and drive the solution accordingly. The
There is no configuration in CC today which can help us with this. But, if we use our own notifier then we can make use of an annotation (on the Kafka CR) to pause self healing (by no-oping all anomalies) for the duration of any maintenance window. This was actually part of an earlier version of this proposal but was removed because we decided that users can disable self-healing if they know that some big manual rolls/rebalancing is going to happen in the cluster. They can re-enable the feature once the roll/rebalance is complete. |
Just because someone does not use Slack or MS Teams doesn't mean that using Kubernetes Events is a good solution. So this is not a good reason to have it. There is also nothing cloud-native about it. You are just using Kubernetes Events for something they were not really designed for, and what is not a good alerting mechanism.
Well, as I said in my comments - this is a reason why not to have this right now. You do not know how the future integration will look like so you have no idea how easy or hard will it be to not break compatibility. So it is one more reason to remove the notifier.
Sorry, but no. If the user forgets to set the Slack notifier, will they somehow miraculously remember to check Kubernetes events? Also, there is no realationship to the operator. So there is nothing operator driven on this. So I do not understand this argument and why having some Kubernetes Events is important for the self-healing.
Auto-rebalancing is not manually triggered. It is driven fully by the operator. So it is not true that all interactions are manually triggered. The self-healing as proposed here is just a fancy name for auto-rebalancing triggered by some metric inside Cruise Control. So I would expect it to be easy to have it driven by an alternative implementation driven through the operator similarly as the auto-rebalancing at scaling is. At the end, just few lines above you talk about driving things through the operator using the Strimzi notifier in the future as well. So it clearly can be passed through some notifier to Strimzi to have Strimzi trigger the auto-rebalance.
Or you can first understand how to do the integration before dealing with some events with dubious value to make sure it does not cause problems in the future.
We already have maintenance time windows in Strimzi. That is what needs to be plugged in. Not some pause self-healing annotation (although there might be other use-cases for such an annotation). |
This proposal shows how we plan to implement the self healing feature in the operator