Skip to content

Proposal for adding Self healing feature in the operator #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
284 changes: 284 additions & 0 deletions 092-self-healing-using-cruise-control.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
# Self Healing using Cruise Control

This proposal is about integrating the self-healing feature of Cruise Control into the Strimzi operator.
The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The self-healing feature of Cruise Control allows us to heal the anomalies detected in the cluster, automatically.
When enabled, Cruise Control's self-healing feature automatically resolves issues detected by its Anomaly Detector Manager.
This includes problems such as broker failures, goal violations, and disk failures, reducing the need for manual intervention.

Anomalies are detected by Cruise Control using the anomaly detector manager.

## Current situation

During normal operations it's common for Kafka clusters to become unbalanced (through partition key skew, uneven partition placement etc.) or for problems to occur with disks or other underlying hardware which makes the cluster unhealthy.
Currently, if we encounter any such scenario we need to fix these issues manually i.e. if there is some broker failure then we might move the partition replicas from that corrupted broker to some other healthy broker by using the `KafkaRebalance` custom resource in the [`remove-broker`](https://strimzi.io/docs/operators/latest/full/deploying.html#proc-generating-optimization-proposals-str) mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not some bare-metal cluster but Kubernetes. So I'm struggling to find the issues where you would need to do what you suggest. Instead, on Kubernetes, you would normally just fix the issue on the existing broker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point. At Kuberntes level, maybe a broker failure is more related to lack of resources which you should fix at cluster level and not removing the broker. A better example could be a disk failure on a broker or some goal violation (which is the usual reason why we use manual rebalancing).

With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the issue on your own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should clarify what issues are you talking about here. From the past discussion, the terminology used here seems highly confusing ... and I suspect that the words such as issues, anomalies, and problems are ambiguous at best`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me they are all synonyms more or less. Maybe just use one and stick with it?


## Motivation

With the help of the self-healing feature we can resolve the issues like disk failure, broker failure and other issues automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing in this proposal suggests that Cruise Control can solve disk or broker failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, the proposal introduces the concept of broker/disk failure detectors and related fix class which highlight that CC can solve them. Or what are you looking for?

Cruise Control treats these issues as "anomalies", the detection of which is the responsibility of the anomaly detector mechanism.
Currently, users are [able to set]((https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection)) the notifier to one of those included with Cruise Control (<list of included notifiers>).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link seems broken? (Double ( and )?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time you mention the term "notifier" but it shows up out of the context. You should first explain what it is and why it's useful in general, not related to the self-healing but more to the alerting.

However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, any anomaly that is detected would then need to be fixed manually by using the `KafkaRebalance` custom resource.
However, self-healing is currently disabled and disallowed in Strimzi.
Any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to elaborate on why we disallowed it originally and why we no longer think that it is warranted any more.

It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it useful to have these anomalies fixed automatically? It might be worth describing some of the anomalies the self-healing could fix automatically and why they are such a hassle to fix manually.


## Proposal

### Introduction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, none of the Introduction section is really a proposal of anything. You seem to be just describing how Cruise control works. It is useful, but I would not include it in the Proposal but add it before it as a separate section. That would make it easier to find what exactly is this proposing.


![self-healing flow diagram](./images/090-self-healing-flow.png)

#### Anomaly Detector Manager

The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly detector manager helps in detecting the anomalies and also handling the detected anomalies.
The anomaly detector manager helps in detecting the anomalies as well as handling them.

It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It acts as a coordinator between the detector classes as well as the classes which will be handling and resolving the anomalies.
It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies.

Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (This periodic check can be easily configured through the `anomaly.detection.interval.ms` configuration).
Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster have their corresponding anomalies or not (The frequency of this check can be easily configured through the `anomaly.detection.interval.ms` configuration).

Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need this level of details. Maybe just link to the doc of these detectors? If not, maybe reword the sentence to something like this:

Suggested change
Every detector class works in a different way to detect their corresponding anomalies. A `KafkaBrokerFailureDetector` will require a deep usage of requesting metadata to the Kafka cluster (i.e. broker failure) while if we talk about `DiskFailureDetector` it will be more inclined towards using the admin client API. In the same way the `MetricAnomalyDetector` will be making use of metrics and other detector like `GoalViolationDetector` and `TopicAnomalyDetector` will have their own way to detect anomalies.
Detector classes have different mechanisms to detect their corresponding anomalies. For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` utilises Kafka Admin API. Furthermore, `MetricAnomalyDetector` uses metrics while other detector such as `GoalViolationDetector` and `TopicAnomalyDetector` use a different mechanism to detect anomalies.

Maybe we should mention briefly how GoalViolationDetector and TopicAnomalyDetector detect anomalies like these others too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what exactly it is trying to say ... but it does not seem to make sense to me.

What the text is trying to say is that each detector has its own way to detect the anomaly (i.e. using admin client API, or deep metadata API request, or reading metrics and so on). Maybe the changes proposed by @tinaselenge makes it clearer?

The detected anomalies can be of various types:
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean the goals in KafkaRebalance and goals self.healing.goals are independent of each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have specified hard goals in the KR, then the self.healing.goals should be a superset of the hard goals.

If by KR you mean the KafkaRebalance custom resource, I don't think what you are saying is true. There is no relationship between a specific KafkaRebalance and self-healing configuration.
Maybe are you referring to the general CC configuration? So the default hard goals in CC has to be a superset of self.healing.goals (and not the other way around)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) to those used for manual rebalancing..
* Goal Violation - This happens if certain optimization goals are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured independently (through the `self.healing.goals` config) from those used for manual rebalancing.

The nuances between the goal configurations of Cruise Control can be quite confusing, to help keep them straight in my head I categorize them in the following way:

  • The server-side Cruise Control goal configs in spec.cruiseControl.config of the Kafka resource
  • The client-side Cruise Control goal config in spec of the KafkaRebalance resource

To avoid confusion, it may be worth adding some text that explains where and how the self.healing.goals compares/relates to the other server-side Cruise Control goal configs - goals, default.goals, hard.goals, anomaly.detection.goals. It wouldn't have to be right here, but somewhere in the proposal where the self.healing configuration is raised. When we add the self-healing feature we will need to add some documentation of how self.healing.goals related to the other goal configurations anyways. We could use the same language used to describe the different goal configs in the Strimzi documentation and provide a link to those docs as well.

I think the language you and Paolo use in this thread referring to other goal configs like "superset" or subset is helpful. Since the goal configs are so complicated, I don't think it is a problem if we reuse text from the Strimzi Cruise Control goal documentation to emphasize the distinction of the self.healing.goals config.

* Topic Anomaly - Can happen when one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Topic Anomaly - Can happen when one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).

* Broker Failure - It happens when a non-empty broker crashes or leaves a cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Broker Failure - It happens when a non-empty broker crashes or leaves a cluster.
* Broker Failure - This happens when a non-empty broker crashes or leaves a cluster.

* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).
* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).

The detected anomalies are inserted into a priority queue and the anomaly with the highest property is picked first to resolve.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the priority decided?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each specify anomaly has a priority defined here [1].
The anomaly detector manager is putting anomalies into a PriorityBlockingQueue instance and the comparator for it is based on priority (lower comes first) and detection time (detected first comes first).

https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/notifier/KafkaAnomalyType.java

The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly detector manager calls the notifier to get an action regarding, whether the anomaly should be fixed, delayed or ignored.
The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored.

If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the action is fix, then the anomaly detector manager calls the classes that are required to resolve the anomaly.
If the action is fixed, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Anomaly detection also has several other [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can effect the performance of the Cruise Control server and the latency of the anomaly detection.
Anomaly detection also has various [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#anomalydetector-configurations), such as the detection interval and the anomaly notifier class, which can affect the performance of the Cruise Control server and the latency of the anomaly detection.


#### Notifiers in Cruise Control

Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes.
The notification sent by these classes increases the visibility of the operation that are taken by Cruise Control.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The notification sent by these classes increases the visibility of the operation that are taken by Cruise Control.
The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The notification sent by these classes increases the visibility of the operation that are taken by Cruise Control.
The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.

However, the notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, the notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.
The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.

The notifier class returns the `action` that is going to be taken on the notified anomaly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The notifier class returns the `action` that is going to be taken on the notified anomaly.
The notifier class returns the `action` that is going to be taken on the flagged anomaly.

These actions can be classified into three types:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These actions can be classified into three types:
These actions have three types:

* FIX - Start the anomaly fix
* CHECK - Delay the anomaly fix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the delay action, apply a configured delay before taking the action provided by notifier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Even if the delay is mostly used only for broker failures and not for another anomalies AFAIK.

* IGNORE - Ignore the anomaly fix

The default `NoopNotifer` always sets the notifier action as `IGNORE` means that the detected anomaly will be silently ignored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that it doesn't send any notification to the user in any way as well.


Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 17, you suggest that Prometheus metrics and Alerts can be used for anomaly notification. So should this be somehow listed here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see Prometheus mentioned earlier. Also AFAICS Cruise Control provides some notifiers like Slack, Alerta, MS Teams but there is not Prometheus one. It would be on you developing your own Prometheus notifier.


#### Self Healing
If self-healing is enabled in Cruise Control and anomalies are detected, then a fix for these will be attempted automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fix will be attempted automatically only if anomaly.notifier.class is configured to non default notifier right? So enabling self-healing alone is not enough, as with the default notifier (NoopNotifer), anomalies are detected but ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you will need to override the notifier too

The action returned by the notifier would decide whether the anomaly should be fixed or not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @tinaselenge pointed out. These two sentences contradict each other. You day a fix will be attempted automatically but then say only if the notifier decides. So you should rewrite these to make it clearer what is happening.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not clear and should be clarified but the first sentence is part of the second one :-)
It's partially explained what I said here #145 (comment)
Anyway, we need to make this more consistent and clear.

If the notifier has returned `FIX` as the action then the classes which are responsible for resolving the anomaly would be called.
Each detected anomaly is implemented by a specific class which leverages a corresponding other class to run a fix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each detected anomaly is implemented by a specific class which leverages a corresponding other class to run a fix.
Each detectable anomaly is handled by a specific detector class which then uses another remediation class to run a fix.

For example, the `BrokerFailures` class uses the `RemoveBrokersRunnable` class, the `DiskFailure` class use the `RemoveDisksRunnable` class and so on.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly they "fix" issues with broken disk or broker? The fix for broken disk would be for example removing the old disk and provisioning a new one.

Running a rebalance is as when you try to fix your car with duct tape - if you are lucky, it might allow you to drive few more miles to the nearest service. That might be useful, but it does not really fix anything.

Calling things the right names is important because it will show what this proposal achieves or doesn't achieve and which environments it fits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is word mitigation better than fix in this case? Self-healing take actions to reduce/minimise the impact on cluster availability rather than resolving the root cause such as failed disk, failed broker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, Cruise Control is all about rebalancing nothing more. It can't remove a disk and replace with a new one. So for a disk failure it will move partitions off that disk to the others, same for a broker failure, same for a goal violation ... moving partitions around. We can call it mitigation if it's not a proper "fix" but this is the naming within Cruise Control.

An optimization proposal will then be generated by these `Runnable` classes and that proposal will be applied on the cluster to fix the anomaly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this details about internal classes fixing the anomaly? Line from 68 to 70 are maybe not needed, wdyt @tomncooper ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is useful to spell out the flow to help readers understand what self-healing is actually doing.

In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware(insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the cruise control logs will be updated with `self-healing is not possible due to unfixable goals` warning.
In case the anomaly detected is unfixable for e.g. violated hard goals that cannot be fixed typically due to lack of physical hardware (insufficient number of racks to satisfy rack awareness, insufficient number of brokers to satisfy Replica Capacity Goal, or insufficient number of resources to satisfy resource capacity goals), the anomaly wouldn't be fixed and the Cruise Control logs will be updated with `self-healing is not possible due to unfixable goals` warning.

When self-healing starts, you can check the anomaly status by polling the `state` endpoint with the `substate` configured as `anomaly_detector`. This would provide the anomaly details like `anomalyId`, `anomalyType` and the status of the anomaly etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more to specify the exact API endpoint you are going to call (with also a reference to the doc) instead of describing it as "the state endpoint with the substate configured as anomaly_detector"

For example, the detected anomaly would look like this:
```shell
#...
recentDiskFailures:[{anomalyId=4b1ed191-50d7-4aa1-806b-b8c5c0d679fe,
detectionDate=2024-12-09T09:49:46Z,
statusUpdateDate=2024-12-09T09:49:47Z,
failedDisksByTimeMs={2={/var/lib/kafka/data-1/kafka-log2=1733737786603}},
status=FIX_STARTED}]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would describe more what the above snippet represents.

```

The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The anomaly status would change based on how the healing progresses. The anomaly can transition to these possible status:
The anomaly status changes based on how the healing is progressing.
The anomaly can transition to these possible states:

* DETECTED - The anomaly is just detected and no action is yet taken on it
* IGNORED - Self healing is either disabled or the anomaly is unfixable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also state will be IGNORED, if self-healing is enabled but set to the default notifier, NoopNotifer right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let me add it

* FIX_STARTED - The anomaly has started getting fixed
Copy link
Contributor

@tinaselenge tinaselenge Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* FIX_STARTED - The anomaly has started getting fixed
* FIX_STARTED - The anomaly fix has started

* FIX_FAILED_TO_START - The proposal generation for the anomaly fix failed
* CHECK_WITH_DELAY - The anomaly fix is delayed
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* LOAD_MONITOR_NOT_READY - Monitor which is monitoring the workload of a Kafka cluster is in loading or bootstrap state
* LOAD_MONITOR_NOT_READY - Monitor for workload of a Kafka cluster is in loading or bootstrap state

* COMPLETENESS_NOT_READY - Completeness not met for some goals. The completeness describes the confidence level of the data in the metric sample aggregator. If this is low then some Goal classes may refuse to produce proposals.

We can also poll the `state` endpoint, with `substate` parameter as `executor`, to get details on the tasks that are currently being performed by Cruise Control.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto as above.

For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example:
For example:

```shell
ExecutorState: {state: INTER_BROKER_REPLICA_MOVEMENT_TASK_IN_PROGRESS,
pending(20)/in-progress(0)/aborting(0)/finished(0)/total(20) inter-broker partition movements,
completed(0)/total(0) bytes in MBs: 0.00%, max/min/avg concurrent inter-broker partition movements per-broker: 5/5/5.00,
triggeredSelfHealingTaskId: 4b1ed191-50d7-4aa1-806b-b8c5c0d679fe, triggeredTaskReason:
Self healing for DISK_FAILURE: {
Disk /var/lib/kafka/data-1/kafka-log2 on broker 2 failed at 2024-12-09T09:49:46Z
}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would describe more what the above snippet represents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShubhamRwt I can't see this comment being addressed, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add a line about it but not went into complete explanation. I will do that

```

### Enabling self-healing configuration in the Strimzi operator

Every configuration related to self-healing is currently disabled in the operator, and we will need to enable users altering some of those properties.
The setting of these properties will be placed behind a feature gate.

### Feature Gate

The new feature gate will be called `UseSelfHealing`.
It will be introduced in an alpha state and will be disabled by default.

We are using a feature gate for this feature due to various reasons:
* We want to allow users to try this feature and test it.
* We would like to know more about the positive/negatives of this feature by testing it in real deployments.
* Getting feedback from the community would allow us to understand how we can improve this even more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly will the feature gate be used to achieve this?


### Enabling self-healing in the Kafka custom resource

When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When the feature gate is enabled, the operator will set the `self.healing.enabled` Cruise Control configuration to `true` and allow the users to set other appropriate `self-healing` related [configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations).
When the `UseSelfHealing` feature gate is enabled, users will be able to activate self-healing by configuring the [`self.healing` Cruise Control configurations](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) in the `spec.cruiseControl.config` of their `Kafka` resource.
When the `UseSelfHealing` feature gate is disabled, users will not be able to activate self healing and all `self.healing` Cruise Control configurations will be ignored.


Here is an example of how the configured Kafka custom resource could look:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here is an example of how the configured Kafka custom resource could look:
Here is an example of how the configured `Kafka` custom resource could look:

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-cluster
spec:
kafka:
# ...
cruiseControl:
config:
# configuration to enable self-healing for all anomalies
self.healing.enabled: true
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two things seem to contradict each other ...

  • Does the UseSelfHealing FG enable the self-healing as sugested in the first paragraph?
  • Or does it allow the user to enable self-healing as suggested by the example YAML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the first one. FG enables the self healing. I will remove this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean you would force the self-healing one day on all users even if they don't want it. That does not seem to be the right way, or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can go with other way too where user enables self healing. We can change this as per the discussion here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find FG and this configuration a bit confusing. If FG is not enabled, would this configuration be allowed by user? What would happen, if user sets this configuration without enabling FG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My take on this is that the user should enable self healing (and all the corresponding properties/configuration) which would be anyway ignored if the FG is not enabled. This way we are putting the "enablement and configuration of self-healing" behind a FG.


Where the user wants to enable self-healing for some specific anomalies, they can use configurations like `self.healing.broker.failure.enabled` for broker failure, `self.healing.goal.violation.enabled` for goal violation etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any anomalies enabled by default? If user does not specify anomalies to detect, self-healing will do nothing?

Copy link
Contributor Author

@ShubhamRwt ShubhamRwt Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies. In case the user knows that somehow any anomaly will keep appearing since they have some issues in their cluster, then in that they can use skip the check for that particular anomaly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't specify any specific anomalies(i.e you used just self.healing.enabled: true) then self healing would be detecting all the anomalies.

I wonder if it would be worth having all anomaly detection disabled by default and force users to explicitly enable which specific anomalies they want to detect and have fixed automatically. I imagine the alerting of some anomalies might be pretty noisy and the automatic fixes may be quite taxing on the cluster, having all anomalies enabled at once by default might surprise and overload users not familiar with the feature.


### Strimzi Notifier

Cruise Control provides `AnomalyNotifier` interface which has multiple abstract methods on what to do if certain anomalies are detected.
`SelfHealingNotifier` is the base class that contains the logic of self-healing and override the methods in by implementing the `AnomalyNotifier` interface.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDisk Failure`, `alert()` etc.
Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()` etc.

The `SelfHealingNotifier` class can then be further used to implement your own notifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `SelfHealingNotifier` class can then be further used to implement your own notifier
The `SelfHealingNotifier` class can then be further used to implement your own notifier.

The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The other custom notifier offered by Cruise control are also based upon the `SelfHealingNotifier`.
The other custom notifiers offered by Cruise control are also based upon the `SelfHealingNotifier`.


We will be using our own notifier implementation called `StrimziNotifer`(set via the `anomaly.notifier.class` config).
Strimzi will configure Cruise Control to always use the `StrimziNotifier` to notify regarding the anomalies and take action on them.

For effective alerts and monitoring, we will be making use of Kubernetes events to signal to users what operations Cruise Control is performing automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why do you think Kubernetes Events are a good way to communicate the anomalies?
  • Why should it be forced on the user?
  • Why should it be called StrimziNotifier? It seems to be more Strimzi KubernetesEventsNotifier as it has nothing really specific to Strimzi?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC (during an offline discussion) we went towards the Kube events as a better solution to make users aware (at least) that an anomaly was detected instead of using, for example, a specific custom resource (i.e. KafkaAnomaly) or some other Kube built-in resource. How would you make the user aware of detected anomalies? Just logs? AFAIK you should be able to gather Kube events in a Grafana dashboard for monitoring as well.

On the naming side, maybe yes, a more generic KubernetesEventsNotifier is better.

The `StrimziNotifier` will override the `alert()` method of the `SelfHealingNotifier` and will trigger and publish events whenever an anomaly is detected.
These events would publish information regarding the anomaly like the anomalyID, the type of anomaly which was detected etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is something like the anomalyID useful to the user? We do not support the CC UI, so is there some simple way for the users to find more details about this anomaly based on the anomalyID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can get more information using the state endpoint with executor substate. The executor will show the progress for the anomaly getting resolved(It will also contains the id of anomaly being resolved). they can also use the state endpoint with anomaly_detector substate to know more about what type of anomaly is detected , when it was detected etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you expect them to curl it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats the only way for now to check what Cruise Control is doing during self healing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds quite complicated. Do you expect the user would need to do this? Or would they be in 99% of cases be happy with what is in the event itself? The example event does not seem to have muhc detail about the problem that was detected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can get more information about the anomaly from the notifier.
We cannot get info about the fix because, as per how CC works today, there is no more interaction with the notifier from the anomaly detector after the anomaly fix starts. The notifier has just one goal: to notify and decide for the anomaly detector to fix, delay, ignore the anomaly. Nothing more. It's not called anymore in any way along the process.
Still per our offline discussion, the refused KafkaAnomaly custom resource was there to provide a way for the operator to get the anomaly and querying the CC endpoints in order to update such resource with the anomaly fix status in progress.

The events would help the user to understand the anomalies that are present, when the anomalies appeared in their cluster and also would help them know what is going on in their clusters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?

Could be enabled by default if a user sets self healing enabled to true but doesn't specify a notifier class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my main question here is: Will this KubernetesEventNotifier be auto-approving all notification?

I think the answer to this is still in my comment here #145 (comment) to explain how the base class for notifier works unless we want to write something from scratch (but I would argue about this).


The event published would look like this:
```yaml
- action: <action>
apiVersion: v1
eventTime: "2024-12-19T08:06:21.314140Z"
firstTimestamp: null
involvedObject:
kind: Pod
name: <cruise-control-pod-name>
namespace: myproject
kind: Event
lastTimestamp: null
message: Cruise Control is fixing the detected anomaly
metadata:
creationTimestamp: "2024-12-19T08:06:21Z"
generateName: cruise-control-event
name: my-cluster-34c41626-0add-47a3-9b79-d6ee29d5d81f # my-cluster-<Anomaly ID>
namespace: myproject
resourceVersion: "101070"
uid: 80892c7d-e3a1-44ad-84dc-2d88e15a7ff0
reason: Anomaly was detected in the cluster GOAL_VIOLATION # Detected Anomaly type
reportingComponent: cruise-control
reportingInstance: cruise-control
source: {}
type: Normal
```

The `StrimziNotifier` class is going to reside in a separate module in the operator repository called the `strimzi-cruise-control-notifier` so that a seperate jar for the notifier can be built for inclusion in the Cruise Control container image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Why should it be in the operators repo? Seems like it has nothing to do with the operator? So Why not a separate repository?
  • There is no Cruise Control container image, Cruise Control is in the Kafka images.

We will also need to include the fabric8 client dependencies in the notifier JAR that will be used for generating the Kubernetes events as well as for reading the annotation on the Kafka resource from the notifier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading what annotation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you are anticipating what's going to be in the next paragraph but this sentence doesn't make any sense here.


### Pausing the self-healing feature

The user should also have the option to pause the self-healing feature, in case they know or plan to do some rolling restart of the cluster or some rebalance they are going to perform which can cause issue with the self-healing itself.
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For this, we will be having an annotation called `strimzi.io/self-healing-paused`.
For this, we will be having an annotation called `strimzi.io/self-healing-paused` on the `Kafka` custom resource.

We can set this annotation on the Kafka custom resource to pause the self-healing feature.
self-healing is not paused by default so missing annotation would mean that anomalies detected will keep getting healed automatically.
When self-healing is paused, it would mean that all the anomalies that are detected after the pause annotation is applied would be ignored and no fix would be done for them.
The anomalies that were already getting fixed would continue with the fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is pausing the self-healing useful for the users? Assuming it is useful, why not just disable the self-healing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabling self-healing would mean restarting CC. Pausing mean that the notifier just returns "ignore" to the anomaly detector (but you could still see anomalies logged in the CC pod for example).


#### How the pausing works?

To pause self-healing, you will need to annotate the Kafka custom resource with the following annotation:

```sh
kubectl annotate kafka my-cluster `strimzi.io/self-healing-paused: "true"`
```

The Kafka custom resource will look like this:
```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
annotations:
strimzi.io/self-healing-paused: "true"
creationTimestamp: "2024-12-17T09:01:10Z"
generation: 1
name: my-cluster
namespace: myproject
resourceVersion: "109971"
uid: 71928731-df71-4b9f-ac77-68961ac25a2f
```

We will have checks on methods like `onGoalViolation`, `onBrokerFailure`, `onTopicAnomaly` etc. in the `StrimziNotifier` to check if the `pause` annotation is applied on the Kafka custom resource or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no pause annotation. Use the exact name.

If the annotation is applied then this check will set the `action` to be taken on the anomaly as `IGNORE` and the anomalies wouldn't be fixed.

### Strimzi Operations while self-healing is running/fixing an anomaly

Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Self-healing is a very robust and resilient feature due to its ability to tackle issues which can appear while the healing of cluster is happening.
Self-healing is a very robust and resilient feature due to its ability to tackle issues while healing a cluster.


#### What happens when self-healing is running/fixing an anomaly and brokers are rolled

If a broker fails, self-healing will trigger and Cruise Control will try to move the partition replicas from the failed broker to other healthy brokers in the Kafka cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly does it mean when the broker fails? How does CC recognize it? How does it distinguish failure for example from some temporary Kubernetes scheduling issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broker anomaly detector uses deep metadata request to the Kafka cluster to know the brokers status. The broker failure in particular is not fixed straight away but it's first delayed in order to double check "later" if the broker is still offline or not. All the timing between double checks are configurable. So any temporary issue will not be reported as anomaly and tried to be fixed.

If, during the process of moving partitions from the failed broker, another broker in the Kafka cluster gets deleted or rolled, then Cruise Control would finish the transfer from the failed broker but log that `self-healing finished successful with error`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would it finish the transfer from the failed broker when the target broker is stopped? You mean that it will continue with it after the target broker is rolled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, If the target broker is rolling at that moment, then self healing would end saying self-healing finished successful with error. Then periodic anomaly check will again detect the failed broker and try to fix it. If the target broker is rolled by that time then self healing will start again else if the rolled broker is not up then we will will get two anomalies to fix the failed broker and the target broker

If the deleted or rolled broker comes back before the periodic anomaly check interval, then self-healing would start again only for the earlier failed broker with a new anomalyId since it was not fixed properly.
In case the deleted broker doesn't come back healthy within the periodic anomaly check interval then another new anomaly for the deleted broker will be created and the earlier detected failed broker would also persist but will be updated with a new anomalyId.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But didn't you say above that broker failures are disabled?


#### What happens when self-healing is running/fixing an anomaly and Kafka Rebalance is applied

If self-healing is ongoing and a `KafkaRebalance` resource is posted in the middle of it then the `KafkaRebalance` resource will move to `NotReady` state, stating that there is already a rebalance ongoing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to differenciate between asking for proposal and approving a proposal? For example when asking for proposal, I would expect the operator to wait and give me a proposal after the self-healing ends.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the following snippet I can see the dryrun=false so this seems a condition happening when you approve a proposal. @ShubhamRwt does it mean that during self-healing, CC was able to return to you a proposal and you were able to approve it and then getting the error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShubhamRwt any answer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the middle of self-healing I was able to generate the proposal and when I approved it I got the error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a section around this in the proposal

For example
```yaml
status:
conditions:
- lastTransitionTime: "2025-01-08T08:24:01.314273452Z"
message: 'Error for request: my-cluster-cruise-control.myproject.svc:9090/kafkacruisecontrol/rebalance?json=true&dryrun=false&verbose=true&skip_hard_goal_check=false&rebalance_disk=false.
Server returned: Error processing POST request ''/rebalance'' due to: ''java.lang.IllegalStateException:
Cannot start a new execution while there is an ongoing execution. Please use
stop_ongoing_execution=true to stop ongoing execution and start a new one.''.'
reason: CruiseControlRestException
status: "True"
type: NotReady
- lastTransitionTime: "2025-01-08T08:24:01.349653295Z"
message: 'Wrong annotation value: approve. The rebalance resource supports the
following annotations in NotReady state: [refresh]'
reason: InvalidAnnotation
status: "True"
type: Warning
observedGeneration: 1
```

## Affected/not affected projects

A new module name `strimzi-notifier` will be added to the operator repository.
The `strimzi-notifier` module should build as a jar and then be included in the cruise control image.

## Rejected Alternatives

### Alternative 1

This alternative is about using a Kubernetes custom resource to create a two-way interaction between the operator and Cruise Control.
The idea was to create a Kubernetes custom resource named `KafkaAnomaly` everytime an anomaly was detected.
The user will get updates regarding the anomaly fix through the generated `KafkaAnomaly` resource which would be updated by the operator by requesting the `state` endpoint with `anomaly_detector` and `executor` substates.

Pros:
* More hold on the self-healing process since everything is driven using the Kafka custom resource.

Cons:
* Very tight coupling with the operator.
* Would be hard to manage multiple `KafkaAnomaly` custom resources (for e.g. Deletion when anomaly is fixed etc.)

### Alternative 2

This alternative is similar to alternative 1 where we will use a Kubernetes custom resource to create a two-way interaction between the operator and Cruise Control.
The idea was to create a Kubernetes custom resource named `KafkaAnomaly` everytime an anomaly was detected and both the operator and the Notifier would watch the resource for updates.
But with this approach the operator will be responsible to make decision regarding the anomaly should be fixed or not

Pros:
* Allows the operator to ignore the anomalies is some task is already running in the cluster like rolling, rebalance etc.

Cons:
* Very tight coupling with the operator.
* Delaying the anomaly detector progress.
Binary file added images/090-self-healing-flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.