-
Notifications
You must be signed in to change notification settings - Fork 75
Introducing new KafkaRoller #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9c6154b to
c74f0b4
Compare
ca68601 to
5abafe6
Compare
4baf73a to
33ec40e
Compare
fvaleri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?
433316f to
4f91a5a
Compare
97bdef2 to
941fe43
Compare
|
@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal. Thanks for it 👍 .
STs POV:
I think we would need to also design multiple tests to cover all states, which KafkaRoller v2. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...
Side note about performance:
What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates of multiple nodes when we use batching mechanism...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinaselenge thanks for the example, it really helps.
I left some comments, let me know if something is not clear or you want to discuss further.
06x-new-kafka-roller.md
Outdated
| - Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2. | ||
| - KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. | ||
| - The reassignment request is processed and foo-0 partition now has ISR [1, 4]. | ||
| - KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.
The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.
I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a dedicated proposal IMO, but let's start by logging an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would calling the ListReassigningPartitions API be enough to know this?
katheris
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.
931adbd to
1060fee
Compare
1060fee to
e56d1f8
Compare
katheris
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nits but otherwise looks good to me
fvaleri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tinaselenge. Thanks for the updates. I think this is definitely the right direction, but I leaved some more comments for you to consider. It may be that I'm missing some detail, so feel free to correct me.
06x-new-kafka-roller.md
Outdated
| - Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2. | ||
| - KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. | ||
| - The reassignment request is processed and foo-0 partition now has ISR [1, 4]. | ||
| - KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a dedicated proposal IMO, but let's start by logging an issue.
06x-new-kafka-roller.md
Outdated
| - Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestartAttempts`. | ||
| - After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationalTimeoutMs` is reached. | ||
| - If the timeout is reached, throw `TimeoutException`. If a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 2. | ||
| - After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationalTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this?
A Kafka background thread ensures that the leader role is shifted to the preferred replica once it's in sync and a configured imbalance threshold is reached. This is enabled by default (see auto.leader.rebalance.enable). I think this may be enough.
02e63f7 to
6842473
Compare
|
Thanks everyone who reviewed the proposal! |
scholzj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I stated in the past - I think this might be useful in general, but I do not think right now is the right time for this and I do not think we have the resources to support this change. So I'm not sure I would approve this proposal my self. However, I left some comments if you want to continue with this effort.
06x-new-kafka-roller.md
Outdated
|
|
||
| The existing KafkaRoller suffers from the following shortcomings: | ||
| - Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)). | ||
| - It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does that impact tail latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworded this little because tail latency perhaps was not the right description. It's more about the impact that it has on clients.
06x-new-kafka-roller.md
Outdated
| - Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)). | ||
| - It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency. | ||
| - Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy. | ||
| - Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could proviude some more details? I'm not aware of any such issue being raised by anyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was raised in the strimzi slack channel a while ago, should I link it here? I have added more details the potential scenario later in the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can raise an issue for this.
06x-new-kafka-roller.md
Outdated
|
|
||
| - KafkaRoller takes a long time to reconcile mixed nodes if they are all in `Pending` state. This is because a mixed node does not become ready until the quorum is formed and KafkaRoller waits for a pod to become ready before it attempts to restart other nodes. In order for the quorum to form, at least the majority of controller nodes need to be running at the same time. This is not easy to solve in the current KafkaRoller without introducing some major changes because it processes each node individually and there is no mechanism to restart multiple nodes in parallel. More information can be found [here](https://github.com/strimzi/strimzi-kafka-operator/issues/9426). | ||
|
|
||
| - The quorum health check relies on the `controller.quorum.fetch.timeout.ms` configuration, which is determined by the desired configuration values. However, during certificate reconciliation or manual rolling updates, KafkaRoller doesn't have access to these desired configuration values since they shouldn't prompt any configuration changes. As a result, the quorum health check defaults to using the hard-coded default value of `controller.quorum.fetch.timeout.ms` instead of the correct configuration value during manual rolling updates or when rolling nodes for certificate renewal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue this is Kafka issue in the first place -> it should provide a in-sync / not in.sync flag. Counting the delays is wrong regardless of whether you use the desired or current value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. However, the workaround we currently have can be improved, not to use hard-coded config value when doing quorum health check.
06x-new-kafka-roller.md
Outdated
| - build a map of nodes and their replicating partitions by sending describeTopics request to Admin API | ||
| - batch the nodes that do not have any partitions in common therefore can be restarted together | ||
| - remove nodes that have an impact on the availability from the batches (more on this later) | ||
| - return the largest batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has any analysis been done that this has an effect? Either on existing big clusters or on big test clusters? The only situation where I personally would expect this to have some effect is when you follow the racks as the batch boundaries. Should we try to follow the rack boundaries directly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested this on 9 broker node cluster. I think we should definitely test this on larger cluster.
I put details on rack boundaries in the the rejected alternative section as something that was already considered, but of course we all need to agree first whether to reject it or not.
06x-new-kafka-roller.md
Outdated
| - Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestartAttempts`. | ||
| - After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationTimeoutMs` is reached. | ||
| - If the timeout is reached, throw `TimeoutException`. If a node's `numRetries` is greater than or equal to `maxRetries`. Otherwise increment their `numRetries` and repeat from step 2. | ||
| - After all the nodes are in `SERVING` state, trigger preferred leader elections via Admin client. Wait for their states to become `LEADING_ALL_PREFERRED` until the configured `postOperationTimeoutMs` is reached. If the timeout is reached, log a `WARN` message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possibly very dangerous. Triggering this immediately after the restart can leade to the leaders moving to a newly started node that does not yet have established networking to some outside networks. E.g. through load balancers. That can happen due to the next node being restarted already today, so I do not think trying to align the preferred leaders is a problem per-se. But we might want to inject an optional/configurable timeout between the restart and the leader realignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this a separate configrable timeout or use the existing operationalTimeout? For example, we could wait for nodes to lead all the preferred partitions until the operation timeout reached. If timed out, only then we trigger leader realignment. Once requested to realign, we have another operational timeout to wait for it to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think operationalTimeout would be too long for most cases. So it would need a separate one.
06x-new-kafka-roller.md
Outdated
|
|
||
| #### Quorum health check | ||
|
|
||
| The quorum health logic is similar to the current KafkaRoller except for a couple of differences. The current KafkaRoller uses the `controller.quorum.fetch.timeout.ms` config value from the desired configurations passed from the reconciler or uses the hard-coded default value if the reconciler pass null for desired configurations. The new roller will use the configuration value of the active controller. This will mean that the quorum health check is done from the active controller's point of view. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think effort should be made in Kafka to expose a clear flag for the quorum being in sync or not. Anything else is just an imperfect workaround. If not accepted by Kafka, maybe this should be done by the KafkaAgent to have it done locally and not remotely.
06x-new-kafka-roller.md
Outdated
| | NOT_RUNNING | Node is not running (Kafka process is not running). This is determined via Kubernetes API, more details for it below. | `NOT_READY`, `NOT_RUNNING`. | `RESTARTED` `SERVING` | | ||
| | NOT_READY | Node is running but not ready to serve requests which is determined by Kubernetes readiness probe (broker state < 2 OR == 127 OR controller is not listening on port). | `RESTARTED` `SERVING` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do the states such as init container, PodInitialization etc. fit in between these two states? In that state, the Pod is scheduled and some parts of it are running. But not the Kafka process. So it seems to fall between these two states right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this would fall to NOT_READY category. NOT_RUNNING has a specific criteria, which is very similar to the criteria of stuck pods in the current roller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. But please add it to the state explanation as right now it suggests it would not belong there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already NOT_RUNNING section right below the table, explaining this.
06x-new-kafka-roller.md
Outdated
| ## Rejected | ||
|
|
||
| - Why not use rack information when batching brokers that can be restarted at the same time? | ||
| When all replicas of all partitions have been assigned in a rack-aware way then brokers in the same rack trivially share no partitions, and so racks provide a safe partitioning. However nothing in a broker, controller or cruise control is able to enforce the rack-aware property therefore assuming this property is unsafe. Even if CC is being used and rack aware replicas is a hard goal we can't be certain that other tooling hasn't reassigned some replicas since the last rebalance, or that no topics have been created in a rack-unaware way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the above is considered a rejected alternative. I mean, this section is for solutions for the same goals which were rejected, while it seems to be used just to highlight that an "idea" within the current proposal was rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course, we would need to agree if we are rejecting this idea. Perhaps I should rename this into other ideas considered?
608d610 to
d3e3f1d
Compare
|
Hi @strimzi/maintainers. I have made some updates to the proposal recently, to hopefully make it easier to review. The proposal should be now focusing on the design only and all the implementation details are in the linked POC PR. You can build and run the POC locally, to see if how it works. Before I invest more time and effort into this, I would like to get a clear decision on the next steps for this proposal since there have been some mixed opinions.
I also have a question to ask the community users and other contributors: |
My stance on this hasn't changed. I believe this is a useful feature, both in terms of the benefits to users around batching the rolling updates for large clusters, and in terms of improving the KafkaRoller code to make it easier to understand. I don't personally have the time to work on the implementation of this, but am willing to put myself forward as a maintainer who will prioritise reviewing the changes and understanding the code for future maintenance once it's in Strimzi. I'm also interested to hear what other maintainers and users think about the usefulness of this feature, so let's also add it to the Strimzi community call agenda late today (17th April). |
|
After three weeks from the community call where we discussed this, there weren't any additional reviews or comments here (apart from the Kate's one, but before the community call).
The two above will be the big benefits we are getting by using a new Kafka Roller but without a real need, I have the concerns about testing and have it working to win against the advantages we can take. Said that, from the community call it seems that we have got:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the proposal, I went through it and it LGTM (I left some minor comments). I would like to go through the PoC and possibly try it to see how it works. In terms of testing, we should have a look on writing more STs now, as IIRC there are just few for the old roller and it would be good to have them improved. Also we discussed with @see-quick that it would be maybe beneficial to look at the performance tests in the future (but it's not something urgent right now).
If the current code is too complex and the new KafkaRoller would make the code clearer + other code changes or new features would be much simpler to implement, it's good to have it there FMPOV.
Also, you mentioned various things that will be implemented -> from my understanding you expect and want to have them all in place with the initial implementation, right? Or you want to do 1:1 KafkaRoller in comparison with the old one and then add the new features?
06x-new-kafka-roller.md
Outdated
| | - | - | UNKNOWN | ||
| | Pod is not Running | - | NOT_RUNNING | ||
| | Pod is Running but lacking Ready status | Broker state != 2 | NOT_READY | ||
| | Pod is Running but lacking Ready stats | Broker state == 2 | RECOVERING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| | Pod is Running but lacking Ready stats | Broker state == 2 | RECOVERING | |
| | Pod is Running but lacking Ready status | Broker state == 2 | RECOVERING |
11x-new-kafka-roller.md
Outdated
| The new KafkaRoller introduced by this proposal will used only for KRaft based clusters. | ||
| This proposal should have no impact on any existing Kafka clusters deployed with ZooKeeper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Strimzi doesn't support Zookeeper anymore, are these two lines needed? Just asking out of curiosity, as I know that the proposal was written in the moment when we supported both.
11x-new-kafka-roller.md
Outdated
|
|
||
| | Phase | Strimzi versions | Default state | | ||
| |:------|:-----------------------|:-------------------------------------------------------| | ||
| | Alpha | 0.46, ? | Disabled by default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change it to 0.47/0.48 or something. It's just a reminder.
11x-new-kafka-roller.md
Outdated
|
|
||
| `NOT_READY` nodes will be restarted if they have a restart reason and have not been restarted yet. If it is not ready after being restarted already, we don't want to restart any other nodes to avoid taking down more nodes. | ||
|
|
||
| `READY` nodes will be restarted if they have a restart reason. If they don't have a restart reason but need to be reconfigured, they will be reconfigured. If no reconfiguration is needed, then no action will be taken on these nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to stop a roll, would the process be to apply strimzi.io/pause-reconciliation="true"? Would that put each node in READY or something else?
|
I wanted to drop in and say thank you for taking the time to put this proposal together. Slow deployments and rolls are the biggest headache we have here at Reddit. Our clusters are many hundreds of brokers, meaning each update can be a multi-day effort. Everyone dreads it. We have clusters we are eager to grow even more, but linear speed deployments put us in a tough place. Just last month I told a product team they had to hold back their ramp because we couldn't manage a 300 node cluster due to this. The current plan for the next year is to start creating clusters with "-2" "-3" appended to at least allow parallelism across clusters. This is not ideal, our users would rather write to one big cluster instead of a lot of small ones. And of course it is a headache for us. I've seen replication-aware deployments for Kafka in the past, and it worked quite well. Deployment time for 300+ broker clusters only took hours. Having this in Strimzi would be a huge operational win for us and anyone running a big deployment. We would be happy to help contribute to this however we can. While we don't have much Java experience we can certainly help test, review, and promote once it is released. Thanks again for putting this together. |
|
Hi @nickgarvey thanks for taking the time to review the PR and for offering to help with testing and reviews. Implementing this would take a lot of work so it's really useful to hear from community members that this would be useful and worth investing the time in. |
Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <[email protected]>
Tidy up Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Co-authored-by: Maros Orsak <[email protected]> Signed-off-by: Gantigmaa Selenge <[email protected]>
Add possible transitions Signed-off-by: Gantigmaa Selenge <[email protected]>
Added flow diagram for state transitions Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
- Improve the names for categories and states - Remove restarted/reconfigured states - Add a configuration for delay between restarts - Add a configuration for delay between restart and trigger of preferred leader election - Restart NOT_RUNNING nodes in parallel for quicker recovery - Improve the overall algorithm section, to make it clearer and concise Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Updated the text on the diagram Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Removed the implementation details such as the algorithm. This will be included in the draft PR for the POC instead. Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
e9d647d to
1048626
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
|
Thanks to everyone who reviewed PR. I'm closing this PR for now as the proposal has changed quite a bit since the last time the most people looked at and has many comments that may or may not be outdated. I plan to go over the proposal again and open a new PR for this, so that people can have a look with fresh eyes. @nickgarvey thanks for the comment offering to help with reviewing and testing it. It would help us build much more confidence in the new roller and make progress. I will mention Reddit in the new update as one of the vendors willing to help testing if that's ok. |
POC implementation