-
Notifications
You must be signed in to change notification settings - Fork 67
Introducing new KafkaRoller #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8c79a95
to
9c6154b
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
9c6154b
to
c74f0b4
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
ca68601
to
5abafe6
Compare
56d7a24
to
4baf73a
Compare
Made some improvements on the structure Signed-off-by: Gantigmaa Selenge <[email protected]>
4baf73a
to
33ec40e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?
c060c24
to
433316f
Compare
Tidy up Signed-off-by: Gantigmaa Selenge <[email protected]>
433316f
to
4f91a5a
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
97bdef2
to
941fe43
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice proposal. Thanks for it 👍 .
STs POV:
I think we would need to also design multiple tests to cover all states, which KafkaRoller v2
. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...
Side note about performance:
What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates
of multiple nodes when we use batching mechanism...
Co-authored-by: Maros Orsak <[email protected]> Signed-off-by: Gantigmaa Selenge <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tinaselenge thanks for the example, it really helps.
I left some comments, let me know if something is not clear or you want to discuss further.
06x-new-kafka-roller.md
Outdated
- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2. | ||
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed. | ||
- The reassignment request is processed and foo-0 partition now has ISR [1, 4]. | ||
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.
The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.
I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a dedicated proposal IMO, but let's start by logging an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would calling the ListReassigningPartitions API be enough to know this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.
931adbd
to
1060fee
Compare
Add possible transitions Signed-off-by: Gantigmaa Selenge <[email protected]>
1060fee
to
e56d1f8
Compare
4c035fa
to
bf71ae6
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
bf71ae6
to
660f2ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering all my questions. Good job.
Thank you @fvaleri , I really appreciate you reviewing the proposal thoroughly. |
06x-new-kafka-roller.md
Outdated
- Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)). | ||
- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency. | ||
- Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy. | ||
- Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.
Updated the text on the diagram Signed-off-by: Gantigmaa Selenge <[email protected]>
ad75fcf
to
e9a9859
Compare
Signed-off-by: Gantigmaa Selenge <[email protected]>
Hi @tombentley @scholzj @ppatierno, do you have any further comments on this proposal? |
Converting this PR to Draft, as I'm currently rewriting the proposal to make it more focused on the proposed solutions, rather than implementation details. However, I'm not changing the core of the proposal. I'm hoping that the upcoming update will make it easier to review. |
e29b13a
to
9faf987
Compare
9faf987
to
1d97d60
Compare
1d97d60
to
e43e667
Compare
Any updates on this? |
Removed the implementation details such as the algorithm. This will be included in the draft PR for the POC instead. Signed-off-by: Gantigmaa Selenge <[email protected]>
e43e667
to
e9d647d
Compare
Hi @strimzi/maintainers. I have made some updates to the proposal recently, to hopefully make it easier to review. The proposal should be now focusing on the design only and all the implementation details are in the linked POC PR. You can build and run the POC locally, to see if how it works. Before I invest more time and effort into this, I would like to get a clear decision on the next steps for this proposal since there have been some mixed opinions.
I also have a question to ask the community users and other contributors: |
My stance on this hasn't changed. I believe this is a useful feature, both in terms of the benefits to users around batching the rolling updates for large clusters, and in terms of improving the KafkaRoller code to make it easier to understand. I don't personally have the time to work on the implementation of this, but am willing to put myself forward as a maintainer who will prioritise reviewing the changes and understanding the code for future maintenance once it's in Strimzi. I'm also interested to hear what other maintainers and users think about the usefulness of this feature, so let's also add it to the Strimzi community call agenda late today (17th April). |
After three weeks from the community call where we discussed this, there weren't any additional reviews or comments here (apart from the Kate's one, but before the community call).
The two above will be the big benefits we are getting by using a new Kafka Roller but without a real need, I have the concerns about testing and have it working to win against the advantages we can take. Said that, from the community call it seems that we have got:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the proposal, I went through it and it LGTM (I left some minor comments). I would like to go through the PoC and possibly try it to see how it works. In terms of testing, we should have a look on writing more STs now, as IIRC there are just few for the old roller and it would be good to have them improved. Also we discussed with @see-quick that it would be maybe beneficial to look at the performance tests in the future (but it's not something urgent right now).
If the current code is too complex and the new KafkaRoller would make the code clearer + other code changes or new features would be much simpler to implement, it's good to have it there FMPOV.
Also, you mentioned various things that will be implemented -> from my understanding you expect and want to have them all in place with the initial implementation, right? Or you want to do 1:1 KafkaRoller in comparison with the old one and then add the new features?
| - | - | UNKNOWN | ||
| Pod is not Running | - | NOT_RUNNING | ||
| Pod is Running but lacking Ready status | Broker state != 2 | NOT_READY | ||
| Pod is Running but lacking Ready stats | Broker state == 2 | RECOVERING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Pod is Running but lacking Ready stats | Broker state == 2 | RECOVERING | |
| Pod is Running but lacking Ready status | Broker state == 2 | RECOVERING |
The new KafkaRoller introduced by this proposal will used only for KRaft based clusters. | ||
This proposal should have no impact on any existing Kafka clusters deployed with ZooKeeper. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because Strimzi doesn't support Zookeeper anymore, are these two lines needed? Just asking out of curiosity, as I know that the proposal was written in the moment when we supported both.
|
||
| Phase | Strimzi versions | Default state | | ||
|:------|:-----------------------|:-------------------------------------------------------| | ||
| Alpha | 0.46, ? | Disabled by default | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change it to 0.47/0.48 or something. It's just a reminder.
POC implementation