Skip to content

Introducing new KafkaRoller #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

tinaselenge
Copy link
Contributor

@tinaselenge tinaselenge commented Jan 2, 2024

Signed-off-by: Gantigmaa Selenge <[email protected]>
Made some improvements on the structure

Signed-off-by: Gantigmaa Selenge <[email protected]>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a first pass, as I need more time to digest this. I think it would be useful to illustrate the new behavior with a couple of examples of the form: with this roller configuration and cluster state, these are the node groups and their restart order. Wdyt?

@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from c060c24 to 433316f Compare March 15, 2024 12:11
Tidy up

Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge tinaselenge marked this pull request as ready for review March 15, 2024 12:29
Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge
Copy link
Contributor Author

@fvaleri Thank you for the feedback. I have added an example of rolling update. Please let me know what you think.

Copy link
Member

@see-quick see-quick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal. Thanks for it 👍 .


STs POV:

I think we would need to also design multiple tests to cover all states, which KafkaRoller v2. We have a few tests but for sure that's not 100% coverage. So, we should maybe have a meeting to talk about this...

Side note about performance:

What would be appropriate performance metrics for us to consider when designing performance tests? Are there any critical ones? For sure I can image that we would see significant improvement on RollingUpdates of multiple nodes when we use batching mechanism...

Co-authored-by: Maros Orsak <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge thanks for the example, it really helps.

I left some comments, let me know if something is not clear or you want to discuss further.

- Cruise Control sends `removingReplicas` request to un-assign the partition from broker 2.
- KafkaRoller is performing a rolling update to the cluster. It checks the availability impact for foo-0 partition before rolling broker 1. Since partition foo-0 has ISR [1, 2, 4], KafkaRoller decides that it is safe to restart broker 1. It is unaware of the `removingReplicas` request that is about to be processed.
- The reassignment request is processed and foo-0 partition now has ISR [1, 4].
- KafkaRoller restarts broker 1 and foo-0 partition now has ISR [4] which is below the configured minimum in sync replica of 2 resulting in producers with acks-all no longer being able to produce to this partition.
Copy link
Contributor

@fvaleri fvaleri Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to rebalance, we have the same race condition with replication factor change (the new integration between CC and TO), maybe you can mention this.

The roller should be able to call the CC's user_tasks endpoint, and check if there is any pending task. In that case, the roller has two options: wait for all tasks completion, or continue as today with the potential issue you describe here. You can't really stop the tasks because the current batch will still be completed, and the operators will try to submit a new task in the next reconciliation loop.

I think that we should let the user decide which policy to apply through a configuration. By default the roller would wait for all CC tasks to complete, logging a warning. If the user set or switch to "force" policy, then the roller would behave like today. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be perhaps included/discussed in a separate proposal or issue? The idea was to mention that there is a race condition we could fix with the new roller in the future, which is not easy to fix with the old roller. How we fix it and other similar problems should be a separate discussion I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a dedicated proposal IMO, but let's start by logging an issue.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would calling the ListReassigningPartitions API be enough to know this?

Copy link
Contributor

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, but I had a few questions and wording suggestions. I definitely think this will be useful since I've experienced first hand how tricky it is to debug the existing code.

@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 3 times, most recently from 931adbd to 1060fee Compare April 30, 2024 13:58
Add possible transitions

Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge tinaselenge force-pushed the kafka-roller-2 branch 2 times, most recently from 4c035fa to bf71ae6 Compare July 19, 2024 08:30
Signed-off-by: Gantigmaa Selenge <[email protected]>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering all my questions. Good job.

@tinaselenge
Copy link
Contributor Author

Thanks for answering all my questions. Good job.

Thank you @fvaleri , I really appreciate you reviewing the proposal thoroughly.

- Although it is safe and straightforward to restart one broker at a time, this process is slow in large clusters ([related issue](https://github.com/strimzi/strimzi-kafka-operator/issues/8547)).
- It does not account for partition preferred leadership. As a result, there may be more leadership changes than necessary during a rolling restart, consequently impacting tail latency.
- Hard to reason about when things go wrong. The code is complex to understand and it's not easy to determine why a pod was restarted from logs that tend to be noisy.
- Potential race condition between Cruise Control rebalance and KafkaRoller that could cause partitions under minimum in sync replica. This issue is described in more detail in the `Future Improvements` section.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general Slack is not really ideal for keeping details of problems in the long term. Better to create an issue, which can be discovered more easily by anyone who faces a similar problem.

Updated the text on the diagram

Signed-off-by: Gantigmaa Selenge <[email protected]>
Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge
Copy link
Contributor Author

Hi @tombentley @scholzj @ppatierno, do you have any further comments on this proposal?

@tinaselenge tinaselenge marked this pull request as draft December 19, 2024 13:18
@tinaselenge
Copy link
Contributor Author

Converting this PR to Draft, as I'm currently rewriting the proposal to make it more focused on the proposed solutions, rather than implementation details. However, I'm not changing the core of the proposal. I'm hoping that the upcoming update will make it easier to review.

@rotem-human
Copy link

Any updates on this?

Removed the implementation details such as the algorithm. This will be included in the draft PR for the POC instead.

Signed-off-by: Gantigmaa Selenge <[email protected]>
@tinaselenge
Copy link
Contributor Author

Hi @strimzi/maintainers. I have made some updates to the proposal recently, to hopefully make it easier to review. The proposal should be now focusing on the design only and all the implementation details are in the linked POC PR. You can build and run the POC locally, to see if how it works.

Before I invest more time and effort into this, I would like to get a clear decision on the next steps for this proposal since there have been some mixed opinions.

  • Do you see this proposal being accepted and implemented now or in the future? Or should we close this proposal for now?
  • Is there a maintainer/s who would be willing to support this proposal through the implementation if it does get accepted?

I also have a question to ask the community users and other contributors:
If we were to go ahead with this proposal, is there community users and contributors who are willing to help with the implementation part or able to test this feature in their environments while it's still behind a feature gate. I think testing it in their environment and providing feedback would be the most important contribution for this feature. This also may weigh into the maintainer's decision as well. So please let us know if you are interested in or need this feature for your use case.

@katheris
Copy link
Contributor

  • Do you see this proposal being accepted and implemented now or in the future? Or should we close this proposal for now?
  • Is there a maintainer/s who would be willing to support this proposal through the implementation if it does get accepted?

My stance on this hasn't changed. I believe this is a useful feature, both in terms of the benefits to users around batching the rolling updates for large clusters, and in terms of improving the KafkaRoller code to make it easier to understand. I don't personally have the time to work on the implementation of this, but am willing to put myself forward as a maintainer who will prioritise reviewing the changes and understanding the code for future maintenance once it's in Strimzi.

I'm also interested to hear what other maintainers and users think about the usefulness of this feature, so let's also add it to the Strimzi community call agenda late today (17th April).

@tinaselenge tinaselenge marked this pull request as ready for review April 17, 2025 10:14
@ppatierno
Copy link
Member

ppatierno commented May 6, 2025

After three weeks from the community call where we discussed this, there weren't any additional reviews or comments here (apart from the Kate's one, but before the community call).
I had a quick look at the proposal but mostly the first part describing known issues affecting the current Kafka roller.
My current feeling is:

  • despite some known issues, I can't see more and more community users complaining about them. Are they really issues or are we overthinking here?
  • the additional features it can bring (rolling brokers in batches, slowing down rolling update) are still not pushed more from the community. So even in this case, how much are they important/requested?

The two above will be the big benefits we are getting by using a new Kafka Roller but without a real need, I have the concerns about testing and have it working to win against the advantages we can take.
From my pov, we should have community users showing interest in it. For example, the referenced issue #8547 was started by @yyang48 and it would be interesting to know if there is still interest and willing to help with it.

Said that, from the community call it seems that we have got:

  • @tinaselenge to work on the implementation
  • @katheris as the core maintainer willing to help and review the changes
  • @im-konge helping with testing it

Copy link
Member

@im-konge im-konge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the proposal, I went through it and it LGTM (I left some minor comments). I would like to go through the PoC and possibly try it to see how it works. In terms of testing, we should have a look on writing more STs now, as IIRC there are just few for the old roller and it would be good to have them improved. Also we discussed with @see-quick that it would be maybe beneficial to look at the performance tests in the future (but it's not something urgent right now).

If the current code is too complex and the new KafkaRoller would make the code clearer + other code changes or new features would be much simpler to implement, it's good to have it there FMPOV.

Also, you mentioned various things that will be implemented -> from my understanding you expect and want to have them all in place with the initial implementation, right? Or you want to do 1:1 KafkaRoller in comparison with the old one and then add the new features?

| - | - | UNKNOWN
| Pod is not Running | - | NOT_RUNNING
| Pod is Running but lacking Ready status | Broker state != 2 | NOT_READY
| Pod is Running but lacking Ready stats | Broker state == 2 | RECOVERING
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Pod is Running but lacking Ready stats | Broker state == 2 | RECOVERING
| Pod is Running but lacking Ready status | Broker state == 2 | RECOVERING

Comment on lines +188 to +189
The new KafkaRoller introduced by this proposal will used only for KRaft based clusters.
This proposal should have no impact on any existing Kafka clusters deployed with ZooKeeper.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Strimzi doesn't support Zookeeper anymore, are these two lines needed? Just asking out of curiosity, as I know that the proposal was written in the moment when we supported both.


| Phase | Strimzi versions | Default state |
|:------|:-----------------------|:-------------------------------------------------------|
| Alpha | 0.46, ? | Disabled by default |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change it to 0.47/0.48 or something. It's just a reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.