Skip to content

#2168 A brand-new WatchKube provider for Kubernetes service discovery #2174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 16, 2025

Conversation

kick2nick
Copy link
Contributor

Closes #2168

Proposed Changes

  • WatchKube service discovery provider that utilizes k8s watch api requests for efficient detection of changes.

It's proof of concept that needs improvements (see comments) and tests.
Tested locally, it works as expected, but with older versions of KubeClient which can be used with newer one after fix tintoy/dotnet-kube-client#163.

p.s. sorry for mess with #2173
Had no intention to force downgrade of KubeClient, will resume after fix)

@kick2nick kick2nick changed the title Added 'WatchKube' discovery provider #2168 WatchKube discovery provider Oct 14, 2024
@kick2nick kick2nick marked this pull request as draft October 14, 2024 18:17
@raman-m raman-m added proposal Proposal for a new functionality in Ocelot Service Discovery Ocelot feature: Service Discovery Kubernetes Service discovery by Kubernetes labels Oct 15, 2024
@raman-m
Copy link
Member

raman-m commented Oct 16, 2024

I've rebased the branch. Don't push please❗
But if you want to commit more, then you have to delete branch, fetch all, checkout the branch locally 👉

  • git switch develop
  • git branch -D 2168-watchkube
  • git fetch --all
  • git pull
  • git checkout 2168-watchkube

And you will be able to contribute more 😉

@kick2nick
Copy link
Contributor Author

Hi @raman-m , from my point of view all required stuff is done for moving to code review. Or am I missing something?

@raman-m
Copy link
Member

raman-m commented Oct 24, 2024

The issue found at tintoy/dotnet-kube-client#163 was resolved by tintoy/dotnet-kube-client#164 and included in the release v2.5.12. However, this release has not been marked as the latest on the repository's releases page, where v2.5.10 is still listed as the most recent version. Could you please explain why v2.5.10 is considered the latest release?

@raman-m
Copy link
Member

raman-m commented Oct 24, 2024

from my point of view all required stuff is done for moving to code review. Or am I missing something?

If you're confident and have completed the main development phase, can't you press the "Ready for Review" button?

@kick2nick
Copy link
Contributor Author

Could you please explain why v2.5.10 is considered the latest release?

Have no idea, let's wait for the author's answer to your question.

@kick2nick kick2nick marked this pull request as ready for review October 24, 2024 08:42
@raman-m
Copy link
Member

raman-m commented Oct 24, 2024

Hello, Nikolay!

While it's true that the PR seems partially ready for review following the closure of tintoy/dotnet-kube-client#163 and its subsequent release of v2.5.12, I am concerned that reviewing this PR may be premature due to the following dependencies:

  1. This PR relies on #977 Enable validation of DI scopes in Kube and PollKube discovery providers #2180, which needs to be released beforehand as a part of the next 23.3.6 patch.
  2. My plan is to include it in the November'24 milestone, which will prioritize the .NET 9 upgrade and migration to the net9.0 framework, with a lower priority assigned to introducing new features.

@raman-m raman-m added the Spring'25 Spring 2025 release label Oct 24, 2024
@raman-m raman-m added this to the November'24 milestone Oct 24, 2024
Copy link
Member

@raman-m raman-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing the new WatchKube type in ShouldReturnServicesFromK8s is necessary, but it is not sufficient for completing acceptance testing.

While it is well-tested by unit tests, there are no new acceptance tests for the proposed feature. Please review the scenarios of the current Kube and PollKube providers for inspiration. If Not, I will share my ideas later, in 1-2 weeks, after the release of the v23.3.6 patch.

@kick2nick
Copy link
Contributor Author

This PR relies on #2180, which needs to be released beforehand as a part of the next 23.3.6 patch.

Not quite "relies", but conflicts on rebase will apear, I'll resolve them after #2180 merging.

My plan is to include it in the November'24 milestone, which will prioritize the .NET 9 upgrade and migration to the net9.0 framework, with a lower priority assigned to introducing new features.

No problem, I'll be in touch when you get back to this

@kick2nick
Copy link
Contributor Author

@raman-m commented on June 1

Acceptance
Converted tests to theories and created a new test that covers the case of a service endpoint change during concurrent client requests.

Unit tests
All points are fixed.

Copy link
Collaborator

@tintoy tintoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me; minor question about the .Catch() operator, but not a blocker.

@raman-m
Copy link
Member

raman-m commented Jun 5, 2025

@tintoy approved these changes on June 4, 2025

Thank you, Adam, for reviewing the code!


@ggnaegi Gui, any suggestions?
I know you use the Consul provider rather than the Kubernetes one, but since this is a new type of provider, I believe you should at least take a look at this important PR.

@raman-m
Copy link
Member

raman-m commented Jun 5, 2025

Development Complete ✅

  • Unit tests ✔️
  • Acceptance tests ✔️
  • Documentation ☑️

@kick2nick commented on June 4

Thank you, Nikolay, for professional development and quick feedback!

This PR is ready to merge, but before approving it, I will conduct a final review over the next few days until Monday. I/We will write a more advanced section in the K8s doc, as the current version seems too brief to effectively present the feature. Also, I believe the appropriate "WatchKube provider" section could be longer, including real-world code snippets or even a sample.

Could you attach some configurations, instructions, or code blocks from your production environment, please? My idea is as follows: at a minimum, we should compare the new provider type with the current ones, Kube and PollKube. Any advantages and disadvantages (aka pros and cons) could be outlined, perhaps in a comparison table.

Finally, at long last, I am curious about your production environment where this new provider will be utilized. As a team, we are interested in beta-testing after deployment to production, and we will be ready to apply hotfixes if any bugs are found.
Could you describe your production technical stack, please? Does it include monitoring and logging to observe the new provider's behavior?

Regarding delivery... How urgent is this feature for you and your team? Would you like to keep it in the Summer'25 milestone, scheduled for release in September? I believe we need a quicker timeline, and I am ready to include it in the current Spring'25 milestone, which contains 6 (six) features and is set for release in June, in a couple of weeks.

P.S. If super-urgent beta testing is needed, I am ready to make an unofficial release and upload a new beta-version package to NuGet.

@raman-m raman-m self-requested a review June 5, 2025 12:05
@kick2nick
Copy link
Contributor Author

@raman-m commented #2174 (comment)

Could you attach some configurations, instructions, or code blocks from your production environment, please?

Not sure what you mean, it's configured in standard way - ServiceName property per Route and discovery provider in GlobalConfiguration:
"ServiceDiscoveryProvider": { "Namespace": "dev", "Type": "WatchKube" }

Any advantages and disadvantages (aka pros and cons) could be outlined, perhaps in a comparison table.

Sounds good, we could start from this:

Kube PollKube WatchKube
Extra latency One hop per route - -
Speed of response to endpoints changes High Low* High
Pressure on kube API High Low* Low

* - depends on poll interval

Could you describe your production technical stack, please? Does it include monitoring and logging to observe the new provider's behavior?

Yep, nothing special, we're hosting ocelot in k8s (24.0, .net 9, ubuntu) with several services behind it. We use rolling update
deployment, so services(and their corresponding endpoints) replaced one by one. Yes, we have observability.

Regarding delivery... How urgent is this feature for you and your team?

This isn't urgent at all, currently we're using WatchKube mechanic as custom provider. We will just switch to new provider when it's released.

Would you like to keep it in the Summer'25 milestone, scheduled for release in September?

It's up to you, I'll need up to a week to conduct tests with the new provider once the package is ready.

@raman-m
Copy link
Member

raman-m commented Jun 9, 2025

Thank you for the preliminary comparison table!
I need to concentrate on the current release of the Spring'25 milestone, which must be finalized this month, so a significant amount of work remains. However, I will endeavor to include this PR in the current release rather than postponing to the Summer'25 release.

@raman-m raman-m added Spring'25 Spring 2025 release and removed Summer'25 Summer 2025 release labels Jun 9, 2025
@raman-m raman-m modified the milestones: Summer'25, Spring'25 Jun 9, 2025
Copy link
Member

@raman-m raman-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me, but let's talk a bit more and then merge this PR.

I'm going to push one commit with my review which will give my approval.

Comment on lines 12 to 13
internal const int FailedSubscriptionRetrySeconds = 5;
internal const int FirstResultsFetchingTimeoutSeconds = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5? Why 3? Why internal?
How will the user be able to configure this?

Copy link
Contributor Author

@kick2nick kick2nick Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made internal for reference from Tests (with InternalsVisibleTo attribute). Numbers are made up =) Didn't want to make configuration complicated. We can move it to configuration, but I think FailedSubscriptionRetrySeconds should stay as const, because KubeClient already uses retries. Our retry was added for really unexpected errors.

Copy link
Member

@raman-m raman-m Jun 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection to the Retry logic However, my concern lies with the unconfigurable numbers. I believe that a configuration design can be deferred for now and introduced at a later stage, in a separate pull request.
An acceptable compromise would be to enable configuration directly in C# code.

Next step 👇

Therefore, I propose converting the internal constants into public static properties, allowing developers to set custom values for advanced user scenarios directly in C# code, given that configuration properties are not implemented now to be defined in ocelot.json, aka global ServiceDiscoveryProvider section.

The Q ❔

What I consider inefficient and worth discussing further is the Seconds dimension. Why not Milliseconds? As you may know, some deployed environments operate at the millisecond level — 10, 100, or 500 ms — clearly less than one second. For now, I propose decreasing the default values to one second as the minimum for definitions, but I believe that for highly rapid systems operating in milliseconds, one second won't significantly impact response time.
Ultimately, I would suggest that waiting 5 seconds between retries is excessive 😉
Do you have any use cases from your project, Kubernetes with the Ocelot environment in production, based on data from monitoring or traced logs?

2nd Q

The 2nd question: Do both properties depend on each other? It seems the second property of fetching the first result is merely a delay in obtaining the initial result from the Observable layer, correct? This delay does not appear to be related to the retry logic, is that right?
In case they are both dependent, we have to think carefully about their values ​​in seconds, avoiding any disruptions.

Copy link
Member

@raman-m raman-m Jun 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, my proposal is this commit → 061f018
However, I am uncertain about the following:

  • Returning an int type that represents the value of the TimeSpan.FromSeconds(double) argument: it has the double parameter
  • The default values of 5 and 3 seconds for the backing fields should be reduced, for instance, they could be set to 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, I propose converting the internal constants into public static properties, allowing developers to set custom values for advanced user scenarios directly in C# code

reasonable

What I consider inefficient and worth discussing further is the Seconds dimension. Why not Milliseconds?

10 ms for retries could create high pressure on k8s =) 10 ms = 100 rps. 100 rps * downstream services count * ocelot instances.
First result timeout is time that client can wait before first results fetching, and it can vary from couple milliseconds and up to defined timeout. So it's good candidate for configuration. Will try to find some numbers from prod env.

Do you have any use cases from your project, Kubernetes with the Ocelot environment in production, based on data from monitoring or traced logs

No, as I said KubeClient already retries subcription in case of network failures. You can think about retry logic introduced in this PR like "try -> catch -> retry all nested stuff inside KubeClient whatever it was under the hood". This 'whatever' could be bug or anything else in KubeClient logic.

This delay does not appear to be related to the retry logic, is that right?

Right, there is no dependency between properties. First results is important part, as far as discovery provider instantiated during first call.

So, my proposal is this commit.

Agree with your proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I have assigned def values (both to 1 second) in commit 28a3afe 💡

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 ms for retries could create high pressure on k8s =) 10 ms = 100 rps. 100 rps * downstream services count * ocelot instances.
First result timeout is time that client can wait before first results fetching, and it can vary from couple milliseconds and up to defined timeout. So it's good candidate for configuration. Will try to find some numbers from prod env.

It appears this analysis is taking some time, and I am unable to wait any longer: the PR is ready to merge.
If there are further proposals for improvement, we will incorporate them into subsequent PRs, OK?

@raman-m raman-m changed the title #2168 WatchKube discovery provider #2168 A brand-new WatchKube provider for Kubernetes service discovery Jun 16, 2025
Copy link
Member

@raman-m raman-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready for delivery ✅

  • Code reviews ✔️✔️
  • Unit testing ✔️
  • Acceptance testing ✔️
  • Updated docs ✔️

@raman-m raman-m merged commit 8143d56 into ThreeMammals:develop Jun 16, 2025
4 checks passed
@raman-m
Copy link
Member

raman-m commented Jun 16, 2025

@kick2nick Congrats! 🥳 It will be released in version 24.1
FYI, prior to the release, you may review the updated docs here 👉 WatchKube provider | develop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kubernetes Service discovery by Kubernetes proposal Proposal for a new functionality in Ocelot Service Discovery Ocelot feature: Service Discovery Spring'25 Spring 2025 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watch request fails with InvalidOperationException
5 participants