Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using coredns daemonset instead of nodelocal dns #594

Closed
dudicoco opened this issue Jul 25, 2023 · 28 comments
Closed

Using coredns daemonset instead of nodelocal dns #594

dudicoco opened this issue Jul 25, 2023 · 28 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dudicoco
Copy link

The current recommended DNS architecture solution within a cluster includes NodeLocal DNS + CoreDNS deployment + DNS autoscaler.

To me it would seem preferable to use a much simpler solution - run CoreDNS as a daemonset.

Is there a downside to such a solution? Why is the recommended solution include a more complex architecture?

See also coredns/helm#86 (comment)

@johnbelamaric
Copy link
Member

There are a number of reasons.

When running as cluster DNS, CoreDNS is configured with the Kubernetes plugin. This puts a watch on all EndpointSlices and Services (and other things, depending on your config). This means a persistent connection to the API server for each instance of CoreDNS, and the API server sending watch events down that channel for any changes to those resources. For clusters with thousands of nodes, that would put a substantial burden on the API server.

NodeLocalDNS, on the other hand, is only a cache and a stub resolver. It does not put a watch on the API server. This makes it much less of a burden on the API server, and also makes it a much smaller process since it does not need to use memory to hold those API resources.

NodeLocalDNS also solves a second problem. Early versions of Kubernetes would sometimes have failures due to the conntrack table filling up. This was found to be because UDP entries need to age out of the conntrack table, so a burst of DNS traffic could fill that table up (I seem to recall some kernel bugs may have also been involved, but this is several years ago). NodeLocalDNS turns off connection tracking for UDP traffic to the node local DNS IP address, and it upgrades requests made to cluster DNS from UDP to TCP. TCP is not subject to this issue since entries can be removed when the connection is closed.

Finally, even if we did use a DaemonSet, it wouldn't work the way you would hope. There is no guarantee that requests from a client would talk to the local CoreDNS instance. In fact, at the time NodeLocalDNS was created, it would be rare, because the local node would have no higher weight in the kube-proxy based load balancing. So if you had 1000 instances of CoreDNS, only 1/1000 would go to your local CoreDNS instance. I am not sure if that has changed, there has been some work on more topology-aware services. But I am not sure how far it has progressed - you would have to check with SIG Network.

@dudicoco
Copy link
Author

Thanks for the info @johnbelamaric.

  1. Regarding the API server connections, I have addressed that in CoreDNS currently only supports Deployment mode, but would you consider supporting Daemonset mode as well coredns/helm#86 (comment) - other daemonsets also perform API calls - kube-proxy, CNI plugins, log collectors etc.

  2. Regarding the conntrack issue - can't we turn off connection traffic for a coredns daemonset?

  3. Regarding directing requests from the client to the local coredns instance - this is now possible with internal traffic policy, but in any case this problem would be present with nodelocaldns as well, which could negate its benefits.
    One possible issue that could occur when using coredns as a daemonset with internal traffic policy is that until the coredns pod is ready no DNS requests could be made by other pods on that node.

@johnbelamaric
Copy link
Member

  1. Correct. But in general they limit the scope of what they are querying for to the things local to that node. For example, kubelet, kube-proxy, etc. do not monitor all pods and endpoints across the cluster, but instead only those assigned to their node.
  2. Possibly. It's not controlled in that way, it's done through iptables rules IIRC. So you would need to do some magic but it's theoretically possible.
  3. No, Node Local DNS changes the way pods running on the node do their DNS so that it goes to the local cache. It is not subject to this issue (it is not accessed via kube-proxy based load balancing rules).

@dudicoco
Copy link
Author

dudicoco commented Jul 28, 2023

  1. What is the measured impact of a coredns daemonset querying pods on the API server? I know that Zalando is using a coredns daemonset and i'm pretty sure they're running at scale.
  2. I believe that using coredns as a daemonset might result in not experiencing the conntrack issue without working around it since each pod will get a much lower amount of requests than when using a coredns deployment.
  3. It is also possible to have clients bypass the kube-proxy service and send requests to the local coredns pod by using the downward API:
- name: HOST_IP
  valueFrom:
    fieldRef:
      apiVersion: v1
      fieldPath: status.hostIP

@dpasiukevich
Copy link
Member

  1. What is the measured impact of a coredns daemonset querying pods on the API server? I know that Zalando is using a coredns daemonset and i'm pretty sure they're running at scale.
  2. I believe that using coredns as a daemonset might result in not experiencing the conntrack issue without working around it since each pod will get a much lower amount of requests than when using a coredns deployment.
  3. It is also possible to have clients bypass the kube-proxy service and send requests to the local coredns pod by using the downward API:
- name: HOST_IP
  valueFrom:
    fieldRef:
      apiVersion: v1
      fieldPath: status.hostIP
  1. As @johnbelamaric mentioned, it would be the linear dependency. In the CoreDNS daemonset each CoreDNS instance would initialize watchers for EndpointSlices, Services and ConfigMaps.
    Overall effect would be defined: how frequently these objects change in your cluster, multiplied by N (num of nodes).

  2. Nodelocaldns uses TCP to talk to clusterDNS pods thus it's not affected by the conntrack issue that much vs UDP.

Also keep in mind that with CoreDNS daemonset there will be no guarantee that client pod would talk to local CoreDNS pod on the same node.
/etc/resolv.conf points to the kube-dns Service so the traffic would go to the any pod in the cluster.
Plus, as the default DNS protocol is UDP, as client will communicate with any CoreDNS pod in the cluster, the conntrack exhaustion issue will reappear in such setup.

Whereas with nodelocaldns (with the iptables rules) the client is guranteed to talk to the local NLD pod on the same node.

  1. This should work, but I personally see this as a non-elegant solution as you'd have to define and keep this override for all pods in your cluster.
    Plus there may be unexplored consistency problems that HOST_IP points to the right IP all the time (e.g. some redeploys and status changes may cause brief unexpected outages).

@dudicoco
Copy link
Author

dudicoco commented Aug 1, 2023

@dpasiukevich

  1. It's still not clear to me if the possible strain on the API server was tested, did the relevant group in the kubernetes project perform tests on a coredns daemonset and found it to produce a load on the API server at scale? Or is it just speculated?
  2. Why can't the same solution be applied to coredns? We could have an ip tables rule to direct DNS traffic to the local coredns pod (this also negates the need for a downward API based solution).

@dpasiukevich
Copy link
Member

  1. That's just the estimation. I don't expect there were any scalability benchmarks to see the API server performance and requested resources depending on the size of daemonset and frequency/size of Service/EndpointSlice object changes.
  2. It definitely can be done. And it's definitely a good optimisation in certain cases at the cost of more DIY.

@dudicoco
Copy link
Author

dudicoco commented Aug 1, 2023

  1. That's just the estimation. I don't expect there were any scalability benchmarks to see the API server performance and requested resources depending on the size of daemonset and frequency/size of Service/EndpointSlice object changes.
  2. It definitely can be done. And it's definitely a good optimisation in certain cases at the cost of more DIY.

Why would it require more DIY? Couldn't it be implemented into coredns directly?

Another idea: have a nodelocaldns container and a coredns sidecar container in the same pod and direct traffic from nodelocaldns to coredns via localhost, this would simplify the architecture while preserving the benefits of nodelocaldns without requiring new features or code changes.
A possible issue would be if the nodelocaldns container starts before the coredns container, in that case the DNS resolution would fail. I assume this can be solved by having nodelocal wait for coredns to be available.

@dudicoco
Copy link
Author

@johnbelamaric @dpasiukevich any updates?

@chrisohaver
Copy link
Contributor

It's still not clear to me if the possible strain on the API server was tested, did the relevant group in the kubernetes project perform tests on a coredns daemonset and found it to produce a load on the API server at scale? Or is it just speculated?

Yes. It doesn't scale.

@johnbelamaric
Copy link
Member

By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.

@dudicoco
Copy link
Author

By the way, node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.
This is what zelando are doing but with dnsmasq instead of nodelocaldns, according to them it performs better.

By the way, it's not just the API server that is the issue. It's a simple matter of cost efficiency. Imagine a 10,000 node cluster. If you want to use and extra 500MB on every node to cache the entire cluster worth of services and headless end points, that is 5,000 GB of RAM. It's expensive. Much better to just have the DNS node local cache with only a small DNS cache needed for the workloads on that node that takes say < 50MB per node. @prameshj did a very detailed set of analyses before implementing this.

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.
We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

@chrisohaver
Copy link
Contributor

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.

That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.

The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.

We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.

@dudicoco
Copy link
Author

What I wrote was that the coredns container should be co-located on the same pod as nodelocaldns in order to avoid the extra infrastructure complexity.

That adds more infrastructure and complexity. For the sake of argument, it would be simpler and result in less overhead to compile the kubernetes plugin into nodelocaldns, and just run a kubernetes enabled nodelocaldns on each node by itself. Of course with the kubernetes plugin in use, each instance of nodelocaldns would then require more memory (as much as CoreDNS uses). So it is still significantly more resource-expensive than the current solution.

I don't think it is more complex than nodelocaldns daemonset + coredns deployment + dns autoscaler, however using just nodelocaldns with the kubernetes plugin would be preferable, i'm not sure how it would deal with non-cached responses in that case though?

Looking at the metrics from our cluster over the course of the last week, coredns did not consume more than 50mb of memory.

The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster. 50MB would suggest your cluster is not a large scale cluster, and thus dos not have a large number of services and endpoints.

What is considered a large cluster? There is no info on number of services/endpoints within https://kubernetes.io/docs/setup/best-practices/cluster-large/.

We are running ~500 services and ~500 endpoints.

We can assume that if it would run as a daemonset it would consume even less memory since there would be much less load on each pod.

That would not be the case. The minimum amount of memory coredns uses is linearly related to the number of services and endpoints in the cluster - not related to the query load.

Thanks for the clarification.

@chrisohaver
Copy link
Contributor

chrisohaver commented Sep 19, 2023

What is considered a large cluster? ... We are running ~500 services and ~500 endpoints.

Per the link 150000 Pods per cluster. Each pod can have multiple services and endpoints.

@vaskozl
Copy link

vaskozl commented Jan 10, 2024

I expect endpoint churn (per unit time) to be a more useful number than absolute number of endpoints.

There's nothing stoping one from using a DaemonSet with maxSurge=1 and maxUnavailable=0 together with internalTrafficPolicy: Local with the vanilla coredns image.

The suggested way to autoscale coredns is proportional to cluster size, exactly the same as scaling with a daemonset, except with a configurable coresPerReplica rather than coresPerReplica being equal to the number of cores per machine.

The suggested config in the doc is "coresPerReplica":256,"nodesPerReplica":16 which is also "linear" any which way you look at it, the advantage is that you can chose K and run a fraction of coreDNS pods when you have small nodes. At worst the DaemonSet method results in 16x the load on the API.

As such I see no good argument to not run CoreDNS as a Daemonset.

On the contrary I can think of quite a few advantages to the Daemonset approach:

  • no iptables/ipvs kube-proxy or other CNI caveats due to the NOTRACK rules
  • simple scaling on vanilla clusters with no cluster-proportional-autoscaler deployed
  • quicker DNS resolution without needing second node-local DNS config with hardcoded service IPs
  • easier to debug and reason about when encountering issues

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024
@zengyuxing007
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2024
@zengyuxing007
Copy link

any updates?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 24, 2024
@LeoQuote
Copy link

LeoQuote commented Sep 13, 2024

Hello everyone , I just drafted a pull request to show how to use coredns and cilium to implement nodelocal dns, I've tested it and it worked, without duplicated __PILLAR__ variables, like @vaskozl mentioned. please see the pr linked above to get more info.

Also I noticed that @johnbelamaric said

node local DNS is just a custom build of coredns with minimal plugins and with a little glue to update the iptables. So, effectively, nodelocaldns is what you are saying. It just doesn't run the k8s plugin.

In that case, if I'm using cilium and bpf to rewrite requests, I can use coredns instead, or there's any hidden pits that I'm not aware of?

@johnbelamaric
Copy link
Member

If you are getting your requests directed locally by cilium/bpf instead of the iptables rules that NodeLocalDNS installs, then yeah, running coredns should be OK. The other thing it does is turn off connection tracking for those requests, so that you don't run into the conntrack overflow issues we have seen in the past. Does your solution handle that? There were some older kernel bugs that this also helped avoid, IIRC - not sure the status of those.

As discussed above, I still would not use the standard K8s DNS Corefile though - I would create a custom one that just enables cache and maybe stub domains for this. Definitely not the K8s plugin, especially if you have a large cluster. I don't recall if NodeLocalDNS has some special stub domain support or not, where it reads stub domain definitions from the api server. That tickles a memory but it's been a long time since I looked.

Also, the NodeLocalDNS build of CoreDNS is stripped down to have as small a memory footprint as possible. If you use the standard CoreDNS, it will take more memory than the node local one.

Of course, you could build your own, minimal CoreDNS instance for, too.

@LeoQuote
Copy link

LeoQuote commented Sep 14, 2024

Thanks for your reply, here's the advantage of node-cache/NodeLocalDNS I summarized:

  1. embedded iptables modification to make it work without any external modification.
  2. turn off connection tracking to fix conntrack issues which affect only kernel older than 4.19
  3. disable k8s plugin to minimize impact to API server
  4. stripped down to have minimal memory footprint

I think I can test or handle those issues by:

  1. cilium can handle network
  2. I can upgrade kernel to avoid this issue
  3. yes, definitely disabled
  4. I'll test for the actual memory usage

I'll do more work to see which solution to adopt.

@johnbelamaric
Copy link
Member

turn off connection tracking to fix conntrack issues which affect only kernel older than kubernetes/kubernetes#56903 (comment)

No, I think conntrack can fill up with any kernel version. The issue is that since UDP is connectionless, conntrack entries are expunged by timeout rather than a connection closing. AIUI that issue is unrelated to the kernel bugs which caused problems a few years ago.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 26, 2024
@gecube
Copy link

gecube commented Nov 25, 2024

Hi! I'd like to relieve the issue. As I mentioned in coredns/helm#86 - I have ONE proper use case for DaemonSet. When you have a bare metal (or VM) self managed cluster and you don't want to use HPA or anything similar and want to deploy CoreDNS on control plane nodes... so daemon set shines in such a case. Yes, I understand it is probably not more than 5% of all users of k8s... but why should we ignore them? Anyway it won't create a breaking changes and it would be relatively easy to maintain the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants