Contour leader doesn't update endpoints in xDS cache after upstream pods recreation #6743

philimonoff · 2024-10-28T04:40:57Z

What steps did you take and what happened:

There is something about 8000 httpproxy objects with same ingressclass.
There are 2 contour pods (leader and replica) and four envoy pods (daemonset).
We recreate pods of application, that are upstreams of corresponding envoy cluster.
After recreating of these pods, contour-replica has ip addresses of new pods as endpoints for this envoy cluster in eDS (via contour cli).
Contour-leader has ip addresses of old (deleted) pods as endpoints for this envoy cluster in eDS (via contour cli).
Envoy pods connected to contour leader pod return 503 error for requests to corresponding hosts.
Envoy pods connected to contour replica pod serves requests correctly.
Recreating of contour pods fixes this problem for a while.

What did you expect to happen:

Leader pod updates it's state after app's pod recreation.

Anything else you would like to add:

Environment:

Contour version: 1.29.2
Kubernetes version: (use kubectl version): 1.25.16, 1.27.16
Kubernetes installer & version: kops 1.26.5, kubeadm 1.27.16
Cloud provider or hardware configuration: AWS, Openstack
OS (e.g. from /etc/os-release): Ubuntu 24.04 LTS

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-28T04:41:08Z

Hey @philimonoff! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Contour. You can also join us on our mailing list and in our channel in the Kubernetes Slack Workspace

tsaarni · 2024-10-28T05:41:32Z

Hi @philimonoff, I haven’t tried to reproduce this yet, but I wanted to ask - does the issue depend on having a large number of HTTPProxies, or have you observed it occurring with fewer (or even a single) HTTPProxy as well?

philimonoff · 2024-10-28T05:47:32Z

@tsaarni thank you for your fast reaction. We don't see this on small installations. I can't say which number of proxies is the exact trigger bound, but this situation occurs sometimes on the installation with 5000 proxies. But if with 5000 it occurs sometimes, with 8000 and more it's a strong pattern.

tsaarni · 2024-10-28T20:59:10Z

@philimonoff Could this be due to rate limiting? The API server client library has limits on requests, which can result in significant delays if a large number of resources change simultaneously. You could try adjusting these parameters in the Contour deployment for the contour serve command to see if it helps: --kubernetes-client-qps=<qps> and --kubernetes-client-burst=<burst>. Use large values such as 100 or larger, to observe any difference. For details, check out this article.

philimonoff · 2024-10-29T10:02:44Z

@tsaarni I tried 100 qps and 150 burst, and it didn't help. More than that, all hosts started return 503, so I removed these flags. I don't know whats happened, because it's a production environment and I can't let it stay broken.

tsaarni · 2024-10-30T06:00:56Z

@philimonoff Unfortunately at this point, I don’t have any other ideas what could cause the issue. I assume you've already checked for any errors in the leader’s logs? It’s possible the Contour pod might be under heavy resource constraints (like CPU), but if that were the case, I’d expect it to impact contour cli responses as well, which didn't seem to be the case.

philimonoff · 2024-10-30T06:06:27Z

@tsaarni before opening this issue, I had already tried to read contour debug logs (they have very intensive rate), record pprof sessions and traces (nothing suspicious), watch all metrics contour has. My next idea is to add my own logs on any step of the way of eventslice from api-server to xds cache. Now I also can't even imagine what is the cause of it.

philimonoff added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contour leader doesn't update endpoints in xDS cache after upstream pods recreation #6743

Contour leader doesn't update endpoints in xDS cache after upstream pods recreation #6743

philimonoff commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

tsaarni commented Oct 28, 2024

philimonoff commented Oct 28, 2024

tsaarni commented Oct 28, 2024

philimonoff commented Oct 29, 2024

tsaarni commented Oct 30, 2024

philimonoff commented Oct 30, 2024

Contour leader doesn't update endpoints in xDS cache after upstream pods recreation #6743

Contour leader doesn't update endpoints in xDS cache after upstream pods recreation #6743

Comments

philimonoff commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

tsaarni commented Oct 28, 2024

philimonoff commented Oct 28, 2024

tsaarni commented Oct 28, 2024

philimonoff commented Oct 29, 2024

tsaarni commented Oct 30, 2024

philimonoff commented Oct 30, 2024