-
Notifications
You must be signed in to change notification settings - Fork 649
[occm] Add HTTP health checks #3020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Welcome @rybnico! |
|
Hi @rybnico. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I never faced such issue before. With the BTW, with the HTTP healthcheck you'll get the same problem, since loadbalancers usually have a latency with a healthcheck, and this latency is much more higher than k8s API understands that the pod is going to shut down, see kubernetes/kubernetes#133165 for more details. See also #2148 and try to use the node selector for the node maintenance #2601 See also #2869 docs update, which should be available since k8s 1.34 |
|
/hold |
I encountered this issue when adding new nodes. New nodes may be added to the LoadBalancer before the network is fully set up. The service node ports open before Kubernetes networking can forward requests to the service pods, so the TCP health check succeeds but the application itself cannot be reached.
The LoadBalancer should not add a new backend until at least one successful HTTP health check has been performed. This should help with the scenario I described above.
I can't think of a solution for auto-labelling nodes for Cluster-API or KOPS-based deployments without manual action or a custom controller. Aside from the issues we encountered with our load balancers when maintaining the clusters, I believe that support for HTTP health checks would be beneficial in general. |
Unfortunately this is not true for OpenStack Octavia. A new pool member receives the traffic immediately independently on the healthcheck results. This is true for OpenStack Amphora and for F5 Octavia backends/drivers. Other backends/drivers should follow the same approach. If not, please provide an evidence.
I don't think so, considering the facts above and the overall high latency for healthchecks from the LB side. Believe me, I'm trying to find a solution for the |
You're absolutely right. This doesn't seem logical to me, and other load balancer solutions that I know of always check new backends before sending traffic to them. Sorry for the misinformation, and thank you for clearing things up. I found a theoretical solution to this problem: if you create a new member with I would be happy to give it a try if you're interested in an implementation. |
I'm aware about this, but in this case we need to have two reconciliation cycles:
And we must have a delay between these cycles, what you can not really determine. Besides current k8s upstream cloud-controller logic doesn't have a two cycles logic. |
|
Just for the reference to my research about this topic: haproxy adds a pool member and considers it's alive: https://github.com/haproxy/haproxy/blob/697f7d1142a26352940532d0481cc17f9225d0d1/src/server.c#L2836 To change this behavior the haproxy config must contain diff --git a/octavia/common/jinja/haproxy/combined_listeners/templates/macros.j2 b/octavia/common/jinja/haproxy/combined_listeners/templates/macros.j2
index 5a81c37..45c6b91 100644
--- a/octavia/common/jinja/haproxy/combined_listeners/templates/macros.j2
+++ b/octavia/common/jinja/haproxy/combined_listeners/templates/macros.j2
@@ macro member_macro(constants, lib_consts, pool, member) %}
{% if pool.health_monitor and pool.health_monitor.enabled %}
{% if member.monitor_address %}
@@
{% endif %}
{% if pool.alpn_protocols is defined %}
{% set alpn_opt = " alpn %s"|format(pool.alpn_protocols) %}
{% else %}
{% set alpn_opt = "" %}
{% endif %}
+ {% if member.monitor_init_state_down %}
+ {% set init_state_opt = " init-state fully-down" %}
+ {% else %}
+ {% set init_state_opt = "" %}
+ {% endif %}
- {{ "server %s %s:%d weight %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s"|e|format(
- member.id, member.address, member.protocol_port, member.weight,
- hm_opt, persistence_opt, proxy_protocol_opt, member_backup_opt,
- member_enabled_opt, def_opt_prefix, def_crt_opt, ca_opt, crl_opt,
- def_verify_opt, def_sni_opt, ciphers_opt, tls_versions_opt,
- alpn_opt)|trim() }}
+ {{ "server %s %s:%d weight %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s"|e|format(
+ member.id, member.address, member.protocol_port, member.weight,
+ hm_opt, persistence_opt, proxy_protocol_opt, member_backup_opt,
+ member_enabled_opt, def_opt_prefix, def_crt_opt, ca_opt, crl_opt,
+ def_verify_opt, def_sni_opt, ciphers_opt, tls_versions_opt,
+ alpn_opt, init_state_opt)|trim() }}
{% endmacro %}Once the upstream octavia codebase get this support, backend vendors should also add this support, and we can implement support for this option in OCCM. |
|
Thank you for the detailed information, @kayrus! I just wanted to ask if you had created a PR/bug in the Octavia project. Then I saw your reply. I will keep an eye on the Octavia bug. Please tell me if there's anything I can help with. |
No, I hope that someone from Octavia team can handle this.
you can subscribe to this bug to increase the bug severity |
What this PR does / why we need it:
I realised that we had many failed requests to load balancers on our K8s platforms using our OpenStack platform when we did maintenance that involved replacing nodes in a rolling fashion. I opened an issue in the Kubernetes project and I was told that the correct solution would be to have HTTP health checks on the Services to make sure not only the port is reachable, but also the service behind the port.
As the OCCM only supports HTTP health checks for load balancers with a TrafficPolicy of "Local" on the HealthCheckNodePort, I implemented HTTP monitors using additional annotations.
Release note: