K8SPXC-1734 Add configurable HAProxy health check parameters #2207

timstoop · 2025-10-05T21:58:32Z

CHANGE DESCRIPTION

Problem:

When a PXC node fails (e.g., during rolling restart), HAProxy takes 20+ seconds to detect the failure with default settings (check inter 10000 + fall 2 = 20s). Worse, existing client connections to the failed backend are NOT terminated, causing them to hang until TCP timeout (potentially minutes).

The only workaround is to provide a complete HAProxy configuration via haproxy.configuration, which duplicates operator logic, breaks on upgrades, and is difficult to maintain.

Cause:

HAProxy backend health check parameters (interval, fall, rise, on-marked-down shutdown-sessions) are hardcoded in the operator and not exposed through the CR API.

Solution:

Add a new healthCheck field to the HAProxy spec allowing granular control:

haproxy:
  healthCheck:
    interval: 3000          # ms between checks (default: 10000)
    fall: 2                 # consecutive failures (default: 2)
    rise: 1                 # consecutive successes (default: 1)
    shutdownOnMarkDown: true # terminate connections (default: false)

This enables fast failover (6s vs 20s) and active connection cleanup without overriding the entire configuration.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PXC version?
Does the change support oldest and newest supported Kubernetes version?

Additional Notes:

Fully backwards compatible (optional field with safe defaults)
Comprehensive unit test coverage
CRDs regenerated with validation rules

Fixes #2206

it-percona-cla · 2025-10-05T21:58:37Z

All committers have signed the CLA.

timstoop · 2025-10-06T08:04:31Z

I'm btw happy to add this change to the Helm chart and the docs as well, but wanted to wait with those PRs until I knew for sure this would be accepted.

hors · 2025-10-06T09:18:48Z

@timstoop haproxy does not have on-marked-down shutdown-sessions at all now :( I am going to add it via https://github.com/percona/percona-xtradb-cluster-operator/pull/2205/files . Do you want to be able to enable/disable it? From my point of view, we can just add it by default, but maybe you have a different use case.

timstoop · 2025-10-06T10:53:01Z

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen.
Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

hors · 2025-10-06T19:54:08Z

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen. Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

yes, please

hors · 2025-10-06T19:55:08Z

build/haproxy_add_pxc_nodes.sh

 	firs_node_replica=''
 	main_node=''

 	SERVER_OPTIONS=${HA_SERVER_OPTIONS:-'resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1'}


What if I want to set custom HA_SERVER_OPTIONS via a secret? Will it work with your changes?

Right, that would break. Fixed in my latest commit!

timstoop · 2025-10-06T22:09:14Z

Removed on marked down shutdown sessions option, as requested.

Also fixed the handling of HA_SERVER_OPTIONS via a secret.

gkech · 2025-11-04T12:47:21Z

hello @timstoop , we are planning to include this PR in the next PXC release. Is it possible for you to have a look on the conflicts and the test failures so that we can proceed with the review and the testing? Thanks a lot!

timstoop · 2025-11-04T15:57:02Z

Rebased on main.

Add support for configuring HAProxy health check parameters through the PerconaXtraDBCluster CR spec, allowing operators to tune health check behavior for their specific environments. Changes: - Added healthCheck field to HAProxySpec with interval, fall, rise, and shutdownOnMarkDown options - Modified haproxy_add_pxc_nodes.sh to use HA_SERVER_OPTIONS and HA_SHUTDOWN_ON_MARK_DOWN environment variables - Updated StatefulSet generation to pass health check config as env vars - Added comprehensive test coverage for health check configurations

timstoop · 2025-11-04T16:41:03Z

Rebased and cleaned up the branch. Now contains a single commit with only the necessary changes:

Added HAProxy health check configuration fields to the CR spec
Generated CRDs updated by controller-gen v0.19.0 to reflect new fields
Implementation in haproxy.go and haproxy_add_pxc_nodes.sh
Test coverage for the new functionality

All generated files (CRDs, manifests) created with correct tool versions.

Add HA_SERVER_OPTIONS environment variable to HAProxy statefulset comparison files to match the new default configuration generated by the operator. This environment variable is now added to all HAProxy deployments with default values: resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1

timstoop · 2025-11-05T08:19:57Z

Updated e2e test comparison files to include the new HA_SERVER_OPTIONS environment variable in a separate commit.

I updated these manually by adding the env var to each HAProxy statefulset comparison file. Is there tooling or a script to regenerate these comparison files that I should be using instead?

timstoop · 2025-11-05T08:21:31Z

Note: There are 2 other test failures (custom-users-8-0, demand-backup-encrypted-with-tls-8-0) that appear unrelated to this PR. The custom-users test shows a password rotation issue, which doesn't involve HAProxy configuration.

Move HA_SERVER_OPTIONS env var to be added after REPLICAS_SVC_ONLY_READERS to maintain consistent ordering in the generated StatefulSet configuration. This ensures the environment variables appear in the expected order: 1. PXC_SERVICE 2. IS_PROXY_PROTOCOL (if present) 3. REPLICAS_SVC_ONLY_READERS 4. HA_SERVER_OPTIONS

egegunes · 2025-11-06T11:14:01Z

pkg/pxc/app/statefulset/haproxy.go

+		// Add health check configuration env vars after REPLICAS_SVC_ONLY_READERS
+		if cr.Spec.HAProxy != nil {
+			container.Env = append(container.Env, buildHAProxyHealthCheckEnvVars(cr.Spec.HAProxy.HealthCheck)...)
+		}


we need to add this env var only if crVersion is 1.19.0 and above

upgrade-haproxy fails because of this

Only add HA_SERVER_OPTIONS environment variable for crVersion 1.19.0 and above to maintain backward compatibility during upgrades. Fixes upgrade-haproxy test failures.

JNKPercona · 2025-11-06T14:41:18Z

Test Name	Result	Time
affinity-8-0	passed	00:06:15
auto-tuning-8-0	passed	00:19:18
cross-site-8-0	passed	00:35:27
custom-users-8-0	passed	00:13:33
demand-backup-cloud-8-0	passed	00:59:16
demand-backup-encrypted-with-tls-8-0	failure	00:17:57
demand-backup-8-0	passed	00:41:59
demand-backup-flow-control-8-0	passed	00:10:55
demand-backup-parallel-8-0	passed	00:09:40
demand-backup-without-passwords-8-0	passed	00:26:20
haproxy-5-7	passed	00:14:48
haproxy-8-0	passed	00:14:44
init-deploy-5-7	passed	00:16:37
init-deploy-8-0	passed	00:16:36
limits-8-0	passed	00:12:07
monitoring-2-0-8-0	passed	00:22:38
monitoring-pmm3-8-0	passed	00:17:55
one-pod-5-7	passed	00:14:55
one-pod-8-0	passed	00:13:54
pitr-8-0	passed	00:43:28
pitr-gap-errors-8-0	passed	00:56:05
proxy-protocol-8-0	passed	00:09:36
proxysql-sidecar-res-limits-8-0	passed	00:08:48
pvc-resize-5-7	passed	00:17:26
pvc-resize-8-0	passed	00:16:28
recreate-8-0	passed	00:17:53
restore-to-encrypted-cluster-8-0	passed	00:26:31
scaling-proxysql-8-0	passed	00:08:56
scaling-8-0	passed	00:11:12
scheduled-backup-5-7	passed	01:06:26
scheduled-backup-8-0	passed	01:06:06
security-context-8-0	passed	00:25:25
smart-update1-8-0	passed	00:33:48
smart-update2-8-0	passed	00:38:43
storage-8-0	passed	00:10:38
tls-issue-cert-manager-ref-8-0	passed	00:08:54
tls-issue-cert-manager-8-0	passed	00:11:21
tls-issue-self-8-0	passed	00:13:15
upgrade-consistency-8-0	passed	00:11:16
upgrade-haproxy-5-7	passed	00:24:10
upgrade-haproxy-8-0	passed	00:24:31
upgrade-proxysql-5-7	passed	00:16:15
upgrade-proxysql-8-0	passed	00:15:12
users-5-7	passed	00:25:01
users-8-0	passed	00:26:36
validation-hook-8-0	passed	00:01:44

Summary	Value
Tests Run	46/46
Job Duration	02:54:48
Total Test Time	17:01:02

commit: 89209ce
image: perconalab/percona-xtradb-cluster-operator:PR-2207-89209ce1

timstoop requested review from egegunes, gkech, hors, nmarukovich and pooknull as code owners October 5, 2025 21:58

pull-request-size bot added the size/L 100-499 lines label Oct 5, 2025

timstoop changed the title ~~Add configurable HAProxy health check parameters~~ K8SPXC-2206 Add configurable HAProxy health check parameters Oct 5, 2025

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from bc23e2f to cb1bef1 Compare October 6, 2025 06:50

egegunes added the community label Oct 6, 2025

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from cb1bef1 to c4cc754 Compare October 6, 2025 09:08

timstoop requested review from eleo007, jvpasinatto and valmiranogueira as code owners October 6, 2025 15:28

hors reviewed Oct 6, 2025

View reviewed changes

pull-request-size bot added size/XXL 1000+ lines and removed size/L 100-499 lines labels Oct 6, 2025

gkech changed the title ~~K8SPXC-2206 Add configurable HAProxy health check parameters~~ K8SPXC-1734 Add configurable HAProxy health check parameters Oct 31, 2025

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from f8e092d to 6107a65 Compare November 4, 2025 15:55

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from 6107a65 to c797815 Compare November 4, 2025 16:21

pull-request-size bot added size/L 100-499 lines and removed size/XXL 1000+ lines labels Nov 4, 2025

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from c797815 to 7952d6c Compare November 4, 2025 16:25

timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from 7952d6c to 2ecd007 Compare November 4, 2025 16:39

egegunes requested changes Nov 6, 2025

View reviewed changes

Add version check for HAProxy health check env vars

89209ce

Only add HA_SERVER_OPTIONS environment variable for crVersion 1.19.0 and above to maintain backward compatibility during upgrades. Fixes upgrade-haproxy test failures.

egegunes approved these changes Nov 6, 2025

View reviewed changes

K8SPXC-1734 Add configurable HAProxy health check parameters #2207

Are you sure you want to change the base?

K8SPXC-1734 Add configurable HAProxy health check parameters #2207

Conversation

timstoop commented Oct 5, 2025 • edited by pull-request-badge bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

it-percona-cla commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

hors commented Oct 6, 2025

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

hors commented Oct 6, 2025

Uh oh!

hors Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

timstoop Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

timstoop commented Oct 6, 2025

Uh oh!

gkech commented Nov 4, 2025

Uh oh!

timstoop commented Nov 4, 2025

Uh oh!

timstoop commented Nov 4, 2025

Uh oh!

timstoop commented Nov 5, 2025

Uh oh!

timstoop commented Nov 5, 2025

Uh oh!

egegunes Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

egegunes Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

JNKPercona commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

timstoop commented Oct 5, 2025 •

edited by pull-request-badge bot

Loading

it-percona-cla commented Oct 5, 2025 •

edited

Loading