Skip to content

Conversation

@timstoop
Copy link

@timstoop timstoop commented Oct 5, 2025

K8SPXC-2206 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:

When a PXC node fails (e.g., during rolling restart), HAProxy takes 20+ seconds to detect the failure with default settings (check inter 10000 + fall 2 = 20s). Worse, existing client connections to the failed backend are NOT terminated, causing them to hang until TCP timeout (potentially minutes).

The only workaround is to provide a complete HAProxy configuration via haproxy.configuration, which duplicates operator logic, breaks on upgrades, and is difficult to maintain.

Cause:

HAProxy backend health check parameters (interval, fall, rise, on-marked-down shutdown-sessions) are hardcoded in the operator and not exposed through the CR API.

Solution:

Add a new healthCheck field to the HAProxy spec allowing granular control:

haproxy:
  healthCheck:
    interval: 3000          # ms between checks (default: 10000)
    fall: 2                 # consecutive failures (default: 2)
    rise: 1                 # consecutive successes (default: 1)
    shutdownOnMarkDown: true # terminate connections (default: false)

This enables fast failover (6s vs 20s) and active connection cleanup without overriding the entire configuration.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PXC version?
  • Does the change support oldest and newest supported Kubernetes version?

Additional Notes:

  • Fully backwards compatible (optional field with safe defaults)
  • Comprehensive unit test coverage
  • CRDs regenerated with validation rules

Fixes #2206

@it-percona-cla
Copy link

it-percona-cla commented Oct 5, 2025

CLA assistant check
All committers have signed the CLA.

@pull-request-size pull-request-size bot added the size/L 100-499 lines label Oct 5, 2025
@timstoop timstoop changed the title Add configurable HAProxy health check parameters K8SPXC-2206 Add configurable HAProxy health check parameters Oct 5, 2025
@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from bc23e2f to cb1bef1 Compare October 6, 2025 06:50
@timstoop
Copy link
Author

timstoop commented Oct 6, 2025

I'm btw happy to add this change to the Helm chart and the docs as well, but wanted to wait with those PRs until I knew for sure this would be accepted.

@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from cb1bef1 to c4cc754 Compare October 6, 2025 09:08
@hors
Copy link
Collaborator

hors commented Oct 6, 2025

@timstoop haproxy does not have on-marked-down shutdown-sessions at all now :( I am going to add it via https://github.com/percona/percona-xtradb-cluster-operator/pull/2205/files . Do you want to be able to enable/disable it? From my point of view, we can just add it by default, but maybe you have a different use case.

@timstoop
Copy link
Author

timstoop commented Oct 6, 2025

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen.
Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

@hors
Copy link
Collaborator

hors commented Oct 6, 2025

I totally agree that it makes sense to have it as default, it's what we were planning on doing in our Helm default code anyway. I only made it togglable in case there were good reasons that I didn't see and I really wanted to have this change implemented :-) I totally missed the 2205 PR, no idea how that could happen. Do you want me to remove it from this patch and only use this PR to allow setting of the health checks instead?

yes, please

firs_node_replica=''
main_node=''

SERVER_OPTIONS=${HA_SERVER_OPTIONS:-'resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1'}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want to set custom HA_SERVER_OPTIONS via a secret? Will it work with your changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that would break. Fixed in my latest commit!

@pull-request-size pull-request-size bot added size/XXL 1000+ lines and removed size/L 100-499 lines labels Oct 6, 2025
@timstoop
Copy link
Author

timstoop commented Oct 6, 2025

Removed on marked down shutdown sessions option, as requested.

Also fixed the handling of HA_SERVER_OPTIONS via a secret.

@gkech gkech changed the title K8SPXC-2206 Add configurable HAProxy health check parameters K8SPXC-1734 Add configurable HAProxy health check parameters Oct 31, 2025
@gkech
Copy link
Contributor

gkech commented Nov 4, 2025

hello @timstoop , we are planning to include this PR in the next PXC release. Is it possible for you to have a look on the conflicts and the test failures so that we can proceed with the review and the testing? Thanks a lot!

@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from f8e092d to 6107a65 Compare November 4, 2025 15:55
@timstoop
Copy link
Author

timstoop commented Nov 4, 2025

Rebased on main.

@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from 6107a65 to c797815 Compare November 4, 2025 16:21
@pull-request-size pull-request-size bot added size/L 100-499 lines and removed size/XXL 1000+ lines labels Nov 4, 2025
@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from c797815 to 7952d6c Compare November 4, 2025 16:25
Add support for configuring HAProxy health check parameters through the
PerconaXtraDBCluster CR spec, allowing operators to tune health check
behavior for their specific environments.

Changes:
- Added healthCheck field to HAProxySpec with interval, fall, rise, and
  shutdownOnMarkDown options
- Modified haproxy_add_pxc_nodes.sh to use HA_SERVER_OPTIONS and
  HA_SHUTDOWN_ON_MARK_DOWN environment variables
- Updated StatefulSet generation to pass health check config as env vars
- Added comprehensive test coverage for health check configurations
@timstoop timstoop force-pushed the K8SPXC-2206-haproxy-health-check-config branch from 7952d6c to 2ecd007 Compare November 4, 2025 16:39
@timstoop
Copy link
Author

timstoop commented Nov 4, 2025

Rebased and cleaned up the branch. Now contains a single commit with only the necessary changes:

  • Added HAProxy health check configuration fields to the CR spec
  • Generated CRDs updated by controller-gen v0.19.0 to reflect new fields
  • Implementation in haproxy.go and haproxy_add_pxc_nodes.sh
  • Test coverage for the new functionality

All generated files (CRDs, manifests) created with correct tool versions.

Add HA_SERVER_OPTIONS environment variable to HAProxy statefulset
comparison files to match the new default configuration generated by
the operator.

This environment variable is now added to all HAProxy deployments with
default values: resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1
@timstoop
Copy link
Author

timstoop commented Nov 5, 2025

Updated e2e test comparison files to include the new HA_SERVER_OPTIONS environment variable in a separate commit.

I updated these manually by adding the env var to each HAProxy statefulset comparison file. Is there tooling or a script to regenerate these comparison files that I should be using instead?

@timstoop
Copy link
Author

timstoop commented Nov 5, 2025

Note: There are 2 other test failures (custom-users-8-0, demand-backup-encrypted-with-tls-8-0) that appear unrelated to this PR. The custom-users test shows a password rotation issue, which doesn't involve HAProxy configuration.

Move HA_SERVER_OPTIONS env var to be added after REPLICAS_SVC_ONLY_READERS
to maintain consistent ordering in the generated StatefulSet configuration.

This ensures the environment variables appear in the expected order:
1. PXC_SERVICE
2. IS_PROXY_PROTOCOL (if present)
3. REPLICAS_SVC_ONLY_READERS
4. HA_SERVER_OPTIONS
Comment on lines 296 to 299
// Add health check configuration env vars after REPLICAS_SVC_ONLY_READERS
if cr.Spec.HAProxy != nil {
container.Env = append(container.Env, buildHAProxyHealthCheckEnvVars(cr.Spec.HAProxy.HealthCheck)...)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add this env var only if crVersion is 1.19.0 and above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgrade-haproxy fails because of this

Only add HA_SERVER_OPTIONS environment variable for crVersion 1.19.0 and
above to maintain backward compatibility during upgrades.

Fixes upgrade-haproxy test failures.
@JNKPercona
Copy link
Collaborator

Test Name Result Time
affinity-8-0 passed 00:06:15
auto-tuning-8-0 passed 00:19:18
cross-site-8-0 passed 00:35:27
custom-users-8-0 passed 00:13:33
demand-backup-cloud-8-0 passed 00:59:16
demand-backup-encrypted-with-tls-8-0 failure 00:17:57
demand-backup-8-0 passed 00:41:59
demand-backup-flow-control-8-0 passed 00:10:55
demand-backup-parallel-8-0 passed 00:09:40
demand-backup-without-passwords-8-0 passed 00:26:20
haproxy-5-7 passed 00:14:48
haproxy-8-0 passed 00:14:44
init-deploy-5-7 passed 00:16:37
init-deploy-8-0 passed 00:16:36
limits-8-0 passed 00:12:07
monitoring-2-0-8-0 passed 00:22:38
monitoring-pmm3-8-0 passed 00:17:55
one-pod-5-7 passed 00:14:55
one-pod-8-0 passed 00:13:54
pitr-8-0 passed 00:43:28
pitr-gap-errors-8-0 passed 00:56:05
proxy-protocol-8-0 passed 00:09:36
proxysql-sidecar-res-limits-8-0 passed 00:08:48
pvc-resize-5-7 passed 00:17:26
pvc-resize-8-0 passed 00:16:28
recreate-8-0 passed 00:17:53
restore-to-encrypted-cluster-8-0 passed 00:26:31
scaling-proxysql-8-0 passed 00:08:56
scaling-8-0 passed 00:11:12
scheduled-backup-5-7 passed 01:06:26
scheduled-backup-8-0 passed 01:06:06
security-context-8-0 passed 00:25:25
smart-update1-8-0 passed 00:33:48
smart-update2-8-0 passed 00:38:43
storage-8-0 passed 00:10:38
tls-issue-cert-manager-ref-8-0 passed 00:08:54
tls-issue-cert-manager-8-0 passed 00:11:21
tls-issue-self-8-0 passed 00:13:15
upgrade-consistency-8-0 passed 00:11:16
upgrade-haproxy-5-7 passed 00:24:10
upgrade-haproxy-8-0 passed 00:24:31
upgrade-proxysql-5-7 passed 00:16:15
upgrade-proxysql-8-0 passed 00:15:12
users-5-7 passed 00:25:01
users-8-0 passed 00:26:36
validation-hook-8-0 passed 00:01:44
Summary Value
Tests Run 46/46
Job Duration 02:54:48
Total Test Time 17:01:02

commit: 89209ce
image: perconalab/percona-xtradb-cluster-operator:PR-2207-89209ce1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community size/L 100-499 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add configurable HAProxy backend health check parameters without requiring full configuration override

6 participants