K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier #2221

NotAndD · 2025-10-23T09:23:14Z

Change ordering of operations in backup scripts to begin listening for SST earlier

Problem:

When Istio in ambient mode is deployed in the cluster and Percona pods are inside its service mesh, backups fail as the SST transfer is sometimes randomly lost, leaving the backup pod stuck running forever, or always failing because the first socat receives the snapshot files instead of the sst file (and the second socat remains hanging forever).

I've opened an issue here on Github as I had some issues creating a Jira ticket (not sure why).

Cause:

Honestly, a bit unclear. Since connections are wrapped by ztunnel, I suspect something breaks an assumption made by the donor and somehow the sst transfer is marked as okay, even if it fails. This was not happening with older version of Percona (1.14.0) so I think it's something on garbd / donor side.

Solution:

I've refactored the backup scripts so that they work in a different way:

socat listening starts before garbd, this way the pod is ready to await the sst.
also, garbd is closed without errors when sst transfer is done, and then the snapshot is awaited by the main script in a separate way.
to obtain this different ordering, there are small changes here and there.

I've tested my solution in some of our test clusters and it looks it is working (for s3 storage at least). But I am searching for inputs as I am unsure if this is the right approach, or whatever. I think this could be an improvement on the scripts, but unsure if I missed something obvious in changing them.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PXC version?
Does the change support oldest and newest supported Kubernetes version?

… garbd and the second socat are unrelated.

CLAassistant · 2025-10-23T09:23:23Z

All committers have signed the CLA.

…ete for s3 and azure upload.

JNKPercona · 2025-10-23T17:22:00Z

Test Name	Result	Time
affinity-8-0	passed	00:05:58
auto-tuning-8-0	passed	00:18:22
cross-site-8-0	passed	00:34:18
custom-users-8-0	passed	00:12:37
demand-backup-cloud-8-0	passed	00:57:45
demand-backup-encrypted-with-tls-8-0	failure	00:17:07
demand-backup-8-0	passed	00:42:03
demand-backup-flow-control-8-0	passed	00:10:50
demand-backup-parallel-8-0	passed	00:09:11
demand-backup-without-passwords-8-0	passed	00:15:38
haproxy-5-7	passed	00:14:14
haproxy-8-0	passed	00:13:57
init-deploy-5-7	passed	00:15:49
init-deploy-8-0	passed	00:16:47
limits-8-0	passed	00:11:52
monitoring-2-0-8-0	passed	00:22:26
monitoring-pmm3-8-0	passed	00:18:10
one-pod-5-7	passed	00:13:44
one-pod-8-0	passed	00:13:34
pitr-8-0	passed	00:43:37
pitr-gap-errors-8-0	passed	00:55:41
proxy-protocol-8-0	passed	00:09:41
proxysql-sidecar-res-limits-8-0	passed	00:08:20
pvc-resize-5-7	passed	00:16:33
pvc-resize-8-0	passed	00:15:45
recreate-8-0	passed	00:17:13
restore-to-encrypted-cluster-8-0	failure	00:17:05
scaling-proxysql-8-0	passed	00:08:17
scaling-8-0	passed	00:10:45
scheduled-backup-5-7	passed	01:06:43
scheduled-backup-8-0	passed	01:03:44
security-context-8-0	passed	00:25:24
smart-update1-8-0	passed	00:34:00
smart-update2-8-0	passed	00:38:32
storage-8-0	passed	00:10:48
tls-issue-cert-manager-ref-8-0	passed	00:09:31
tls-issue-cert-manager-8-0	passed	00:11:46
tls-issue-self-8-0	passed	00:13:21
upgrade-consistency-8-0	passed	00:11:19
upgrade-haproxy-5-7	passed	00:24:05
upgrade-haproxy-8-0	passed	00:24:13
upgrade-proxysql-5-7	passed	00:15:01
upgrade-proxysql-8-0	passed	00:18:09
users-5-7	passed	00:25:50
users-8-0	failure	00:17:37
validation-hook-8-0	passed	00:01:51
We run 46 out of 46		16:19:36

commit: 842b7a7
image: perconalab/percona-xtradb-cluster-operator:PR-2221-842b7a7e

egegunes · 2025-10-29T08:12:27Z

@NotAndD thank you for your contribution. We'll start working on next PXC operator release soon, we'll review and test your changes as part of release.

NotAndD · 2025-10-31T10:01:21Z

@egegunes thanks a lot! Please let me know if there's anything wrong, or even if you prefer a different solution.

Like I wrote at the beginning, I am mostly searching for guidance here since I wasn't able to fully understand the issue, so my changes to the scripts are somehow done without full knowledge and more based on practical tests in our test environments.

gkech · 2025-11-04T11:55:06Z

hello @NotAndD, given that the changes on the scripts are substantial, we believe that we should not only review this PR but also test it a little bit more thoroughly.

Change ordering of socat so that it is started before garbd. And also…

4d51093

… garbd and the second socat are unrelated.

NotAndD requested review from egegunes, gkech, hors, nmarukovich and pooknull as code owners October 23, 2025 09:23

pull-request-size bot added the size/L 100-499 lines label Oct 23, 2025

Small improvement to prevent incomplete backups to be marked as compl…

842b7a7

…ete for s3 and azure upload.

egegunes added the community label Oct 29, 2025

gkech changed the title ~~feat: Change ordering of operations in backup scripts to begin listening for SST earlier~~ K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier Oct 31, 2025

gkech assigned egegunes Nov 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier #2221

K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier #2221

NotAndD commented Oct 23, 2025 •

edited by pull-request-badge bot

Loading

Uh oh!

CLAassistant commented Oct 23, 2025 •

edited

Loading

Uh oh!

JNKPercona commented Oct 23, 2025

Uh oh!

egegunes commented Oct 29, 2025

Uh oh!

NotAndD commented Oct 31, 2025

Uh oh!

gkech commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier #2221

Are you sure you want to change the base?

K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier #2221

Conversation

NotAndD commented Oct 23, 2025 • edited by pull-request-badge bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change ordering of operations in backup scripts to begin listening for SST earlier

CHECKLIST

Uh oh!

CLAassistant commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JNKPercona commented Oct 23, 2025

Uh oh!

egegunes commented Oct 29, 2025

Uh oh!

NotAndD commented Oct 31, 2025

Uh oh!

gkech commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NotAndD commented Oct 23, 2025 •

edited by pull-request-badge bot

Loading

CLAassistant commented Oct 23, 2025 •

edited

Loading