Skip to content

Conversation

@NotAndD
Copy link

@NotAndD NotAndD commented Oct 23, 2025

K8SPXC-1732 Powered by Pull Request Badge

Change ordering of operations in backup scripts to begin listening for SST earlier

Problem:

K8SPXC-1732

When Istio in ambient mode is deployed in the cluster and Percona pods are inside its service mesh, backups fail as the SST transfer is sometimes randomly lost, leaving the backup pod stuck running forever, or always failing because the first socat receives the snapshot files instead of the sst file (and the second socat remains hanging forever).

I've opened an issue here on Github as I had some issues creating a Jira ticket (not sure why).

Cause:

Honestly, a bit unclear. Since connections are wrapped by ztunnel, I suspect something breaks an assumption made by the donor and somehow the sst transfer is marked as okay, even if it fails. This was not happening with older version of Percona (1.14.0) so I think it's something on garbd / donor side.

Solution:

I've refactored the backup scripts so that they work in a different way:

  • socat listening starts before garbd, this way the pod is ready to await the sst.
  • also, garbd is closed without errors when sst transfer is done, and then the snapshot is awaited by the main script in a separate way.
  • to obtain this different ordering, there are small changes here and there.

I've tested my solution in some of our test clusters and it looks it is working (for s3 storage at least). But I am searching for inputs as I am unsure if this is the right approach, or whatever. I think this could be an improvement on the scripts, but unsure if I missed something obvious in changing them.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PXC version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/L 100-499 lines label Oct 23, 2025
@CLAassistant
Copy link

CLAassistant commented Oct 23, 2025

CLA assistant check
All committers have signed the CLA.

@JNKPercona
Copy link
Collaborator

Test Name Result Time
affinity-8-0 passed 00:05:58
auto-tuning-8-0 passed 00:18:22
cross-site-8-0 passed 00:34:18
custom-users-8-0 passed 00:12:37
demand-backup-cloud-8-0 passed 00:57:45
demand-backup-encrypted-with-tls-8-0 failure 00:17:07
demand-backup-8-0 passed 00:42:03
demand-backup-flow-control-8-0 passed 00:10:50
demand-backup-parallel-8-0 passed 00:09:11
demand-backup-without-passwords-8-0 passed 00:15:38
haproxy-5-7 passed 00:14:14
haproxy-8-0 passed 00:13:57
init-deploy-5-7 passed 00:15:49
init-deploy-8-0 passed 00:16:47
limits-8-0 passed 00:11:52
monitoring-2-0-8-0 passed 00:22:26
monitoring-pmm3-8-0 passed 00:18:10
one-pod-5-7 passed 00:13:44
one-pod-8-0 passed 00:13:34
pitr-8-0 passed 00:43:37
pitr-gap-errors-8-0 passed 00:55:41
proxy-protocol-8-0 passed 00:09:41
proxysql-sidecar-res-limits-8-0 passed 00:08:20
pvc-resize-5-7 passed 00:16:33
pvc-resize-8-0 passed 00:15:45
recreate-8-0 passed 00:17:13
restore-to-encrypted-cluster-8-0 failure 00:17:05
scaling-proxysql-8-0 passed 00:08:17
scaling-8-0 passed 00:10:45
scheduled-backup-5-7 passed 01:06:43
scheduled-backup-8-0 passed 01:03:44
security-context-8-0 passed 00:25:24
smart-update1-8-0 passed 00:34:00
smart-update2-8-0 passed 00:38:32
storage-8-0 passed 00:10:48
tls-issue-cert-manager-ref-8-0 passed 00:09:31
tls-issue-cert-manager-8-0 passed 00:11:46
tls-issue-self-8-0 passed 00:13:21
upgrade-consistency-8-0 passed 00:11:19
upgrade-haproxy-5-7 passed 00:24:05
upgrade-haproxy-8-0 passed 00:24:13
upgrade-proxysql-5-7 passed 00:15:01
upgrade-proxysql-8-0 passed 00:18:09
users-5-7 passed 00:25:50
users-8-0 failure 00:17:37
validation-hook-8-0 passed 00:01:51
We run 46 out of 46 16:19:36

commit: 842b7a7
image: perconalab/percona-xtradb-cluster-operator:PR-2221-842b7a7e

@egegunes
Copy link
Contributor

@NotAndD thank you for your contribution. We'll start working on next PXC operator release soon, we'll review and test your changes as part of release.

@gkech gkech changed the title feat: Change ordering of operations in backup scripts to begin listening for SST earlier K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier Oct 31, 2025
@NotAndD
Copy link
Author

NotAndD commented Oct 31, 2025

@egegunes thanks a lot! Please let me know if there's anything wrong, or even if you prefer a different solution.

Like I wrote at the beginning, I am mostly searching for guidance here since I wasn't able to fully understand the issue, so my changes to the scripts are somehow done without full knowledge and more based on practical tests in our test environments.

@gkech
Copy link
Contributor

gkech commented Nov 4, 2025

hello @NotAndD, given that the changes on the scripts are substantial, we believe that we should not only review this PR but also test it a little bit more thoroughly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community size/L 100-499 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants