-
Notifications
You must be signed in to change notification settings - Fork 203
K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier #2221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
K8SPXC-1732 Change ordering of operations in backup scripts to begin listening for SST earlier #2221
Conversation
… garbd and the second socat are unrelated.
…ete for s3 and azure upload.
commit: 842b7a7 |
|
@NotAndD thank you for your contribution. We'll start working on next PXC operator release soon, we'll review and test your changes as part of release. |
|
@egegunes thanks a lot! Please let me know if there's anything wrong, or even if you prefer a different solution. Like I wrote at the beginning, I am mostly searching for guidance here since I wasn't able to fully understand the issue, so my changes to the scripts are somehow done without full knowledge and more based on practical tests in our test environments. |
|
hello @NotAndD, given that the changes on the scripts are substantial, we believe that we should not only review this PR but also test it a little bit more thoroughly. |
Change ordering of operations in backup scripts to begin listening for SST earlier
Problem:
K8SPXC-1732
When Istio in ambient mode is deployed in the cluster and Percona pods are inside its service mesh, backups fail as the SST transfer is sometimes randomly lost, leaving the backup pod stuck running forever, or always failing because the first socat receives the snapshot files instead of the sst file (and the second socat remains hanging forever).
I've opened an issue here on Github as I had some issues creating a Jira ticket (not sure why).
Cause:
Honestly, a bit unclear. Since connections are wrapped by ztunnel, I suspect something breaks an assumption made by the donor and somehow the sst transfer is marked as okay, even if it fails. This was not happening with older version of Percona (1.14.0) so I think it's something on garbd / donor side.
Solution:
I've refactored the backup scripts so that they work in a different way:
I've tested my solution in some of our test clusters and it looks it is working (for s3 storage at least). But I am searching for inputs as I am unsure if this is the right approach, or whatever. I think this could be an improvement on the scripts, but unsure if I missed something obvious in changing them.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
compare/*-oc.yml)?Config/Logging/Testability