-
Notifications
You must be signed in to change notification settings - Fork 203
Description
Report
When starting a backup of a MySQL cluster which is inside Istio service-mesh in Ambient mode, sometimes the backups fail (in different ways). This does not happen with an older (1.14.0) version of the operator.
More about the problem
When the backup is started, the backup Pod is scheduled as expected and it becomes a member of the cluster without issues. But then, most of the times the recv script fails for the following reason:
- I saw that normally, the recv script opens port 4444 two times, once to receive sst_info (with xbstream), the other to receive everything else (with xbcloud).
- On first opening, instead of receiving the sst_info, the snapshot of the database is received. Like if the sender already sent the sst_info but the port was not open on the receiver side.
This causes 2 possible issues:
- The backup fails and the Pod is stopped by
SST request failedand sigterm - The transfer of the snapshot is so fast that the sigterm is not fast enough and the Pod remains stuck running forever at the second socat opening stream
I could confirm that the pod was receiving some database snapshot in the first stream, by manually starting a backup Pod and manually starting the backup script inside to debug things up. Inside /tmp/ I find partial databases compressed files when the backup does not work.
Logs for non-working case:
backup_script_not_working.log
pxc_during_backup_not_working.log
Logs for working case:
backup_script_working.log
pxc_during_backup_working.log
Steps to reproduce
- Install Istio with profile ambient in the cluster
- Annotate the Percona namespace with the ambient label
istio.io/dataplane-mode: ambient - Restart all Percona Pods
- Start a backup
Versions
- Kubernetes v1.32.5
- Operator 1.18.0
- Database 8.0.42-33.1
Unrelated to Percona itself:
- Istio 1.26.3
Anything else?
Like I was saying, with the same version of Istio and Kubernetes, but using the operator at 1.14.0, backups work without issues.
On the other hand, without Istio, backups work good also with 1.18.0, so I was thinking maybe this is related to a change on Percona / Galera that impacts timings or checks to understand if the sst_info has been streamed correctly?
I tried to debug but I got stuck at a certain point. If someone can point me to checks that I can do, I can investigate some more.