Backup procedure can fail when Istio is in the cluster, probably related to timings on sending sst_info from the donor

### Report

When starting a backup of a MySQL cluster which is inside Istio service-mesh in Ambient mode, sometimes the backups fail (in different ways). This does not happen with an older (1.14.0) version of the operator.

### More about the problem

When the backup is started, the backup Pod is scheduled as expected and it becomes a member of the cluster without issues. But then, most of the times the recv script fails for the following reason:

- I saw that normally, the recv script opens port 4444 two times, once to receive sst_info (with xbstream), the other to receive everything else (with xbcloud).
- On first opening, instead of receiving the sst_info, the snapshot of the database is received. Like if the sender already sent the sst_info but the port was not open on the receiver side.

This causes 2 possible issues:

- The backup fails and the Pod is stopped by `SST request failed` and sigterm
- The transfer of the snapshot is so fast that the sigterm is not fast enough and the Pod remains stuck running forever at the second socat opening stream

I could confirm that the pod was receiving some database snapshot in the first stream, by manually starting a backup Pod and manually starting the backup script inside to debug things up. Inside `/tmp/` I find partial databases compressed files when the backup does not work.


Logs for non-working case:

[backup_script_not_working.log](https://github.com/user-attachments/files/23019670/backup_script_not_working.log)
[pxc_during_backup_not_working.log](https://github.com/user-attachments/files/23019671/pxc_during_backup_not_working.log)

Logs for working case:

[backup_script_working.log](https://github.com/user-attachments/files/23019672/backup_script_working.log)
[pxc_during_backup_working.log](https://github.com/user-attachments/files/23019673/pxc_during_backup_working.log)

### Steps to reproduce

1. Install Istio with profile ambient in the cluster
2. Annotate the Percona namespace with the ambient label `istio.io/dataplane-mode: ambient`
3. Restart all Percona Pods
4. Start a backup


### Versions

1. Kubernetes v1.32.5
2. Operator 1.18.0
3. Database 8.0.42-33.1

Unrelated to Percona itself:

1. Istio 1.26.3


### Anything else?

Like I was saying, with the same version of Istio and Kubernetes, but using the operator at 1.14.0, backups work without issues.

On the other hand, without Istio, backups work good also with 1.18.0, so I was thinking maybe this is related to a change on Percona / Galera that impacts timings or checks to understand if the sst_info has been streamed correctly?

I tried to debug but I got stuck at a certain point. If someone can point me to checks that I can do, I can investigate some more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backup procedure can fail when Istio is in the cluster, probably related to timings on sending sst_info from the donor #2217

Report

More about the problem

Steps to reproduce

Versions

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backup procedure can fail when Istio is in the cluster, probably related to timings on sending sst_info from the donor #2217

Description

Report

More about the problem

Steps to reproduce

Versions

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions