Skip to content

Backup procedure can fail when Istio is in the cluster, probably related to timings on sending sst_info from the donor #2217

@NotAndD

Description

@NotAndD

Report

When starting a backup of a MySQL cluster which is inside Istio service-mesh in Ambient mode, sometimes the backups fail (in different ways). This does not happen with an older (1.14.0) version of the operator.

More about the problem

When the backup is started, the backup Pod is scheduled as expected and it becomes a member of the cluster without issues. But then, most of the times the recv script fails for the following reason:

  • I saw that normally, the recv script opens port 4444 two times, once to receive sst_info (with xbstream), the other to receive everything else (with xbcloud).
  • On first opening, instead of receiving the sst_info, the snapshot of the database is received. Like if the sender already sent the sst_info but the port was not open on the receiver side.

This causes 2 possible issues:

  • The backup fails and the Pod is stopped by SST request failed and sigterm
  • The transfer of the snapshot is so fast that the sigterm is not fast enough and the Pod remains stuck running forever at the second socat opening stream

I could confirm that the pod was receiving some database snapshot in the first stream, by manually starting a backup Pod and manually starting the backup script inside to debug things up. Inside /tmp/ I find partial databases compressed files when the backup does not work.

Logs for non-working case:

backup_script_not_working.log
pxc_during_backup_not_working.log

Logs for working case:

backup_script_working.log
pxc_during_backup_working.log

Steps to reproduce

  1. Install Istio with profile ambient in the cluster
  2. Annotate the Percona namespace with the ambient label istio.io/dataplane-mode: ambient
  3. Restart all Percona Pods
  4. Start a backup

Versions

  1. Kubernetes v1.32.5
  2. Operator 1.18.0
  3. Database 8.0.42-33.1

Unrelated to Percona itself:

  1. Istio 1.26.3

Anything else?

Like I was saying, with the same version of Istio and Kubernetes, but using the operator at 1.14.0, backups work without issues.

On the other hand, without Istio, backups work good also with 1.18.0, so I was thinking maybe this is related to a change on Percona / Galera that impacts timings or checks to understand if the sst_info has been streamed correctly?

I tried to debug but I got stuck at a certain point. If someone can point me to checks that I can do, I can investigate some more.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions