Skip to content

[18.0-fr3] Fetch up-to-date gcomm members list during a failover #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

openshift-cherrypick-robot

This is an automated cherry-pick of #348

/assign lmiccini

dciabrin added 3 commits July 25, 2025 06:55
One chainsaw test consists in abruptly cutting one galera node away from the
galera cluster and verify that the active endpoint moves to one of the
remaining two galera instances.

In doing so, we currently kill -9 the target mysqld server. By design,
this can take by default up to 15s for the remaining galera nodes to
acknowlege the node went away and react to that. This is a problem for
the test as if the pod comes back online before the 15s, the galera
cluster won't move the endpoint and the test will fail.

To prevent flaky result in the unit test, use the STOP signal instead
of the KILL signal. This doesn't kill the pod, and by default galera
will mark the node as not responding after 3s, and switch the endpoint.

This achieves the same result, which is to make sure that an unexpected
disconnection still trigger a endpoint switch.
The operator script that implements service endpoint failover
contains internal logic to probe the up-to-date state of the
gcomm cluster. This is done when the script starts, or when
a command failed and is retried.

The list of members was incorrectly extracted from a mysql
table which is not guaranteed to be up-to-date when e.g.
a node disappears from the cluster due to a network partition.

Instead, we must rely on the mysql status, that always exposes
the up-to-date gcomm state, in particular the members that
are still connected to the primary partition.

Jira: OSPRH-18408
@lmiccini
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jul 25, 2025
Copy link
Contributor

@stuggi stuggi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Jul 25, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-cherrypick-robot, stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit e3a64bd into openstack-k8s-operators:18.0-fr3 Jul 25, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants