Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2292435: mon: Remove extra mon from quorum before taking down pod #720

Merged
merged 1 commit into from
Sep 9, 2024

Conversation

travisn
Copy link

@travisn travisn commented Sep 5, 2024

When removing a mon from quorum, there is a race condition that can result in mon quorum going being lost at least temporarily. The mon pod was being deleted first, and then the mon removed from quorum. If any other mon went down between the time the pod of the bad mon was deleted and when the mon was removed from quorum, there may not be sufficient quorum to complete the action of removing the mon from quorum and the operator would be stuck.

For example, there could be 4 mons temporarily due to timing of upgrading K8s nodes where mons may be taken down for some number of minutes. Say a new mon is started while the down mon also comes back up. Now the operator sees it can remove the 4th mon from quorum, so it starts to remove it. Now say another mon goes down on another node that is being updated or otherwise drained. Since the 4th mon pod was deleted and another mon is down, there are only two mons remaining in quorum, but 3 mons are required in quorum when there are 4 mons. Therefore, the quorum is stuck until the third mon comes back up.

The solution is to first remove the extra mon from quorum before taking down the mon pod.

Signed-off-by: Travis Nielsen [email protected]
(cherry picked from commit 8987d26) (cherry picked from commit 09c9065)

Issue resolved by this Pull Request:
Resolves #https://bugzilla.redhat.com/show_bug.cgi?id=2292435

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

When removing a mon from quorum, there is a race condition that
can result in mon quorum going being lost at least temporarily.
The mon pod was being deleted first, and then the mon removed
from quorum. If any other mon went down between the time the
pod of the bad mon was deleted and when the mon was removed from
quorum, there may not be sufficient quorum to complete the
action of removing the mon from quorum and the operator would
be stuck.

For example, there could be 4 mons temporarily due to timing
of upgrading K8s nodes where mons may be taken down for some
number of minutes. Say a new mon is started while the down
mon also comes back up. Now the operator sees it can remove
the 4th mon from quorum, so it starts to remove it. Now say
another mon goes down on another node that is being updated
or otherwise drained. Since the 4th mon pod was deleted
and another mon is down, there are only two mons remaining
in quorum, but 3 mons are required in quorum when there
are 4 mons. Therefore, the quorum is stuck until the
third mon comes back up.

The solution is to first remove the extra mon from quorum
before taking down the mon pod.

Signed-off-by: Travis Nielsen <[email protected]>
(cherry picked from commit 8987d26)
(cherry picked from commit 09c9065)
@openshift-ci openshift-ci bot added bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 5, 2024
Copy link

openshift-ci bot commented Sep 5, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: travisn

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Sep 5, 2024

@travisn: This pull request references Bugzilla bug 2292435, which is invalid:

  • expected the bug to target the "ODF 4.17.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2292435: mon: Remove extra mon from quorum before taking down pod

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@travisn travisn merged commit 6c78c88 into red-hat-storage:release-4.17 Sep 9, 2024
50 of 51 checks passed
Copy link

openshift-ci bot commented Sep 9, 2024

@travisn: All pull requests linked via external trackers have merged:

Bugzilla bug 2292435 has been moved to the MODIFIED state.

In response to this:

Bug 2292435: mon: Remove extra mon from quorum before taking down pod

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@travisn travisn deleted the backport-mon-race branch October 4, 2024 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant