Bug 2315624: Fix mds liveness probe and mon failover #746

parth-gr · 2024-10-08T05:28:06Z

When the MDS liveness probe times out, it should not fail the probe. If
the cluster has a network partition or other issue that causes the Ceph
mon cluster to become unstable, ceph ... commands can hang and cause
a timeout. In this case, the MDS should not be restarted so as to not
cause cascading cluster disruption.
If the mon failover is in progress, ensure the removal
of an extra mon deployment is skipped since that code
path only has one mon in the list for the mon that was
just newly started. The extra mon was erroneously removing
a random mon in that case, followed immediately by the mon
failover completing and removing the expected failed mon,
and potentially causing mon quroum loss.

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

When the MDS liveness probe times out, it should not fail the probe. If the cluster has a network partition or other issue that causes the Ceph mon cluster to become unstable, `ceph ...` commands can hang and cause a timeout. In this case, the MDS should not be restarted so as to not cause cascading cluster disruption. Signed-off-by: Blaine Gardner <[email protected]> (cherry picked from commit ad1bae9)

If the mon failover is in progress, ensure the removal of an extra mon deployment is skipped since that code path only has one mon in the list for the mon that was just newly started. The extra mon was erroneously removing a random mon in that case, followed immediately by the mon failover completing and removing the expected failed mon, and potentially causing mon quroum loss. Signed-off-by: Travis Nielsen <[email protected]> (cherry picked from commit e2cadab)

openshift-ci · 2024-10-08T05:28:17Z

@parth-gr: This pull request references Bugzilla bug 2315624, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (ODF 4.17.0) matches configured target release for branch (ODF 4.17.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2315624: Fix mds liveness probe and mon failover

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-10-08T05:28:20Z

@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: nehaberry.

Note that only red-hat-storage members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@parth-gr: This pull request references Bugzilla bug 2315624, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target release (ODF 4.17.0) matches configured target release for branch (ODF 4.17.0)

bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @nehaberry

In response to this:

Bug 2315624: Fix mds liveness probe and mon failover

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2024-10-08T05:57:20Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: parth-gr, sp98

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-10-08T06:24:37Z

@parth-gr: All pull requests linked via external trackers have merged:

red-hat-storage/rook#746

Bugzilla bug 2315624 has been moved to the MODIFIED state.

In response to this:

Bug 2315624: Fix mds liveness probe and mon failover

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BlaineEXE and others added 2 commits October 8, 2024 10:52

openshift-ci bot added bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 8, 2024

agarwal-mudit added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Oct 8, 2024

sp98 approved these changes Oct 8, 2024

View reviewed changes

openshift-ci bot assigned sp98 Oct 8, 2024

sp98 merged commit a948c8d into red-hat-storage:release-4.17 Oct 8, 2024
50 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 2315624: Fix mds liveness probe and mon failover #746

Bug 2315624: Fix mds liveness probe and mon failover #746

parth-gr commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024

Bug 2315624: Fix mds liveness probe and mon failover #746

Bug 2315624: Fix mds liveness probe and mon failover #746

Conversation

parth-gr commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024

openshift-ci bot commented Oct 8, 2024