-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2315624: Fix mds liveness probe and mon failover #746
Conversation
When the MDS liveness probe times out, it should not fail the probe. If the cluster has a network partition or other issue that causes the Ceph mon cluster to become unstable, `ceph ...` commands can hang and cause a timeout. In this case, the MDS should not be restarted so as to not cause cascading cluster disruption. Signed-off-by: Blaine Gardner <[email protected]> (cherry picked from commit ad1bae9)
If the mon failover is in progress, ensure the removal of an extra mon deployment is skipped since that code path only has one mon in the list for the mon that was just newly started. The extra mon was erroneously removing a random mon in that case, followed immediately by the mon failover completing and removing the expected failed mon, and potentially causing mon quroum loss. Signed-off-by: Travis Nielsen <[email protected]> (cherry picked from commit e2cadab)
@parth-gr: This pull request references Bugzilla bug 2315624, which is valid. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@openshift-ci[bot]: GitHub didn't allow me to request PR reviews from the following users: nehaberry. Note that only red-hat-storage members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: parth-gr, sp98 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@parth-gr: All pull requests linked via external trackers have merged: Bugzilla bug 2315624 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
When the MDS liveness probe times out, it should not fail the probe. If
the cluster has a network partition or other issue that causes the Ceph
mon cluster to become unstable,
ceph ...
commands can hang and causea timeout. In this case, the MDS should not be restarted so as to not
cause cascading cluster disruption.
If the mon failover is in progress, ensure the removal
of an extra mon deployment is skipped since that code
path only has one mon in the list for the mon that was
just newly started. The extra mon was erroneously removing
a random mon in that case, followed immediately by the mon
failover completing and removing the expected failed mon,
and potentially causing mon quroum loss.
Checklist: