-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The MCD is stuck and unable to recover from file degradations #1443
Comments
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
/reopen |
@yuqi-zhang: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@yuqi-zhang: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lifecycle frozen |
BUG REPORT INFORMATION
Description
Let's say if I have a file at
/home/core/test
, and then I apply a new machineconfig to write to/home/core/test/test
, since/home/core/test
is a file, the MCO properly catches that it is unable to create a directory there, and thus degrades.If I then delete the machineconfig that introduced this change, the MCC will properly detect that the worker pool should go back to targeting the previous machineconfig for the pool worker. However, the MCD running on the node does not detect this change. It will continuously fail-loop on
Marking Degraded due to: failed to create directory "/home/core/test": mkdir /home/core/test: not a directory
, marking the node as schedulingdisabled and failing to make any progress. In fact, since the annotation on the node never gets updated, even deleting the MCD pod doesn't fix the error, as the new one will attempt the same update and fail on the same error.To recover, we would need to update the annotation on the node by hand to the previous
desiredConfig
, and the manuallyoc adm uncordon node
. This is obviously not desired behaviour, as we should be able to recover automatically when the MC is deleted.I've not tested every degrade-recovery scenario but I remember we were able to recover from some cases before. Will test to see if other types of degrades exhibit the same behaviour.
Steps to reproduce the issue:
Additional environment details (platform, options, etc.):
Reproduced so far on 4.4 Azure and AWS
The text was updated successfully, but these errors were encountered: