The MCD is stuck and unable to recover from file degradations #1443

yuqi-zhang · 2020-02-05T22:23:38Z

BUG REPORT INFORMATION

Description
Let's say if I have a file at /home/core/test, and then I apply a new machineconfig to write to /home/core/test/test, since /home/core/test is a file, the MCO properly catches that it is unable to create a directory there, and thus degrades.

If I then delete the machineconfig that introduced this change, the MCC will properly detect that the worker pool should go back to targeting the previous machineconfig for the pool worker. However, the MCD running on the node does not detect this change. It will continuously fail-loop on Marking Degraded due to: failed to create directory "/home/core/test": mkdir /home/core/test: not a directory, marking the node as schedulingdisabled and failing to make any progress. In fact, since the annotation on the node never gets updated, even deleting the MCD pod doesn't fix the error, as the new one will attempt the same update and fail on the same error.

To recover, we would need to update the annotation on the node by hand to the previous desiredConfig, and the manually oc adm uncordon node. This is obviously not desired behaviour, as we should be able to recover automatically when the MC is deleted.

I've not tested every degrade-recovery scenario but I remember we were able to recover from some cases before. Will test to see if other types of degrades exhibit the same behaviour.

Steps to reproduce the issue:

Create a file with a machineconfig snippet like:

    ...
    storage:
      files:
      - contents:
          source: data:,hello%20world%0A
          verification: {}
        filesystem: root
        mode: 420
        path: /home/core/test

Create a second file with another machineconfig like:

    ...
    storage:
      files:
      - contents:
          source: data:,hello%20worlddd%0A
          verification: {}
        filesystem: root
        mode: 420
        path: /home/core/test/test

Notice that the second machineconfig causes a degrade on one of the nodes
Delete the second machineconfig, and notice that the node is unable to recover

Additional environment details (platform, options, etc.):
Reproduced so far on 4.4 Azure and AWS

The text was updated successfully, but these errors were encountered:

openshift-bot · 2020-09-29T23:40:38Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-10-30T01:29:34Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-11-29T03:22:48Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-11-29T03:23:04Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2020-11-30T18:06:40Z

/remove-lifecycle rotten

yuqi-zhang · 2020-11-30T18:07:07Z

/reopen

openshift-ci-robot · 2020-11-30T18:07:25Z

@yuqi-zhang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-03-01T01:32:40Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2021-03-31T03:24:38Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2021-04-30T07:01:19Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2021-04-30T07:01:32Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2021-04-30T18:04:56Z

/reopen

openshift-ci-robot · 2021-04-30T18:05:09Z

@yuqi-zhang: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-05-30T19:18:32Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2021-05-30T19:18:45Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2021-05-31T15:53:26Z

/lifecycle frozen

yuqi-zhang self-assigned this Feb 5, 2020

runcom added the jira label Feb 20, 2020

yuqi-zhang mentioned this issue Feb 26, 2020

generate CRD manifests and fix for oc explain #1485

Merged

yuqi-zhang mentioned this issue Mar 18, 2020

WIP: Uncordon the node during failed updates #1572

Closed

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 29, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 30, 2020

openshift-ci-robot closed this as completed Nov 29, 2020

openshift-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 30, 2020

openshift-ci-robot reopened this Nov 30, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 1, 2021

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 31, 2021

openshift-ci-robot closed this as completed Apr 30, 2021

openshift-ci-robot reopened this Apr 30, 2021

openshift-ci bot closed this as completed May 30, 2021

yuqi-zhang reopened this May 31, 2021

openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels May 31, 2021

Xaenalt mentioned this issue Aug 5, 2021

MCD fails with empty string in source #2705

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The MCD is stuck and unable to recover from file degradations #1443

The MCD is stuck and unable to recover from file degradations #1443

yuqi-zhang commented Feb 5, 2020

openshift-bot commented Sep 29, 2020

openshift-bot commented Oct 30, 2020

openshift-bot commented Nov 29, 2020

openshift-ci-robot commented Nov 29, 2020

yuqi-zhang commented Nov 30, 2020

yuqi-zhang commented Nov 30, 2020

openshift-ci-robot commented Nov 30, 2020

openshift-bot commented Mar 1, 2021

openshift-bot commented Mar 31, 2021

openshift-bot commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

yuqi-zhang commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

openshift-bot commented May 30, 2021

openshift-ci bot commented May 30, 2021

yuqi-zhang commented May 31, 2021

The MCD is stuck and unable to recover from file degradations #1443

The MCD is stuck and unable to recover from file degradations #1443

Comments

yuqi-zhang commented Feb 5, 2020

BUG REPORT INFORMATION

openshift-bot commented Sep 29, 2020

openshift-bot commented Oct 30, 2020

openshift-bot commented Nov 29, 2020

openshift-ci-robot commented Nov 29, 2020

yuqi-zhang commented Nov 30, 2020

yuqi-zhang commented Nov 30, 2020

openshift-ci-robot commented Nov 30, 2020

openshift-bot commented Mar 1, 2021

openshift-bot commented Mar 31, 2021

openshift-bot commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

yuqi-zhang commented Apr 30, 2021

openshift-ci-robot commented Apr 30, 2021

openshift-bot commented May 30, 2021

openshift-ci bot commented May 30, 2021

yuqi-zhang commented May 31, 2021