allow setting label to nodes about to be upgraded/restarted #3204

ibotty · 2022-06-22T09:18:47Z

Description

Because there is no agreed-upon way to signal operators that a node is drained, there are multiple ways that operators handle it.
Rook detects node drain by observing pods on the node. This works fine but feels a bit fragile.
The problem is that some operators (e.g. the Zalando PostgreSQL Operator) "detect" drains by watching node's labels. Whenever a label is not set anymore (e.g. "node-ready=true") it will (try to) failover to another DB pod on another node.

This is a feature request to update node's labels when a reboot is about to happen.

Steps to reproduce the issue:

update some machineconfig,
observe machine-config-daemon trying to drain a node,
failing to drain the node because there is a pdb on a pod on that node,

meanwhile
4. some operator not knowing that the machine is about to be rebooted and not updating the pdb (directly or indirectly.)

the node not getting drained.

Describe the results you expected:

update some machineconfig,
machine-config-daemon updating label machineconfiguration.openshift.io/pending-restart=false to =true,
3a. an operator removes active workload from the node, removing/updating pdbs that affect the node,
3b. machine-config-daemon drains the node,
node reboots successful,
machine-config-daemon sets label machineconfiguration.openshift.io/pending-restart=false.

The text was updated successfully, but these errors were encountered:

cgwalters · 2022-06-22T17:21:54Z

Hi, thanks for filing this!

This issue relates to a topic of reboot handling that's ongoing, for which most information/discussion is (AFAIK) sadly trapped in internal-to-RH proprietary systems because staying open requires relentless commitment and we aren't consistent about that.

machine-config-daemon updating label machineconfiguration.openshift.io/pending-restart=false to =true

I think we should avoid having OpenShift/MCO-specific labels here; we want to interoperate with the rest of the Kubernetes ecosystem.

Rook detects node drain by observing pods on the node. This works fine but feels a bit fragile.

Note that https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown will make this more reliable and we (OCP) plan to roll that out.

openshift-bot · 2022-09-21T01:00:23Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2022-10-21T08:30:36Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2022-11-21T00:00:25Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2022-11-21T00:00:44Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ibotty · 2022-11-21T07:02:58Z

Still relevant.

And reading https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown another time, I don't see how that will help the use case described above. How can rook know that the node is about to shut down. The only taint (or annotation) that is described is for **non-**graceful shutdown which the machine-config-daemon will explicitly not do.

@cgwalters: Do I misunderstand the mechanism?

/remove-lifecycle rotten
/lifecycle frozen

ibotty · 2022-11-21T07:04:01Z

/reopen

openshift-ci · 2022-11-21T07:04:11Z

@ibotty: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 21, 2022

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 21, 2022

openshift-ci bot closed this as completed Nov 21, 2022

openshift-ci bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 21, 2022

openshift-ci bot reopened this Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow setting label to nodes about to be upgraded/restarted #3204

allow setting label to nodes about to be upgraded/restarted #3204

ibotty commented Jun 22, 2022

cgwalters commented Jun 22, 2022

openshift-bot commented Sep 21, 2022

openshift-bot commented Oct 21, 2022

openshift-bot commented Nov 21, 2022

openshift-ci bot commented Nov 21, 2022

ibotty commented Nov 21, 2022

ibotty commented Nov 21, 2022

openshift-ci bot commented Nov 21, 2022

allow setting label to nodes about to be upgraded/restarted #3204

allow setting label to nodes about to be upgraded/restarted #3204

Comments

ibotty commented Jun 22, 2022

cgwalters commented Jun 22, 2022

openshift-bot commented Sep 21, 2022

openshift-bot commented Oct 21, 2022

openshift-bot commented Nov 21, 2022

openshift-ci bot commented Nov 21, 2022

ibotty commented Nov 21, 2022

ibotty commented Nov 21, 2022

openshift-ci bot commented Nov 21, 2022