-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only DU/DD accepted by the restarted node should be cancelled #8534
Comments
Also @Lyndon-Li , can you confirm if any new node-agent pod comes up(not restart) due to a new node creation during the time of backup, will it still cancel? Asking this because we had cancellation of datauploads and no restart was observed, except new node-agent creation. |
We saw the same. We use spit instances. In log I see only new nodes and no agent restarts |
I think so, the checks happen during node-agent starts. |
@Lyndon-Li do you have any idea, when it will be fixed ? |
We instigated us seeing timeStamp="$(date --utc +%d%m%y%H%M)"
velero backup create "cluster-manager-${timeStamp}" \
--namespace backup \
--include-cluster-resources=true \
--snapshot-move-data \
--snapshot-volumes=true \
--wait All objects registered by Troubleshooting what exactly is going on can be quite hard. However, we noticed the issue I'm commenting on right now and that made is troubleshoot the node-agent more closely. We noticed:
Based on the above we paused our node auto-scaling feature and tried creating a Velero backup again. This time => SUCCESS. No restart of the Questions
We have a workaround. But, it would be great to be able to avoid the need for pausing our Thank you very much- @Lyndon-Li and you other fantastic Velero contributors. Have a great day. |
"Why would the entire node-agent DaemonSet be restarted because one or more new workers gets scaled in" This shouldn't happen. I wonder if something else is causing those instances to restart. I haven't used autoscaling in my own dev environment, but the last time we had a bug related to node agent and node auto-scaling, the only problems that were observed related to the new node -- other nodes were unaffected. But it's possible something has changed in the codebase since then to affect this behavior. |
When a new worker node is added, below are the currently expected behaviors:
@larssb If your case is not like this, please share more info. |
@Lyndon-Li the steps that you describe seems to be exactly the ones we see. Can you elaborate further on why all
There may very well be a very logical reason. But. without some more info - the reason for me asking for more details - I'm having a hard time understanding why the current set of actions. Thank you very much. |
I am not saying the behavior is reasonable, I am just trying to describe the current behavior. This is why we need this issue opened for future enhancements. |
No no I wasn't getting it like that either. Just trying to get some more info/details on the matter. My idea around any new node-agent joining the a work resource pool - is that viable? Would be cool as then any additional And even in the case that a Thank you very much. |
The current behavior is actually not related to worker node join, but to worker node restart --- we want to take care of the DU/DD that are affected by the restart. |
Okay. Sure makes sense to act in the case of a node-agent restarting. Even in the restart case I think it would be great that the jobs are not actually cancelled. But, instead added back to a |
At present, when a node-agent pod restarts, all DU/DD in Accepted phase are cancelled.
We may be able to enhance this as we have recorded the accepted node in DUCR/DDCR as of #8498. Details:
If the node-agent pod restarts because node-agent itself, only the DU/DD that are accepted by the restarted node-agent pod need to be cancelled
If the node-agent pod restarts because node restart, some other DU/DD in Accepted phase may have created some backupPod/restorePod into the restarted node, then once the node restarts, we need to investigate what happens to those pods:
The text was updated successfully, but these errors were encountered: