Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only DU/DD accepted by the restarted node should be cancelled #8534

Open
Lyndon-Li opened this issue Dec 19, 2024 · 12 comments
Open

Only DU/DD accepted by the restarted node should be cancelled #8534

Lyndon-Li opened this issue Dec 19, 2024 · 12 comments

Comments

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Dec 19, 2024

At present, when a node-agent pod restarts, all DU/DD in Accepted phase are cancelled.
We may be able to enhance this as we have recorded the accepted node in DUCR/DDCR as of #8498. Details:

  1. If the node-agent pod restarts because node-agent itself, only the DU/DD that are accepted by the restarted node-agent pod need to be cancelled

  2. If the node-agent pod restarts because node restart, some other DU/DD in Accepted phase may have created some backupPod/restorePod into the restarted node, then once the node restarts, we need to investigate what happens to those pods:

     - If they fail and never recover, we need to cancel those DU/DD
     - If they fail and then recover, we need to ignore and not cancel the DU/DD
    
@dharanui
Copy link

dharanui commented Dec 23, 2024

Also @Lyndon-Li , can you confirm if any new node-agent pod comes up(not restart) due to a new node creation during the time of backup, will it still cancel? Asking this because we had cancellation of datauploads and no restart was observed, except new node-agent creation.

@SCLogo
Copy link

SCLogo commented Dec 23, 2024

We saw the same. We use spit instances. In log I see only new nodes and no agent restarts

@Lyndon-Li
Copy link
Contributor Author

can you confirm if any new node-agent pod comes up(not restart) due to a new node creation during the time of backup, will it still cancel?

I think so, the checks happen during node-agent starts.

@SCLogo
Copy link

SCLogo commented Dec 31, 2024

@Lyndon-Li do you have any idea, when it will be fixed ?

@larssb
Copy link

larssb commented Jan 8, 2025

We instigated us seeing PartiallyFailed backups when executing this velero backup create .... cmdline.

timeStamp="$(date --utc +%d%m%y%H%M)"
velero backup create "cluster-manager-${timeStamp}" \
                        --namespace backup \
                        --include-cluster-resources=true \
                        --snapshot-move-data \
                        --snapshot-volumes=true \
                        --wait

All objects registered by Velero to be backed up finishes. E.g. we see: 10000 out of 10000 backed up at the end of the Velero backup job.
However, a number of the Volumes is registered as failed when describing the backup.

Troubleshooting what exactly is going on can be quite hard. However, we noticed the issue I'm commenting on right now and that made is troubleshoot the node-agent more closely.

We noticed:

  • that all node-agent's in the node-agent DaemonSet where restarted
  • we created the same backup several times and the backup would PartiallyFail each time. With a different number of failed volume backups.
  • The we noticed that our node auto-scaling was triggered because of the Pending Velero CSI DataMover snapshot job pods. This made us think that - based on the input in this issue right here - that an introduced new worker >> getting a Velero node-agent instance scheduled on it might be the cause

Based on the above we paused our node auto-scaling feature and tried creating a Velero backup again. This time => SUCCESS. No restart of the node-agent DamonSet.


Questions

  • Why would the entire node-agent DaemonSet be restarted because one or more new workers gets scaled in and there Velero node-agent instances being assigned on these? That surprises me a great deal.
    • What's the need for restarting all node-agent instances on workers already running and working?

We have a workaround. But, it would be great to be able to avoid the need for pausing our node auto-scaling.

Thank you very much- @Lyndon-Li and you other fantastic Velero contributors. Have a great day.

@sseago
Copy link
Collaborator

sseago commented Jan 8, 2025

"Why would the entire node-agent DaemonSet be restarted because one or more new workers gets scaled in"

This shouldn't happen. I wonder if something else is causing those instances to restart. I haven't used autoscaling in my own dev environment, but the last time we had a bug related to node agent and node auto-scaling, the only problems that were observed related to the new node -- other nodes were unaffected. But it's possible something has changed in the codebase since then to affect this behavior.

@Lyndon-Li
Copy link
Contributor Author

When a new worker node is added, below are the currently expected behaviors:

  1. A new node-agent pod is started in the new node
  2. All the DataUploads in Accepted phase will be cancelled, so the backup will be partiallyFailed
  3. node-agent pods on other nodes should not be affected, unless the nodes are removed and then added back
  4. DataUploads in other phases should not be affected

@larssb If your case is not like this, please share more info.

@larssb
Copy link

larssb commented Jan 9, 2025

@Lyndon-Li the steps that you describe seems to be exactly the ones we see. Can you elaborate further on why all DataUploads in Accepted gets cancelled just because a new node-agent joins the party? Why:

  • isn't it "just" joining the party to make itself available for work or alternatively
  • being ignored by any currently running backup job to ensure that any Accepted phase DataUploads are NOT Cancelled.

There may very well be a very logical reason. But. without some more info - the reason for me asking for more details - I'm having a hard time understanding why the current set of actions.

Thank you very much.

@Lyndon-Li
Copy link
Contributor Author

I am not saying the behavior is reasonable, I am just trying to describe the current behavior. This is why we need this issue opened for future enhancements.

@larssb
Copy link

larssb commented Jan 9, 2025

No no I wasn't getting it like that either. Just trying to get some more info/details on the matter. My idea around any new node-agent joining the a work resource pool - is that viable? Would be cool as then any additional node-agent arriving later on to the party, e.g. because a new worker was scaled in - provides more resources for any current backup job to finish.

And even in the case that a node-agent instance having already acknowledged having picked whatever amount of DU or DD jobs and then failing/restarting. Why Cancel such jobs instead of alternatively letting such jobs, now without an owner, be "sent" back to the pool of jobs for other live and healthy node-agents to pick up such jobs?

Thank you very much.

@Lyndon-Li
Copy link
Contributor Author

The current behavior is actually not related to worker node join, but to worker node restart --- we want to take care of the DU/DD that are affected by the restart.
We are doing this on the edge of the start, so eventually this also affects the worker node join case.

@larssb
Copy link

larssb commented Jan 9, 2025

Okay. Sure makes sense to act in the case of a node-agent restarting. Even in the restart case I think it would be great that the jobs are not actually cancelled. But, instead added back to a node-agent job pool so that such can be picked by other healthy node-agent instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants