Skip to content

Add retry on for netlink when it receives a ErrDumpInterrupted #3339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

dcoppa
Copy link

@dcoppa dcoppa commented Jul 2, 2025

What type of PR is this?

improvement

Which issue does this PR fix?:

Errors like:

"Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b413c572e7bac7a27767123be3da3e031a3ac5b3695efc995c15057996b2c930": plugin type="aws-cni" name="aws-cni" failed (add): add command: failed to setup network: SetupPodNetwork: failed to setup veth pair: failed to setup veth network: setup NS network: failed while waiting for v6 addresses to be stable: could not list addresses: results may be incomplete or inconsistent"

What does this PR do / Why do we need it?:

This PR adds a retry mechanism when netlink fails with error ErrDumpInterrupted.
Adapted from https://github.com/containernetworking/plugins/blob/main/pkg/netlinksafe/netlink.go

Testing done on this change:

Unit tests passed:

PASS
coverage: 100.0% of statements
ok  	github.com/aws/amazon-vpc-cni-k8s/pkg/vpc	0.003s	coverage: 100.0% of statements

Also tested on my sandbox EKS cluster.

Will this PR introduce any new dependencies?:

No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:

No

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

Implement a retry mechanism for netlink that, on an IPv6 EKS cluster, should fix (or mitigate) the following error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e868238d122df610228688f51fee6a187544a45ce43a9d613307b6b08246e65a": plugin type="aws-cni" name="aws-cni" failed (add): add command: failed to setup network: SetupPodNetwork: failed to setup veth pair: failed to setup veth network: setup NS network: failed while waiting for v6 addresses to be stable: could not list addresses: results may be incomplete or inconsistent

No additional actions from users required.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@dcoppa dcoppa requested a review from a team as a code owner July 2, 2025 12:19
@yash97
Copy link
Contributor

yash97 commented Jul 7, 2025

Thanks for contributing this change. Can you make sure that netlink wrapper is always used in code base. For this we need to make changes where vishvananda/netlink is directly used in code to use this wrapper. Also have linter rule to avoid any one in future to use that library directly.

links, err = netlink.LinkList()
return err
})
return links, discardErrDumpInterrupted(err)
Copy link
Contributor

@yash97 yash97 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind discarding this error type?
If we already retried and the error still occurred, shouldn't the user be informed about it? I understand that we’re logging the error, but the user won’t see that. Also, when this error occurs, are the links stale or simply empty?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants