[Bug] AL2023 AMI breaks toleration support

In the past, a PR was added to permit daemonset pods to schedule on GPU nodes with taints: https://github.com/eksctl-io/eksctl/pull/5345/files

It looks like this logic only runs against AL2 images: https://github.com/eksctl-io/eksctl/blob/208dee7e282012aa8199496412c03a1701f0229c/pkg/addons/device_plugin.go#L225-L244

As such, when upgrading a cluster to AL2023, this feature breaks.

I think the fix is simply just to extend the allow list of images to include the AL2023 family. In the meantime, people upgrading to AL2023 should probably turn off the device plugin installation and manually ship the daemonset.

	for _, ng := range n.spec.NodeGroups {
	if api.HasInstanceType(ng, instance.IsNvidiaInstanceType) &&
	ng.GetAMIFamily() == api.NodeImageFamilyAmazonLinux2 {
	for _, taint := range ng.Taints {
	if _, ok := taints[taint.Key]; !ok {
	taints[taint.Key] = taint
	}
	}
	}
	}
	for _, ng := range n.spec.ManagedNodeGroups {
	if api.HasInstanceTypeManaged(ng, instance.IsNvidiaInstanceType) &&
	ng.GetAMIFamily() == api.NodeImageFamilyAmazonLinux2 {
	for _, taint := range ng.Taints {
	if _, ok := taints[taint.Key]; !ok {
	taints[taint.Key] = taint
	}
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] AL2023 AMI breaks toleration support #8550

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] AL2023 AMI breaks toleration support #8550

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions