Skip to content

[Bug] AL2023 AMI breaks toleration support #8550

@agoose77

Description

@agoose77

In the past, a PR was added to permit daemonset pods to schedule on GPU nodes with taints: https://github.com/eksctl-io/eksctl/pull/5345/files

It looks like this logic only runs against AL2 images:

for _, ng := range n.spec.NodeGroups {
if api.HasInstanceType(ng, instance.IsNvidiaInstanceType) &&
ng.GetAMIFamily() == api.NodeImageFamilyAmazonLinux2 {
for _, taint := range ng.Taints {
if _, ok := taints[taint.Key]; !ok {
taints[taint.Key] = taint
}
}
}
}
for _, ng := range n.spec.ManagedNodeGroups {
if api.HasInstanceTypeManaged(ng, instance.IsNvidiaInstanceType) &&
ng.GetAMIFamily() == api.NodeImageFamilyAmazonLinux2 {
for _, taint := range ng.Taints {
if _, ok := taints[taint.Key]; !ok {
taints[taint.Key] = taint
}
}
}
}

As such, when upgrading a cluster to AL2023, this feature breaks.

I think the fix is simply just to extend the allow list of images to include the AL2023 family. In the meantime, people upgrading to AL2023 should probably turn off the device plugin installation and manually ship the daemonset.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions