-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Open
Labels
Description
In the past, a PR was added to permit daemonset pods to schedule on GPU nodes with taints: https://github.com/eksctl-io/eksctl/pull/5345/files
It looks like this logic only runs against AL2 images:
eksctl/pkg/addons/device_plugin.go
Lines 225 to 244 in 208dee7
| for _, ng := range n.spec.NodeGroups { | |
| if api.HasInstanceType(ng, instance.IsNvidiaInstanceType) && | |
| ng.GetAMIFamily() == api.NodeImageFamilyAmazonLinux2 { | |
| for _, taint := range ng.Taints { | |
| if _, ok := taints[taint.Key]; !ok { | |
| taints[taint.Key] = taint | |
| } | |
| } | |
| } | |
| } | |
| for _, ng := range n.spec.ManagedNodeGroups { | |
| if api.HasInstanceTypeManaged(ng, instance.IsNvidiaInstanceType) && | |
| ng.GetAMIFamily() == api.NodeImageFamilyAmazonLinux2 { | |
| for _, taint := range ng.Taints { | |
| if _, ok := taints[taint.Key]; !ok { | |
| taints[taint.Key] = taint | |
| } | |
| } | |
| } | |
| } |
As such, when upgrading a cluster to AL2023, this feature breaks.
I think the fix is simply just to extend the allow list of images to include the AL2023 family. In the meantime, people upgrading to AL2023 should probably turn off the device plugin installation and manually ship the daemonset.