-
Notifications
You must be signed in to change notification settings - Fork 444
Add validation for unsupported DRA features #7226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: harche The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @harche. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
/ok-to-test /cc @alaypatel07 |
Yes I'll take a look. |
c524414
to
4318946
Compare
pkg/dra/claims.go
Outdated
var dcName string | ||
var q int64 | ||
if req.Exactly != nil { | ||
// Check for CEL selectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please convert the entire if condition into a switch case handling all the unsupported features first before considering the valid supported case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harche why does this have to be nested ifs and switch cases? Can we work out a logic with one single switch case?
For example:
switch {
case req.FirstAvailable != nil:
case req.Exactly != nil && len(req.Exactly.Selectors) > 0:
// error
case req.Exactly != nil && req.Exactly.AdminAccess != nil && *req.Exactly.AdminAccess:
// error
case req.Exactly != nil && req.Exactly.AllocationMode == DeviceAllocationModeAll:
// error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alaypatel07 I didn't want to repeat req.Exactly != nil
, would something like following do?
switch {
case req.FirstAvailable != nil:
return nil, fmt.Errorf("%w: FirstAvailable device selection is not supported", ErrUnsupportedDRAFeature)
case req.Exactly == nil:
continue
case len(req.Exactly.Selectors) > 0:
return nil, fmt.Errorf("%w: CEL selectors are not supported", ErrUnsupportedDRAFeature)
case req.Exactly.AdminAccess != nil && *req.Exactly.AdminAccess:
return nil, fmt.Errorf("%w: AdminAccess is not supported", ErrUnsupportedDRAFeature)
case req.Exactly.AllocationMode == resourcev1.DeviceAllocationModeAll:
return nil, fmt.Errorf("%w: AllocationMode 'All' is not supported", ErrUnsupportedDRAFeature)
case req.Exactly.AllocationMode == resourcev1.DeviceAllocationModeExactCount:
dcName = req.Exactly.DeviceClassName
q = req.Exactly.Count
case req.Exactly.AllocationMode == "":
dcName = req.Exactly.DeviceClassName
default:
return nil, fmt.Errorf("%w: unsupported allocation mode: %s", ErrUnsupportedDRAFeature, req.Exactly.AllocationMode)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, let me know if that works, but I am open to changing it to repeated req.Exactly != nil
.
pkg/dra/claims.go
Outdated
ErrDeviceClassNotMapped = errors.New("DeviceClass is not mapped in DRA configuration") | ||
ErrResourceClaimInUse = errors.New("ResourceClaim is in use") | ||
ErrClaimSpecNotFound = errors.New("failed to get claim spec") | ||
ErrDeviceClassNotMapped = errors.New("DeviceClass is not mapped in DRA configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we use a common error "UnsupportedDRAfeature" and then wrap a human readable error message to be returned from the function later.
I think we only need to have the following concrete error types:
- ErrResourceClaimInUse and other
- ErrClaimSpecNotFound
- ErrUnsupportedDRAfeature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thanks.
}, | ||
want: map[kueuev1beta1.PodSetReference]corev1.ResourceList{"main": {"res-1": resource.MustParse("2")}}, | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also update the integration tests for these errors at https://github.com/kubernetes-sigs/kueue/tree/main/test/integration/singlecluster/controller/dra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added integration tests.
4318946
to
006e7b3
Compare
/test pull-kueue-test-e2e-multikueue-main |
@alaypatel07 I have updated the PR. Can you take another look? Thanks. |
wl := utiltesting.MakeWorkload("test-wl-all-mode", ns.Name). | ||
Queue("test-lq"). | ||
Obj() | ||
wl.Spec.PodSets[0].Template.Spec.ResourceClaims = []corev1.PodResourceClaim{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use this opportunity to start using, https://github.com/kubernetes-sigs/kueue/blob/main/pkg/util/testing/wrappers.go#L623 and https://github.com/kubernetes-sigs/kueue/blob/main/pkg/util/testing/wrappers.go#L637?
I think this change is needed across all the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
|
||
ginkgo.It("Should reject workload with AllocationMode 'All'", func() { | ||
ginkgo.By("Creating a ResourceClaimTemplate with AllocationMode All") | ||
rct := &resourcev1.ResourceClaimTemplate{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can create a ResourceClaimWrapper to produce these templates/claims in utiltesting:
kueue/pkg/util/testing/wrappers.go
Line 422 in 5a6b10a
type PodSetWrapper struct{ kueue.PodSet } |
it will be helpful in unit testing as well as integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great idea, but should we tackle this outside this PR? I can create a separate issue to look into it. I am afraid it might escalate the scope of this PR (which is just to harden the validation around unsupported DRA fields/features)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. IMHO, since this PR is adding all the permutations of creating resource slices with different features, this is the right moment to add the wrapper. I dont really think it expands the scope. Thanks for tackling that.
e6127d8
to
03ca256
Compare
pkg/dra/claims.go
Outdated
q = req.Exactly.Count | ||
} | ||
q = req.Exactly.Count | ||
case req.Exactly.AllocationMode == "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is not needed, Kueue should assume that the defaults are set appropriately during resource admission: https://github.com/kubernetes/kubernetes/blob/f306bb14b4a288cd04c6ddad65a5a91bc1f665f8/pkg/apis/resource/v1/defaults.go#L32-L37.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thanks.
|
||
// ResourceClaimTemplateWrapper wraps a resourcev1.ResourceClaimTemplate | ||
type ResourceClaimTemplateWrapper struct { | ||
resourcev1.ResourceClaimTemplate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Instead of having separate structs for RCT and RC, there should be a common struct for ResourceClaimSpec and the With* functions use that struct to generate the spec. I think will remove a lot of duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Between ResourceClaimTemplateWrapper
and ResourceClaimWrapper
, there's now a shared ResourceClaimSpecBuilder that both wrappers use, let me know if that aligns with what you had in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @alaypatel07
1f307fd
to
33f5c56
Compare
Please add a release note. |
Thanks, done. |
@harche I'm falling behind on reviews due to the upcoming KEP freeze. I'll take a look first thing tomorrow morning. |
No worries @alaypatel07 , take your time. This can wait until the KEP freeze deadline passes. Thanks for your valuable feedback. |
resourceClaims: []*resourcev1.ResourceClaim{ | ||
utiltesting.MakeResourceClaim("rc1", "ns"). | ||
DeviceRequest("device-request", "gpu.example.com", 1). | ||
Obj(), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually could you please send a preaparatory non-functional PR just for the cleanup?
So that we don't mix cleanups with features or bugfixes.
It will make the diff here smaller for the ease of reviewing / focusing on the user-changing experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mimowo, I have brought this up earlier, #7226 (comment)
@alaypatel07 are you okay with splitting the refactoring of tests utils from this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alaypatel07 and @mimowo, I have raised #7300 to add wrapper for existing tests cases only.
We will deal with this PR once that one is merged.
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged the other PR already, please rebase, let me unhold in the meanwhile, it will not merge anyway while not rebased.
/unhold
This commit introduces test helper wrappers for DRA ResourceClaim and ResourceClaimTemplate objects to improve test readability and reduce boilerplate. Related to PR kubernetes-sigs#7226 review comment: kubernetes-sigs#7226 (comment)
33f5c56
to
5f56b50
Compare
pkg/dra/claims_test.go
Outdated
if dc == "test-deviceclass-1" { | ||
return "res-1", true | ||
} | ||
return "", false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very verbose and repeatable, can we replace it somehow with some default lookups maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a common defaultLookup
function
pkg/dra/claims_test.go
Outdated
} | ||
return "", false | ||
}, | ||
wantErr: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of asserting true/false it is better to assert on specific validation error, please check https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/jobset/jobset_webhook_test.go#L49
I know it was like this before, but I would like to use it as an occasion to cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated to assert exact error.
5f56b50
to
8d24bce
Compare
/hold just like unit tests, the integration tests should also check for exact error. |
Signed-off-by: Harshal Patil <[email protected]>
8d24bce
to
8de8f8c
Compare
updated. /unhold |
ErrResourceClaimInUse = errors.New("ResourceClaim is in use") | ||
ErrClaimSpecNotFound = errors.New("failed to get claim spec") | ||
ErrUnsupportedDRAFeature = errors.New("unsupported DRA feature") | ||
|
||
// Specific unsupported DRA feature errors | ||
ErrUnsupportedDRADeviceConstraints = errors.New("device constraints (MatchAttribute) are not supported") | ||
ErrUnsupportedDRADeviceConfig = errors.New("device config is not supported") | ||
ErrUnsupportedDRAFirstAvailable = errors.New("FirstAvailable device selection is not supported") | ||
ErrUnsupportedDRACELSelectors = errors.New("CEL selectors are not supported") | ||
ErrUnsupportedDRAAdminAccess = errors.New("AdminAccess is not supported") | ||
ErrUnsupportedDRAAllocationModeAll = errors.New("AllocationMode 'All' is not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be exposed? If not then unexport (lowercase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are used in integration tests, besides that, I am not sure having errors exported causes any harm, future callers might be able to verify if the errors belongs to any of the DRA related errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always export them later if needed. If they are unexported it is immediately obvious which are used or not outside the package for example during review in GH. Also, sometimes exported constants become unused but linters don't complain, so making them unused is easy to miss during review/ code changing.
if tc.wantErr != nil { | ||
if err == nil { | ||
t.Fatalf("expected error but got none") | ||
} | ||
if !errors.Is(err, tc.wantErr) { | ||
t.Fatalf("unexpected error: got=%v, want=%v", err, tc.wantErr) | ||
} | ||
return | ||
} | ||
if err != nil { | ||
t.Fatalf("unexpected error: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to assert errors as we do here, using cmp.Diff:
kueue/pkg/controller/jobs/jobset/jobset_webhook_test.go
Lines 273 to 291 in 1fa1463
wantErr: field.ErrorList{ | |
field.Invalid(field.NewPath("spec.replicatedJobs[0].template.spec.template.metadata.annotations"), field.OmitValueType{}, "must specify 'kueue.x-k8s.io/podset-required-topology' or 'kueue.x-k8s.io/podset-preferred-topology' topology consistent with 'spec.replicatedJobs[1].template.spec.template.metadata.annotations' in group 'groupname'"), | |
field.Invalid(field.NewPath("spec.replicatedJobs[1].template.spec.template.metadata.annotations"), field.OmitValueType{}, "must specify 'kueue.x-k8s.io/podset-required-topology' or 'kueue.x-k8s.io/podset-preferred-topology' topology consistent with 'spec.replicatedJobs[0].template.spec.template.metadata.annotations' in group 'groupname'"), | |
}.ToAggregate(), | |
topologyAwareScheduling: true, | |
}, | |
} | |
for _, tc := range testcases { | |
t.Run(tc.name, func(t *testing.T) { | |
features.SetFeatureGateDuringTest(t, features.TopologyAwareScheduling, tc.topologyAwareScheduling) | |
jsw := &JobSetWebhook{} | |
ctx, _ := utiltesting.ContextWithLog(t) | |
_, gotErr := jsw.ValidateCreate(ctx, tc.job) | |
if diff := cmp.Diff(tc.wantErr, gotErr); diff != "" { | |
t.Errorf("validateCreate() mismatch (-want +got):\n%s", diff) | |
} |
Would it work here? I'm not sure maybe it works better if the validation errors are using field.ErrorList type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current tests use errors.Is
, this is exactly what we want. IMO, achieving this in any other way would not be more optimal than simply using errors.Is
.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR adds validation to reject DRA (Dynamic Resource Allocation) v1 API features that are enabled
by default in Kubernetes 1.34 but explicitly excluded from Kueue's DRA support scope per KEP-2941.
Without validation, Kueue silently ignores unsupported DRA features, leading to:
runtime failures
errors instead of clear feature-not-supported messages
unsupported features
Solution is to add early validation in pkg/dra/claims.go to explicitly reject all 6 enabled-by-default DRA features
that Kueue doesn't support:
GA Features:
Beta Features (enabled by default in K8s 1.34):
Each feature now returns a clear, actionable error message like "FirstAvailable device selection is
not supported" instead of misleading DeviceClass mapping errors.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?