fix: Enforce single ML policy constraint with CEL validation for Torch, MPI, and JAX#3225
Conversation
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR tightens TrainingRuntime ML policy validation to prevent configuring multiple incompatible runtime policies (Torch/MPI/JAX) at the same time, and aligns PlainML fallback behavior with that constraint.
Changes:
- Updated CRD CEL validation to enforce “at most one of torch/mpi/jax is set”.
- Updated PlainML’s
EnforceMLPolicyto no-op when a JAX policy is configured (matching existing Torch/MPI behavior).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
pkg/runtime/framework/plugins/plainml/plainml.go |
Extends PlainML’s fallback guard to treat JAX as an explicitly selected runtime policy (so PlainML won’t apply). |
pkg/apis/trainer/v1alpha1/trainingruntime_types.go |
Replaces pairwise Torch/MPI exclusion with a single CEL rule that limits the number of configured ML policies to 1 across Torch/MPI/JAX. |
aa6bd0c to
c22b56c
Compare
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you @Krishna-kg732!
/lgtm
/approve
ae0df7a to
3ff5704
Compare
astefanutti
left a comment
There was a problem hiding this comment.
Thanks @Krishna-kg732!
/lgtm
|
/retest |
f61cb55 to
19c1a86
Compare
|
@Krishna-kg732 Please rebase your PR. |
19c1a86 to
af7a30e
Compare
|
One more rebase is needed @Krishna-kg732. |
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
af7a30e to
6c0fda0
Compare
andreyvelich
left a comment
There was a problem hiding this comment.
Thank you for this @Krishna-kg732!
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…h, MPI, and JAX (kubeflow#3225) * fix: enforce single ML policy constraint with CEL validation Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com> * added plainML fallback test case Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com> * added autogenerated files Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com> * added integration tests Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com> * bumped the version in charts to fix ci Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com> * added autogenerated file Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com> * chore: bump version Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com> --------- Signed-off-by: krishna-kg732 <krishnagupta.kg2k6@gmail.com>
What this PR solves
This PR fixes a validation issue where multiple ML runtime policies (Torch, MPI, JAX) could be configured simultaneously in a TrainingRuntime, leading to conflicting runtime configurations.
The previous validation logic :
!(has(self.torch) && has(self.mpi))which only prevented Torch and MPI from being set together, but didn't account for:This allowed invalid configurations where users could set multiple incompatible runtime policies.
Solution
Added comprehensive CEL validation: Updated the validation rule to
[has(self.torch), has(self.mpi), has(self.jax)].filter(x, x).size() <= 1which:Updated PlainML plugin: Modified the
EnforceMLPolicyfunction to check for JAX policy alongside Torch and MPI, ensuring PlainML only applies when no other runtime policy is activeTesting
mentioned in PR#3200