Skip to content

feat(api): Replace PodTemplateOverrides with RuntimePatches API#3199

Closed
andreyvelich wants to merge 3 commits intokubeflow:masterfrom
andreyvelich:template-override-api
Closed

feat(api): Replace PodTemplateOverrides with RuntimePatches API#3199
andreyvelich wants to merge 3 commits intokubeflow:masterfrom
andreyvelich:template-override-api

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Feb 10, 2026

This BREAKING CHANGE will replace PodTemplateOverride with RuntimePatches API.

We would like to group patches by manager for clear ownership boundaries.

This PR updates KEP, APIs, and implementation.

Related: #3020

Copilot AI review requested due to automatic review settings February 10, 2026 13:58
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a breaking API change to replace PodTemplateOverrides with manager-scoped TemplateOverrides, aiming to group override ownership boundaries more clearly across controllers/users.

Changes:

  • Replaces TrainJobSpec.PodTemplateOverrides with TrainJobSpec.TemplateOverrides keyed by manager.
  • Introduces new API types for TemplateOverride, including job-level and pod-level override histories.
  • Updates the v2 proposal/KEP documentation to describe the new API shape and examples.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
pkg/apis/trainer/v1alpha1/trainjob_types.go Updates TrainJob API types to add manager-keyed TemplateOverrides and new override structs.
docs/proposals/2170-kubeflow-trainer-v2/README.md Updates the proposal to document TemplateOverrides, including rationale and YAML examples.

Comment on lines +799 to +804
// JobTemplateOverride represents a custom override that will be applied to the JobTemplateSpec
type JobTemplateOverride struct {
// Time is the timestamp of when the JobTemplateOverride entry was added.
// +required
Time metav1.Time `json:"time,omitempty"`

Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the docs code snippet, Time is marked as +required and uses a non-pointer metav1.Time, which doesn’t match the actual API types in pkg/apis/trainer/v1alpha1/trainjob_types.go where Time is optional (*metav1.Time).

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +276
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobTemplateOverrides/PodTemplateOverrides are declared as +listType=map with +listMapKey=time, but the keyed field Time is optional (*metav1.Time), which makes the map key potentially unset and breaks map-list semantics (unique/stable keys for merge/validation).

Suggested change
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"

Copilot uses AI. Check for mistakes.
Comment on lines +299 to +305
TargetJobs []TemplateOverrideTargetJob `json:"targetJobs,omitempty"`

// metadata overrides the Job template metadata or JobSet metadata.
// If targetJobs is specified, these values are merged with the specific ReplicatedJob's Job template metadata.
// If targetJobs is empty, these values are merged with the JobSet object metadata.
// +optional
Metadata *metav1.ObjectMeta `json:"metadata,omitempty"`
Copy link
Member Author

@andreyvelich andreyvelich Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve been thinking that we could start using JobTemplateOverride instead of the dedicated Labels and Annotations fields we currently expose in the TrainJob.spec API.

The idea would be:

  • If targetJob is omitted, the override is applied to the JobSet
  • If targetJob is set, the override is applied to the specific Job

One concern is that once we introduce JobTemplateSpecOverride, it could potentially contain fields relevant to both Job and JobSet, which may introduce ambiguity. I’m not entirely sure what the better way to handle that would be, though I also don’t see a clearly better alternative at the moment.

@tenzen-y @kaisoz @mimowo @astefanutti @kannon92 , I’d really appreciate your thoughts on this approach.

Comment on lines +310 to +313
// Time is the timestamp of when the JobTemplateOverride entry was added. If value is omitted,
// controller defaults this value to the current timestamp.
// +optional
Time *metav1.Time `json:"time,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time will be set server-side by Trainer admission mutating webhook when TrainJob is created/updated.

Comment on lines +270 to +277
JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
// +listType=map
// +listMapKey=time
PodTemplateOverrides []PodTemplateOverride `json:"pod,omitempty"`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you prefer pod and job or podTemplateOverrides and jobTemplateOverrides?

@andreyvelich
Copy link
Member Author

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @akshaychitneni

Comment on lines +315 to +318
// templateOverrides defines template overrides that will be applied to the TrainJob's training runtime template.
// +listType=map
// +listMapKey=manager
TemplateOverrides []TemplateOverride `json:"templateOverrides,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned previously, I still think separate override fields would be better. Because the external scheduler and the external job manager could be separated. In that case, scheduling constraints (podTemplate) will be managed by the external scheduler, and job parameters (jobTemplate) will be managed by the external job manager.

If we combine those into templateOverrides as in this proposal, there is no way to decouple those.

podTemplateOverrides:
- manager: 
    name: kueue
    time: xyz
  targetJobs:
  - name: trainer
  spec:
    nodeSelector:
      accelerator: nvidia-gpu
    tolerations:
    - key: "nvidia.com/gpu"
       operator: "Exists"
       effect: "NoSchedule"
jobTemplateOverrides: // or runtimeParameterOverrides? in any case, we can revisit that in the future.
- manager: 
    name: abc
    time: xyz
  targetJobs:
  - name: trainer
  ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any limitations with the following API to define your example @tenzen-y ?

templateOverrides:
  - manager: kueue.x-k8s.io/manager
    pod:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        spec:
          nodeSelector:
            accelerator: nvidia-gpu
          tolerations:
            - key: "nvidia.com/gpu"
              operator: "Exists"
              effect: "NoSchedule"

  - manager: abc.example.com/abc
    job:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        metadata:
          labels:
            custom-label: value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage of having separate override fields would be to be backward compatible.

Even if the API is still alpha, podTemplateOverrides are already used quite a lot so it's be easier to maintain compatibility.

Also it makes it clearer what the scope of each override type is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y
We briefly discussed this today during the Trainer call.
Recording: https://youtu.be/e9_g28XdpHg?t=830

One challenge with this approach is that it prevents us from using
+listType=map +listMapKey=manager, because the list becomes atomic, as @kaisoz pointed out in previous PRs:

- manager: 
    name: kueue
    time: xyz

If we don't want to place all overrides under TemplateOverride API, I think we have two options:

Option 1

Place overrides under an overrides slice. The fields would be immutable, but new override entries could be appended over time.

Pros: Provides a clear history of appended overrides.
Cons: YAML grow in size

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    overrides:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        spec:
          nodeSelector:
            accelerator: nvidia-gpu

jobTemplateOverrides:
  - manager: abc.example.com/abc
    overrides:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: trainer
        metadata:
          labels:
            custom-label: value

Option 2

Place overrides directly under each entry and make the API mutable.
Pros: Simpler structure.
Cons: History is not preserved

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-gpu

jobTemplateOverrides:
  - manager: "abc.example.com/abc"
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    metadata:
      labels:
        custom-label: value

Option 3

Keep what we have in the KEP right now.
example: #3199 (comment)

Any thoughts ?

cc @VassilisVassiliadis @kannon92 @mimowo @astefanutti @vsoch
If you can provide any feedback for the API, it would be super helpful!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understand the difference between option 2 and what is there now (option 3) the general "templteOverrides" with a list of manager pod|job is being replaced with a single podTemplateOverrides and jobTemplateOverrides either with overrides or directly under it. And a list of overrides is valid in all cases, e.g.,

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
  - manager: kueue.x-k8s.io/another-manager
...

A question. What happens if there is conflicting information? E.g., two sets of overrides, and different nodeSelector for the same managers:

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
  - manager: kueue.x-k8s.io/another-manager
    time: "2026-02-17T10:00:00Z"
    targetJobs:
      - name: trainer
    spec:
      nodeSelector:
        accelerator: nvidia-another-gpu

I don't think preserving history is a strong priority, and having to consolidate "old" information (versus one source of truth) is adding a challenge that does not need to be there. I like Option 3 best, but I want to better understand why we allow a listing. If there is a duplicate manager would it not validate? And is this interface expected to be most utilized by the user (writing a YAML TrainJob with overrides) or internal controllers (e.g., FluxPolicy) or both? I'd like to see Command/Args/Environment support, and I suspect that would be in the PodTemplateOverrides?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense conceptually, though those remain hypothetical use cases and it might help converging on a design if we focus on the main use cases pragmatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea would be to structure API as follows:

overrides:
  - manager: kueue.x-k8s.io/manager
    podTemplateOverrides:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: node
        spec:
          nodeSelector:
            accelerator: nvidia-gpu
    jobTemplateOverrides:
      - time: "2026-02-17T10:00:00Z"
        targetJob:
          - name: node
        metadata:
          labels:
            custom-label: value

I like this approach. It makes each manager the owner of its overrides, so it can define as many as needed (similar to the singleton concept @vsoch mentioned) while still allowing extensibility.

To ensure that an actor (controller or user) can only modify its own list, we could restrict the manager field to be set exclusively via a CEL-based admission policy, assigning it to the user making the create/update request. If an actor tries to modify a list they don’t own, a VAP would reject the update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found another issue with Option 2: how would it handle cases where a single manager needs to define different overrides to multiple ReplicatedJob?
For example:

  • nodeSelector.accelerator: nvidia-gpu for Node Job
  • nodeSelector.accelerator: cpu for Initializer Job.

Does anyone have a better approach to handle this?

podTemplateOverrides:
  - manager: kueue.x-k8s.io/manager
    overrides:
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: node
        spec:
          nodeSelector:
            accelerator: nvidia-gpu
      - time: "2026-02-17T10:00:00Z"
        targetJobs:
          - name: launcher
        spec:
          nodeSelector:
            accelerator: cpu-only

For validation, we can prohibit modifying the targetJob field and prevent adding additional list entries that reference the same targetJob.

I imagined that trainer sorts all override parameters defined both in podTemplateOverrides and jobTemplateOverrides, then Trainer applies each override one-by-one. After Trainer applied each override parameter, it records which override field has been applied in status. In the next override parameter changes, the Trainer start overriding from the latest one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a few things:

If manager wants to apply same overrides to multiple targetJobs, it will help to reduce YAML size
Potentially, we can apply metadata overrides to JobSet if targetJobs is empty: https://github.com/andreyvelich/trainer/blob/template-override-api/docs/proposals/2170-kubeflow-trainer-v2/README.md?plain=1#L807
@tenzen-y would like to support label selectors in targetJobs struct in the future. That will allow to apply override to multiple Jobs with the same label.

That makes sense conceptually, though those remain hypothetical use cases and it might help converging on a design if we focus on the main use cases pragmatically.

I proposed a label selector previously because the replicated Job name couldn't be restricted to a fixed name in the future. Because we plan to remove the replicated Job and container names limitations in the future version. After that, cluster admins will specify arbitrary names, and TrainJob users can not estimate those names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After Trainer applied each override parameter, it records which override field has been applied in status. In the next override parameter changes, the Trainer start overriding from the latest one.

@tenzen-y How would that solve the use-case when a single Manager needs to apply different override to different Job's Pod template? For example, nodeSelector: nvidia-gpu to Trainer ReplicatedJob, and nodeSelector: cpu to Initializer ReplicatedJob?

@andreyvelich andreyvelich force-pushed the template-override-api branch 2 times, most recently from cf2fa4e to 2c22ee8 Compare March 2, 2026 01:02
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Mar 2, 2026
@andreyvelich andreyvelich force-pushed the template-override-api branch from 2c22ee8 to bf23562 Compare March 2, 2026 01:03
@andreyvelich andreyvelich changed the title [WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides feat(api): Replace PodTemplateOverrides with RuntimePatches API Mar 2, 2026
@andreyvelich andreyvelich force-pushed the template-override-api branch from bf23562 to 24ab971 Compare March 2, 2026 01:09
TargetJobs []PodTemplateOverrideTargetJob `json:"targetJobs"`
// RuntimePatch represents a custom patch applied to the TrainJob's training runtime template.
// Patches are keyed by manager to provide clear ownership and avoid conflicts between controllers.
type RuntimePatch struct {
Copy link
Member Author

@andreyvelich andreyvelich Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the overrides API offline with @astefanutti and @tenzen-y and agreed that managing overrides via targetJob and overrides List is too complex and hard to maintain. Especially, if users are require to override fields in Runtime spec: mpi.sshAuthMountPath, JobSet spec: failurePolicy, or JobSet/Job metadata.

Instead, we propose introducing a RuntimePatch API that represents a valid Kubernetes patch applied directly to the TrainJob runtime spec before creation of JobSet. This provides a clearer and more declarative way to customize runtime behavior.

Upstream, we plan to support structured patches for runtimes such as ClusterTrainingRuntime, GroveRuntime, SlurmRuntime, FluxRuntime, and other OSS runtimes we integrate.

For custom or in-house CRDs, I propose an opaqueRuntimeSpec field that allows users to provide an arbitrary patch to the runtime spec (API details can be defined later).

This approach simplifies overrides and makes runtime customization more explicit and extensible.

Please let us know what do you think, so we can start implementation!
/assign @kaisoz @mimowo @vsoch @VassilisVassiliadis
cc @kannon92 @Ronkahn21

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. It’s definitely more flexible and maintainable than the overrides solution. How would these be set from the SDK? I guess via the options field, as they do with PodTemplateOverrides?

Copy link
Member Author

@andreyvelich andreyvelich Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should refactor options to add values into runtimePatches API.
cc @kubeflow/kubeflow-sdk-team @Fiona-Waters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should deprecated the PodTemplateOverrides option. Ideally, such an option should not really be exposed so RuntimePatches should be encapsulated and only higher-level option should be exposed.

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: mimowo, vsoch, VassilisVassiliadis.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

We discussed the overrides API offline with @astefanutti and @tenzen-y and agreed that managing overrides via targetJob and podTemplateOverrides is too complex and hard to maintain. Especially, if users are require to override fields in Runtime spec: mpi.sshAuthMountPath or JobSet spec: failurePolicy.

Instead, we propose introducing a RuntimePatch API that represents a valid Kubernetes patch applied directly to the TrainJob runtime spec before creation of JobSet. This provides a clearer and more declarative way to customize runtime behavior.

Upstream, we plan to support structured patches for runtimes such as ClusterTrainingRuntime, GroveRuntime, SlurmRuntime, and other OSS runtimes we integrate.

For custom or in-house CRDs, I propose an opaqueRuntimeSpec field that allows users to provide an arbitrary patch to the runtime spec (API details can be defined later).

This approach simplifies overrides and makes runtime customization more explicit and extensible.

Please let us know what do you think, so we can start implementation!
/assign @kaisoz @mimowo @vsoch @VassilisVassiliadis
cc @kannon92 @Ronkahn21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines +931 to +936
// ContainerPatch represents parameters that can be patched using PodSpecPatch.
type ContainerPatch struct {
// name for the container. Runtime must have this container.
// +kubebuilder:validation:MinLength=1
// +required
Name string `json:"name,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich could we also add fields here for resources, command, and args ?

This would be ideal for a manager that's patching the resource requirements of a TrainJob. It'd also need to also mutate the cmdline args for example to change the options of torchrun or accelerate launch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values should be controlled via .spec.trainer API in TrainJob: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainjob_types.go#L226

However, we have limitation of Job API to make these fields mutable for suspended Job.

Shall we talk as a followup how to allow external controllers to update container resources?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, we can can try tackling resource requirements patching in a future KEP!

// +listType=atomic
PodTemplateOverrides []PodTemplateOverride `json:"podTemplateOverrides,omitempty"`
// runtimePatches defines custom patches applied to the TrainJob's Runtime.
// Patches are keyed by manager to provide clear ownership and avoid conflicts between controllers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts can still occur if the intersection of two patches is not empty.
Maybe here we can be more specific about precedence? Should it be time or order (if it's maintained)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it better to enforce strict validation initially which doesn't allow to modify the same Pod spec field by two different managers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, to keep this focused, I’d stick to the current behaviour and apply them in order. Then we could discuss a different precedence in a follow-up. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we should talk about validation as soon as we refactor the API.

Copy link
Contributor

@astefanutti astefanutti Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it better to enforce strict validation initially which doesn't allow to modify the same Pod spec field by two different managers?

I agree fail-fast is the less ambiguous approach. It might be a bit complex to detect conflicts on leaf fields / structs and merge strategy markers should conceptually be taken into account. It may be Kuberentes provides some helpers we could reuse though.

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andreyvelich!

/lgtm

Copy link

@VassilisVassiliadis VassilisVassiliadis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the RuntimePatches API makes sense to me as a solid foundation for facilitating the integration with frameworks like Kueue. I left some minor comments here and there with my thoughts regarding details of the proposed API.

Comment on lines 71 to 74
- Introduce the `TrainingRuntime` and `ClusterTrainingRuntime` APIs that will store blueprints
for model training and LLM fine-tuning using various ML frameworks. These runtimes will be built
on top of `JobSet` APIs with additional functionality for special use-cases.
For example, training using MPI orchestration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that one of the qualities of the v2 API is the separation of concerns between the different personas. My mental model of how v2 works is that the Cluster admins (devops engineers) create "best practices" for using the k8s cluster. Similarly, the MLOps Engineers set the "best practices" for advanced features of ML frameworks.

I would consider these personas as the "Platform Engineers" of a cluster. By "Platform Engineer" I'm referring to a person that assists the end users to better make use of the available compute resources. The combined expertise of "Platform Engineers" is what it takes to produce a high-value TrainingRuntime/ClusterTrainingRuntime blueprint that users can "safely" use to maximize the potential of their apps. I'm also imagining that these 2 "Platform Engineer" personas will configure different fields in these CRs. So ownership is kind of easy to follow.

After reading the goals, it's as if there's a 4th "implied" persona. That of the frameworks/controllers/services (like Kueue) but also power-users that use the API to enhance user experience. This persona is basically overriding what the Admins decided in the TrainingRuntime/ClusterTrainingRuntime templates for example to optimize the TrainJobs in some way which is not possible without up-to-date information or details about the TrainJob at hand (e.g. scheduling hints based on admission checks from Kueue).

If so, would it make sense to add one more goal to specify how these entities/people can customize the Runtime that a Job uses by referencing that there's a mechanism in place called RuntimePatches?

Suggested change
- Introduce the `TrainingRuntime` and `ClusterTrainingRuntime` APIs that will store blueprints
for model training and LLM fine-tuning using various ML frameworks. These runtimes will be built
on top of `JobSet` APIs with additional functionality for special use-cases.
For example, training using MPI orchestration.
- Enable `TrainJob` to customize `TrainingRuntime` templates through the `RuntimePatches` API,
supporting configuration injection by controllers, admission webhooks, and custom clients.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is great idea!

The below diagram shows how platform engineers manage `TrainingRuntime` and how data scientists
create `TrainJob`:

![user-roles](./user-roles.drawio.svg)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show that the Runtime configuration of TrainJobs can be patched by external controllers too?

```

The webhook will validate that TargetJob and Container name exist in the Runtime Job template.
The webhook validates that the container names in `Containers` and `InitContainers` exist in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see validation/patching becoming tricky if we allow multiple managers to patch the same field.

Should we add a validation rule here to forbid this scenario ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we discussed it here with @astefanutti: #3199 (comment)
We should forbid multiple managers to apply the same override.
I understand that it might be tricker for use-cases when for example multiple managers need to apply ENV var to the same container, but I hope we can discuss it later.

@google-oss-prow
Copy link

@VassilisVassiliadis: changing LGTM is restricted to collaborators

Details

In response to this:

Overall the RuntimePatches API makes sense to me as a solid foundation for facilitating the integration with frameworks like Kueue. I left some minor comments here and there with my thoughts regarding details of the proposed API.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow
Copy link

New changes are detected. LGTM label has been removed.

// metadata patches the JobSet object metadata.
// Only labels and annotations are allowed.
// +optional
Metadata *metav1.ObjectMeta `json:"metadata,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "Only labels and annotations are allowed" but the field type is *metav1.ObjectMeta, which includes many other fields (Name, Namespace, Finalizers, OwnerReferences, etc.). Without schema-level enforcement, users could set fields that are either ignored or cause unexpected behavior, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a separate restricted type instead of full ObjectMeta.. like a MetadataPatch struct with labels and annotations listed ?

Copy link
Contributor

@krishdef7 krishdef7 Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using *metav1.ObjectMeta for metadata patches while documenting "only labels and annotations are allowed" creates a silent contract violation, users could set Finalizers, OwnerReferences, or Namespace which would either be ignored or cause unexpected reconcile behavior.
A restricted type enforces this at the schema level:

type MetadataPatch struct {
    // +optional
    Labels map[string]string `json:"labels,omitempty"`
    // +optional
    Annotations map[string]string `json:"annotations,omitempty"`
}

This would replace *metav1.ObjectMeta in JobSetTemplatePatch, JobTemplatePatch, and PodTemplatePatch, the constraint becomes structural rather than documentary.
Also noting that #3285 proposes adding terminationGracePeriodSeconds to PodSpecPatch, which seems like a natural fit here given this PR defines that struct.
Happy to implement both the MetadataPatch refactor and the terminationGracePeriodSeconds addition in a follow-up PR once this merges, if that's useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been a discussion about that point when the metadata field got introduced in PodTemplateOverrides with #2785.

@andreyvelich @tenzen-y do you remember why we went for using *metav1.ObjectMeta instead of only a subset of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this comment from @tenzen-y: #2785 (comment)
I think, using well-known API: ObjectMeta allows us to avoid confusion what fields users can set there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, @tenzen-y's point is to avoid possibly introducing fields that are not in metav1.ObjectMeta.

@andreyvelich andreyvelich mentioned this pull request Mar 10, 2026
8 tasks
@andreyvelich
Copy link
Member Author

Implemented in: #3309
Thanks everyone 🚀
/close

@google-oss-prow google-oss-prow bot closed this Mar 12, 2026
@google-oss-prow
Copy link

@andreyvelich: Closed this PR.

Details

In response to this:

Implemented in: #3309
Thanks everyone 🚀
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants