feat(api): Replace PodTemplateOverrides with RuntimePatches API by andreyvelich · Pull Request #3199 · kubeflow/trainer

andreyvelich · 2026-02-10T13:58:37Z

This BREAKING CHANGE will replace PodTemplateOverride with RuntimePatches API.

We would like to group patches by manager for clear ownership boundaries.

This PR updates KEP, APIs, and implementation.

Related: #3020

google-oss-prow · 2026-02-10T13:58:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR introduces a breaking API change to replace PodTemplateOverrides with manager-scoped TemplateOverrides, aiming to group override ownership boundaries more clearly across controllers/users.

Changes:

Replaces TrainJobSpec.PodTemplateOverrides with TrainJobSpec.TemplateOverrides keyed by manager.
Introduces new API types for TemplateOverride, including job-level and pod-level override histories.
Updates the v2 proposal/KEP documentation to describe the new API shape and examples.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
pkg/apis/trainer/v1alpha1/trainjob_types.go	Updates TrainJob API types to add manager-keyed `TemplateOverrides` and new override structs.
docs/proposals/2170-kubeflow-trainer-v2/README.md	Updates the proposal to document `TemplateOverrides`, including rationale and YAML examples.

Copilot · 2026-02-10T14:03:47Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

+// JobTemplateOverride represents a custom override that will be applied to the JobTemplateSpec
+type JobTemplateOverride struct {
+	// Time is the timestamp of when the JobTemplateOverride entry was added.
+	// +required
+	Time metav1.Time `json:"time,omitempty"`
+


In the docs code snippet, Time is marked as +required and uses a non-pointer metav1.Time, which doesn’t match the actual API types in pkg/apis/trainer/v1alpha1/trainjob_types.go where Time is optional (*metav1.Time).

pkg/apis/trainer/v1alpha1/trainjob_types.go

Copilot · 2026-02-10T14:03:49Z

pkg/apis/trainer/v1alpha1/trainjob_types.go

+	// +listType=map
+	// +listMapKey=time
+	JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
+
+	// podTemplateOverrides defines overrides that applied to PodTemplateSpec
+	// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
+	// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
+	// +listType=map
+	// +listMapKey=time


JobTemplateOverrides/PodTemplateOverrides are declared as +listType=map with +listMapKey=time, but the keyed field Time is optional (*metav1.Time), which makes the map key potentially unset and breaks map-list semantics (unique/stable keys for merge/validation).

Suggested change

// +listType=map

// +listMapKey=time

JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec

// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"

// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"

// +listType=map

// +listMapKey=time

JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`

// podTemplateOverrides defines overrides that applied to PodTemplateSpec

// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"

// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"

pkg/apis/trainer/v1alpha1/trainjob_types.go

docs/proposals/2170-kubeflow-trainer-v2/README.md

andreyvelich · 2026-02-10T14:03:23Z

pkg/apis/trainer/v1alpha1/trainjob_types.go

+	TargetJobs []TemplateOverrideTargetJob `json:"targetJobs,omitempty"`
+
+	// metadata overrides the Job template metadata or JobSet metadata.
+	// If targetJobs is specified, these values are merged with the specific ReplicatedJob's Job template metadata.
+	// If targetJobs is empty, these values are merged with the JobSet object metadata.
+	// +optional
+	Metadata *metav1.ObjectMeta `json:"metadata,omitempty"`


I’ve been thinking that we could start using JobTemplateOverride instead of the dedicated Labels and Annotations fields we currently expose in the TrainJob.spec API.

The idea would be:

If targetJob is omitted, the override is applied to the JobSet

If targetJob is set, the override is applied to the specific Job

One concern is that once we introduce JobTemplateSpecOverride, it could potentially contain fields relevant to both Job and JobSet, which may introduce ambiguity. I’m not entirely sure what the better way to handle that would be, though I also don’t see a clearly better alternative at the moment.

@tenzen-y @kaisoz @mimowo @astefanutti @kannon92 , I’d really appreciate your thoughts on this approach.

andreyvelich · 2026-02-10T14:04:13Z

pkg/apis/trainer/v1alpha1/trainjob_types.go

+	// Time is the timestamp of when the JobTemplateOverride entry was added. If value is omitted,
+	// controller defaults this value to the current timestamp.
+	// +optional
+	Time *metav1.Time `json:"time,omitempty"`


Time will be set server-side by Trainer admission mutating webhook when TrainJob is created/updated.

andreyvelich · 2026-02-10T14:05:34Z

pkg/apis/trainer/v1alpha1/trainjob_types.go

+	JobTemplateOverrides []JobTemplateOverride `json:"job,omitempty"`
+
+	// podTemplateOverrides defines overrides that applied to PodTemplateSpec
+	// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || oldSelf.all(old, self.exists(new, old.time == new.time && old == new))", message="existing entries cannot be modified or removed in template overrides"
+	// +kubebuilder:validation:XValidation:rule="!has(oldSelf) || size(self) >= size(oldSelf)", message="pod template override entries cannot be deleted"
+	// +listType=map
+	// +listMapKey=time
+	PodTemplateOverrides []PodTemplateOverride `json:"pod,omitempty"`


Do you prefer pod and job or podTemplateOverrides and jobTemplateOverrides?

andreyvelich · 2026-02-10T15:16:19Z

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @akshaychitneni

tenzen-y · 2026-02-17T17:59:37Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

+	// templateOverrides defines template overrides that will be applied to the TrainJob's training runtime template.
+	// +listType=map
+	// +listMapKey=manager
+	TemplateOverrides []TemplateOverride `json:"templateOverrides,omitempty"`


As I mentioned previously, I still think separate override fields would be better. Because the external scheduler and the external job manager could be separated. In that case, scheduling constraints (podTemplate) will be managed by the external scheduler, and job parameters (jobTemplate) will be managed by the external job manager.

If we combine those into templateOverrides as in this proposal, there is no way to decouple those.

podTemplateOverrides: - manager: name: kueue time: xyz targetJobs: - name: trainer spec: nodeSelector: accelerator: nvidia-gpu tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" jobTemplateOverrides: // or runtimeParameterOverrides? in any case, we can revisit that in the future. - manager: name: abc time: xyz targetJobs: - name: trainer ...

Do you see any limitations with the following API to define your example @tenzen-y ?

templateOverrides: - manager: kueue.x-k8s.io/manager pod: - time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer spec: nodeSelector: accelerator: nvidia-gpu tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" - manager: abc.example.com/abc job: - time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer metadata: labels: custom-label: value

One advantage of having separate override fields would be to be backward compatible.

Even if the API is still alpha, podTemplateOverrides are already used quite a lot so it's be easier to maintain compatibility.

Also it makes it clearer what the scope of each override type is.

@tenzen-y
We briefly discussed this today during the Trainer call.
Recording: https://youtu.be/e9_g28XdpHg?t=830

One challenge with this approach is that it prevents us from using
+listType=map +listMapKey=manager, because the list becomes atomic, as @kaisoz pointed out in previous PRs:

- manager: name: kueue time: xyz

If we don't want to place all overrides under TemplateOverride API, I think we have two options:

Option 1

Place overrides under an overrides slice. The fields would be immutable, but new override entries could be appended over time.

Pros: Provides a clear history of appended overrides.
Cons: YAML grow in size

podTemplateOverrides: - manager: kueue.x-k8s.io/manager overrides: - time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer spec: nodeSelector: accelerator: nvidia-gpu jobTemplateOverrides: - manager: abc.example.com/abc overrides: - time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer metadata: labels: custom-label: value

Option 2

Place overrides directly under each entry and make the API mutable.
Pros: Simpler structure.
Cons: History is not preserved

podTemplateOverrides: - manager: kueue.x-k8s.io/manager time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer spec: nodeSelector: accelerator: nvidia-gpu jobTemplateOverrides: - manager: "abc.example.com/abc" time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer metadata: labels: custom-label: value

Option 3

Keep what we have in the KEP right now.
example: #3199 (comment)

Any thoughts ?

cc @VassilisVassiliadis @kannon92 @mimowo @astefanutti @vsoch
If you can provide any feedback for the API, it would be super helpful!

So I understand the difference between option 2 and what is there now (option 3) the general "templteOverrides" with a list of manager pod|job is being replaced with a single podTemplateOverrides and jobTemplateOverrides either with overrides or directly under it. And a list of overrides is valid in all cases, e.g.,

podTemplateOverrides: - manager: kueue.x-k8s.io/manager time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer spec: nodeSelector: accelerator: nvidia-gpu - manager: kueue.x-k8s.io/another-manager ...

A question. What happens if there is conflicting information? E.g., two sets of overrides, and different nodeSelector for the same managers:

podTemplateOverrides: - manager: kueue.x-k8s.io/manager time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer spec: nodeSelector: accelerator: nvidia-gpu - manager: kueue.x-k8s.io/another-manager time: "2026-02-17T10:00:00Z" targetJobs: - name: trainer spec: nodeSelector: accelerator: nvidia-another-gpu

I don't think preserving history is a strong priority, and having to consolidate "old" information (versus one source of truth) is adding a challenge that does not need to be there. I like Option 3 best, but I want to better understand why we allow a listing. If there is a duplicate manager would it not validate? And is this interface expected to be most utilized by the user (writing a YAML TrainJob with overrides) or internal controllers (e.g., FluxPolicy) or both? I'd like to see Command/Args/Environment support, and I suspect that would be in the PodTemplateOverrides?

That makes sense conceptually, though those remain hypothetical use cases and it might help converging on a design if we focus on the main use cases pragmatically.

One idea would be to structure API as follows:

overrides: - manager: kueue.x-k8s.io/manager podTemplateOverrides: - time: "2026-02-17T10:00:00Z" targetJobs: - name: node spec: nodeSelector: accelerator: nvidia-gpu jobTemplateOverrides: - time: "2026-02-17T10:00:00Z" targetJob: - name: node metadata: labels: custom-label: value

I like this approach. It makes each manager the owner of its overrides, so it can define as many as needed (similar to the singleton concept @vsoch mentioned) while still allowing extensibility.

To ensure that an actor (controller or user) can only modify its own list, we could restrict the manager field to be set exclusively via a CEL-based admission policy, assigning it to the user making the create/update request. If an actor tries to modify a list they don’t own, a VAP would reject the update.

I found another issue with Option 2: how would it handle cases where a single manager needs to define different overrides to multiple ReplicatedJob?
For example:

nodeSelector.accelerator: nvidia-gpu for Node Job

nodeSelector.accelerator: cpu for Initializer Job.

Does anyone have a better approach to handle this?

podTemplateOverrides: - manager: kueue.x-k8s.io/manager overrides: - time: "2026-02-17T10:00:00Z" targetJobs: - name: node spec: nodeSelector: accelerator: nvidia-gpu - time: "2026-02-17T10:00:00Z" targetJobs: - name: launcher spec: nodeSelector: accelerator: cpu-only

For validation, we can prohibit modifying the targetJob field and prevent adding additional list entries that reference the same targetJob.

I imagined that trainer sorts all override parameters defined both in podTemplateOverrides and jobTemplateOverrides, then Trainer applies each override one-by-one. After Trainer applied each override parameter, it records which override field has been applied in status. In the next override parameter changes, the Trainer start overriding from the latest one.

Yes, a few things:

If manager wants to apply same overrides to multiple targetJobs, it will help to reduce YAML size
Potentially, we can apply metadata overrides to JobSet if targetJobs is empty: https://github.com/andreyvelich/trainer/blob/template-override-api/docs/proposals/2170-kubeflow-trainer-v2/README.md?plain=1#L807
@tenzen-y would like to support label selectors in targetJobs struct in the future. That will allow to apply override to multiple Jobs with the same label.

That makes sense conceptually, though those remain hypothetical use cases and it might help converging on a design if we focus on the main use cases pragmatically.

I proposed a label selector previously because the replicated Job name couldn't be restricted to a fixed name in the future. Because we plan to remove the replicated Job and container names limitations in the future version. After that, cluster admins will specify arbitrary names, and TrainJob users can not estimate those names.

After Trainer applied each override parameter, it records which override field has been applied in status. In the next override parameter changes, the Trainer start overriding from the latest one.

@tenzen-y How would that solve the use-case when a single Manager needs to apply different override to different Job's Pod template? For example, nodeSelector: nvidia-gpu to Trainer ReplicatedJob, and nodeSelector: cpu to Initializer ReplicatedJob?

andreyvelich · 2026-03-02T01:21:37Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

-	TargetJobs []PodTemplateOverrideTargetJob `json:"targetJobs"`
+// RuntimePatch represents a custom patch applied to the TrainJob's training runtime template.
+// Patches are keyed by manager to provide clear ownership and avoid conflicts between controllers.
+type RuntimePatch struct {


We discussed the overrides API offline with @astefanutti and @tenzen-y and agreed that managing overrides via targetJob and overrides List is too complex and hard to maintain. Especially, if users are require to override fields in Runtime spec: mpi.sshAuthMountPath, JobSet spec: failurePolicy, or JobSet/Job metadata.

Instead, we propose introducing a RuntimePatch API that represents a valid Kubernetes patch applied directly to the TrainJob runtime spec before creation of JobSet. This provides a clearer and more declarative way to customize runtime behavior.

Upstream, we plan to support structured patches for runtimes such as ClusterTrainingRuntime, GroveRuntime, SlurmRuntime, FluxRuntime, and other OSS runtimes we integrate.

For custom or in-house CRDs, I propose an opaqueRuntimeSpec field that allows users to provide an arbitrary patch to the runtime spec (API details can be defined later).

This approach simplifies overrides and makes runtime customization more explicit and extensible.

Please let us know what do you think, so we can start implementation!
/assign @kaisoz @mimowo @vsoch @VassilisVassiliadis
cc @kannon92 @Ronkahn21

cc @akshaychitneni @robert-bell

sgtm. It’s definitely more flexible and maintainable than the overrides solution. How would these be set from the SDK? I guess via the options field, as they do with PodTemplateOverrides?

Yes, we should refactor options to add values into runtimePatches API.
cc @kubeflow/kubeflow-sdk-team @Fiona-Waters

Yes we should deprecated the PodTemplateOverrides option. Ideally, such an option should not really be exposed so RuntimePatches should be encapsulated and only higher-level option should be exposed.

google-oss-prow · 2026-03-02T01:21:50Z

@andreyvelich: GitHub didn't allow me to assign the following users: mimowo, vsoch, VassilisVassiliadis.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

We discussed the overrides API offline with @astefanutti and @tenzen-y and agreed that managing overrides via targetJob and podTemplateOverrides is too complex and hard to maintain. Especially, if users are require to override fields in Runtime spec: mpi.sshAuthMountPath or JobSet spec: failurePolicy.

Instead, we propose introducing a RuntimePatch API that represents a valid Kubernetes patch applied directly to the TrainJob runtime spec before creation of JobSet. This provides a clearer and more declarative way to customize runtime behavior.

Upstream, we plan to support structured patches for runtimes such as ClusterTrainingRuntime, GroveRuntime, SlurmRuntime, and other OSS runtimes we integrate.

For custom or in-house CRDs, I propose an opaqueRuntimeSpec field that allows users to provide an arbitrary patch to the runtime spec (API details can be defined later).

This approach simplifies overrides and makes runtime customization more explicit and extensible.

Please let us know what do you think, so we can start implementation!
/assign @kaisoz @mimowo @vsoch @VassilisVassiliadis
cc @kannon92 @Ronkahn21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

VassilisVassiliadis · 2026-03-02T10:03:14Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

+// ContainerPatch represents parameters that can be patched using PodSpecPatch.
+type ContainerPatch struct {
+	// name for the container. Runtime must have this container.
+	// +kubebuilder:validation:MinLength=1
+	// +required
+	Name string `json:"name,omitempty"`


@andreyvelich could we also add fields here for resources, command, and args ?

This would be ideal for a manager that's patching the resource requirements of a TrainJob. It'd also need to also mutate the cmdline args for example to change the options of torchrun or accelerate launch.

These values should be controlled via .spec.trainer API in TrainJob: https://github.com/kubeflow/trainer/blob/master/pkg/apis/trainer/v1alpha1/trainjob_types.go#L226

However, we have limitation of Job API to make these fields mutable for suspended Job.

Shall we talk as a followup how to allow external controllers to update container resources?

No problem, we can can try tackling resource requirements patching in a future KEP!

astefanutti · 2026-03-02T17:11:39Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

-	// +listType=atomic
-	PodTemplateOverrides []PodTemplateOverride `json:"podTemplateOverrides,omitempty"`
+	// runtimePatches defines custom patches applied to the TrainJob's Runtime.
+	// Patches are keyed by manager to provide clear ownership and avoid conflicts between controllers.


Conflicts can still occur if the intersection of two patches is not empty.
Maybe here we can be more specific about precedence? Should it be time or order (if it's maintained)?

Would it better to enforce strict validation initially which doesn't allow to modify the same Pod spec field by two different managers?

I think that, to keep this focused, I’d stick to the current behaviour and apply them in order. Then we could discuss a different precedence in a follow-up. WDYT?

Sure, we should talk about validation as soon as we refactor the API.

Would it better to enforce strict validation initially which doesn't allow to modify the same Pod spec field by two different managers?

I agree fail-fast is the less ambiguous approach. It might be a bit complex to detect conflicts on leaf fields / structs and merge strategy markers should conceptually be taken into account. It may be Kuberentes provides some helpers we could reuse though.

astefanutti

Thanks @andreyvelich!

/lgtm

VassilisVassiliadis

Overall the RuntimePatches API makes sense to me as a solid foundation for facilitating the integration with frameworks like Kueue. I left some minor comments here and there with my thoughts regarding details of the proposed API.

VassilisVassiliadis · 2026-03-04T15:48:43Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

 - Introduce the `TrainingRuntime` and `ClusterTrainingRuntime` APIs that will store blueprints
  for model training and LLM fine-tuning using various ML frameworks. These runtimes will be built
  on top of `JobSet` APIs with additional functionality for special use-cases.
  For example, training using MPI orchestration.


I think that one of the qualities of the v2 API is the separation of concerns between the different personas. My mental model of how v2 works is that the Cluster admins (devops engineers) create "best practices" for using the k8s cluster. Similarly, the MLOps Engineers set the "best practices" for advanced features of ML frameworks.

I would consider these personas as the "Platform Engineers" of a cluster. By "Platform Engineer" I'm referring to a person that assists the end users to better make use of the available compute resources. The combined expertise of "Platform Engineers" is what it takes to produce a high-value TrainingRuntime/ClusterTrainingRuntime blueprint that users can "safely" use to maximize the potential of their apps. I'm also imagining that these 2 "Platform Engineer" personas will configure different fields in these CRs. So ownership is kind of easy to follow.

After reading the goals, it's as if there's a 4th "implied" persona. That of the frameworks/controllers/services (like Kueue) but also power-users that use the API to enhance user experience. This persona is basically overriding what the Admins decided in the TrainingRuntime/ClusterTrainingRuntime templates for example to optimize the TrainJobs in some way which is not possible without up-to-date information or details about the TrainJob at hand (e.g. scheduling hints based on admission checks from Kueue).

If so, would it make sense to add one more goal to specify how these entities/people can customize the Runtime that a Job uses by referencing that there's a mechanism in place called RuntimePatches?

Suggested change

- Introduce the `TrainingRuntime` and `ClusterTrainingRuntime` APIs that will store blueprints

for model training and LLM fine-tuning using various ML frameworks. These runtimes will be built

on top of `JobSet` APIs with additional functionality for special use-cases.

For example, training using MPI orchestration.

- Enable `TrainJob` to customize `TrainingRuntime` templates through the `RuntimePatches` API,

supporting configuration injection by controllers, admission webhooks, and custom clients.

Yeah, this is great idea!

VassilisVassiliadis · 2026-03-04T15:52:20Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

 The below diagram shows how platform engineers manage `TrainingRuntime` and how data scientists
 create `TrainJob`:

 ![user-roles](./user-roles.drawio.svg)


Show that the Runtime configuration of TrainJobs can be patched by external controllers too?

VassilisVassiliadis · 2026-03-04T16:00:20Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

 ```

-The webhook will validate that TargetJob and Container name exist in the Runtime Job template.
+The webhook validates that the container names in `Containers` and `InitContainers` exist in


I can see validation/patching becoming tricky if we allow multiple managers to patch the same field.

Should we add a validation rule here to forbid this scenario ?

Yes, we discussed it here with @astefanutti: #3199 (comment)
We should forbid multiple managers to apply the same override.
I understand that it might be tricker for use-cases when for example multiple managers need to apply ENV var to the same container, but I hope we can discuss it later.

google-oss-prow · 2026-03-04T16:08:28Z

@VassilisVassiliadis: changing LGTM is restricted to collaborators

Details

In response to this:

Overall the RuntimePatches API makes sense to me as a solid foundation for facilitating the integration with frameworks like Kueue. I left some minor comments here and there with my thoughts regarding details of the proposed API.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow · 2026-03-04T16:31:47Z

New changes are detected. LGTM label has been removed.

abhijeet-dhumal · 2026-03-10T08:14:43Z

docs/proposals/2170-kubeflow-trainer-v2/README.md

+	// metadata patches the JobSet object metadata.
+	// Only labels and annotations are allowed.
+	// +optional
 	Metadata *metav1.ObjectMeta `json:"metadata,omitempty"`


The comment states "Only labels and annotations are allowed" but the field type is *metav1.ObjectMeta, which includes many other fields (Name, Namespace, Finalizers, OwnerReferences, etc.). Without schema-level enforcement, users could set fields that are either ignored or cause unexpected behavior, right?

can we use a separate restricted type instead of full ObjectMeta.. like a MetadataPatch struct with labels and annotations listed ?

Using *metav1.ObjectMeta for metadata patches while documenting "only labels and annotations are allowed" creates a silent contract violation, users could set Finalizers, OwnerReferences, or Namespace which would either be ignored or cause unexpected reconcile behavior.
A restricted type enforces this at the schema level:

type MetadataPatch struct { // +optional Labels map[string]string `json:"labels,omitempty"` // +optional Annotations map[string]string `json:"annotations,omitempty"` }

This would replace *metav1.ObjectMeta in JobSetTemplatePatch, JobTemplatePatch, and PodTemplatePatch, the constraint becomes structural rather than documentary.
Also noting that #3285 proposes adding terminationGracePeriodSeconds to PodSpecPatch, which seems like a natural fit here given this PR defines that struct.
Happy to implement both the MetadataPatch refactor and the terminationGracePeriodSeconds addition in a follow-up PR once this merges, if that's useful.

There has been a discussion about that point when the metadata field got introduced in PodTemplateOverrides with #2785.

@andreyvelich @tenzen-y do you remember why we went for using *metav1.ObjectMeta instead of only a subset of it?

Found this comment from @tenzen-y: #2785 (comment)
I think, using well-known API: ObjectMeta allows us to avoid confusion what fields users can set there.

Right, @tenzen-y's point is to avoid possibly introducing fields that are not in metav1.ObjectMeta.

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich · 2026-03-12T16:34:36Z

Implemented in: #3309
Thanks everyone 🚀
/close

google-oss-prow · 2026-03-12T16:34:41Z

@andreyvelich: Closed this PR.

Details

In response to this:

Implemented in: #3309
Thanks everyone 🚀
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot AI review requested due to automatic review settings February 10, 2026 13:58

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 10, 2026

google-oss-prow bot requested a review from akshaychitneni February 10, 2026 13:58

google-oss-prow bot requested a review from jinchihe February 10, 2026 13:58

google-oss-prow bot added the size/L label Feb 10, 2026

Copilot started reviewing on behalf of andreyvelich February 10, 2026 13:59 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

andreyvelich commented Feb 10, 2026

View reviewed changes

andreyvelich mentioned this pull request Feb 12, 2026

feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD #3068

Merged

tenzen-y reviewed Feb 17, 2026

View reviewed changes

andreyvelich mentioned this pull request Feb 25, 2026

feat(runtimes): Add XGBoost runtime(KEP-2598) #3200

Merged

5 tasks

andreyvelich force-pushed the template-override-api branch 2 times, most recently from cf2fa4e to 2c22ee8 Compare March 2, 2026 01:02

google-oss-prow bot added size/XL and removed size/L labels Mar 2, 2026

andreyvelich force-pushed the template-override-api branch from 2c22ee8 to bf23562 Compare March 2, 2026 01:03

andreyvelich changed the title ~~[WIP] feat(api): Replace PodTemplateOverrides with TemplateOverrides~~ feat(api): Replace PodTemplateOverrides with RuntimePatches API Mar 2, 2026

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 2, 2026

andreyvelich force-pushed the template-override-api branch from bf23562 to 24ab971 Compare March 2, 2026 01:09

andreyvelich commented Mar 2, 2026

View reviewed changes

google-oss-prow bot assigned kaisoz Mar 2, 2026

VassilisVassiliadis reviewed Mar 2, 2026

View reviewed changes

astefanutti reviewed Mar 2, 2026

View reviewed changes

andreyvelich mentioned this pull request Mar 3, 2026

feat: add support for tracking TrainJob progress and training metrics #3227

Merged

8 tasks

astefanutti reviewed Mar 3, 2026

View reviewed changes

google-oss-prow bot assigned astefanutti Mar 3, 2026

google-oss-prow bot added the lgtm label Mar 3, 2026

VassilisVassiliadis approved these changes Mar 4, 2026

View reviewed changes

google-oss-prow bot removed the lgtm label Mar 4, 2026

andreyvelich force-pushed the template-override-api branch from 88e8673 to 46c357d Compare March 4, 2026 16:48

tariq-hasan mentioned this pull request Mar 9, 2026

Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob #3285

Open

abhijeet-dhumal reviewed Mar 10, 2026

View reviewed changes

andreyvelich mentioned this pull request Mar 10, 2026

Kubeflow Trainer v2.2 Release #3116

Closed

8 tasks

andreyvelich added 3 commits March 10, 2026 16:57

feat(api): Replace PodTemplateOverrides with RuntimePatches API

c86efc2

Signed-off-by: Andrey Velichkevich <[email protected]>

Update goal with RuntimePatches API

de3176a

Signed-off-by: Andrey Velichkevich <[email protected]>

Add diagram for RuntimePatches

cfd8933

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the template-override-api branch from 46c357d to cfd8933 Compare March 10, 2026 16:57

Fiona-Waters mentioned this pull request Mar 11, 2026

feat(trainer): replace PodTemplateOverrides with RuntimePatches API kubeflow/sdk#381

Merged

9 tasks

astefanutti mentioned this pull request Mar 12, 2026

feat(api): Replace PodTemplateOverrides with RuntimePatches API #3309

Merged

1 task

google-oss-prow bot closed this Mar 12, 2026

Conversation

andreyvelich commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Feb 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreyvelich Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Feb 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Option 1

Option 2

Option 3

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Mar 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

andreyvelich commented Feb 10, 2026 •

edited

Loading

andreyvelich Feb 10, 2026 •

edited

Loading

andreyvelich Mar 2, 2026 •

edited

Loading

andreyvelich Mar 3, 2026 •

edited

Loading

astefanutti Mar 3, 2026 •

edited

Loading

krishdef7 Mar 10, 2026 •

edited

Loading