Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Adding validation webhook for v2 trainjob #2307

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

akshaychitneni
Copy link
Contributor

@akshaychitneni akshaychitneni commented Oct 24, 2024

Adds validation webhook for v2 trainjob.
Relates to #2209

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes # #2209

Checklist:

  • Docs included if any changes are user facing

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@akshaychitneni
Copy link
Contributor Author

cc @tenzen-y @andreyvelich

@coveralls
Copy link

coveralls commented Oct 25, 2024

Pull Request Test Coverage Report for Build 11709371378

Details

  • 6 of 6 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 11663764609: 0.0%
Covered Lines: 78
Relevant Lines: 78

💛 - Coveralls

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this, and moving this forward.
And Sorry for the delay.

pkg/controller.v2/trainjob_controller.go Show resolved Hide resolved
Comment on lines +69 to +71
Namespace: new.Namespace,
Name: new.Spec.RuntimeRef.Name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you ever seen the isseus when we use the old object names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we get new object here and not old ?

pkg/runtime.v2/core/trainingruntime.go Outdated Show resolved Hide resolved
pkg/runtime.v2/framework/plugins/jobset/jobset.go Outdated Show resolved Hide resolved
@@ -140,3 +143,115 @@ func (j *JobSet) ReconcilerBuilders() []runtime.ReconcilerBuilder {
},
}
}

func (j *JobSet) Validate(oldObj, newObj *kubeflowv2.TrainJob, runtimeInfo *runtime.Info) (admission.Warnings, field.ErrorList) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there are some conflicts between @andreyvelich PR and this.
@akshaychitneni Could you consult with @andreyvelich, then which PRs should we merge into the main, first.

Copy link
Contributor Author

@akshaychitneni akshaychitneni Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased with @andreyvelich's changes

@@ -31,7 +31,7 @@ func Setup(mgr ctrl.Manager, runtimes map[string]runtime.Runtime) (string, error
return kubeflowv2.TrainingRuntimeKind, err
}
if err := setupWebhookForTrainJob(mgr, runtimes); err != nil {
return "TrainJob", err
return kubeflowv2.TrainJobKind, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

pkg/webhook.v2/trainjob_webhook.go Outdated Show resolved Hide resolved
pkg/webhook.v2/trainjob_webhook.go Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! This is what I imagined architechture in my KubeflowJobPipeline framework design phase.

failedCtrlName, err := controllerv2.SetupControllers(mgr, runtimes)
gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred(), "controller", failedCtrlName)
gomega.ExpectWithOffset(1, failedCtrlName).To(gomega.BeEmpty())
if startControllers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you ever seen any issues like null pointer when we start the controllers for webhook testing, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have seen but we might not need to start the controllers just to validate create/update requests and leave to reconciler tests to cover reconciliation

fixing runtime

Signed-off-by: Akshay Chitneni <[email protected]>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @akshaychitneni!
I left initial comments.

Comment on lines +29 to +31
// JobExporter is the Job name for the exporter.
JobExporter string = "exporter"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we implement the validation for exporter in the future once we design it as part of: #2245 ?
We should discuss whether we want to use sidecar container or another ReplicatedJob for model checkpointing.
cc @saileshd1402 @akshaychitneni @tenzen-y

// ContainerModelInitializer is the container name for the model initializer.
ContainerModelInitializer string = "model-initializer"

// ContainerModelExporter is the container name for the model exporter.
ContainerModelExporter string = "model-exporter"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for the container.

@@ -0,0 +1,17 @@
package runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a runtime util, don't we want to move it under /pkg/runtime.v2/util @akshaychitneni @tenzen-y ?

return r.framework.RunComponentBuilderPlugins(ctx, jobSetTemplate.DeepCopy(), info, trainJob)
}

func (r *TrainingRuntime) runtimeInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of Runtime interface:

type Runtime interface {

And should we name this API more explicit (e.g. getRuntimeInfo() or initializeRuntimeInfo()) ?

Comment on lines +69 to +71
Namespace: new.Namespace,
Name: new.Spec.RuntimeRef.Name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we get new object here and not old ?

numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode")
if runtimeInfo.MLPolicy.MPI != nil {
if _, err := strconv.Atoi(*newJobObj.Spec.Trainer.NumProcPerNode); err != nil {
allErrs = append(allErrs, field.Invalid(numProcPerNodePath, newJobObj.Spec.Trainer.NumProcPerNode, "should have an int value"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, is this value compatible with the k8s API conventions: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md ?

numProcPerNodePath := specPath.Child("trainer").Child("numProcPerNode")
if runtimeInfo.RuntimePolicy.MLPolicy.Torch != nil && newObj.Spec.Trainer.NumProcPerNode != nil {
allowedStringValList := []string{"auto", "cpu", "gpu"}
numProcPerNode := *newObj.Spec.Trainer.NumProcPerNode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akshaychitneni @tenzen-y Can't we use CEL for that validation since we just validate values for .nProcPerNode parameter ?

// TODO: Need to implement validateions for TorchJob.
func (t *Torch) Validate(oldObj, newObj *kubeflowv2.TrainJob) (admission.Warnings, field.ErrorList) {
return nil, nil
func (t *Torch) Validate(runtimeJobTemplate client.Object, runtimeInfo *runtime.Info, oldObj, newObj *kubeflowv2.TrainJob) (admission.Warnings, field.ErrorList) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned here, we should also validate that TrainJob doesn't have PET_ Trainer envs

// TODO (andreyvelich): Add validation to check that TrainJob doesn't have "PET_" envs.

return nil, nil
}

if newObj.Spec.ModelConfig != nil && newObj.Spec.ModelConfig.Input != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, for now we should check the initContainers in JobSet, as I mentioned here: https://github.com/kubeflow/training-operator/blob/master/pkg/runtime.v2/framework/plugins/jobset/builder.go#L87-L89

gomega.Expect(k8sClient.DeleteAllOf(ctx, &kubeflowv2.TrainJob{}, client.InNamespace(ns.Name))).To(gomega.Succeed())
})

ginkgo.When("Creating TrainJob", func() {
Copy link
Member

@andreyvelich andreyvelich Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @akshaychitneni What is right way to test our validations with integration or unit tests ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants