-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bug about status absence when worker pod spec is invalid #606
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine as a fix to unblock the issue. Any thoughts? @alculquicondor @tenzen-y
@@ -961,8 +961,13 @@ func (c *MPIJobController) getOrCreateWorker(mpiJob *kubeflow.MPIJob) ([]*corev1 | |||
// If an error occurs during Get/Create, we'll requeue the item so we | |||
// can attempt processing again later. This could have been caused by a | |||
// temporary network failure, or any other transient reason. | |||
// But, if err is about pod spec invalid, retrying would be | |||
// futile, the status of job should turn to failed. | |||
if err != nil { | |||
c.recorder.Eventf(mpiJob, corev1.EventTypeWarning, mpiJobFailedReason, "worker pod created failed: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only one of the cases where there could be an invalid Pod template.
It might be better to return this error and handle more generically in syncHandler, so we can handle the launcher pod, the worker pods and any other validation errors:
if errs := validation.ValidateMPIJob(mpiJob); len(errs) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have examined how Pod Spec validation is performed in the Kubernetes project. The relevant code can be found in the "k8s.io/kubernetes/pkg/apis/core/validation" package.
However, it seems that this package is not usable outside of the Kubernetes project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean that you should use the validation code form kubernetes.
I just mean that there are multiple cases in which we can't retry, and this PR is only covering one of them.
close #604
When a worker pod fails to create, the current practice is to retry later. However, retrying does not solve the issue if the failure is due to an invalid Pod Spec. In this PR , I try to check the failure reason first and if it is due to an invalid Pod Spec, just update the Job's status to "Failed" without any retries.