-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
Description
Environment
- How did you deploy Kubeflow Pipelines (KFP)?
Using kubeflow/manifests - KFP version:
1.9.1 (but I think it impacts 1.10 too) - KFP SDK version:
1.8.22 and 2.8.0
Steps to reproduce
Setting POD_NAMES=v1
in the workflow controller resolves the "ML Metadata not found" issue in kubeflow/pipelines#11457.
However, this change introduces a bug that prevents users from retrying failed pipelines. The sequence is:
- A user starts a run.
- The pipeline fails due to one or more failing components.
- The user clicks "Retry" in the web UI.
- The pipeline gets stuck in a pending state, and no new pods are scheduled.
This happens because, if the pods associated with failed components are not deleted, the workflow controller logs:
level=info msg="Failed pod ... creation: already exists"
As a result, the Kubeflow pipeline remains stuck in the pending state.
Expected result
The failed pipeline is retried.
Materials and Reference
The reason for this bug is that:
- The
GenerateRetryExecution
function usesRetrievePodName
to collect the list of pods to delete. - When
POD_NAMES=v1
is set, pod names match the ArgoWorkflows Node IDs. - This causes deletion to fail, leaving pods behind and blocking pipeline retries.
Proposed Solution
To fix this, we should use the annotation workflows.argoproj.io/pod-name-format
in the RetrievePodName
function.
A similar solution is implement in the frontend in this PR: #11682
Impacted by this bug? Give it a 👍.