Skip to content

[backend] Backwards compatibility error with POD_NAMES v1 for v1 runs #12308

@112358fn

Description

@112358fn

Environment

  • How did you deploy Kubeflow Pipelines (KFP)?
    Using kubeflow/manifests
  • KFP version:
    1.9.1 (but I think it impacts 1.10 too)
  • KFP SDK version:
    1.8.22 and 2.8.0

Steps to reproduce

Setting POD_NAMES=v1 in the workflow controller resolves the "ML Metadata not found" issue in kubeflow/pipelines#11457.

However, this change introduces a bug that prevents users from retrying failed pipelines. The sequence is:

  1. A user starts a run.
  2. The pipeline fails due to one or more failing components.
  3. The user clicks "Retry" in the web UI.
  4. The pipeline gets stuck in a pending state, and no new pods are scheduled.

This happens because, if the pods associated with failed components are not deleted, the workflow controller logs:

level=info msg="Failed pod ... creation: already exists"

As a result, the Kubeflow pipeline remains stuck in the pending state.

Expected result

The failed pipeline is retried.

Materials and Reference

The reason for this bug is that:

  • The GenerateRetryExecution function uses RetrievePodName to collect the list of pods to delete.
  • When POD_NAMES=v1 is set, pod names match the ArgoWorkflows Node IDs.
  • This causes deletion to fail, leaving pods behind and blocking pipeline retries.

Proposed Solution

To fix this, we should use the annotation workflows.argoproj.io/pod-name-format in the RetrievePodName function.
A similar solution is implement in the frontend in this PR: #11682


Impacted by this bug? Give it a 👍.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions