Fix: tfjob gets stuck in running state when succeeded pods are garbage collected #44

ChenYi015 · 2025-08-26T09:53:28Z

Bug Reproduction

Create a TFJob as following:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: test
  namespace: default
spec:
  successPolicy: AllWorkers
  tfReplicaSpecs:
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        metadata:
          labels:
            pod-group.scheduling.sigs.k8s.io/name: test
            pod-group.scheduling.sigs.k8s.io/min-available: "3"
        spec:
          containers:
          - name: tensorflow
            image: ubuntu:22.04
            command:
            - bash
            - -c
            args:
            - |
              t=$((((RANDOM % 3) + 2) * 10))
              echo "sleep $t seconds..."
              sleep $t

Delete any worker pod once it completed, then the tfjob will get stuck in running state forever. This mimics the situation that worker pods are garbage collected very soon when the k8s cluster has more than 12500 (default GC threshold) completed pods.

Proposed Changes

Add new field .status.replicaStatuses.pending to count pending pods
Fix bug: tfjob will get stuck in running state when succeeded pods get garbage collected
- Before: tfjob is Succeeded when expected==0, which means number of succeeded pods equals to replicas.
- After: tfjob is Succeeded when the job is in Running state and no pods are pending, running or failed.

pkg/controller.v1/tensorflow/status.go

Signed-off-by: Yi Chen <[email protected]>

…e collected Signed-off-by: Yi Chen <[email protected]>

Signed-off-by: Yi Chen <[email protected]>

cheyang

/lgtm
/approve

ChenYi015 changed the title ~~Fix: tfjob gets stuck in running state when succeeded pods get garbage collected~~ Fix: tfjob gets stuck in running state when succeeded pods are garbage collected Aug 26, 2025

cheyang requested changes Aug 27, 2025

View reviewed changes

pkg/controller.v1/tensorflow/status.go Outdated Show resolved Hide resolved

ChenYi015 requested a review from cheyang August 28, 2025 02:33

ChenYi015 added 2 commits September 3, 2025 10:34

Add new field '.status.replicaStatuses.pending'

af05c3e

Signed-off-by: Yi Chen <[email protected]>

Fix: tfjob gets stuck in running state when succeeded pods get garbag…

97973c2

…e collected Signed-off-by: Yi Chen <[email protected]>

ChenYi015 force-pushed the fix/tfjob-success branch from eee5d28 to f5a9c3d Compare September 3, 2025 02:34

Make the if condition statement easier to understand

de4236c

Signed-off-by: Yi Chen <[email protected]>

ChenYi015 force-pushed the fix/tfjob-success branch from f5a9c3d to de4236c Compare September 3, 2025 03:52

cheyang approved these changes Sep 3, 2025

View reviewed changes

ChenYi015 merged commit f12d01e into AliyunContainerService:ack-tf-operator Sep 3, 2025
2 checks passed

ChenYi015 mentioned this pull request Sep 3, 2025

Fix: tfjob gets stuck in running state when succeeded pods are garbage collected kubeflow/arena#1370

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: tfjob gets stuck in running state when succeeded pods are garbage collected #44

Fix: tfjob gets stuck in running state when succeeded pods are garbage collected #44

Uh oh!

ChenYi015 commented Aug 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

cheyang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: tfjob gets stuck in running state when succeeded pods are garbage collected #44

Fix: tfjob gets stuck in running state when succeeded pods are garbage collected #44

Uh oh!

Conversation

ChenYi015 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Reproduction

Proposed Changes

Uh oh!

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenYi015 commented Aug 26, 2025 •

edited

Loading