Skip to content

Conversation

@ChenYi015
Copy link
Collaborator

@ChenYi015 ChenYi015 commented Aug 26, 2025

Bug Reproduction

Create a TFJob as following:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: test
  namespace: default
spec:
  successPolicy: AllWorkers
  tfReplicaSpecs:
    Worker:
      replicas: 3
      restartPolicy: Never
      template:
        metadata:
          labels:
            pod-group.scheduling.sigs.k8s.io/name: test
            pod-group.scheduling.sigs.k8s.io/min-available: "3"
        spec:
          containers:
          - name: tensorflow
            image: ubuntu:22.04
            command:
            - bash
            - -c
            args:
            - |
              t=$((((RANDOM % 3) + 2) * 10))
              echo "sleep $t seconds..."
              sleep $t

Delete any worker pod once it completed, then the tfjob will get stuck in running state forever. This mimics the situation that worker pods are garbage collected very soon when the k8s cluster has more than 12500 (default GC threshold) completed pods.

Proposed Changes

  • Add new field .status.replicaStatuses.pending to count pending pods
  • Fix bug: tfjob will get stuck in running state when succeeded pods get garbage collected
    • Before: tfjob is Succeeded when expected==0, which means number of succeeded pods equals to replicas.
    • After: tfjob is Succeeded when the job is in Running state and no pods are pending, running or failed.

@ChenYi015 ChenYi015 changed the title Fix: tfjob gets stuck in running state when succeeded pods get garbage collected Fix: tfjob gets stuck in running state when succeeded pods are garbage collected Aug 26, 2025
@ChenYi015 ChenYi015 requested a review from cheyang August 28, 2025 02:33
Copy link
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@ChenYi015 ChenYi015 merged commit f12d01e into AliyunContainerService:ack-tf-operator Sep 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants