Skip to content

feat(run-pod): allow image pull retries in run pod collector #1811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 25, 2025

Conversation

diamonwiggins
Copy link
Member

Description, Motivation and Context

This PR adds a new allowImagePullRetries field to the RunPod collector configuration to control how ImagePullBackOff errors are handled.

Problem: Currently, when a RunPod collector encounters a pod in ImagePullBackOff state, it immediately fails regardless of any configured timeout. This prevents users from handling scenarios where image pulls might eventually succeed (e.g., temporary registry issues, authentication delays, or slow networks).

Solution: Added an optional allowImagePullRetries boolean field that allows ImagePullBackOff conditions to respect the configured timeout instead of failing immediately.

Behavior:

  • When allowImagePullRetries: false (default): Maintains existing behavior - fails immediately on ImagePullBackOff
  • When allowImagePullRetries: true: Waits for the configured timeout, allowing image pull retries to potentially succeed

Usage Example:

- runPod:
    timeout: "5m"
    allowImagePullRetries: true
    podSpec:
      containers:
      - name: my-container
        image: my-private-image

Checklist

  • New and existing tests pass locally with introduced changes.
  • [] Tests for the changes have been added (for bug fixes / features)
  • The commit message(s) are informative and highlight any breaking changes
  • Any documentation required has been added/updated. For changes to https://troubleshoot.sh/ create a PR here

Does this PR introduce a breaking change?

  • Yes
  • No

@diamonwiggins diamonwiggins added the type::feature New feature or request label Jul 24, 2025
@diamonwiggins diamonwiggins requested a review from a team as a code owner July 24, 2025 23:01
emosbaugh
emosbaugh previously approved these changes Jul 24, 2025
@diamonwiggins diamonwiggins merged commit 2861425 into main Jul 25, 2025
21 checks passed
@diamonwiggins diamonwiggins deleted the diamonwiggins/run-pod-image-pull-timeout branch July 25, 2025 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants