feat(run-pod): allow image pull retries in run pod collector #1811
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description, Motivation and Context
This PR adds a new
allowImagePullRetries
field to the RunPod collector configuration to control how ImagePullBackOff errors are handled.Problem: Currently, when a RunPod collector encounters a pod in ImagePullBackOff state, it immediately fails regardless of any configured timeout. This prevents users from handling scenarios where image pulls might eventually succeed (e.g., temporary registry issues, authentication delays, or slow networks).
Solution: Added an optional
allowImagePullRetries
boolean field that allows ImagePullBackOff conditions to respect the configured timeout instead of failing immediately.Behavior:
allowImagePullRetries: false
(default): Maintains existing behavior - fails immediately on ImagePullBackOffallowImagePullRetries: true
: Waits for the configured timeout, allowing image pull retries to potentially succeedUsage Example:
Checklist
Does this PR introduce a breaking change?