-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Description
Similar to #1067, users should be able to specify which tolerations they would like their job pods to have.
Motivation/Background
This will allow users to run jobs on tainted nodes for testing like hardware validation. It also increases the flexibility of the Kubernetes cluster by allowing operators to prevent certain pods from being scheduled while still allowing runs from Torchx.
Detailed Proposal
Add tolerations as a run-opt to the KubernetesScheduler run_opts, KubernetesOpts and other entry points. Add user-specified tolerations to the role_to_pod method.
Alternatives
I can't think of any alternatives. As far as I know, there is no built-in support for custom tolerations currently. Tolerations significantly change the pod scheduling behavior, so they should only be added when the user requests them.
Additional context/links
Relevant code linked above.
Documentation: https://docs.pytorch.org/torchx/main/schedulers/kubernetes.html