Skip to content

Add tolerations to KubernetesScheduler run opts #1068

@JackWittmayer

Description

@JackWittmayer

Description

Similar to #1067, users should be able to specify which tolerations they would like their job pods to have.

Motivation/Background

This will allow users to run jobs on tainted nodes for testing like hardware validation. It also increases the flexibility of the Kubernetes cluster by allowing operators to prevent certain pods from being scheduled while still allowing runs from Torchx.

Detailed Proposal

Add tolerations as a run-opt to the KubernetesScheduler run_opts, KubernetesOpts and other entry points. Add user-specified tolerations to the role_to_pod method.

Alternatives

I can't think of any alternatives. As far as I know, there is no built-in support for custom tolerations currently. Tolerations significantly change the pod scheduling behavior, so they should only be added when the user requests them.

Additional context/links

Relevant code linked above.
Documentation: https://docs.pytorch.org/torchx/main/schedulers/kubernetes.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions