An AppWrapper generator for PyTorchJobs
This file documents the variables that may be set in a user's settings.yaml
to
customize the Jobs generated by the tool.
Key |
Type |
Default |
Description |
jobName |
string |
must be provided by user |
Name of the Job. Will be the name of the AppWrapper and the PyTorchJob. |
namespace |
string |
nil |
Namespace in which to run the Job. If unspecified, the namespace will be inferred using normal Helm/Kubernetes mechanisms when the Job is submitted. |
queueName |
string |
"default-queue" |
Name of the local queue to which the Job will be submitted. |
priority |
string |
"default-priority" |
Type of priority for the job (choose from: "default-priority", "low-priority" or "high-priority"). |
customLabels |
array |
nil |
Optional array of custom labels to add to all the resources created by the Job (the PyTorchJob, the PodGroup, and the AppWrapper). |
containerImage |
string |
must be provided by the user |
Image used for creating the Job's containers (needs to have all the applications your job may need) |
imagePullSecrets |
array |
nil |
List of image-pull-secrets to be used for pulling containerImages |
Key |
Type |
Default |
Description |
numPods |
integer |
1 |
Total number of pods (i.e. master + worker pods) to be created |
numCpusPerPod |
integer or string |
1 |
Number of CPUs for each pod. May be a positive integer or a ResourceQuantity (eg 500m) |
numGpusPerPod |
integer |
0 |
Number of GPUs for each pod (all GPUs per node is currently recommended for distributed training). |
totalMemoryPerPod |
string |
"1Gi" |
Total memory for each pod expressed as a ResourceQuantity (eg 1Gi, 200M, etc.). |
limitCpusPerPod |
integer or string |
numCpusPerPod |
Limit on the number of CPUs per pod for elastic jobs. May be a positive integer or a ResourceQuantity (eg 500m). |
limitGpusPerPod |
integer |
numGpusPerPod |
Limit of number of GPUs per pod for elastic jobs. |
limitMemoryPerPod |
string |
totalMemoryPerPod |
Limit of total memory per pod for elastic jobs (eg 1Gi, 200M, etc.). |
Key |
Type |
Default |
Description |
environmentVariables |
array |
nil |
List of variables/values to be defined for all the ranks. Values can be literals or references to Kuberetes secrets or configmaps. See values.yaml for examples of supported syntaxes. NOTE: The following standard PyTorch Distributed environment variables are set automatically and can be referenced in the commands without being set manually: WORLD_SIZE, RANK, MASTER_ADDR, MASTER_PORT. |
sshGitCloneConfig |
object |
nil |
Private GitHub clone support. See values.yaml for additional instructions. |
setupCommands |
array |
no custom commands are executed |
List of custom commands to be ran at the beginning of the execution. Use setupCommand to clone code, download data, and change directories. |
mainProgram |
string |
nil |
Name of the PyTorch program to be executed by torchrun . Please provide your program name here and NOT in "setupCommands" as this helm template provides the necessary "torchrun" arguments for the parallel execution. WARNING: this program is relative to the current path set by change-of-directory commands in "setupCommands". If no value is provided; then only setupCommands are executed and torchrun is elided. |
volumes |
array |
No volumes are mounted |
List of "(name, claimName, mountPath)" of volumes, with persistentVolumeClaim, to be mounted to the infrastructure |
Key |
Type |
Default |
Description |
roceGdrResName |
string |
nvidia.com/roce_gdr |
RoCE GDR resource name (can vary by cluster configuration) |
numRoceGdr |
integer |
0 |
number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1. |
topologyFileConfigMap |
string |
nil |
Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr |
ncclGdrEnvConfigMap |
string |
nil |
Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars |
multiNicNetworkName |
string |
nil |
Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with oc get multinicnetwork . |
disableSharedMemory |
boolean |
false |
Control whether or not a shared memory volume is added to the PyTorchJob. |
mountNVMe |
object |
nil |
Mount NVMe as a volume. The environment variable MOUNT_PATH_NVME provides the runtime mount path |
initContainers |
array |
nil |
List of "(name, image, command[])" specifying an init containers to be run before the main job. The 'command' field is a list of commands to run in the container, see the Kubernetes entry on initContainers for reference. |
autopilotHealthChecks |
array |
No pre-flight checks are enabled. |
Autopilot health checks. List of labels enabling one or more system health pre-flight checks. |
hostIgnoreList |
array |
nil |
List of host names on which the Job must not be scheduled (to avoid faulty nodes). |
schedulerName |
string |
nil |
If non-nil, use the specified Kubernetes scheduler. Setting this to the default-scheduler may result in GPU fragmentation on the cluster. Setting this to any non-nil value should only be done when explicitly directed to do so by a cluster admin! |
serviceAccountName |
string |
the default service account for the namespace will be used. |
Service account to be used for running the Job |