conf: update for gpu in job template (#301)

myungjin · web-flow · commit 56ccf644c219 · 2023-01-04T13:35:02.000-08:00
The job template is updated to prefer a node with gpu. If gpu is not
available, other nodes are considered for scheduling.
diff --git a/docs/03-b-amzn2-gpu.md b/docs/03-b-amzn2-gpu.md
@@ -66,7 +66,7 @@ curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stabl
 sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
 
 # install helm
-HELM_VERSION=v3.10.2-linux-amd64
+HELM_VERSION=v3.10.2
 curl -LO https://get.helm.sh/helm-$HELM_VERSION-linux-amd64.tar.gz
 tar -zxvf helm-$HELM_VERSION-linux-amd64.tar.gz
 sudo mv linux-amd64/helm /usr/local/bin/helm
@@ -125,7 +125,52 @@ An output should look similar to:
 }
 ```
 
-### Step 3: Configuring addons
+### Step 3: Install NVIDIA'S GPU feature discovery resources
+More details are found [here](https://github.com/NVIDIA/gpu-feature-discovery).
+
+Deploy Node Feature Discovery (NFD) as a daemonset.
+```bash
+kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/nfd.yaml
+```
+
+Deploy NVIDIA GPU Feature Discovery (GFD) as a daemonset.
+```bash
+kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/gpu-feature-discovery-daemonset.yaml
+```
+
+```bash
+kubectl get nodes -o yaml
+```
+The above command will output something similar to the following:
+```console
+apiVersion: v1
+items:
+- apiVersion: v1
+  kind: Node
+  metadata:
+    ...
+    labels:
+      ...
+      nvidia.com/cuda.driver.major: "470"
+      nvidia.com/cuda.driver.minor: "57"
+      nvidia.com/cuda.driver.rev: "02"
+      nvidia.com/cuda.runtime.major: "11"
+      nvidia.com/cuda.runtime.minor: "4"
+      nvidia.com/gfd.timestamp: "1672792567"
+      nvidia.com/gpu.compute.major: "3"
+      nvidia.com/gpu.compute.minor: "7"
+      nvidia.com/gpu.count: "1"
+      nvidia.com/gpu.family: kepler
+      nvidia.com/gpu.machine: HVM-domU
+      nvidia.com/gpu.memory: "11441"
+      nvidia.com/gpu.product: Tesla-K80
+      nvidia.com/gpu.replicas: "1"
+      nvidia.com/mig.capable: "false"
+      ...
+...
+```
+
+### Step 4: Configuring addons
 Next, `ingress` and `ingress-dns` addons need to be installed with the following command:
 ```bash
 sudo minikube addons enable ingress
diff --git a/fiab/helm-chart/control/job/job-agent.yaml.mustache b/fiab/helm-chart/control/job/job-agent.yaml.mustache
@@ -51,4 +51,15 @@ spec:
             - name: AWS_SECRET_ACCESS_KEY
               value: {{ .Values.secretAccessKey }}
       restartPolicy: Never
+
+      affinity:
+        nodeAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 1
+            preference:
+              matchExpressions:
+              - key: "nvidia.com/gpu.count"
+                operator: Gt
+                values:
+                - "0"
 <%={{ }}=%>
diff --git a/fiab/helm-chart/deployer/job/job-agent.yaml.mustache b/fiab/helm-chart/deployer/job/job-agent.yaml.mustache
@@ -51,4 +51,15 @@ spec:
             - name: AWS_SECRET_ACCESS_KEY
               value: {{ .Values.secretAccessKey }}
       restartPolicy: Never
+
+      affinity:
+        nodeAffinity:
+          preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 1
+            preference:
+              matchExpressions:
+              - key: "nvidia.com/gpu.count"
+                operator: Gt
+                values:
+                - "0"
 <%={{ }}=%>