Can't get k3s and nvidia-container-runtime via the NVIDIA gpu-operator to work #5645

skirsten · 2022-06-06T14:21:56Z

skirsten
Jun 6, 2022

I am trying now for a few days and I can't get k3s to use the nvidia-container-runtime...

I am basically just following the documentation of the NVIDIA gpu-operator:

1. I changed these two env vars to point to the bundled containerd

nvidia-gpu-operator.yaml

toolkit:
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock

2. Install the gpu-operator

helm install --wait nvidia-gpu-operator \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  -f nvidia-gpu-operator.yaml

3. Wait for everything to finish installing

running on a fresh ubuntu 20.04 cloud image

nvidia-gpu-operator-node-feature-discovery-worker-wdzb7           1/1     Running     0             62m
nvidia-gpu-operator-node-feature-discovery-worker-tqp7b           1/1     Running     0             62m
nvidia-gpu-operator-node-feature-discovery-master-54bbdc98ljpb4   1/1     Running     0             62m
nvidia-driver-daemonset-xg7hf                                     1/1     Running     0             61m
nvidia-container-toolkit-daemonset-r4g2g                          1/1     Running     2 (60m ago)   61m
gpu-operator-7bfc5f55-xmhzm                                       1/1     Running     0             62m
nvidia-cuda-validator-rhrch                                       0/1     Completed   0             59m
gpu-feature-discovery-5m6k2                                       1/1     Running     0             61m
nvidia-device-plugin-daemonset-w6rzx                              1/1     Running     0             61m
nvidia-dcgm-exporter-6cpgm                                        1/1     Running     0             61m
nvidia-device-plugin-validator-85g98                              0/1     Completed   0             59m
nvidia-operator-validator-vsmdm                                   1/1     Running     0             61m

4. Some things I checked:

Yes, RuntimeClass nvidia exists
Yes, the gpu was detected, and node labels were added
Yes, /var/lib/kubelet/device-plugins/nvidia-gpu.sock exists

Yes, the containerd/config.toml was updated:

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins.cri.containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true


[plugins.cri.cni]
  bin_dir = "/var/lib/rancher/k3s/data/8c2b0191f6e36ec6f3cb68e2302fcc4be850c6db31ec5f8a74e4b3be403101d8/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"


[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"









[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia-experimental"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia-experimental".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

Yes, the driver was installed on the host. Verified using sudo chroot /run/nvidia/driver nvidia-smi
Yes, containerd was restarted

5. Create a pod (deployment in this case for easier iterations)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample
  labels:
    app: sample
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      containers:
        - name: sample
          image: nvcr.io/nvidia/cuda:11.6.2-base-ubuntu20.04
          command: [nvidia-smi]
          resources:
            limits:
              nvidia.com/gpu: 1

6. No nvidia-smi and no nvidia-container-runtime

Now I get this error message:

Error: failed to create containerd task: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "nvidia-smi": executable file not found in $PATH: unknown

also when instead using command: [sleep, infinity] I only see containerd-shim-runc-v2 on the processes. The containerd logs also don't include any mention of nvidia-container-runtime or any related errors.

So basically its not using the nvidia runtime... Ive been at this for quite a while now and I cannot figure out what I missed.
Any help is appreciated!

Answered by skirsten

Jun 6, 2022

Wow and of course seconds after writing this, something I tested works:

runtimeClassName: nvidia

works.
Now I just need to figure out why I have to manually specify this and none of the examples on the NVIDIA docs include it.

View full answer

skirsten · 2022-06-06T14:28:07Z

skirsten
Jun 6, 2022
Author

Wow and of course seconds after writing this, something I tested works:

runtimeClassName: nvidia

works.
Now I just need to figure out why I have to manually specify this and none of the examples on the NVIDIA docs include it.

2 replies

brandond Jun 6, 2022
Collaborator

it's because we don't replace the default runtime class. runc is still the default; if you want to use the nvidia runtime you need to specify it in the pod spec.

skirsten Jun 6, 2022
Author

Yes, I added the runtime to the pod specs and now everything works as expected 🎉 ❤️ k3s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get k3s and nvidia-container-runtime via the NVIDIA gpu-operator to work #5645

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Can't get k3s and nvidia-container-runtime via the NVIDIA gpu-operator to work #5645

skirsten Jun 6, 2022

1. I changed these two env vars to point to the bundled containerd

nvidia-gpu-operator.yaml

2. Install the gpu-operator

3. Wait for everything to finish installing

4. Some things I checked:

5. Create a pod (deployment in this case for easier iterations)

6. No nvidia-smi and no nvidia-container-runtime

Replies: 1 comment · 2 replies

skirsten Jun 6, 2022 Author

brandond Jun 6, 2022 Collaborator

skirsten Jun 6, 2022 Author

skirsten
Jun 6, 2022

Replies: 1 comment 2 replies

skirsten
Jun 6, 2022
Author

brandond Jun 6, 2022
Collaborator

skirsten Jun 6, 2022
Author