Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement of nvidia-device-plugin-daemonset in gpu-cluster.md #138

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JoeyC-Dev
Copy link
Contributor

@JoeyC-Dev JoeyC-Dev commented Jan 14, 2025

Proposed change: Improve the Nvidia device plugin deamonset for better deployment.

Supporting point:

  1. Currently, if the user is using Spot node, the current version of deamonset will FAIL to deploy due to missing tolerations to relative taint.
  2. Also, the current daemonset will also try to deploy the Pod to AMD GPU / non-GPU node due to missing nodeSelector.
  3. As of now, the latest version of plugin is v0.17.0. Update image version relatively: https://github.com/NVIDIA/k8s-device-plugin/releases
  4. The use of label/taint is following the proposed/supported list:
  5. Weirdly, currently in the document is asking for creating namespace gpu-operator, but there is no actual use in the daemonset. Consider this is a manual set-up, I kept it as in namespace gpu-operator and change the yaml relatively.
    1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
    ```bash
    kubectl create namespace gpu-operator
    ```

Having checked, the propsed yaml is working, and it will deploy the ds on Nvidia GPU node.
image
image

Successful deployment:

kubectl logs nvidia-device-plugin-daemonset-7fzlq -n gpu-operator
I0114 09:03:01.133253       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
        d475b2cf
        commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
 >
I0114 09:03:01.133332       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I0114 09:03:01.133367       1 main.go:245] Starting OS watcher.
I0114 09:03:01.133653       1 main.go:260] Starting Plugins.
I0114 09:03:01.133684       1 main.go:317] Loading configuration.
I0114 09:03:01.134351       1 main.go:342] Updating config with default resource matching patterns.
I0114 09:03:01.134496       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0114 09:03:01.134503       1 main.go:356] Retrieving plugins.
I0114 09:03:01.162986       1 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I0114 09:03:01.164044       1 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0114 09:03:01.166773       1 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Copy link
Contributor

@JoeyC-Dev : Thanks for your contribution! The author(s) have been notified to review your proposed change.

Copy link

Learn Build status updates of commit 07ab281:

✅ Validation status: passed

File Status Preview URL Details
articles/aks/gpu-cluster.md ✅Succeeded

For more details, please refer to the build report.

For any questions, please:

@ttorble
Copy link
Contributor

ttorble commented Jan 14, 2025

@schaffererin

Can you review the proposed changes?

IMPORTANT: When the changes are ready for publication, adding a #sign-off comment is the best way to signal that the PR is ready for the review team to merge.

#label:"aq-pr-triaged"
@MicrosoftDocs/public-repo-pr-review-team

Copy link

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

@v-dirichards
Copy link
Contributor

@schaffererin This PR hasn’t had any updates for a while. If it's ready for review, could you sign off? Or should it be closed?

#label:"aq-pr-triaged","aq-followed-up"

@github-actions github-actions bot removed the inactive label Jan 29, 2025
Copy link

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

@JoeyC-Dev
Copy link
Contributor Author

This PR is critical for users who aren't expert on Kubernetes as they don't know why their driver cannot be installed.
Can anyone kindly approve this PR?

@MicrosoftDocs/public-repo-pr-review-team

@github-actions github-actions bot removed the inactive label Feb 12, 2025
@Court72
Copy link
Contributor

Court72 commented Feb 17, 2025

I sent an email to the content owner today.

Copy link

github-actions bot commented Mar 3, 2025

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

@JoeyC-Dev
Copy link
Contributor Author

Can we have another person reviewing this PR? If the designated is not free for reviewing the PR.

@MicrosoftDocs/public-repo-pr-review-team

@v-dirichards
Copy link
Contributor

I sent a Teams message to the content owner today.

@@ -138,7 +138,7 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
namespace: gpu-operator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary at the same time? Just wary of this tripping people up but I think it should be ok...

Copy link
Contributor Author

@JoeyC-Dev JoeyC-Dev Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In conlcusion: yes.
First thing first, in the original tutorial, it is already asking user to create gpu-operator namespace.
I can tell what that gpu-operator coming from, it is from: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
image

Though k8s-plugin and gpu-operator are different things, I am still thinking it is necessary to let it having its own namespace. For example: istio has own aks-istio-system, approuting has app-routing-system. So why not give the separated one for NVIDIA/k8s-device-plugin?

The second and most important part is: when I install 3rd party plugin in kube-system: so far I remember, it will force inject additional env variables (which is unnecessary). For example:
image

I don't feel satisfied for this.

From the time I start realizing Pods in kube-system are different than others, I stop suggesting installing Pods in kube-system, when it is totally not necessary. So I did not choose this way to edit the document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@v-dirichards
Copy link
Contributor

@davefellows Would you add a comment to indicate what action to take on this PR? Adding a #sign-off comment is the best way to signal that the PR is ready for the review team to merge. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants