Enhancement of nvidia-device-plugin-daemonset in gpu-cluster.md #138

JoeyC-Dev · 2025-01-14T09:22:46Z

Proposed change: Improve the Nvidia device plugin deamonset for better deployment.

Supporting point:

Currently, if the user is using Spot node, the current version of deamonset will FAIL to deploy due to missing tolerations to relative taint.
Also, the current daemonset will also try to deploy the Pod to AMD GPU / non-GPU node due to missing nodeSelector.
As of now, the latest version of plugin is v0.17.0. Update image version relatively: https://github.com/NVIDIA/k8s-device-plugin/releases
The use of label/taint is following the proposed/supported list:
- https://learn.microsoft.com/en-us/azure/aks/use-labels#reserved-system-labels
- https://learn.microsoft.com/en-us/azure/aks/spot-node-pool#:~:text=kubernetes.azure.com/scalesetpriority=spot:NoSchedule

Weirdly, currently in the document is asking for creating namespace gpu-operator, but there is no actual use in the daemonset. Consider this is a manual set-up, I kept it as in namespace gpu-operator and change the yaml relatively.

azure-aks-docs/articles/aks/gpu-cluster.md

Lines 128 to 132 in 130a484

    
           1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command. 
        
               ```bash 
        
               kubectl create namespace gpu-operator 
        
               ```

Having checked, the propsed yaml is working, and it will deploy the ds on Nvidia GPU node.

Successful deployment:

kubectl logs nvidia-device-plugin-daemonset-7fzlq -n gpu-operator
I0114 09:03:01.133253       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
        d475b2cf
        commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
 >
I0114 09:03:01.133332       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I0114 09:03:01.133367       1 main.go:245] Starting OS watcher.
I0114 09:03:01.133653       1 main.go:260] Starting Plugins.
I0114 09:03:01.133684       1 main.go:317] Loading configuration.
I0114 09:03:01.134351       1 main.go:342] Updating config with default resource matching patterns.
I0114 09:03:01.134496       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0114 09:03:01.134503       1 main.go:356] Retrieving plugins.
I0114 09:03:01.162986       1 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I0114 09:03:01.164044       1 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0114 09:03:01.166773       1 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

prmerger-automator · 2025-01-14T09:23:02Z

@JoeyC-Dev : Thanks for your contribution! The author(s) have been notified to review your proposed change.

learn-build-service-prod · 2025-01-14T09:24:05Z

Learn Build status updates of commit 07ab281:

✅ Validation status: passed

File	Status	Preview URL	Details
articles/aks/gpu-cluster.md	✅Succeeded

For more details, please refer to the build report.

For any questions, please:

Try searching the learn.microsoft.com contributor guides
Post your question in the Learn support channel

ttorble · 2025-01-14T16:06:48Z

@schaffererin

Can you review the proposed changes?

IMPORTANT: When the changes are ready for publication, adding a #sign-off comment is the best way to signal that the PR is ready for the review team to merge.

#label:"aq-pr-triaged"
@MicrosoftDocs/public-repo-pr-review-team

github-actions · 2025-01-28T18:03:46Z

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

v-dirichards · 2025-01-28T21:16:20Z

@schaffererin This PR hasn’t had any updates for a while. If it's ready for review, could you sign off? Or should it be closed?

#label:"aq-pr-triaged","aq-followed-up"

github-actions · 2025-02-12T06:04:05Z

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

JoeyC-Dev · 2025-02-12T15:34:47Z

This PR is critical for users who aren't expert on Kubernetes as they don't know why their driver cannot be installed.
Can anyone kindly approve this PR?

@MicrosoftDocs/public-repo-pr-review-team

Court72 · 2025-02-17T15:19:53Z

I sent an email to the content owner today.

github-actions · 2025-03-03T18:03:34Z

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

JoeyC-Dev · 2025-03-03T18:16:37Z

Can we have another person reviewing this PR? If the designated is not free for reviewing the PR.

@MicrosoftDocs/public-repo-pr-review-team

v-dirichards · 2025-03-03T21:16:18Z

I sent a Teams message to the content owner today.

davefellows · 2025-03-10T07:23:02Z

articles/aks/gpu-cluster.md

    metadata:
      name: nvidia-device-plugin-daemonset
-      namespace: kube-system
+      namespace: gpu-operator


Is this change necessary at the same time? Just wary of this tripping people up but I think it should be ok...

In conlcusion: yes.
First thing first, in the original tutorial, it is already asking user to create gpu-operator namespace.
I can tell what that gpu-operator coming from, it is from: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html

Though k8s-plugin and gpu-operator are different things, I am still thinking it is necessary to let it having its own namespace. For example: istio has own aks-istio-system, approuting has app-routing-system. So why not give the separated one for NVIDIA/k8s-device-plugin?

The second and most important part is: when I install 3rd party plugin in kube-system: so far I remember, it will force inject additional env variables (which is unnecessary). For example:

I don't feel satisfied for this.

From the time I start realizing Pods in kube-system are different than others, I stop suggesting installing Pods in kube-system, when it is totally not necessary. So I did not choose this way to edit the document.

Makes sense

Thanks for the feedback, to align with our Windows GPU guidance the namespace for the k8s device plugin should be updated to gpu-resources.
@JoeyC-Dev can you please update line 131 to kubectl create namespace gpu-resources ?

v-dirichards · 2025-03-11T21:50:27Z

@davefellows Would you add a comment to indicate what action to take on this PR? Adding a #sign-off comment is the best way to signal that the PR is ready for the review team to merge. Thanks!

sdesai345

Update the device plugin namespace to gpu-resources

sdesai345 · 2025-03-21T16:21:50Z

articles/aks/gpu-cluster.md

    metadata:
      name: nvidia-device-plugin-daemonset
-      namespace: kube-system
+      namespace: gpu-operator


Thanks for the feedback, to align with our Windows GPU guidance the namespace for the k8s device plugin should be updated to gpu-resources.
@JoeyC-Dev can you please update line 131 to kubectl create namespace gpu-resources ?

sdesai345 · 2025-03-21T16:22:09Z

articles/aks/gpu-cluster.md

    metadata:
      name: nvidia-device-plugin-daemonset
-      namespace: kube-system
+      namespace: gpu-operator


Suggested change

namespace: gpu-operator

namespace: gpu-resources

JoeyC-Dev

    kubectl create namespace gpu-resources

This will reflect in final change.

learn-build-service-prod · 2025-03-21T23:47:02Z

Learn Build status updates of commit 8047ccb:

✅ Validation status: passed

File	Status	Preview URL	Details
articles/aks/gpu-cluster.md	✅Succeeded

For more details, please refer to the build report.

For any questions, please:

Try searching the learn.microsoft.com contributor guides
Post your question in the Learn support channel

JoeyC-Dev · 2025-03-22T00:49:46Z

@sdesai345 Have applied the code suggestions and checked that latest change work without error:

joey [ ~ ]$ kubectl create namespace gpu-resources
namespace/gpu-resources created

# Apply yaml from Azure portal

joey [ ~ ]$ kubectl logs nvidia-device-plugin-daemonset-zb7vm -n gpu-resources | tail -10
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0322 00:45:03.923006       1 main.go:356] Retrieving plugins.
I0322 00:45:03.966464       1 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I0322 00:45:03.967296       1 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0322 00:45:03.969487       1 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

joey [ ~ ]$ kubectl describe node aks-gpupool-27756511-vmss000000 | grep -A 8 "Allocatable:"
Allocatable:
  cpu:                39500m
  ephemeral-storage:  301397976999
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             329209608Ki
  nvidia.com/gpu:     1
  pods:               30
System Info:

v-dirichards · 2025-04-01T21:41:13Z

@sdesai345 Could you review this proposed update to your article and enter #sign-off in a comment if it's ready to merge?

v-ccolin · 2025-04-07T10:23:12Z

I sent an email to the content owner today.

@MicrosoftDocs/public-repo-pr-review-team

v-ccolin · 2025-04-23T10:01:59Z

I sent an email to the content owner today.

@MicrosoftDocs/public-repo-pr-review-team

github-actions · 2025-05-07T12:05:23Z

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

learn-build-service-prod · 2025-05-07T15:41:28Z

Learn Build status updates of commit 61a5e57:

✅ Validation status: passed

File	Status	Preview URL	Details
articles/aks/gpu-cluster.md	✅Succeeded

For more details, please refer to the build report.

For any questions, please:

Try searching the learn.microsoft.com contributor guides
Post your question in the Learn support channel

v-dirichards · 2025-06-05T21:21:05Z

I sent a Teams message to the content owner today.

@MicrosoftDocs/public-repo-pr-review-team

v-dirichards · 2025-06-11T20:49:14Z

The article has been deleted, and this change no longer applies. If you'd still like to update the article, please open a new PR. Or if you believe this PR was closed by mistake, email the PublicPRs alias.

#please-close
@MicrosoftDocs/public-repo-pr-review-team

JoeyC-Dev · 2025-06-11T21:25:10Z

Looks like this page is being renamed to: https://learn.microsoft.com/en-us/azure/aks/use-nvidia-gpu

But well nevermind. I feel enough on the contribution process. Let's give up, shall we?

Half a year, and the PR is being closed due to the page is moved.
What a ridiculous end.

Enhancement of nvidia-device-plugin-daemonset

07ab281

prmerger-automator bot added the do-not-merge label Jan 14, 2025

prmerger-automator bot requested a review from schaffererin January 14, 2025 09:23

prmerger-automator bot assigned schaffererin Jan 14, 2025

prmerger-automator bot added aks-developer/subsvc azure-kubernetes-service/svc Change sent to author labels Jan 14, 2025

prmerger-automator bot added the qualifies-for-auto-merge label Jan 14, 2025

prmerger-automator bot added the aq-pr-triaged label Jan 14, 2025

github-actions bot added the inactive label Jan 28, 2025

github-actions bot removed the inactive label Jan 29, 2025

github-actions bot added the inactive label Feb 12, 2025

github-actions bot removed the inactive label Feb 12, 2025

github-actions bot added the inactive label Mar 3, 2025

github-actions bot removed the inactive label Mar 4, 2025

JoeyC-Dev mentioned this pull request Mar 10, 2025

gpu-cluster: Remove step to create gpu-operator ns #173

Closed

davefellows approved these changes Mar 10, 2025

View reviewed changes

davefellows reviewed Mar 10, 2025

View reviewed changes

JoeyC-Dev mentioned this pull request Mar 21, 2025

[BUG] Outdated info about kubenet in document Azure/AKS#4883

Closed

sdesai345 reviewed Mar 21, 2025

View reviewed changes

JoeyC-Dev commented Mar 21, 2025

View reviewed changes

Apply suggestions from code review

8047ccb

prmerger-automator bot removed the qualifies-for-auto-merge label Mar 21, 2025

JoeyC-Dev requested a review from sdesai345 April 9, 2025 12:04

github-actions bot added the inactive label May 7, 2025

Merge branch 'main' into nvidia-plugin-ds

61a5e57

github-actions bot removed the inactive label Jun 6, 2025

prmerger-automator bot closed this Jun 11, 2025

	1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.

	```bash
	kubectl create namespace gpu-operator
	```

Enhancement of nvidia-device-plugin-daemonset in gpu-cluster.md #138

Enhancement of nvidia-device-plugin-daemonset in gpu-cluster.md #138

Uh oh!

Conversation

JoeyC-Dev commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prmerger-automator bot commented Jan 14, 2025

Uh oh!

learn-build-service-prod bot commented Jan 14, 2025

✅ Validation status: passed

Uh oh!

ttorble commented Jan 14, 2025

Uh oh!

github-actions bot commented Jan 28, 2025

Uh oh!

v-dirichards commented Jan 28, 2025

Uh oh!

github-actions bot commented Feb 12, 2025

Uh oh!

JoeyC-Dev commented Feb 12, 2025

Uh oh!

Court72 commented Feb 17, 2025

Uh oh!

github-actions bot commented Mar 3, 2025

Uh oh!

JoeyC-Dev commented Mar 3, 2025

Uh oh!

v-dirichards commented Mar 3, 2025

Uh oh!

davefellows Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

JoeyC-Dev Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davefellows Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

sdesai345 Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

v-dirichards commented Mar 11, 2025

Uh oh!

sdesai345 left a comment

Choose a reason for hiding this comment

Uh oh!

sdesai345 Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

sdesai345 Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

JoeyC-Dev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

learn-build-service-prod bot commented Mar 21, 2025

✅ Validation status: passed

Uh oh!

JoeyC-Dev commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v-dirichards commented Apr 1, 2025

Uh oh!

v-ccolin commented Apr 7, 2025

Uh oh!

v-ccolin commented Apr 23, 2025

Uh oh!

github-actions bot commented May 7, 2025

Uh oh!

learn-build-service-prod bot commented May 7, 2025

✅ Validation status: passed

Uh oh!

v-dirichards commented Jun 5, 2025

Uh oh!

v-dirichards commented Jun 11, 2025

Uh oh!

JoeyC-Dev commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

JoeyC-Dev commented Jan 14, 2025 •

edited

Loading

JoeyC-Dev Mar 10, 2025 •

edited

Loading

JoeyC-Dev left a comment •

edited

Loading

JoeyC-Dev commented Mar 22, 2025 •

edited

Loading

JoeyC-Dev commented Jun 11, 2025 •

edited

Loading