-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement of nvidia-device-plugin-daemonset in gpu-cluster.md #138
base: main
Are you sure you want to change the base?
Conversation
@JoeyC-Dev : Thanks for your contribution! The author(s) have been notified to review your proposed change. |
Learn Build status updates of commit 07ab281: ✅ Validation status: passed
For more details, please refer to the build report. For any questions, please:
|
Can you review the proposed changes? IMPORTANT: When the changes are ready for publication, adding a #label:"aq-pr-triaged" |
This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions. |
@schaffererin This PR hasn’t had any updates for a while. If it's ready for review, could you sign off? Or should it be closed? #label:"aq-pr-triaged","aq-followed-up" |
This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions. |
This PR is critical for users who aren't expert on Kubernetes as they don't know why their driver cannot be installed. @MicrosoftDocs/public-repo-pr-review-team |
I sent an email to the content owner today. |
This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions. |
Can we have another person reviewing this PR? If the designated is not free for reviewing the PR. @MicrosoftDocs/public-repo-pr-review-team |
I sent a Teams message to the content owner today. |
@@ -138,7 +138,7 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d | |||
kind: DaemonSet | |||
metadata: | |||
name: nvidia-device-plugin-daemonset | |||
namespace: kube-system | |||
namespace: gpu-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary at the same time? Just wary of this tripping people up but I think it should be ok...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In conlcusion: yes.
First thing first, in the original tutorial, it is already asking user to create gpu-operator
namespace.
I can tell what that gpu-operator
coming from, it is from: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
Though k8s-plugin and gpu-operator are different things, I am still thinking it is necessary to let it having its own namespace. For example: istio has own aks-istio-system
, approuting has app-routing-system
. So why not give the separated one for NVIDIA/k8s-device-plugin?
The second and most important part is: when I install 3rd party plugin in kube-system
: so far I remember, it will force inject additional env variables (which is unnecessary). For example:
I don't feel satisfied for this.
From the time I start realizing Pods in kube-system
are different than others, I stop suggesting installing Pods in kube-system
, when it is totally not necessary. So I did not choose this way to edit the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
@davefellows Would you add a comment to indicate what action to take on this PR? Adding a |
Proposed change: Improve the Nvidia device plugin deamonset for better deployment.
Supporting point:
tolerations
to relativetaint
.nodeSelector
.v0.17.0
. Update image version relatively: https://github.com/NVIDIA/k8s-device-plugin/releasesgpu-operator
, but there is no actual use in the daemonset. Consider this is a manual set-up, I kept it as in namespacegpu-operator
and change the yaml relatively.azure-aks-docs/articles/aks/gpu-cluster.md
Lines 128 to 132 in 130a484
Having checked, the propsed yaml is working, and it will deploy the ds on Nvidia GPU node.


Successful deployment: