-
Notifications
You must be signed in to change notification settings - Fork 12
Roadmap for future Gardener GPU support #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thank you @dhague (cc @afritzler). I will read it asap. |
| Because the NVIDIA driver installer image is specific to each Garden | ||
| Linux version, each GPU node requires a label identifying this version, | ||
| for example **os-version: 1592.4.0**. Gardener does not take care of | ||
| adding such labels, so this becomes a chore for the operations team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this sounds like something rather trivial to quickly/easily add (modulo in cases of conflict).
|
|
||
| ## Roadmap for the future | ||
|
|
||
| ### Step 1 - Add Garden Linux support to the NVIDIA GPU Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does that help after you said:
- Either:
- Fragile and slow if at runtime
or: - License issue/violation if pre-built
- Fragile and slow if at runtime
- Requires package manager/installation at runtime
- As well as requiring an approval by its gatekeeper, NVIDIA itself
| This project supports having a S3 bucket, such that kernel modules are | ||
| still downloaded & compiled at runtime, but only once - the resulting | ||
| files are stored in the S3 bucket and the installer checks this bucket | ||
| for pre-built kernel modules. This has the advantages of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so that's the way around the slow+fragile vs. license issue.
But what about the NVIDIA gatekeeper issue? The link above (Kinvolk/Flatcar Linux) only helps with the kernel modules, not the integration into the operator, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, we might need to use and maintain forks of the NVIDIA GPU Operator repo and the NVIDIA GPU Driver Container repo. Given the structures of these repos it would not be too difficult to keep in sync; that said, we are reasonably hopeful that NVIDIA will accept our PRs as it will help them to sell GPUs to cloud providers using Gardener for their Kubernetes offering.
docs/roadmap.md
Outdated
| Garden Linux version) and would enable the NVIDIA Container runtime as an option | ||
| for worker pools. | ||
|
|
||
| ### Step 5 - Consider extending the NVIDIA GPU Operator to support AMD & Intel GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh? Wouldn't the answer be a resounding: never? Are we speaking about a fork (also above) or how do you see this ever happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA, Intel and AMD are all currently maintaining operators for supporting their GPUs on Kubernetes. This is not a competitive differentiator for any of them. Bringing them together would reduce the overhead for all, and improve the user experience.
Something similar is already happening with Project HAMi, which supports the GPUs of multiple vendors.
With that said, I agree that such a unificiation is somewhat unlikely.
An alternative "Step 5" might be that the Gardener GPU extension supports operators from multiple vendors, and the extension providerConfig could include CRs of type nvidia.com/v1alpha1/NVIDIADriver, amd.com/v1alpha1/DeviceConfig, deviceplugin.intel.com/v1/GpuDevicePlugin and others.
| worst case we would need to build a specific Garden Linux image to | ||
| support NVIDIA GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so pre-installed images then. Is that driver-version-independent? What kind of compatibility matrix/issues are to be expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding your first comment:
The configuration aspect is addressed in another comment above.
Regarding your second comment:
The specific Garden Linux images we are talking about here contain only the NVIDIA Container Runtime files (not the kernel modules for the driver) - these are Go binaries stored in /usr/bin and /usr/lib and are independent of the kernel and driver versions, so there is no compatibility matrix to consider (other than keeping the binaries more-or-less up to date).
| 3. Add support for Garden Linux in the [NVIDIA GPU | ||
| Operator](https://github.com/NVIDIA/gpu-operator) | ||
|
|
||
| Not a great deal needs to be done here - mostly adding a few lines of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, isn't NVIDIA the gatekeeper and probably blocking that? I haven't seen support for anything but Ubuntu and Red Hat until now. Can you please share a detail link where more operating systems are supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat sceptical, but it would be good to see that @pnpavlov .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please share a detail link where more operating systems are supported?
In terms of the container images for installing the driver, the NVIDIA repo has top-level folders for Azure Linux, Photon Linux and SLES15 in addition to Ubuntu and the various RedHat OSes.
For the GPU operator the only place I could find OS-specific code (outside of test data & test code) is the following few lines from driver_volumes.go:
// RepoConfigPathMap indicates standard OS specific paths for repository configuration files
var RepoConfigPathMap = map[string]string{
"centos": "/etc/yum.repos.d",
"ubuntu": "/etc/apt/sources.list.d",
"rhcos": "/etc/yum.repos.d",
"rhel": "/etc/yum.repos.d",
}
// CertConfigPathMap indicates standard OS specific paths for ssl keys/certificates.
// Where Go looks for certs: https://golang.org/src/crypto/x509/root_linux.go
// Where OCP mounts proxy certs on RHCOS nodes:
// https://access.redhat.com/documentation/en-us/openshift_container_platform/4.3/html/authentication/ocp-certificates#proxy-certificates_ocp-certificates
var CertConfigPathMap = map[string]string{
"centos": "/etc/pki/ca-trust/extracted/pem",
"ubuntu": "/usr/local/share/ca-certificates",
"rhcos": "/etc/pki/ca-trust/extracted/pem",
"rhel": "/etc/pki/ca-trust/extracted/pem",
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: there is also some OS-specific code further down in the same file which includes some SLES-specific code.
|
Thank you @dhague. Realistically, do you believe, you are talking here about an NVIDIA GPU operator fork (as NVIDIA is the gatekeeper and hasn't shown interest to support other operating systems in the past)? How much effort do you believe is it to maintain such a fork permanently (in addition to the other work)? |
Co-authored-by: Vedran Lerenc <[email protected]>
I don't think it would be too bad - I don't see many areas where there is a potential for merge conflicts, so we'd just need to do a merge every few months. The biggest effort would be making sure that we include support for future versions of Garden Linux as they get released, but we have to do that anyway with our current approach and also if we have PRs accepted into the NVIDIA operator. |
|
There are some discussions between SAP and NVIDIA on this topic. Please reach out internally to Dominic Kistner, who I am working with on this. |
No description provided.