Skip to content

Conversation

@dhague
Copy link
Contributor

@dhague dhague commented May 14, 2025

No description provided.

@vlerenc
Copy link

vlerenc commented May 14, 2025

Thank you @dhague (cc @afritzler). I will read it asap.

Comment on lines +102 to +105
Because the NVIDIA driver installer image is specific to each Garden
Linux version, each GPU node requires a label identifying this version,
for example **os-version: 1592.4.0**. Gardener does not take care of
adding such labels, so this becomes a chore for the operations team.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this sounds like something rather trivial to quickly/easily add (modulo in cases of conflict).


## Roadmap for the future

### Step 1 - Add Garden Linux support to the NVIDIA GPU Operator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that help after you said:

  • Either:
    • Fragile and slow if at runtime
      or:
    • License issue/violation if pre-built
  • Requires package manager/installation at runtime
  • As well as requiring an approval by its gatekeeper, NVIDIA itself

Comment on lines +227 to +230
This project supports having a S3 bucket, such that kernel modules are
still downloaded & compiled at runtime, but only once - the resulting
files are stored in the S3 bucket and the installer checks this bucket
for pre-built kernel modules. This has the advantages of the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so that's the way around the slow+fragile vs. license issue.

But what about the NVIDIA gatekeeper issue? The link above (Kinvolk/Flatcar Linux) only helps with the kernel modules, not the integration into the operator, does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned earlier, we might need to use and maintain forks of the NVIDIA GPU Operator repo and the NVIDIA GPU Driver Container repo. Given the structures of these repos it would not be too difficult to keep in sync; that said, we are reasonably hopeful that NVIDIA will accept our PRs as it will help them to sell GPUs to cloud providers using Gardener for their Kubernetes offering.

docs/roadmap.md Outdated
Garden Linux version) and would enable the NVIDIA Container runtime as an option
for worker pools.

### Step 5 - Consider extending the NVIDIA GPU Operator to support AMD & Intel GPUs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Wouldn't the answer be a resounding: never? Are we speaking about a fork (also above) or how do you see this ever happening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA, Intel and AMD are all currently maintaining operators for supporting their GPUs on Kubernetes. This is not a competitive differentiator for any of them. Bringing them together would reduce the overhead for all, and improve the user experience.
Something similar is already happening with Project HAMi, which supports the GPUs of multiple vendors.
With that said, I agree that such a unificiation is somewhat unlikely.

An alternative "Step 5" might be that the Gardener GPU extension supports operators from multiple vendors, and the extension providerConfig could include CRs of type nvidia.com/v1alpha1/NVIDIADriver, amd.com/v1alpha1/DeviceConfig, deviceplugin.intel.com/v1/GpuDevicePlugin and others.

Comment on lines +211 to +212
worst case we would need to build a specific Garden Linux image to
support NVIDIA GPUs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so pre-installed images then. Is that driver-version-independent? What kind of compatibility matrix/issues are to be expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your first comment:
The configuration aspect is addressed in another comment above.

Regarding your second comment:
The specific Garden Linux images we are talking about here contain only the NVIDIA Container Runtime files (not the kernel modules for the driver) - these are Go binaries stored in /usr/bin and /usr/lib and are independent of the kernel and driver versions, so there is no compatibility matrix to consider (other than keeping the binaries more-or-less up to date).

3. Add support for Garden Linux in the [NVIDIA GPU
Operator](https://github.com/NVIDIA/gpu-operator)

Not a great deal needs to be done here - mostly adding a few lines of
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, isn't NVIDIA the gatekeeper and probably blocking that? I haven't seen support for anything but Ubuntu and Red Hat until now. Can you please share a detail link where more operating systems are supported?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @vlerenc , there has been shift of direction from NVIDIA. We will meet up with @dhague & @gehoern to align on that likely today. Will keep you posted.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat sceptical, but it would be good to see that @pnpavlov .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share a detail link where more operating systems are supported?

In terms of the container images for installing the driver, the NVIDIA repo has top-level folders for Azure Linux, Photon Linux and SLES15 in addition to Ubuntu and the various RedHat OSes.

For the GPU operator the only place I could find OS-specific code (outside of test data & test code) is the following few lines from driver_volumes.go:

// RepoConfigPathMap indicates standard OS specific paths for repository configuration files
var RepoConfigPathMap = map[string]string{
	"centos": "/etc/yum.repos.d",
	"ubuntu": "/etc/apt/sources.list.d",
	"rhcos":  "/etc/yum.repos.d",
	"rhel":   "/etc/yum.repos.d",
}

// CertConfigPathMap indicates standard OS specific paths for ssl keys/certificates.
// Where Go looks for certs: https://golang.org/src/crypto/x509/root_linux.go
// Where OCP mounts proxy certs on RHCOS nodes:
// https://access.redhat.com/documentation/en-us/openshift_container_platform/4.3/html/authentication/ocp-certificates#proxy-certificates_ocp-certificates
var CertConfigPathMap = map[string]string{
	"centos": "/etc/pki/ca-trust/extracted/pem",
	"ubuntu": "/usr/local/share/ca-certificates",
	"rhcos":  "/etc/pki/ca-trust/extracted/pem",
	"rhel":   "/etc/pki/ca-trust/extracted/pem",
}

Copy link
Contributor Author

@dhague dhague May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: there is also some OS-specific code further down in the same file which includes some SLES-specific code.

@vlerenc
Copy link

vlerenc commented May 15, 2025

Thank you @dhague. Realistically, do you believe, you are talking here about an NVIDIA GPU operator fork (as NVIDIA is the gatekeeper and hasn't shown interest to support other operating systems in the past)? How much effort do you believe is it to maintain such a fork permanently (in addition to the other work)?

@dhague
Copy link
Contributor Author

dhague commented May 15, 2025

How much effort do you believe is it to maintain such a fork permanently (in addition to the other work)?

I don't think it would be too bad - I don't see many areas where there is a potential for merge conflicts, so we'd just need to do a merge every few months.

The biggest effort would be making sure that we include support for future versions of Garden Linux as they get released, but we have to do that anyway with our current approach and also if we have PRs accepted into the NVIDIA operator.

@JathavanSriramNVIDIA
Copy link

There are some discussions between SAP and NVIDIA on this topic. Please reach out internally to Dominic Kistner, who I am working with on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants