Skip to content

Conversation

@mgsharm
Copy link
Contributor

@mgsharm mgsharm commented Nov 13, 2025

Description of changes:

This PR adds AMD GPU DKMS driver support for Bottlerocket kernel 6.12, enabling users to run newer AMD GPU hardware with the latest out-of-tree drivers. It also includes systemd services to ensure proper driver initialization before dependent services start.

Dependencies

⚠️ Blocked by: #325

This PR requires AMD GPU firmware support to be merged first. The firmware files are necessary for the AMD GPU drivers to initialize properly.

Changes

  1. Disable In-Tree AMD GPU Driver

Why: Out-of-tree DKMS drivers typically support newer GPU hardware faster than in-tree drivers and receive updates independently of kernel releases.

What changed:

  • Disabled CONFIG_DRM_AMDGPU in both x86_64 and aarch64 kernel configs
  • Updated kernel patch 1006 to explicitly select DRM helper modules (DRM_DISPLAY_HDCP_HELPER and DRM_DISPLAY_HDMI_HELPER) that were previously bundled with the in-tree driver but are required as standalone modules for DKMS builds
  • Eliminates driver conflicts by ensuring exclusive hardware access
  1. Add AMD GPU DKMS Kernel Module

What: New kmod-6.12-amdgpu package that builds AMD GPU drivers from upstream DKMS source.

Key features:

  • Sources drivers from AMD's official DKMS repository (version 30.20)
  • Dynamic kernel version detection - automatically adapts to kernel updates without hardcoded dependencies
  • Full cross-compilation support integrated with Bottlerocket's build toolchain
  • GPG signature verification of upstream driver packages
  • Includes all required kernel modules: amdgpu, amdkcl, amdttm, amddrm_*, amd-sched, amdxcp

Testing

Build Validation

  • ✅ Package compiles successfully in Bottlerocket's cross-compilation environment
  • ✅ Cross-compilation verified: x86_64 → aarch64 and aarch64 → x86_64 builds both successful
  • ✅ AMI creation completed and instance boots correctly

Hardware Validation

Tested on AMD GPU instance with 8 GPUs (device ID 75a3 - MI300X):

Driver Initialization

All GPUs initialized successfully:

bash-5.1# dmesg | grep -E 'amdgpu.*initialized'
  [   40.434923] amdgpu 0000:51:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[65bff] ras_mask[65bff]
  [   42.672485] amdgpu 0000:51:00.0: amdgpu: SMU is initialized successfully!
  [   43.326766] amdgpu 0000:52:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[65bff] ras_mask[65bff]
  [   45.583236] amdgpu 0000:52:00.0: amdgpu: SMU is initialized successfully!
  # ... (8 GPUs total, all initialized successfully)
  All GPUs detected by PCI:
  bash-5.1# lspci | grep -i 75a3 | wc -l
  8

Package Installation

  bash-5.1# cat /var/lib/bottlerocket/inventory/application.json | grep -A 10 "kmod-6.12-amdgpu"
  {
    "Name": "kmod-6.12-amdgpu",
    "Publisher": "bottlerocket-kernel-kit",
    "Version": "30.10",
    "Release": "1.1763654568.d2bc9648.br1",
    "Epoch": "0",
    "InstalledTime": "2025-11-20T20:59:51Z",
    "ApplicationType": "Unspecified",
    "Architecture": "x86_64",
    "Url": "https://www.amd.com/",
    "Summary": "AMD GPU drivers for the 6.12 kernel"
  }

Service Status and Integration

All services running correctly:

  bash-5.1# systemctl list-units --type=service | grep -E "(kmod|driver|device)"
    amdgpu-drivers-loaded.service             loaded active exited  AMD GPU drivers configured
    amdgpu-drivers.service                    loaded active exited  AMD GPU driver detection and validation
    kmod-static-nodes.service                 loaded active exited  Create List of Static Device Nodes
    rocm-k8s-device-plugin.service            loaded active running Start ROCm kubernetes device plugin

Terms of Contribution

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When aarch64 fails, that's a sign that cross-compiling might not be set up correctly. It's not always true, but it's the theory you should work hard to disprove.

Based on that I'd say you should get the aarch64 build at least to the point where it fails from some source-level compatibility. Otherwise I don't trust that the x86_64 build is doing what we think, meaning the configure / make logic may not be correct.

@mgsharm mgsharm force-pushed the amd-gpu-support-dkms branch 3 times, most recently from 9efa41a to f0a45e3 Compare November 19, 2025 09:11
@mgsharm mgsharm force-pushed the amd-gpu-support-dkms branch 2 times, most recently from f1f4319 to 677ca98 Compare November 20, 2025 22:02
@mgsharm mgsharm force-pushed the amd-gpu-support-dkms branch from 677ca98 to ed8a298 Compare November 21, 2025 04:17
@mgsharm mgsharm force-pushed the amd-gpu-support-dkms branch 2 times, most recently from 74d21fb to 3acf425 Compare November 24, 2025 02:34
@mgsharm mgsharm marked this pull request as ready for review November 24, 2025 06:24
@mgsharm mgsharm force-pushed the amd-gpu-support-dkms branch 2 times, most recently from 578101c to bc4b69a Compare November 25, 2025 21:59
@mgsharm mgsharm force-pushed the amd-gpu-support-dkms branch from bc4b69a to c7c08d7 Compare November 26, 2025 00:14
Disable CONFIG_DRM_AMDGPU and remove related kernel modules from
packaging to allow DKMS AMD driver to take precedence.

Signed-off-by: Gaurav Sharma <[email protected]>
- Add new kmod-6.12-amdgpu package with DKMS driver support
- Update kernel 1006 patch to include additional DRM helpers for AMD GPU

Signed-off-by: Gaurav Sharma <[email protected]>
@mgsharm mgsharm force-pushed the amd-gpu-support-dkms branch from c7c08d7 to 9509795 Compare November 26, 2025 00:29
@mgsharm
Copy link
Contributor Author

mgsharm commented Nov 26, 2025

Added linux-firmware dependency for amd gpu to initialize - Requires: %{_cross_os}linux-firmware-amdgpu .

%global kernel_sources %{_cross_usrsrc}/kernels/6.12
%define _kernel_version %(cat %{kernel_sources}/include/config/kernel.release)
%global _cross_kmoddir %{_cross_libdir}/modules/%{_kernel_version}
%global amdgpu_kmoddir %{_cross_kmoddir}/kernel/drivers/extra/gpu/drm/amd/amdgpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use updates instead of extra:

Suggested change
%global amdgpu_kmoddir %{_cross_kmoddir}/kernel/drivers/extra/gpu/drm/amd/amdgpu
%global amdgpu_kmoddir %{_cross_kmoddir}/kernel/drivers/updates/gpu/drm/amd/amdgpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants