Skip to content

Configure Nvidia GPU drivers for use with PyTorch and CUDA

Cosmo edited this page Jun 5, 2024 · 1 revision

Introduction

JuNest is based on Arch Linux, a rolling-release Linux distribution whose packages are always up-to-date. This means that, in most cases, the Nvidia drivers available through the Arch Linux package repository will ship with a higher version number than the drivers installed in the host environment. This version mismatch between JuNest and the host environment breaks tools like PyTorch and CUDA for example:

>>> import torch
>>> torch.cuda.is_available()
/usr/lib/python3.10/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at  /build/python-pytorch/src/pytorch-1.9.0-opt-cuda/c10/cuda/CUDAFunctions.cpp:115.)
  return torch._C._cuda_getDeviceCount() > 0
False

The solution is to downgrade the Nvidia GPU driver in JuNest to match the version available in the host environment.

First, check the Nvidia GPU driver version available on the host:

cat /proc/driver/nvidia/version

Find the version string in the output. In this example, the version is 535.183.01.

NVRM version: NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
GCC version:  gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

Then, check the Nvidia GPU driver version available in JuNest:

pacman -Sii nvidia-dkms

And again, find the version string in the output. In this example, the version is 550.78.

Repository      : extra
Name            : nvidia-dkms
Version         : 550.78-1
Description     : NVIDIA drivers - module sources
Architecture    : x86_64
URL             : http://www.nvidia.com/
Licenses        : custom
Groups          : None
Provides        : NVIDIA-MODULE  nvidia
Depends On      : dkms  nvidia-utils=550.78  libglvnd
Optional Deps   : None
Required By     : python-cuda  python-cuda-docs
Optional For    : bumblebee
Conflicts With  : NVIDIA-MODULE  nvidia
Replaces        : None
Download Size   : 39.89 MiB
Installed Size  : 68.31 MiB
Packager        : Sven-Hendrik Haase <[email protected]>
Build Date      : Thu 02 May 2024 04:36:33 PM EDT
MD5 Sum         : None
SHA-256 Sum     : 760c143003ab755c348094b26e869435aae535449655724466a7f614ea57aef4
Signatures      : 39E4B877E62EB915
Extended Data   : None

Now that we've confirmed the existence of a version mismatch, we need to install downgraded versions of the Nvidia GPU drivers in JuNest. The three packages to be downgraded are nvidia-dkms, nvidia-utils, and opencl-nvidia.

Here comes the tricky part. There are many ways to downgrade these packages, each with differing levels of complexity.

Table of Contents

  • Step 1: Check if the target version exists in the history of the Arch Linux package repository.
    • Exists? Go to Step 2.
    • Doesn't exist? Go to Step 3.
  • Step 2: Downgrade the packages directly using the downgrade program.
    • Successful? Done!
    • Unsuccessful? Go to Step 6.
  • Step 3: Checkout the PKGBUILD of the closest matching driver version in the Arch Linux package repository.
  • Step 4: Modify the version string in the PKGBUILD to download the correct version of the driver.
  • Step 5: Run makepkg -s and install the package.
    • Successful? Done!
    • Unsuccessful? Go to Step 6.
  • Step 6: Downgrade the Linux kernel to match the host environment, and re-run the original command.

Step 1/2: Using downgrade

Install downgrade from the AUR:

yay -S downgrade

Display a list of past driver versions:

sudo downgrade nvidia-dkms

If the target version exists, press Ctrl+C to cancel, then downgrade the three packages in one command:

sudo downgrade nvidia-dkms nvidia-utils opencl-nvidia

Add packages to the ignore list to prevent updates in the future:

sudo nano /etc/pacman.conf

Uncomment the line that contains IgnorePkg, then change it to:

IgnorePkg = nvidia-dkms nvidia-utils opencl-nvidia

Step 3/4/5: Manually build the packages with makepkg and a modified PKGBUILD

Navigate to the nvidia-dkms page in the Arch Linux package repository. On the top-right corner, click ‘View Changes’ to view to the package's commit history in the Arch Linux GitLab repository.

Find the commit hash of closest version. In this example, where we want version 535.183.01 of the Nvidia driver packages, the closest available version is 535.113.01 and its commit hash is b43efedee3e5a44eb1af5c299f68403f791107a9.

Clone the repository and perform a hard reset to the desired commit hash:

git clone https://gitlab.archlinux.org/archlinux/packaging/packages/nvidia-utils.git
cd nvidia-utils
git reset --hard b43efedee3e5a44eb1af5c299f68403f791107a9

Inside the directory there is a PKGBUILD script. Modify the version string to download the correct version of the driver.

nano PKGBUILD
Sample PKGBUILD script before modification

After opening the editor, there are two things worth noting here:

  • We need to modify pkgver to the target version (on my current machine it is 459.29.05)
  • After modifying pkgver, the corresponding source URL will change, so the sha512sums will change accordingly. In this case, we can disable the check by replacing the original sha512sum with 'SKIP' for simplicity.
Sample PKGBUILD script after modification

Run makepkg with the modified PKGBUILD script:

makepkg -s

If the build is successful, you can install the resulting packages:

sudo pacman -U *.tar.zst

Add packages to the ignore list to prevent updates in the future:

sudo nano /etc/pacman.conf

Uncomment the line that contains IgnorePkg, then change it to:

IgnorePkg = nvidia-dkms nvidia-utils opencl-nvidia

Step 6: Downgrade the Linux kernel (if necessary)

The installation may fail because the old GPU driver may not be compatible with the latest version of the Linux kernel, which is the default choice of Arch Linux. If this happens, you need to downgrade the Linux kernel as well.

Outside the JuNest environment, check the Linux kernel version of the host:

uname -r

In this example, the output is 5.4.0–126-generic, so we know that the GPU driver version 459.29.05 is compatible with Linux kernel version 5.4.0. Note that the kernel version need not be exactly the same as the host.

Then, downgrade the Linux kernel in JuNest:

sudo nano /etc/pacman.conf

At the end of the file, add these two lines to add the kernel-lts unofficial user repository:

[kernel-lts] 
Server = https://repo.m2x.dev/current/$repo/$arch

Refresh the package list:

sudo pacman -Sy

List all the available Linux kernel versions:

sudo pacman -Ss 'linux-lts.*-headers'

According to the output, we know that Linux kernel 5.4 is available, so we can install it by:

sudo pacman -S linux-lts54 linux-lts54-headers

Acknowledgements

Huge shoutout to @ayaka_45434 for posting the original version of this tutorial on Medium!