-
-
Notifications
You must be signed in to change notification settings - Fork 114
Configure Nvidia GPU drivers for use with PyTorch and CUDA
JuNest is based on Arch Linux, a rolling-release Linux distribution whose packages are always up-to-date. This means that, in most cases, the Nvidia drivers available through the Arch Linux package repository will ship with a higher version number than the drivers installed in the host environment. This version mismatch between JuNest and the host environment breaks tools like PyTorch and CUDA for example:
>>> import torch
>>> torch.cuda.is_available()
/usr/lib/python3.10/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /build/python-pytorch/src/pytorch-1.9.0-opt-cuda/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False
The solution is to downgrade the Nvidia GPU driver in JuNest to match the version available in the host environment.
First, check the Nvidia GPU driver version available on the host:
cat /proc/driver/nvidia/version
Find the version string in the output. In this example, the version is 535.183.01
.
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.183.01 Sun May 12 19:39:15 UTC 2024
GCC version: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
Then, check the Nvidia GPU driver version available in JuNest:
pacman -Sii nvidia-dkms
And again, find the version string in the output. In this example, the version is 550.78
.
Repository : extra
Name : nvidia-dkms
Version : 550.78-1
Description : NVIDIA drivers - module sources
Architecture : x86_64
URL : http://www.nvidia.com/
Licenses : custom
Groups : None
Provides : NVIDIA-MODULE nvidia
Depends On : dkms nvidia-utils=550.78 libglvnd
Optional Deps : None
Required By : python-cuda python-cuda-docs
Optional For : bumblebee
Conflicts With : NVIDIA-MODULE nvidia
Replaces : None
Download Size : 39.89 MiB
Installed Size : 68.31 MiB
Packager : Sven-Hendrik Haase <[email protected]>
Build Date : Thu 02 May 2024 04:36:33 PM EDT
MD5 Sum : None
SHA-256 Sum : 760c143003ab755c348094b26e869435aae535449655724466a7f614ea57aef4
Signatures : 39E4B877E62EB915
Extended Data : None
Now that we've confirmed the existence of a version mismatch, we need to install downgraded versions of the Nvidia GPU drivers in JuNest. The three packages to be downgraded are nvidia-dkms
, nvidia-utils
, and opencl-nvidia
.
Here comes the tricky part. There are many ways to downgrade these packages, each with differing levels of complexity.
-
Step 1: Check if the target version exists in the history of the Arch Linux package repository.
- Exists? Go to Step 2.
- Doesn't exist? Go to Step 3.
-
Step 2: Downgrade the packages directly using the
downgrade
program.- Successful? Done!
- Unsuccessful? Go to Step 6.
-
Step 3: Checkout the
PKGBUILD
of the closest matching driver version in the Arch Linux package repository. -
Step 4: Modify the version string in the
PKGBUILD
to download the correct version of the driver. -
Step 5: Run
makepkg -s
and install the package.- Successful? Done!
- Unsuccessful? Go to Step 6.
- Step 6: Downgrade the Linux kernel to match the host environment, and re-run the original command.
Install downgrade
from the AUR:
yay -S downgrade
Display a list of past driver versions:
sudo downgrade nvidia-dkms
If the target version exists, press Ctrl+C to cancel, then downgrade the three packages in one command:
sudo downgrade nvidia-dkms nvidia-utils opencl-nvidia
Add packages to the ignore list to prevent updates in the future:
sudo nano /etc/pacman.conf
Uncomment the line that contains IgnorePkg, then change it to:
IgnorePkg = nvidia-dkms nvidia-utils opencl-nvidia
Navigate to the nvidia-dkms
page in the Arch Linux package repository. On the top-right corner, click ‘View Changes’ to view to the package's commit history in the Arch Linux GitLab repository.
Find the commit hash of closest version. In this example, where we want version 535.183.01
of the Nvidia driver packages, the closest available version is 535.113.01
and its commit hash is b43efedee3e5a44eb1af5c299f68403f791107a9
.
Clone the repository and perform a hard reset to the desired commit hash:
git clone https://gitlab.archlinux.org/archlinux/packaging/packages/nvidia-utils.git
cd nvidia-utils
git reset --hard b43efedee3e5a44eb1af5c299f68403f791107a9
Inside the directory there is a PKGBUILD script. Modify the version string to download the correct version of the driver.
nano PKGBUILD
After opening the editor, there are two things worth noting here:
- We need to modify pkgver to the target version (on my current machine it is 459.29.05)
- After modifying pkgver, the corresponding source URL will change, so the sha512sums will change accordingly. In this case, we can disable the check by replacing the original sha512sum with 'SKIP' for simplicity.
Run makepkg with the modified PKGBUILD
script:
makepkg -s
If the build is successful, you can install the resulting packages:
sudo pacman -U *.tar.zst
Add packages to the ignore list to prevent updates in the future:
sudo nano /etc/pacman.conf
Uncomment the line that contains IgnorePkg, then change it to:
IgnorePkg = nvidia-dkms nvidia-utils opencl-nvidia
The installation may fail because the old GPU driver may not be compatible with the latest version of the Linux kernel, which is the default choice of Arch Linux. If this happens, you need to downgrade the Linux kernel as well.
Outside the JuNest environment, check the Linux kernel version of the host:
uname -r
In this example, the output is 5.4.0–126-generic
, so we know that the GPU driver version 459.29.05 is compatible with Linux kernel version 5.4.0
. Note that the kernel version need not be exactly the same as the host.
Then, downgrade the Linux kernel in JuNest:
sudo nano /etc/pacman.conf
At the end of the file, add these two lines to add the kernel-lts
unofficial user repository:
[kernel-lts]
Server = https://repo.m2x.dev/current/$repo/$arch
Refresh the package list:
sudo pacman -Sy
List all the available Linux kernel versions:
sudo pacman -Ss 'linux-lts.*-headers'
According to the output, we know that Linux kernel 5.4 is available, so we can install it by:
sudo pacman -S linux-lts54 linux-lts54-headers
Huge shoutout to @ayaka_45434 for posting the original version of this tutorial on Medium!