Solve the version conflicts between the Nvidia driver and CUDA toolkit

Dealing with Nvidia GPU drivers and CUDA software can get tricky. Sometimes, when you update CUDA or your Linux system, it messes up your GPU drivers. So, when that happens, we often have to hunt around online for fixes. It can take a bit of time and digging to find the right solution.

Some questions I recently met are related to Nvidia driver and CUDA failures.

The following packages have unmet dependencies:

cuda-drivers-535 : Depends: nvidia-dkms-535 (>= 535.161.08)

Depends: nvidia-driver-535 (>= 535.161.08) but it is not going to be installed

UserWarning: CUDA initialization: CUDA unknown error – this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running

After spending a significant amount of time troubleshooting, I've realized that a deeper grasp of the complex interaction between CUDA and Nvidia drivers could have expedited the resolution of the driver corruption issue. This highlights the significance of acquiring a thorough understanding of how software components and hardware drivers interact, which can significantly streamline the troubleshooting process and improve system maintenance efficiency. In this post, I aim to demystify the concepts of GPU drivers, CUDA versions, and address other related questions to help others navigate these challenges more effectively.

What is CUDA?

NVIDIA's GPU owes much of its success to the CUDA platform, which stands for Compute Unified Device Architecture. CUDA is a parallel computing platform and application programming interface (API) model developed by NVIDIA. It empowers developers to harness the computational prowess of NVIDIA GPUs (Graphics Processing Units) for a variety of tasks beyond traditional graphics rendering.

Key components of CUDA include:

CUDA Toolkit: This comprehensive development environment provided by NVIDIA equips developers with the necessary tools for building GPU-accelerated applications. It comprises libraries, development tools, compilers (such as nvcc), and runtime APIs.

CUDA C/C++: Extending the capabilities of C and C++ programming languages, CUDA introduces special keywords and constructs that facilitate code authoring for both CPU and GPU. This enables developers to delegate parallelizable portions of their code to the GPU for execution, resulting in significant speed enhancements across various applications.

Runtime API: CUDA furnishes a runtime API enabling developers to manage GPU devices, allocate memory on the GPU, launch kernels (parallel functions executed on the GPU), and synchronize operations between the CPU and GPU.

GPU Architecture: Leveraging the massively parallel architecture inherent in modern NVIDIA GPUs, which comprise thousands of cores capable of executing computations concurrently, CUDA empowers developers to leverage this parallelism for accelerating an array of tasks including scientific simulations, data analytics, image processing, and deep learning.

NVCC and Nvidia-SMI

Let's clarify two important command-line tools in the CUDA ecosystem: nvcc, the NVIDIA CUDA Compiler, and nvidia-smi, the NVIDIA System Management Interface. nvcc serves as the compiler for CUDA, allowing developers to compile CUDA-accelerated applications. On the other hand, nvidia-smi is a command-line utility provided by NVIDIA for monitoring and managing NVIDIA GPU devices. Both nvcc and nvidia-smi are closely tied to specific versions of CUDA. CUDA itself offers both a runtime API and a driver API. The CUDA version reported by nvcc corresponds to the runtime API, while the version displayed in nvidia-smi corresponds to the CUDA version associated with the driver API. It's important to note that if you install nvcc and the driver separately, or for other reasons, the CUDA version reported by nvcc and nvidia-smi may differ. This discrepancy can occur due to various factors, necessitating careful attention to ensure compatibility between CUDA components.

Typically, the driver API is installed alongside the GPU driver installer, ensuring that nvidia-smi is available for use. Conversely, the runtime API and nvcc are bundled with the CUDA Toolkit Installer. Notably, the CUDA Toolkit Installer does not have to be awareness of the GPU driver API. Consequently, even without a GPU, one can still install the CUDA Toolkit, providing a software environment conducive to coding with CUDA parallel computing, albeit without hardware specifics. This setup allows users to install multiple CUDA versions on a single machine, allowing them the flexibility to select the preferred version. However, it's crucial to note that the driver version of CUDA is backward compatible with earlier versions of CUDA supported by nvcc. So in general, the CUDA version of driver API is higher or equal to the CUDA version of runtime API.

The compatibility of CUDA version and GPU version can be found from table 3 in https://docs.nvidia.com/deploy/cuda-compatibility/index.html .

Install different CUDA versions

Here are all the CUDA versions for installation:

https://developer.nvidia.com/cuda-toolkit-archive

Let us use CUDA Toolkit 12.0 as an example:

Very Important for the last option of Installer Type: runfile (local)

If you chose other options like deb, it may reinstall the old driver, and uninstall your newer GPU driver. But runfile will give you an option during the installation to skip updating the GPU driver, so you may keep your newer drivers. This is very important for case you have already installed the GPU driver separately.

Install GPU Drivers:

sudo apt search nvidia-driver
sudo apt install nvidia-driver-510
sudo reboot
Nvidia-smi

Multiple CUDA Version Switching

To begin with, you need to set up the CUDA environment variables for the actual version in use. Open the .bashrc file (vim ~/.bashrc) and add the following statements:

CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"

This indicates that when CUDA is required, the system will search in the /usr/local/cuda directory. However, the CUDA installations typically include version numbers, such as cuda-11.0. So, what should we do? Here comes the need to create symbolic links. The command for creating symbolic links is as follows:

sudo ln -s /usr/local/cuda-11.0/ /usr/local/cuda

After this is done, a cuda file will appear in the /usr/local/ directory, which points to the cuda-11.0 folder. Accessing this file is equivalent to accessing cuda-11.0. This can be seen in the figure below:

At this point, running nvcc --version will display something like:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Sun_Jan__9_22:14:01_CDT_2022
Cuda compilation tools, release 11.0, V11.0.218

For instance, if you need to set up a deep learning environment with Python 3.9.8 + TensorFlow 2.7.0 + CUDA 11.0, follow these steps:

First, create a Python environment with Python 3.9.8 using Anaconda:

conda create -n myenv python=3.9.8
conda activate myenv

Then, install TensorFlow 2.7.0 using pip:

pip install tensorflow==2.7.0

That's it! Since this Python 3.9.8 + TensorFlow 2.7.0 + CUDA 11.0 environment generally meets the requirements in the code, it is certainly compatible. We just need to ensure that the CUDA version matches the version required by the author.

Solve the driver and CUDA version problems.

As we already know the relationship between Nvidia driver and CUDA, we may already know how to solve the above-mentioned problems.

If you do not want to search over the internet, you can remove all Nvidia drivers and CUDA versions, and reinstall them by following the previous steps. Here is one way to get rid of all last Nividia-related packages.

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'
sudo apt-get install linux-headers-$(uname -r)

Understanding CUDA and NVIDIA GPU Driver. PS: due to the demand for GPUs for AI computing, GPUs are very expensive. To save the cost of developing AI systems, it will be more efficient to exchange GPUs and buy and sell used graphics cards. For example, check this link, "sell used graphics card," to learn more about this topic.