问题一:?nvidia-smi报错:NVIDIA-SMI has failed because it couldn‘t communicate with the NVIDIA driver 原因及避坑解决方案
????????场景描述: 由于训练服务器卡顿, 服务器重启后, 再次跑模型的时候, 发现cuda不可用, 于是输入“nvidia-smi”才发现了一个错误,如下:
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver
????????这是由于重启服务器,linux内核升级导致的,由于linux内核升级,之前的Nvidia驱动就不匹配连接了,但是此时Nvidia驱动还在,可以通过命令?nvcc -V?找到答案
????????但是输入nvcc -V 命令的时候, 发现没有此命令, 说明没有安装, 然后安装nvidia-cuda-toolkit, 安装命令为:?sudo apt install nvidia-cuda-toolkit
安装完成之后, 使用nvcc -V 命令, 展示如下:
上网搜索各种方案之后, 解法方法如下:??
第一步: 安装dkms:
sudo apt-get install dkms
第二步:?查看本机连接不上的驱动版本
ls -l /usr/src/
可以看到有个一nvidia的文件, 这里是nvidia-470.94. 如果没有这类文件, 请先下载对应的文件.下载🔗: xxxxx
第三步: 安装适合的驱动:
sudo /data/disk-2T/xxxx/softwares/cuda/NVIDIA-Linux-x86_64-470.94.run
这个安装的路径, 写自己下载470.94.run所在的文件路径.适当自己更换一下路径即可.
或者是命令
sudo dkms install -m nvidia -v 470.94
这条命令 -v 后面需要填写本机的nvidia驱动版本,根据第二步得到! 如果这个安装过程中出现问题, 请看下面的问题三!
到了这里, 如果安装成功, 那么就恭喜了, 哈哈哈, 使用nvidia-smi如果可以展示下图即为成功!.
?
如果安装失败了, 就是下面的问题二了~~~~
问题二:??Nvidia 显卡 Failed to initialize NVML Driver/library version mismatch 错误解决方案
问题复现:?
$ nvidia-smi
-->
Failed to initialize NVML: Driver/library version mismatch
问题分析:?
? ? ? ? NVIDIA内核驱动版本与系统驱动不一致
定位问题:?
1. 查看显卡驱动所使用的内核版本, 命令如下:
cat /proc/driver/nvidia/version
结果如下:?
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.94 Mon Dec 6 22:42:02 UTC 2021
GCC version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
? ? ? ? ?内核版本Kerner Module为470.93 系统内核18.04
2. 查看系统驱动日志, 命令如下:
cat /var/log/dpkg.log | grep nvidia
结果如下:
2022-06-24 11:45:00 install libnvidia-compute-515:amd64 <none> 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:00 status half-installed libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:04 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:04 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:18 install nvidia-cuda-dev:amd64 <none> 9.1.85-3ubuntu1
2022-06-24 11:45:18 status half-installed nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:34 status unpacked nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:34 status unpacked nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:34 install nvidia-cuda-doc:all <none> 9.1.85-3ubuntu1
2022-06-24 11:45:34 status half-installed nvidia-cuda-doc:all 9.1.85-3ubuntu1
2022-06-24 11:45:38 status unpacked nvidia-cuda-doc:all 9.1.85-3ubuntu1
2022-06-24 11:45:38 status unpacked nvidia-cuda-doc:all 9.1.85-3ubuntu1
2022-06-24 11:45:38 install nvidia-cuda-gdb:amd64 <none> 9.1.85-3ubuntu1
2022-06-24 11:45:38 status half-installed nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:38 status unpacked nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:38 status unpacked nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:38 install nvidia-profiler:amd64 <none> 9.1.85-3ubuntu1
2022-06-24 11:45:38 status half-installed nvidia-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:39 status unpacked nvidia-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:39 status unpacked nvidia-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:39 install nvidia-opencl-dev:amd64 <none> 9.1.85-3ubuntu1
2022-06-24 11:45:39 status half-installed nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:39 status unpacked nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:39 status unpacked nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:39 install nvidia-cuda-toolkit:amd64 <none> 9.1.85-3ubuntu1
2022-06-24 11:45:39 status half-installed nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:40 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:40 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:40 install nvidia-visual-profiler:amd64 <none> 9.1.85-3ubuntu1
2022-06-24 11:45:40 status half-installed nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:46 status unpacked nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:46 status unpacked nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:46 configure nvidia-cuda-doc:all 9.1.85-3ubuntu1 <none>
2022-06-24 11:45:46 status unpacked nvidia-cuda-doc:all 9.1.85-3ubuntu1
2022-06-24 11:45:46 status half-configured nvidia-cuda-doc:all 9.1.85-3ubuntu1
2022-06-24 11:45:46 status installed nvidia-cuda-doc:all 9.1.85-3ubuntu1
2022-06-24 11:45:47 configure libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1 <none>
2022-06-24 11:45:47 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:47 status unpacked libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:47 status half-configured libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:47 status installed libnvidia-compute-515:amd64 515.48.07-0ubuntu0.18.04.1
2022-06-24 11:45:47 configure nvidia-opencl-dev:amd64 9.1.85-3ubuntu1 <none>
2022-06-24 11:45:47 status unpacked nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 status half-configured nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 status installed nvidia-opencl-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 configure nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1 <none>
2022-06-24 11:45:47 status unpacked nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 status half-configured nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 status installed nvidia-cuda-gdb:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 configure nvidia-profiler:amd64 9.1.85-3ubuntu1 <none>
2022-06-24 11:45:47 status unpacked nvidia-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 status half-configured nvidia-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 status installed nvidia-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 configure nvidia-visual-profiler:amd64 9.1.85-3ubuntu1 <none>
2022-06-24 11:45:47 status unpacked nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:47 status half-configured nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 status installed nvidia-visual-profiler:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 configure nvidia-cuda-dev:amd64 9.1.85-3ubuntu1 <none>
2022-06-24 11:45:48 status unpacked nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 status half-configured nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 status installed nvidia-cuda-dev:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 configure nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1 <none>
2022-06-24 11:45:48 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 status unpacked nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 status half-configured nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
2022-06-24 11:45:48 status installed nvidia-cuda-toolkit:amd64 9.1.85-3ubuntu1
? ? ? ? 可以看到曾经安装过系统内核18.04的515.48.07的驱动
3. 查看驱动程序, 命令如下:?
sudo dpkg --list | grep nvidia-*
结果如下:?
ii libnvidia-compute-460-server:amd64 515.48.07-0ubuntu0.18.04.1 amd64 NVIDIA libcompute package
ii libnvidia-container-tools 1.0.5-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.5-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.1.4-1 amd64 NVIDIA container runtime
ii nvidia-container-toolkit 1.0.5-1 amd64 NVIDIA container runtime hook
ii nvidia-cuda-dev 9.1.85-3ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 9.1.85-3ubuntu1 all NVIDIA CUDA and OpenCL documentation
- 可以看到系统安装了ubuntu 内核18.04 下的 nvidia 515 驱动
- 实际系统内核版本与驱动需求的版本不一致是问题产生的根源
解决方案:??
卸载驱动:?
sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove nvidia-*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*
输入下面命令不在有任何内容
sudo dpkg --list | grep nvidia-*
重新安装:?
sudo /data/disk-2T/xxxx/softwares/cuda/NVIDIA-Linux-x86_64-470.94.run
安装完成之后, 使用nvidia-smi查看结果.
问题三: gcc版本不匹配造成的安装失败
此时,如果你的gcc(尽量大于7.3版本)版本过低,那么上述命令sudo dkms install -m nvidia -v 470.103.01失败的原因就找到了,查看现有的gcc版本
gcc --version
gcc在/usr/bin目录下, 输入命令查看所有的gcc:
ls /usr/bin/gcc*
ls /usr/bin/g++*
结果如下:?
?将查到的版本加入gcc候选中,最后的数字是优先级,如下:
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-7 20 --slave /usr/bin/g++ g++ /usr/bin/g++-7
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 10 --slave /usr/bin/g++ g++ /usr/bin/g++-9
完成上面的操作之后,我们就可以通过下面的指令来选择不同的gcc和g++的版本了
sudo update-alternatives --config gcc
结果如下:?
这里我选择的是gcc-7选0或是1都可以.成功!~~~🙃
重启电脑, 输入nvidia-smi, 链接成功
以上为总结的三个问题, 哈哈哈, 能帮助大家解决问题的, 就点赞支持一下吧!🤞??
|