最近服务器GPU进行了升级,从1080Ti终于升级到了3090,但是随之而来的问题就是以前配置的运行环境全都乱了,以前的代码也都跑不了= = 因此折腾了整整一个礼拜彻底升级了服务器,把一路上遇到的坑在此记录一下
总结
- 操作系统:Ubuntu 16.04 升级至 Ubuntu 20.04
- CUDA:11.5
- Nvidia-driver:495.46
- CuDNN:8.3.2
- Tensorflow:1.15(nvidia版)
- Pytorch:1.10
说明: 一开始其实很不想升级操作系统,因此很害怕容易直接把服务器搞崩,还要做各种备份。必须升级操作系统的原因是因为笔者一些以前的代码必须运行Tensorflow 1.*的环境,但是3090的卡至少要求CUDA 11.*,而官方版的Tensorflow 1.*不支持CUDA 11.*,在网上找了半天发现了Nvidia-Tensorflow这个宝藏对Tensorflow 1.15进行了重新包装支持了CUDA 11,但是安装完发现又需要Ubuntu 18.04中的dll库= =因此最终只能决定全部推倒重来,狠下心来升级了操作系统
将操作系统从Ubuntu16.04更新至20.04
sudo apt-get update
sudo apt-get upgrade
sudo apt dist-upgrade
sudo apt autoremove
sudo reboot
sudo do-release-upgrade -c # 检查更新的版本
sudo do-release-upgrade # 从16.04升级到18.04
sudo reboot
sudo do-release-upgrade -c # 检查更新的版本
sudo do-release-upgrade # 从18.04升级到20.04
sudo reboot
- 可能的报错:Your python3 install is corrupted. Please fix the '/usr/bin/python3' symlink.
- 方法:重新安装python并更新软链接
sudo apt install --reinstall python3
sudo apt install --reinstall python
sudo update-alternatives --remove-all python
sudo update-alternatives --remove-all python3
sudo ln -sf /usr/bin/python2.7 /usr/bin/python
sudo ln -sf /usr/bin/python3.5 /usr/bin/python3
- 下载安装包时Connection failed
- 方法:更换清华源,在/etc/apt/sources.list中添加源
# 16.04
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted universe multiverse
# 18.04
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ bionic-security main restricted universe multiverse
# 20.04
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-updates main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-backports main restricted universe multiverse
deb https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-security main restricted universe multiverse
# deb-src https://mirrors.tuna.tsinghua.edu.cn/ubuntu/ focal-security main restricted universe multiverse
升级nvidia-driver、cuda和cudnn
# 更新nvidia-driver (495.46)
sudo /usr/bin/nvidia-uninstall
sudo service lightdm stop
sudo bash NVIDIA-Linux-x86_64-495.46.run
sudo service lightdm start
cat /proc/driver/nvidia/version
# 更新cuda (11.5)
sudo bash cuda_11.5.0_495.29.05_linux.run
# 在 ~/.bashrc中添加
export PATH="/usr/local/cuda-11.5/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.5/lib64:$LD_LIBRARY_PATH"
# 安装cudnn (8.3.2)
sudo tar -xf cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive.tar.xz
sudo cp include/cudnn*.h /usr/local/cuda-11.5/include
sudo cp lib/libcudnn* /usr/local/cuda-11.5/lib64
sudo chmod a+r /usr/local/cuda-11.5/include/cudnn*.h
sudo chmod a+r /usr/local/cuda-11.5/lib64/libcudnn*
cat /usr/local/cuda-11.5/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
安装tensorflow和pytorch
conda create --name cuda11 python=3.6.5
source activate cuda11
pip install --upgrade pip setuptools requests
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple # 更换清华源
# 安装tensorflow (1.15) tensorflow 1.x 不支持cuda11,因此使用nvidia版本
pip install nvidia-pyindex
pip install nvidia-tensorflow
# 在python中测试
import tensorflow as tf
tf.test.is_gpu_available()
# 安装pytorch (1.10)
pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
# 在python中测试
import torch
torch.cuda.is_available()
|