本文是李沐老师的【动手学深度学习】课程的学习记录,具体章节为卷积神经网络章节。
从全连接层到卷积
小结:
- 图像的平移不变性使得我们以相同的方式处理局部图像,而不在乎它所在的位置
- 局部性意味着计算相应的隐藏表示只需要一小部分的局部图像像素
- 在图像处理中,卷积层通常比全连接层需要更少的参数,但依旧获得高效用的性能
- 卷积神经网络CNN是一类特殊的神经网络,它可以包含多个卷积层
- 多个输入和输出通道使模型造每个空间位置可以获得图像的多方面特征
图像卷积
import torch
from torch import nn
from d2l import torch as d2l
def corr2d(X, K):
h, w = K.shape
Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
return Y
class Conv2D(nn.Module):
def __init__(self, kernel_size):
super().__init__()
self.weight = nn.Parameter(torch.rand(kernel_size))
self.bias = nn.Parameter(torch.zeros(1))
def forward(self, x):
return corr2d(x, self.weight) + self.bias
X = torch.ones((6,8))
X[:,2:6] = 0
K = torch.tensor([[1.0,-1.0]])
Y = corr2d(X,K)
print(Y)
填充和步幅度
当当输入图像的形状为
n
h
×
n
w
n_h \times n_w
nh?×nw?,卷积形状为
k
h
×
k
w
k_h \times k_w
kh?×kw?时,那么输出形状为
(
n
h
?
k
h
+
1
)
×
(
n
w
?
k
w
+
1
)
(n_h-k_h+1)\times (n_w -k_w+1)
(nh??kh?+1)×(nw??kw?+1)。
那么若填充
p
h
p_h
ph?行和
p
w
p_w
pw?列(分别进行上下左右平均分类),那么最终输出的形状为:
(
n
h
?
k
h
+
p
h
+
1
)
×
(
n
w
?
k
w
+
p
w
+
1
)
(n_h -k_h + p_h + 1)\times(n_w-k_w+p_w+1)
(nh??kh?+ph?+1)×(nw??kw?+pw?+1) 若调整垂直步幅为
s
h
s_h
sh?,水平步幅为
s
w
s_w
sw?时,输出形状为:
?
(
n
h
?
k
h
+
p
h
+
s
h
)
/
s
h
?
×
?
(
n
w
?
k
w
+
p
w
+
s
w
)
/
s
w
?
\lfloor (n_h-k_h+p_h+s_h)/s_h\rfloor \times \lfloor (n_w-k_w+p_w+s_w)/s_w \rfloor
?(nh??kh?+ph?+sh?)/sh??×?(nw??kw?+pw?+sw?)/sw??
import torch
from torch import nn
def comp_conv2d(conv2d,x):
x = x.reshape((1,1) + X.shape)
y = conv2d(x)
return y.reshape(y.shape[2:])
conv2d = nn.Conv2d(1,1,kernel_size=(3,5),padding=(0,1),stride=(3,4))
X = torch.rand(size=(8,8))
print(comp_conv2d(conv2d,X).shape)
小结:
- 填充可以增加输出的高度和宽度,这常用来使得输出与输入具有相同的高和宽
- 步幅可以减小输出的高和宽,例如输出的高和宽仅为输入的高和宽的
1
n
\frac{1}{n}
n1?
- 填充和步幅可用于有效地调整数据的维度
多输入多输出通道
对于多输入通道来说,一般都有相同通道数的卷积核来跟其进行匹配,然后计算的过程就是对每个通道输入的二维张量和对应通道的卷积核的二维张量进行运算,每个通道得到一个计算结果,然后就将各个计算结果相加作为输出的单通道的那个位置的数值,如下图:
对于多输出通道来说,可以将每个通道看作是对不同特征的响应,假设
c
i
、
c
o
c_i、c_o
ci?、co?分别为输入和输出通道的数目,那么为了得到这多个通道的输出,我们需要为每个输出通道创建一个形状为
c
i
×
k
h
×
k
w
c_i\times k_h \times k_w
ci?×kh?×kw?大小的卷积核张量,因此总的卷积核的形状为
c
o
×
c
i
×
k
h
×
k
w
c_o\times c_i \times k_h \times k_w
co?×ci?×kh?×kw?。
而还有一种特殊的卷积层,为
1
×
1
1\times 1
1×1卷积层。因为高宽只有1,因此它无法造高度和宽度的维度上,识别相邻元素间相互作用的能力,它唯一的计算发生在通道上。如下图:
这种卷积层会导致输入和输出具有相同的高度和宽度,但是通道数发生了变化,输出中的每个元素都是从输入图像中同一位置的元素的线性组合,这就说明可以将这个卷积层起的作用看成是一个全连接层,输入的每个通道就是一个输入结点,然后卷积核的每一个通道就是对应的权重。
因此
1
×
1
1\times 1
1×1卷积层通常用于调整网络层的通 道数量和控制模型的复杂度
池化层(汇聚层)
池化层可以用来处理卷积对于像素位置尤其敏感的问题,例如下面:
那么池化有最大池化以及平均池化
具体实现为:
pool2d = nn.MaxPool2d((2,3),padding=(1,1),stride=(2,3))
如果应对多通道的场景,会保持输入和输出通道相等。
小结:
- 对于给定输入元素,最大池化层会输出该窗口内的最大值,平均池化层会输出该窗口内的平均值
- 池化层的主要优点之一是减轻卷积层对位置的过度敏感
- 可以指定池化层的填充和步幅
- 使用最大池化层以及大于1的步幅,可以减小空间的维度
- 池化层的输出通道数和输入通道数相同
卷积神经网络(LeNet)
import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l
class Reshape(torch.nn.Module):
def forward(self, x):
return x.view(-1, 1, 28, 28)
net = nn.Sequential(
Reshape(),
nn.Conv2d(1, 6, kernel_size=5, padding=2),
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16, kernel_size=5),
nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, 10)
)
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
def evaluate_accuracy_gpu(net, data_iter, device=None):
if isinstance(net, torch.nn.Module):
net.eval()
if not device:
device = next(iter(net.parameters())).device
metric = d2l.Accumulator(2)
for X,y in data_iter:
if isinstance(X, list):
X = [x.to(device) for x in X]
else:
X = X.to(device)
y = y.to(device)
metric.add(d2l.accuracy(net(X),y), y.numel())
return metric[0] / metric[1]
def train_ch6(net, train_iter, test_iter, num_eopchs, lr, device):
def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
nn.init.xavier_uniform_(m.weight)
net.apply(init_weights)
print("training on:",device)
net.to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', xlim=[1,num_eopchs],
legend=["train loss",'train acc', 'test,acc'])
timer, num_batches = d2l.Timer(), len(train_iter)
for epoch in range(num_eopchs):
metric = d2l.Accumulator(3)
net.train()
for i,(X,y) in enumerate(train_iter):
timer.start()
optimizer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
optimizer.step()
with torch.no_grad():
metric.add(l * X.shape[0], d2l.accuracy(y_hat,y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i+1) % (num_batches // 5) == 0 or i==num_batches-1:
animator.add(epoch + (i+ 1) / num_batches,
(train_l, train_acc ,None))
test_acc = evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch+1, (None, None, test_acc))
print(f'loss{ train_l:.3f},train acc{train_acc:.3f},'
f'test acc{test_acc:.3f}')
print(f'{metric[2] * num_eopchs / timer.sum():1f} examples / sec'
f'on{str(device)}')
lr, num_epoch = 0.5,20
train_ch6(net, train_iter, test_iter, num_epoch, lr ,d2l.try_gpu())
plt.show()
loss0.417,train acc0.847,test acc0.836
36144.960085 examples / seconcuda:0
小结:
- 卷积神经网络是一类使用卷积层的网络
- 在卷积神经网络中,组合使用卷积层、非线性激活函数和池化层
- 为了构造高性能的CNN,我们通常对卷积层进行排序,逐渐降低其表示的空间分辨率,同时增加通道数
- 在传统的卷积神经网络中,卷积块编码得到的表征在输出之前需要由一个或多个全连接层进行处理
- LeNet是最早发布的卷积神经网络之一
深度卷积神经网络(AlexNet)
import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l
net = nn.Sequential(
nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(96, 256, kernel_size=5, padding=2),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(256, 384, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(384, 384, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(384, 256, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Flatten(),
nn.Linear(6400, 4096),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(4096, 4096),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(4096, 10)
)
batch_size = 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
lr, num_epochs = 0.01, 10
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
plt.show()
跑了好久:
loss 0.328, train acc 0.881, test acc 0.881
666.9 examples/sec on cuda:0
使用块的网络(VGG)
VGG就是沿用了AlexNet的思想,将多个卷积层和一个池化层组成一个块,然后可以指定每个块内卷积层的数目,以及块的数目,经过多个块对图像信息的提取后再经过全连接层。
VGG块中包含以下内容:
- 多个带填充以保持分辨率不变的卷积层
- 每个卷积层后都带有非线性激活函数
- 最后一个池化层
具体代码如下:
import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l
def vgg_block(num_convs, in_channels, out_channels):
layers = []
for _ in range(num_convs):
layers.append(nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
layers.append(nn.ReLU())
in_channels = out_channels
layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
return nn.Sequential(*layers)
def vgg(conv_arch):
conv_blks = []
in_channels = 1
for (num_convs, out_channels) in conv_arch:
conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
in_channels = out_channels
return nn.Sequential(
*conv_blks,
nn.Flatten(),
nn.Linear(out_channels * 7 * 7, 4096),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(4096, 4096),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(4096, 10)
)
conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))
ratio = 4
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)
lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net,train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
plt.show()
loss 0.170, train acc 0.936, test acc 0.912
378.0 examples/sec on cuda:0
小结:
- VGG-11使用可复用的卷积块来构造网络,不同的VGG模型可通过每个块中卷积层数量和输出通道数量的差异来定义
- 块的使用导致网络定义得非常简洁,使用块可以有效地设计复杂的网络
- 在研究中发现深层且窄的卷积(多层
3
×
3
3\times 3
3×3)比浅层且宽(例如少层
5
×
5
5\times 5
5×5)的效果更好
网络中的网络(NiN)
之前的网络都有一个共同的特点在于最后都会通过全连接层来对特征的表示进行处理,这就导致参数数量很大。那么NiN就是希望能够很其他的模块来替换掉全连接层,那么就用到了**
1
×
1
1 \times 1
1×1的卷积层**,因此1个NiN块就是一个正常的卷积层和两个
1
×
1
1 \times 1
1×1的卷积层,那么经过多个NiN块后,将通道数拓展到希望输出的类别数,然后用一个具有输出类别数目的通道数的全局平均池化层来进行处理,也就是对每个通道进行全部平均得到单个标量,那么有
o
u
t
_
c
h
a
n
n
e
l
s
out\_channels
out_channels个通道就有相应个数值,再经过softmax就可以作为输出了。
import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l
def nin_block(in_channels, out_channels, kernel_size, strides, padding):
return nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1),
nn.ReLU(),
nn.Conv2d(out_channels, out_channels, kernel_size=1),
nn.ReLU()
)
net = nn.Sequential(
nin_block(1, 96, kernel_size=11, strides=4, padding=0),
nn.MaxPool2d(3, stride=2),
nin_block(96, 256, kernel_size=5, strides=1, padding=2),
nn.MaxPool2d(3,stride=2),
nin_block(256, 384, kernel_size=3, strides=1, padding=1),
nn.MaxPool2d(3,stride=2),
nn.Dropout(p=0.5),
nin_block(384, 10, kernel_size=3, strides=1, padding=1),
nn.AdaptiveAvgPool2d((1,1)),
nn.Flatten()
)
lr, num_epochs, batch_size = 0.1, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=224)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
plt.show()
loss 0.383, train acc 0.857, test acc 0.847
513.3 examples/sec on cuda:0
小结:
- NiN使用由一个卷积层和多个
1
×
1
1\times 1
1×1卷积层组成的块,该块可以在卷积神经网络中使用,以允许更多的像素非线性
- NiN去除了容易造成过拟合的全连接层,将它们替换成全局平均池化层,该池化层通道数量为所需的输出数目
- 移除全连接层可以减少过拟合,同时显著减少参数量
含并行连接的网络(GoogLeNet)
前面提到的各种网络,其中的问题在于各个卷积层的参数可能都是不一样的,而DNN的解释性如此之差,我们很难解释清楚哪一个超参数的卷积层才是我们需要的,才是最好的。因此在GoogLeNet网络中,其引入了Inception块,这种块引入了并行计算的思想,将常见的多种不同超参数的卷积层都放入,希望能够通过多种提取特征的方式来得到最理想的特征提取效果,如下图:
其具体的结构为:
import torch
from matplotlib import pyplot as plt
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
class Inception(nn.Module):
def __init__(self, in_channels, c1,c2,c3, c4, **kwargs):
super(Inception, self).__init__(**kwargs)
self.p1_1 = nn.Conv2d(in_channels, c1, kernel_size=1)
self.p2_1 = nn.Conv2d(in_channels, c2[0], kernel_size=1)
self.p2_2 = nn.Conv2d(c2[0],c2[1], kernel_size=3, padding=1)
self.p3_1 = nn.Conv2d(in_channels, c3[0], kernel_size=1)
self.p3_2 = nn.Conv2d(c3[0], c3[1], kernel_size=5, padding=2)
self.p4_1 = nn.MaxPool2d(kernel_size=3, stride=1,padding=1)
self.p4_2 = nn.Conv2d(in_channels, c4, kernel_size=1)
def forward(self,x):
p1 = F.relu(self.p1_1(x))
p2 = F.relu(self.p2_2(F.relu(self.p2_1(x))))
p3 = F.relu(self.p3_2(F.relu(self.p3_1(x))))
p4 = F.relu(self.p4_2(self.p4_1(x)))
return torch.cat((p1,p2,p3,p4),dim=1)
b1 = nn.Sequential(
nn.Conv2d(1,64, kernel_size=7, stride=2, padding=3),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
b2 = nn.Sequential(
nn.Conv2d(64, 64, kernel_size=1),
nn.ReLU(),
nn.Conv2d(64, 192, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
b3 = nn.Sequential(
Inception(192,64,(96,128),(16,32),32),
Inception(256,128,(128,192),(32,96),64),
nn.MaxPool2d(kernel_size=3,stride=2,padding=1)
)
b4 = nn.Sequential(
Inception(480, 192, (96,208),(16,48), 64),
Inception(512, 160, (112,224),(24,64), 64),
Inception(512,128,(128,256),(24,64),64),
Inception(512,112, (144,288),(32,64), 64),
Inception(528, 256, (160,320),(32,128),128),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
)
b5 = nn.Sequential(
Inception(832,256, (160,320),(32,128),128),
Inception(832, 384, (192,384), (48,128),128),
nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten()
)
net = nn.Sequential(
b1,b2,b3,b4,b5,nn.Linear(1024,10)
)
lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
plt.show()
"""
x = torch.rand(size=(1,1,96,96))
for layer in net:
x = layer(x)
print(layer.__class__.__name__, 'output shape \t', x.shape)
"""
loss 0.284, train acc 0.891, test acc 0.884
731.9 examples/sec on cuda:0
小结:
- Inception块相当于一个有4条路径的子网络,它通过不同窗口形状的卷积层和最大池化层来并行抽取信息,并使用
1
×
1
1\times 1
1×1卷积层减少每像素级别上的通道维数从而降低模型复杂度
- GoogLeNet将多个设计精细的Inception块与其他层(卷积层、全连接层)串联起来,其中Inception块的通道数分配之比是在ImageNet数据集上通过大量的实验得到的
- GoogLeNet和它的后继者们一度是ImageNet上最有效的模型之一:它以较低的计算复杂度提供了类似的测试精度
批量归一化
在训练过程中,一般正常情况下,后面的层的梯度会比较大,而前面层的梯度会因为经过多层的传播一直相乘而变得比较小,而此时学习率如果固定的话,那么前面的层就会更新得比较慢,后面层会更新得比较快,那么当后面层更新即将完成时,会因为前面的层发生了变动,那么后面层就需要重新更新。
那么批量规范化的思想是:在每一个卷积层或线型层后应用,将其输出规范到某一个分布之中(不同的层所归到的分布是不一样的,是各自学习的),那么限制到一个想要的分布后便可以使得收敛更快。
假设当前批量B得到的样本为
x
=
(
x
1
,
x
2
,
.
.
.
,
x
n
)
\pmb{x}=(x_1,x_2,...,x_n)
xx=(x1?,x2?,...,xn?),那么:
μ
^
B
=
1
∣
B
∣
∑
i
∈
B
x
i
σ
^
B
2
=
1
∣
B
∣
∑
i
∈
B
(
x
i
?
μ
^
B
)
2
+
?
??
(
?
防止方差为
0
)
B
N
(
x
i
)
=
γ
x
i
?
μ
^
B
σ
^
B
+
β
\hat{\mu}_B=\frac{1}{\vert B\vert}\sum_{i\in B}x_i\\ \hat{\sigma}^2_B=\frac{1}{\vert B \vert}\sum_{i\in B}(x_i -\hat{\mu}_B)^2+\epsilon~~(\epsilon防止方差为0)\\ BN(x_i)=\gamma \frac{x_i - \hat{\mu}_B}{\hat{\sigma}_B}+\beta
μ^?B?=∣B∣1?i∈B∑?xi?σ^B2?=∣B∣1?i∈B∑?(xi??μ^?B?)2+???(?防止方差为0)BN(xi?)=γσ^B?xi??μ^?B??+β 可以认为
γ
、
β
\gamma、\beta
γ、β分别为要规范到的分布的方差和均值,是两个待学习的参数。
研究指出,其作用可能就是通过在每个小批量中加入噪音来控制模型的复杂度,因为批量是随机取得的,因此批量的均值和方差也就不同,相当于对该次批量加入了随机偏移
μ
^
B
\hat{\mu}_B
μ^?B?和随机缩放
σ
^
B
\hat{\sigma}_B
σ^B?。需要注意的是它不需要与Dropout一起使用。
它可以作用的全连接层和卷积层的输出上,激活函数之前,也可以作用到全连接层和卷积层的输入上:
- 对于全连接层来说,其作用在特征维
- 对于卷积层,作用在通道维
而当我们在训练中采用和批量归一化,我们就需要记下来每个用到批量归一化的地方,其整个样本数据集的均值和方差是多少,这样才能够在进行预测的时候也对预测样本进行规范。
import torch
from matplotlib import pyplot as plt
from torch import nn
from d2l import torch as d2l
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):
if not torch.is_grad_enabled():
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2,4)
if len(X.shape) == 2:
mean = X.mean(dim = 0)
var = ((X - mean) ** 2 ).mean(dim = 0)
else:
mean = X.mean(dim=(0,2,3),keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0,2,3), keepdim = True)
X_hat = (X - mean) / torch.sqrt(var + eps)
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta
return Y, moving_mean.data, moving_var.data
class BatchNorm(nn.Module):
def __init__(self,num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)
def forward(self, X):
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
Y,self.moving_mean, self.moving_var = batch_norm(X,self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y
net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5),
BatchNorm(6, num_dims=4),
nn.Sigmoid(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16,kernel_size=5),
BatchNorm(16, num_dims=4),
nn.Sigmoid(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 4 * 4, 120),
BatchNorm(120, num_dims=2),
nn.Sigmoid(),
nn.Linear(120, 84),
BatchNorm(84, num_dims=2),
nn.Sigmoid(),
nn.Linear(84, 10))
lr, num_epochs, batch_size = 1.0, 10 ,256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch6(net,train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
plt.show()
loss 0.251, train acc 0.908, test acc 0.883
17375.8 examples/sec on cuda:0
而nn中也有简单的实现方法:
net = nn.Sequential(nn.Conv2d(1, 6, kernel_size=5),
nn.BatchNorm2d(6),
nn.Sigmoid(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(6, 16,kernel_size=5),
nn.BatchNorm2d(16),
nn.Sigmoid(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.Linear(16 * 4 * 4, 120),
nn.BatchNorm2d(120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.BatchNorm2d(84),
nn.Sigmoid(),
nn.Linear(84, 10))
小结:
- 在模型训练的过程中,批量归一化利用小批量的均值和标准差,不断调整神经网络的中间输出,使整个神经网络各层的中间输出更加稳定
- 批量归一化在全连接层和卷积层的使用略有不同,需要注意作用的维度
- 批量归一化和Dropout一样,在训练模式和预测模式下计算不同
- 批量归一化有许多有益的副作用,主要是正则化
残差网络(ResNet)
我们需要讨论一个问题是:是否加入更多的层就能够使得精度进一步提高?
因此ResNet就是这种思想,最具体的表现是:
那么将该块的输入连接到输出,就需要输入和输出的维度是相同的,可以直接相加,因此如果块内部对维度进行了改变,那么就需要对输入也进行维度的变化才能够相加:
那么一般来说,是先对输入进行多个高宽减半的ResNet块,后面再接多个高宽不变的ResNet块,可以使得后面提取特征的时候减少计算量:
那么整体的架构就是:
因此,代码为:
import torch
from matplotlib import pyplot as plt
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l
class Residual(nn.Module):
def __init__(self, input_channels, num_channels,
use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = nn.Conv2d(input_channels, num_channels,
kernel_size=3, padding=1, stride=strides)
self.conv2 = nn.Conv2d(num_channels, num_channels,
kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2d(input_channels, num_channels,
kernel_size=1, stride=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm2d(num_channels)
self.bn2 = nn.BatchNorm2d(num_channels)
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return F.relu(Y)
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
def resnet_block(input_channels, num_channels, num_residuals,first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(Residual(input_channels, num_channels,use_1x1conv=True, strides=2))
else:
blk.append(Residual(num_channels, num_channels))
return blk
b2 = nn.Sequential(*resnet_block(64,64,2,first_block=True))
b3 = nn.Sequential(*resnet_block(64,128,2))
b4 = nn.Sequential(*resnet_block(128,256,2))
b5 = nn.Sequential(*resnet_block(256,512,2))
net = nn.Sequential(
b1,b2,b3,b4,b5,
nn.AdaptiveAvgPool2d((1,1)),
nn.Flatten(),
nn.Linear(512,10)
)
"""
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
X = layer(X)
print(layer.__class__.__name__,'output shape:\t', X.shape)
"""
lr, num_epochs, batch_size = 0.05, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
plt.show()
loss 0.014, train acc 0.996, test acc 0.914
883.9 examples/sec on cuda:0
李沐老师后面又补充了一节关于ResNet的梯度计算的内容,具体如下:
假设
y
=
f
(
x
)
,则更新为??
w
=
w
?
λ
?
y
?
w
那么假设后面增加一个模块为
y
′
=
g
(
y
)
=
g
(
f
(
x
)
)
,则此时输出关于参数的导数为?
?
y
′
?
w
=
?
g
(
y
)
?
y
?
y
?
w
那么如果
g
是一个学习能力比较强的层(例如全连接层),那么就会更接近于真实输出,此时
?
g
(
y
)
?
y
较小
从而导致
?
y
′
?
w
较小,那么
f
(
x
)
层的更新就很慢,主要问题就是乘法
中间一个比较小就会出现梯度消失的问题
而
R
e
s
N
e
t
它采用了残差的方式,即
y
′
=
f
(
x
)
+
g
(
f
(
x
)
)
,那么
?
y
′
?
w
=
?
y
?
w
+
?
g
(
y
)
?
y
?
y
?
w
就算第二部分较小,仍然有第一部分来提供较大的梯度。
因此可以解决梯度消失的问题,在靠近数据部分的也能够进行更新
假设y=f(x),则更新为~~w=w-\lambda \frac{\partial y}{\partial w}\\那么假设后面增加一个模块为y^{\prime}=g(y)=g(f(x)),则此时输出关于参数的导数为~\frac{\partial y^{\prime}}{\partial w}=\frac{\partial g(y)}{\partial y}\frac{\partial y}{\partial w}\\那么如果g是一个学习能力比较强的层(例如全连接层),那么就会更接近于真实输出,此时\frac{\partial g(y)}{\partial y}较小\\从而导致\frac{\partial y^{\prime}}{\partial w}较小,那么f(x)层的更新就很慢,主要问题就是乘法\\中间一个比较小就会出现梯度消失的问题\\而ResNet它采用了残差的方式,即y^{\prime}=f(x)+g(f(x)),那么\frac{\partial y^{\prime}}{\partial w}=\frac{\partial y}{\partial w}+\frac{\partial g(y)}{\partial y}\frac{\partial y}{\partial w}\\就算第二部分较小,仍然有第一部分来提供较大的梯度。\\因此可以解决梯度消失的问题,在靠近数据部分的也能够进行更新
假设y=f(x),则更新为??w=w?λ?w?y?那么假设后面增加一个模块为y′=g(y)=g(f(x)),则此时输出关于参数的导数为??w?y′?=?y?g(y)??w?y?那么如果g是一个学习能力比较强的层(例如全连接层),那么就会更接近于真实输出,此时?y?g(y)?较小从而导致?w?y′?较小,那么f(x)层的更新就很慢,主要问题就是乘法中间一个比较小就会出现梯度消失的问题而ResNet它采用了残差的方式,即y′=f(x)+g(f(x)),那么?w?y′?=?w?y?+?y?g(y)??w?y?就算第二部分较小,仍然有第一部分来提供较大的梯度。因此可以解决梯度消失的问题,在靠近数据部分的也能够进行更新
图像分类竞赛
本次我先是采用了李沐老师上课讲过的ResNet11去跑,结果达到了0.8多一点,具体的代码请见下:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import os
from d2l import torch as d2l
import matplotlib.pyplot as plt
from LeavesDataset import LeavesDataset
先是要对标签类的数据进行处理,将其从字符串转换为对应的类别数字,同时在这两者之间建立关系方便后续:
label_dataorgin = pd.read_csv("dataset/classify-leaves/train.csv")
leaves_labels = sorted(list(set(label_dataorgin['label'])))
num_class = len(leaves_labels)
class_to_num = dict(zip(leaves_labels, range(num_class)))
num_to_class = {i:j for j,i in class_to_num.items()}
接下来就是写我们数据加载器,因为我发现一个问题就是如果把数据加载器和整体的代码写在同样的文件中会报错,会在之后调用d2l的训练函数时说找不到这个数据加载器的定义,那么我们需要在另外的文件写数据加载器的定义然后引用,我在另外的LeavesDataset.py文件中为其定义:
class LeavesDataset(Dataset):
def __init__(self, csv_path, file_path, mode = 'train', valid_ratio = 0.2,
resize_height = 256, resize_width=256):
self.resize_height = resize_height
self.resize_width = resize_width
self.file_path = file_path
self.mode = mode
self.data_csv = pd.read_csv(csv_path, header=None)
self.dataLength = len(self.data_csv.index) - 1
self.trainLength = int(self.dataLength * (1 - valid_ratio))
if mode == 'train':
self.train_images = np.asarray(self.data_csv.iloc[1:self.trainLength, 0])
self.train_labels = np.asarray(self.data_csv.iloc[1:self.trainLength, 1])
self.image_arr = self.train_images
self.label_arr = self.image_arr
elif mode == 'valid':
self.valid_images = np.asarray(self.data_csv.iloc[self.trainLength:, 0])
self.valid_labels = np.asarray(self.data_csv.iloc[self.trainLength:, 1])
self.image_arr = self.valid_images
self.label_arr = self.valid_labels
elif mode == 'test':
self.test_images = np.asarray(self.data_csv.iloc[1:,0])
self.image_arr = self.test_images
self.realLen_now = len(self.image_arr)
print("{}模式下已完成数据载入,得到{}个数据".format(mode, self.realLen_now))
def __getitem__(self, index):
image_name = self.image_arr[index]
img = Image.open(os.path.join(self.file_path, image_name))
transform = transforms.Compose([
transforms.Resize((224,224)),
transforms.ToTensor()
])
img = transform(img)
if self.mode == 'test':
return img
else:
label = self.label_arr[index]
number_label = class_to_num[label]
return img, number_label
def __len__(self):
return self.realLen_now
那么接下来就是加载各个数据集了:
train_path = "dataset/classify-leaves/train.csv"
test_path = "dataset/classify-leaves/test.csv"
img_path = "dataset/classify-leaves/"
train_dataset = LeavesDataset(train_path, img_path, mode = 'train')
valid_dataset = LeavesDataset(train_path, img_path, mode = 'valid')
test_dataset = LeavesDataset(test_path, img_path, mode = 'test')
batch_size = 64
train_loader = DataLoader(dataset=train_dataset,batch_size=batch_size, shuffle=False,num_workers=5)
valid_loader = DataLoader(dataset=valid_dataset,batch_size=batch_size, shuffle=False,num_workers=5)
test_loader = DataLoader(dataset=test_dataset,batch_size=batch_size, shuffle=False,num_workers=5)
得到数据后接下来就是定义模型了,我先是采用了ResNet11:
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
def resnet_block(input_channels, num_channels, num_residuals,first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(d2l.Residual(input_channels, num_channels,use_1x1conv=True, strides=2))
else:
blk.append(d2l.Residual(num_channels, num_channels))
return blk
b2 = nn.Sequential(*resnet_block(64,64,2,first_block=True))
b3 = nn.Sequential(*resnet_block(64,128,2))
b4 = nn.Sequential(*resnet_block(128,256,2))
b5 = nn.Sequential(*resnet_block(256,512,2))
net = nn.Sequential(
b1,b2,b3,b4,b5,
nn.AdaptiveAvgPool2d((1,1)),
nn.Flatten(),
nn.Linear(512,176)
)
然后因为我希望如果模型能够达到要求的精度我就将其保存下来,因此修改了训练函数:
def train_ch6_save(net, train_iter, test_iter, num_epochs, lr, device, best_acc):
"""Train a model with a GPU (defined in Chapter 6).
Defined in :numref:`sec_lenet`"""
def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
nn.init.xavier_uniform_(m.weight)
net.apply(init_weights)
print('training on', device)
net.to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
timer, num_batches = d2l.Timer(), len(train_iter)
for epoch in range(num_epochs):
metric = d2l.Accumulator(3)
net.train()
for i, (X, y) in enumerate(train_iter):
timer.start()
optimizer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
optimizer.step()
with torch.no_grad():
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(train_l, train_acc, None))
test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
f'on {str(device)}')
if test_acc > best_acc:
print("模型精度较高,值得保存!")
torch.save(net.state_dict(), "Now_Best_Module.pth")
else:
print("模型精度不够,不值得保存")
lr, num_epochs,best_acc = 0.05, 25, 0.8
train_ch6_save(net, train_loader, valid_loader, num_epochs, lr, device=d2l.try_gpu(), best_acc=best_acc)
plt.show()
得到结果为:
那么我接下来希望加大ResNet的深度来提高模型复杂度,用了网上的ResNet50模型发现太大了,读完模型之后再读数据,就算把batch_size设置小也显存爆了,因此只能修改模型小一点:
b2 = nn.Sequential(*resnet_block(64,64,2,first_block=True))
b3 = nn.Sequential(*resnet_block(64,256,2))
b4 = nn.Sequential(*resnet_block(256,512,2))
b5 = nn.Sequential(*resnet_block(512,2048,3))
net = nn.Sequential(
b1,b2,b3,b4,b5,
nn.AdaptiveAvgPool2d((1,1)),
nn.Flatten(),
nn.Linear(2048,176)
)
跑了五个小时结果过拟合了…
loss 0.014, train acc 0.996, test acc 0.764
31.6 examples/sec on cuda:0
最终调试了好几个模型花费了一整天的时间,还是没有最开始的ResNet11的效果好,最终决定就用这个了。
因此完整的代码为:
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import os
from d2l import torch as d2l
import matplotlib.pyplot as plt
from tqdm import tqdm
from LeavesDataset import LeavesDataset
def resnet_block(input_channels, num_channels, num_residuals, first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(d2l.Residual(input_channels, num_channels, use_1x1conv=True, strides=2))
else:
blk.append(d2l.Residual(num_channels, num_channels))
return blk
def train_ch6_save(net, train_iter, test_iter, num_epochs, lr, device, best_acc):
"""Train a model with a GPU (defined in Chapter 6).
这是因为我需要训练完保存因此将老师的训练函数进行了修改,就放在这里了
Defined in :numref:`sec_lenet`"""
def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
nn.init.xavier_uniform_(m.weight)
net.apply(init_weights)
print('training on', device)
net.to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
loss = nn.CrossEntropyLoss()
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
timer, num_batches = d2l.Timer(), len(train_iter)
for epoch in range(num_epochs):
metric = d2l.Accumulator(3)
net.train()
for i, (X, y) in enumerate(train_iter):
timer.start()
optimizer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
optimizer.step()
with torch.no_grad():
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(train_l, train_acc, None))
test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
f'on {str(device)}')
if test_acc > best_acc:
print("模型精度较高,值得保存!")
torch.save(net.state_dict(), "Now_Best_Module.pth")
else:
print("模型精度不够,不值得保存")
if __name__ == "__main__":
label_dataorgin = pd.read_csv("dataset/classify-leaves/train.csv")
leaves_labels = sorted(list(set(label_dataorgin['label'])))
num_class = len(leaves_labels)
class_to_num = dict(zip(leaves_labels, range(num_class)))
num_to_class = {i: j for j, i in class_to_num.items()}
train_path = "dataset/classify-leaves/train.csv"
test_path = "dataset/classify-leaves/test.csv"
img_path = "dataset/classify-leaves/"
submission_path = "dataset/classify-leaves/submission.csv"
train_dataset = LeavesDataset(train_path, img_path, mode='train')
valid_dataset = LeavesDataset(train_path, img_path, mode='valid')
test_dataset = LeavesDataset(test_path, img_path, mode='test')
batch_size = 64
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=False, num_workers=5)
valid_loader = DataLoader(dataset=valid_dataset, batch_size=batch_size, shuffle=False, num_workers=5)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False, num_workers=5)
b1 = nn.Sequential(nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
b3 = nn.Sequential(*resnet_block(64, 128, 2))
b4 = nn.Sequential(*resnet_block(128, 256, 2))
b5 = nn.Sequential(*resnet_block(256, 512, 2))
net = nn.Sequential(
b1, b2, b3, b4, b5,
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten(),
nn.Linear(512, 176)
)
lr, num_epochs, best_acc = 0.02, 15, 0.85
device = d2l.try_gpu()
train_ch6_save(net, train_loader, valid_loader, num_epochs, lr, device=device, best_acc=best_acc)
plt.show()
net.load_state_dict(torch.load("Now_Best_Module.pth"))
net.to(device)
net.eval()
predictions = []
for i, data in enumerate(test_loader):
imgs = data.to(device)
with torch.no_grad():
logits = net(imgs)
predictions.extend(logits.argmax(dim=-1).cpu().numpy().tolist())
preds = []
for i in predictions:
preds.append(num_to_class[i])
test_csv = pd.read_csv(test_path)
test_csv['label'] = pd.Series(preds)
submission = pd.concat([test_csv['image'], test_csv['label']], axis=1)
submission.to_csv(submission_path, index=False)
提交的分数为:
虽然结果不是很好,但自己还是非常开心的!第一次完完整整地完成了一个项目,真正地学到了很多东西!只有自己动手从零开始才真正明白自己哪部分欠缺,因此才能够有进步!
请继续努力吧!
|