1 用numpy实现两层神经网络

一个全连接ReLU神经网络，一个隐藏层，没有bias，L2 Loss（h是隐藏层hidden，ReLU激活函数）：

$h = W_1 X + b_1$
$h_relu = max(0, h)$
$y_{hat} = W_2 a + b_2$

下面实现时 $b_1$ 、 $b_2$ 都为0，没有偏置bias。

这一实现完全使用numpy来计算前向神经网络，loss，和反向传播。

numpy ndarray是一个普通的n维array。它不知道任何关于深度学习或者梯度(gradient)的知识，也不知道计算图(computation graph)，只是一种用来计算数学运算的数据结构。

import numpy as np

N, D_in, H, D_out = 64, 1000, 100, 10 #64个训练数据（只是一个batch），输入是1000维，hidden是100维，输出是10维

'''随机创建一些训练数据'''
X = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

W1 = np.random.randn(D_in, H)  #1000维转成100维
W2 = np.random.randn(H, D_out)  #100维转成10维

learning_rate = 1e-6

for t in range(500):  #做500次迭代
    '''前向传播（forward pass）'''
    h = X.dot(W1)  # N * H
    h_relu = np.maximum(h, 0)  #激活函数，N * H
    y_hat = h_relu.dot(W2)  # N * D_out
    
    '''计算损失函数（compute loss）'''
    loss = np.square(y_hat - y).sum()  #均方误差，忽略了÷N
    print(t, loss)  #打印每个迭代的损失
    
    '''后向传播（backward pass）'''
    #计算梯度（此处没用torch，用最普通的链式求导，最终要得到 d{loss}/dX）
    grad_y_hat = 2.0 * (y_hat - y)  # d{loss}/d{y_hat}，N * D_out
    grad_W2 = h_relu.T.dot(grad_y_hat)  #看前向传播中的第三个式子，d{loss}/d{W2}，H * D_out
    grad_h_relu = grad_y_hat.dot(W2.T)  #看前向传播中的第三个式子，d{loss}/d{h_relu}，N * H
    grad_h = grad_h_relu.copy()  #这是h>0时的情况，d{h_relu}/d{h}=1
    grad_h[h<0] = 0  # d{loss}/d{h}
    grad_W1 = X.T.dot(grad_h)  #看前向传播中的第一个式子，d{loss}/d{W1}
    
    '''参数更新（update weights of W1 and W2）'''
    W1 -= learning_rate * grad_W1
    W2 -= learning_rate * grad_W2

0 36455139.29176882
1 35607818.495988876
2 36510242.60519045
3 32972837.109358862
4 23623067.52618093
5 13537226.736260608
6 6806959.784455631
7 3501526.30896816
8 2054356.1020693523
9 1400230.6793163505
...
490 3.4950278045838633e-06
491 3.3498523609301454e-06
492 3.210762995939165e-06
493 3.0774805749939447e-06
494 2.9500114328045522e-06
495 2.827652258736098e-06
496 2.710379907890261e-06
497 2.5980242077038875e-06
498 2.490359305069476e-06
499 2.387185101594446e-06

可以看到最后的loss越来越小，现在我们看一下预测值和真实值的接近程度

y_hat - y

array([[ 9.16825615e-06, -1.53964987e-05,  6.58365129e-06,
        -3.08909604e-05,  1.05735798e-05,  1.73376919e-05,
         2.63084233e-06, -1.11662576e-05,  1.06904464e-05,
        -1.71528894e-05],
       ...
       [-5.79062537e-06, -1.74789200e-05,  5.27619647e-06,
        -7.82154474e-06,  3.39896752e-06,  1.08366770e-05,
         8.28712496e-06, -8.88009103e-06,  5.78585909e-06,
        -1.14913078e-05]])

可以看到他们的差值非常小，也就是我们这个是成功的。

2 用PyTorch自动求导实现两层神经网络

2.1 手动求导

此时 1 中的代码里面有如下改动：

import numpy as np 改成 import torch
np.random.randn 改成 torch.randn
dot 改成 mm
h_relu在numpy中是 np.maximum(h, 0)，在torch中是 h.clamp(min=0)，限制输入下限为0
loss在numpy中是 np.square，在torch中是 .pow(2)，且要用 .item()将tensor转为数值
转置在numpy中是 .T，在torch中是 .t()
copy() 改成 clone()

import torch

N, D_in, H, D_out = 64, 1000, 100, 10 #64个训练数据（只是一个batch），输入是1000维，hidden是100维，输出是10维

'''随机创建一些训练数据'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

W1 = torch.randn(D_in, H)  #1000维转成100维
W2 = torch.randn(H, D_out)  #100维转成10维

learning_rate = 1e-6

for t in range(500):  #做500次迭代
    '''前向传播（forward pass）'''
    h = X.mm(W1)  # N * H
    h_relu = h.clamp(min=0)  #激活函数，N * H
    y_hat = h_relu.mm(W2)  # N * D_out
    
    '''计算损失函数（compute loss）'''
    loss = (y_hat - y).pow(2).sum().item()  #均方误差，忽略了÷N
    print(t, loss)  #打印每个迭代的损失
    
    '''后向传播（backward pass）'''
    #计算梯度（此处没用torch，用最普通的链式求导，最终要得到 d{loss}/dX）
    grad_y_hat = 2.0 * (y_hat - y)  # d{loss}/d{y_hat}，N * D_out
    grad_W2 = h_relu.t().mm(grad_y_hat)  #看前向传播中的第三个式子，d{loss}/d{W2}，H * D_out
    grad_h_relu = grad_y_hat.mm(W2.t())  #看前向传播中的第三个式子，d{loss}/d{h_relu}，N * H
    grad_h = grad_h_relu.clone()  #这是h>0时的情况，d{h_relu}/d{h}=1
    grad_h[h<0] = 0  # d{loss}/d{h}
    grad_W1 = X.t().mm(grad_h)  #看前向传播中的第一个式子，d{loss}/d{W1}
    
    '''参数更新（update weights of W1 and W2）'''
    W1 -= learning_rate * grad_W1
    W2 -= learning_rate * grad_W2

0 28398944.0
1 27809498.0
2 32215128.0
3 37019776.0
4 36226528.0
5 27777396.0
6 16156263.0
7 7798599.0
8 3615862.0
9 1881907.25
...
490 5.404536932474002e-05
491 5.3628453315468505e-05
492 5.282810889184475e-05
493 5.204257831792347e-05
494 5.149881326360628e-05
495 5.084666554466821e-05
496 4.9979411414824426e-05
497 4.938142956234515e-05
498 4.8661189794074744e-05
499 4.8014146159403026e-05

2.2 gradient自动求导

此时 2.1 中的代码有如下改动：

用 requires_grad=True 给 W1、W2 声明可以求导，如果不传这个Ture就默认是False（X，y），以节省内存
前向传播中，为方便计算，直接用一行代码计算 y_hat
loss此时应该是tensor的，把前面加的 item() 去掉，但是要把 .item() 放在下面打印损失函数的地方
把计算梯度的所有步骤去掉，直接用 loss.backward() 代替
把前面手动计算的 grad_W1 改为 W1.grad（W2同理）
在参数更新之后，用 W1.grad.zero_() 把W1的梯度清零（W2同理）
用 with torch.no_grad(): 避免电脑记住W1、W2的计算图，占据内存

import torch

N, D_in, H, D_out = 64, 1000, 100, 10 #64个训练数据（只是一个batch），输入是1000维，hidden是100维，输出是10维

'''随机创建一些训练数据'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

W1 = torch.randn(D_in, H, requires_grad=True)  #1000维转成100维
W2 = torch.randn(H, D_out, requires_grad=True)  #100维转成10维

learning_rate = 1e-6

for t in range(500):  #做500次迭代
    '''前向传播（forward pass）'''
    y_hat = X.mm(W1).clamp(min=0).mm(W2)  # N * D_out
    
    '''计算损失函数（compute loss）'''
    loss = (y_hat - y).pow(2).sum()  #均方误差，忽略了÷N，loss就是一个计算图（computation graph）
    print(t, loss.item())  #打印每个迭代的损失
    
    '''后向传播（backward pass）'''
    loss.backward()
    
    '''参数更新（update weights of W1 and W2）'''
    with torch.no_grad():
        W1 -= learning_rate * W1.grad
        W2 -= learning_rate * W2.grad
        W1.grad.zero_()
        W2.grad.zero_()

0 28114322.0
1 22391836.0
2 19137772.0
3 16153970.0
4 12953562.0
5 9725695.0
6 6933768.5
7 4784875.0
8 3286503.0
9 2288213.25
...
490 3.3917171094799414e-05
491 3.35296499542892e-05
492 3.318845119792968e-05
493 3.276047937106341e-05
494 3.244510298827663e-05
495 3.209296482964419e-05
496 3.168126931996085e-05
497 3.1402159947901964e-05
498 3.097686203545891e-05
499 3.074205596931279e-05

3 用torch.nn库实现两层神经网络

3.1 参数未标准化

此时 2.2 中的代码有如下改动：

import torch 改为 import torch.nn as nn（nn 就是 neural network）
不需要定义W1、W2了，直接定义model = torch.nn.Sequential（一串模型拼到一起）
后续的y_hat的计算只需要model(x)就可以了
loss也不用写那么复杂，用函数 loss_fn = nn.MSELoss(reduction=‘sum’) 即可，下文loss就引用这个函数
参数更新中的参数也要可以直接用 for 循环从模型中得到
梯度清零也一样：model.zero_grad()

import torch.nn as nn  #各种定义 neural network 的方法

N, D_in, H, D_out = 64, 1000, 100, 10 #64个训练数据（只是一个batch），输入是1000维，hidden是100维，输出是10维

'''随机创建一些训练数据'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # W1 * X + b，默认True
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# model = model.cuda()  #这是使用GPU的情况

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for t in range(500):  #做500次迭代
    '''前向传播（forward pass）'''
    y_hat = model(X)  # model(X) = model.forward(X), N * D_out
    
    '''计算损失函数（compute loss）'''
    loss = loss_fn(y_hat, y)  #均方误差，忽略了÷N，loss就是一个计算图（computation graph）
    print(t, loss.item())  #打印每个迭代的损失
        
    '''后向传播（backward pass）'''
    loss.backward()
    
    '''参数更新（update weights of W1 and W2）'''
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad  #模型中所有的参数更新
            
    model.zero_grad()

0 686.78662109375
1 686.2665405273438
2 685.7469482421875
3 685.2279663085938
4 684.7101440429688
5 684.1931762695312
6 683.6768188476562
7 683.1609497070312
8 682.6456909179688
9 682.130859375
...
490 496.4220275878906
491 496.12548828125
492 495.82940673828125
493 495.533203125
494 495.2373046875
495 494.94171142578125
496 494.6462707519531
497 494.35101318359375
498 494.0567321777344
499 493.7628479003906

model

Sequential(
  (0): Linear(in_features=1000, out_features=100, bias=True)
  (1): ReLU()
  (2): Linear(in_features=100, out_features=10, bias=True)
)

model[0]

Linear(in_features=1000, out_features=100, bias=True)

model[0].weight

Parameter containing:
tensor([[-0.0147, -0.0315,  0.0085,  ...,  0.0039,  0.0254, -0.0308],
        [ 0.0046,  0.0125,  0.0128,  ..., -0.0241, -0.0206, -0.0127],
        [-0.0162,  0.0051,  0.0152,  ..., -0.0280, -0.0133,  0.0079],
        ...,
        [ 0.0239,  0.0237, -0.0025,  ...,  0.0290, -0.0192,  0.0187],
        [-0.0249,  0.0287,  0.0060,  ..., -0.0198,  0.0007,  0.0209],
        [ 0.0238, -0.0157, -0.0156,  ...,  0.0105,  0.0057, -0.0189]],
       requires_grad=True)

3.2 参数标准化

3.1 中的代码更新的很慢，可能是参数初始化不好，于是我们用 torch.nn.init.normal_ 对第0层和第2层（也就是Linear层）的weight标准化：

import torch.nn as nn  #各种定义 neural network 的方法

N, D_in, H, D_out = 64, 1000, 100, 10 #64个训练数据（只是一个batch），输入是1000维，hidden是100维，输出是10维

'''随机创建一些训练数据'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # W1 * X + b，默认True
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

torch.nn.init.normal_(model[0].weight)
torch.nn.init.normal_(model[2].weight)

# model = model.cuda()  #这是使用GPU的情况

loss_fn = nn.MSELoss(reduction='sum')

learning_rate = 1e-6

for t in range(500):  #做500次迭代
    '''前向传播（forward pass）'''
    y_hat = model(X)  # model.forward(), N * D_out
    
    '''计算损失函数（compute loss）'''
    loss = loss_fn(y_hat, y)  #均方误差，忽略了÷N，loss就是一个计算图（computation graph）
    print(t, loss.item())  #打印每个迭代的损失
        
    '''后向传播（backward pass）'''
    loss.backward()
    
    '''参数更新（update weights of W1 and W2）'''
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad  #模型中所有的参数更新
            
    model.zero_grad()

0 34311500.0
1 32730668.0
2 33845940.0
3 31335464.0
4 23584192.0
5 14068799.0
6 7252735.5
7 3674312.0
8 2069563.0
9 1346445.75
...
490 7.143352559069172e-05
491 7.078371709212661e-05
492 7.009323599049821e-05
493 6.912354729138315e-05
494 6.783746357541531e-05
495 6.718340591760352e-05
496 6.611335265915841e-05
497 6.529116944875568e-05
498 6.444999598897994e-05
499 6.381605635397136e-05

model[0].weight

Parameter containing:
tensor([[ 0.1849, -0.2587,  1.6247,  ..., -0.8608, -2.2139, -1.3076],
        [-0.5197,  0.0600,  0.2141,  ...,  0.0561, -0.1613, -0.3905],
        [-0.5303, -0.1129, -0.2974,  ..., -0.6166, -3.4082,  0.0969],
        ...,
        [-0.4742,  0.2449, -1.5979,  ..., -0.6195, -0.2970, -1.3764],
        [-0.1131,  0.4973,  0.7679,  ...,  0.1231,  0.6992,  0.4403],
        [-0.1557,  0.8185,  0.7784,  ..., -0.9993,  0.3424, -1.1116]],
       requires_grad=True)

3.3 optim方法

前面的梯度下降方法还是很蠢，是手动更新模型的weights，3.3 中我们使用optim包来帮助我们更新参数，optim这个package提供了各种不同的模型优化方法，包括 SGD+momentum, RMSProp, Adam 等等。

在 3.2 的基础上继续改进：

用optim包一般 learning_rate 是1e-4
定义optimizer，使用 Adam 优化
参数更新整个部分可以用一句 optimizer.step() 代替，表示一步把所有参数全更新
用 optimizer.zero_grad() 给梯度清零

import torch.nn as nn  #各种定义 neural network 的方法

N, D_in, H, D_out = 64, 1000, 100, 10 #64个训练数据（只是一个batch），输入是1000维，hidden是100维，输出是10维

'''随机创建一些训练数据'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H, bias=True), # W1 * X + b，默认True
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# torch.nn.init.normal_(model[0].weight)
# torch.nn.init.normal_(model[2].weight)

# model = model.cuda()  #这是使用GPU的情况

loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):  #做500次迭代
    '''前向传播（forward pass）'''
    y_hat = model(X)  # model.forward(), N * D_out
    
    '''计算损失函数（compute loss）'''
    loss = loss_fn(y_hat, y)  #均方误差，忽略了÷N，loss就是一个计算图（computation graph）
    print(t, loss.item())  #打印每个迭代的损失
        
    optimizer.zero_grad()  #求导之前把 gradient 清空
    '''后向传播（backward pass）'''
    loss.backward()
    
    '''参数更新（update weights of W1 and W2）'''
    optimizer.step()  #一步把所有参数全更新

0 677.295166015625
1 660.0888061523438
2 643.3673095703125
3 627.08642578125
4 611.1599731445312
5 595.6091918945312
6 580.5427856445312
7 565.9138793945312
8 551.620849609375
9 537.651123046875
...
490 9.944045586962602e-09
491 9.147494317574001e-09
492 8.492017755656889e-09
493 7.793811818146423e-09
494 7.225093412444039e-09
495 6.644597760896431e-09
496 6.126881668677697e-09
497 5.687876836191208e-09
498 5.240272660245182e-09
499 4.8260742069317075e-09

在这里又需要把标准化的部分注释掉，否则会更新的非常糟糕。（这一点很奇怪，换了 learning_rate 和优化方法又会变，可能又需要标准化了）

3.4自己定义模型

一般都是从 torch.nn.Module 里面继承新的模型。

import torch.nn as nn  #各种定义 neural network 的方法

N, D_in, H, D_out = 64, 1000, 100, 10 #64个训练数据（只是一个batch），输入是1000维，hidden是100维，输出是10维

'''随机创建一些训练数据'''
X = torch.randn(N, D_in)
y = torch.randn(N, D_out)

'''定义两层网络'''
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        #定义模型结构
        self.linear1 = torch.nn.Linear(D_in, H, bias=False)
        self.linear2 = torch.nn.Linear(H, D_out, bias=False)
        
    def forward(self, x):
        y_hat = self.linear2(self.linear1(X).clamp(min=0))
        return y_hat

    
model = TwoLayerNet(D_in, H, D_out)

loss_fn = nn.MSELoss(reduction='sum')
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

for t in range(500):  #做500次迭代
    '''前向传播（forward pass）'''
    y_hat = model(X)  # model.forward(), N * D_out
    
    '''计算损失函数（compute loss）'''
    loss = loss_fn(y_hat, y)  #均方误差，忽略了÷N，loss就是一个计算图（computation graph）
    print(t, loss.item())  #打印每个迭代的损失
        
    optimizer.zero_grad()  #求导之前把 gradient 清空
    '''后向传播（backward pass）'''
    loss.backward()
    
    '''参数更新（update weights of W1 and W2）'''
    optimizer.step()  #一步把所有参数全更新

0 713.7529296875
1 695.759033203125
2 678.2886352539062
3 661.2178344726562
4 644.5472412109375
5 628.3016357421875
6 612.5072021484375
7 597.1802978515625
8 582.385009765625
9 568.1029663085938
...
490 3.386985554243438e-07
491 3.155915919705876e-07
492 2.9405845225483063e-07
493 2.7391826051825774e-07
494 2.553651086145692e-07
495 2.379783694550497e-07
496 2.2159480295158573e-07
497 2.0649896725899453e-07
498 1.9220941283037973e-07
499 1.790194232853537e-07

此时这个Adam作用很好。