深度学习(30)随机梯度下降七: 多层感知机梯度(反向传播算法)
tens
Recap
Chain Rule
Multi-output Perceptron
?
E
?
w
j
k
=
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
x
j
0
\frac{?E}{?w_{jk}} =(O_k-t_k)O_k (1-O_k)x_j^0
?wjk??E?=(Ok??tk?)Ok?(1?Ok?)xj0? Multi-Layer Perception
1. 多层感知机模型
?
E
?
w
j
k
=
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
x
j
0
\frac{?E}{?w_{jk}}=(O_k-t_k)O_k (1-O_k)x_j^0
?wjk??E?=(Ok??tk?)Ok?(1?Ok?)xj0?
→
\to
→
?
E
?
w
j
k
=
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
x
j
J
\frac{?E}{?w_{jk}}=(O_k-t_k)O_k (1-O_k)x_j^J
?wjk??E?=(Ok??tk?)Ok?(1?Ok?)xjJ?设:
δ
k
K
=
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
δ_k^K=(O_k-t_k)O_k (1-O_k)
δkK?=(Ok??tk?)Ok?(1?Ok?)注: 这里可以将
δ
k
K
δ_k^K
δkK?理解为是k节点的一个属性;
?
E
?
w
j
k
=
δ
k
K
x
j
J
\frac{?E}{?w_{jk}} =δ_k^K x_j^J
?wjk??E?=δkK?xjJ?
2. 多层感知机梯度
?
E
?
w
i
j
=
?
?
w
i
j
1
2
∑
k
∈
K
(
O
k
?
t
k
)
2
\frac{?E}{?w_{ij}} =\frac{?}{?w_{ij} } \frac{ 1}{2} ∑_{k∈K}(O_k-t_k)^2
?wij??E?=?wij???21?k∈K∑?(Ok??tk?)2
?
E
?
w
i
j
=
∑
k
∈
K
(
O
k
?
t
k
)
?
?
w
i
j
O
k
\frac{?E}{?w_{ij}} =∑_{k∈K}(O_k-t_k ) \frac{ ?}{?w_{ij}} O_k
?wij??E?=k∈K∑?(Ok??tk?)?wij???Ok?
?
E
?
w
i
j
=
∑
k
∈
K
(
O
k
?
t
k
)
?
?
w
i
j
σ
(
x
k
)
\frac{?E}{?w_{ij}} =∑_{k∈K}(O_k-t_k ) \frac{ ?}{?w_{ij}} σ(x_k )
?wij??E?=k∈K∑?(Ok??tk?)?wij???σ(xk?)
?
E
?
w
i
j
=
∑
k
∈
K
(
O
k
?
t
k
)
?
σ
(
x
k
)
?
x
k
?
x
k
?
w
i
j
\frac{?E}{?w_{ij}} =∑_{k∈K}(O_k-t_k ) \frac{?σ(x_k )}{?x_k } \frac{?x_k}{?w_{ij} }
?wij??E?=k∈K∑?(Ok??tk?)?xk??σ(xk?)??wij??xk??
?
E
?
w
i
j
=
∑
k
∈
K
(
O
k
?
t
k
)
σ
(
x
k
)
(
1
?
σ
(
x
k
)
)
?
x
k
?
w
i
j
\frac{?E}{?w_{ij}} =∑_{k∈K}(O_k-t_k ) σ(x_k )(1-σ(x_k ))\frac{?x_k}{?w_{ij} }
?wij??E?=k∈K∑?(Ok??tk?)σ(xk?)(1?σ(xk?))?wij??xk??
?
E
?
w
i
j
=
∑
k
∈
K
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
?
x
k
?
w
i
j
\frac{?E}{?w_{ij}} =∑_{k∈K}(O_k-t_k )O_k (1-O_k)\frac{?x_k}{?w_{ij} }
?wij??E?=k∈K∑?(Ok??tk?)Ok?(1?Ok?)?wij??xk??
?
E
?
w
i
j
=
∑
k
∈
K
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
?
x
k
?
O
j
?
O
j
?
w
i
j
\frac{?E}{?w_{ij}} =∑_{k∈K}(O_k-t_k )O_k (1-O_k)\frac{?x_k}{?O_j} \frac{?O_j}{?w_{ij}}
?wij??E?=k∈K∑?(Ok??tk?)Ok?(1?Ok?)?Oj??xk???wij??Oj??
∵
x
k
K
=
O
0
J
w
0
k
J
+
O
1
J
w
1
k
J
+
?
+
O
j
J
w
j
k
J
+
?
+
O
n
J
w
n
k
J
\because x_k^K=O_0^J w_{0k}^J+O_1^J w_{1k}^J+?+O_j^J w_{jk}^J+?+O_n^J w_{nk}^J
∵xkK?=O0J?w0kJ?+O1J?w1kJ?+?+OjJ?wjkJ?+?+OnJ?wnkJ?
∴
?
E
?
w
i
j
=
∑
k
∈
K
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
w
j
k
?
O
j
?
w
i
j
\therefore\frac{?E}{?w_{ij}} =∑_{k∈K}(O_k-t_k )O_k (1-O_k)w_{jk} \frac{?O_j}{?w_{ij}}
∴?wij??E?=k∈K∑?(Ok??tk?)Ok?(1?Ok?)wjk??wij??Oj??
?
E
?
w
i
j
=
?
O
j
?
w
i
j
∑
k
∈
K
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
w
j
k
\frac{?E}{?w_{ij}} = \frac{?O_j}{?w_{ij}}∑_{k∈K}(O_k-t_k )O_k (1-O_k)w_{jk}
?wij??E?=?wij??Oj??k∈K∑?(Ok??tk?)Ok?(1?Ok?)wjk?
∵
?
O
j
?
w
i
j
=
?
O
j
?
x
j
?
x
j
?
w
i
j
=
O
j
(
1
?
O
j
)
?
x
j
?
w
i
j
\because\frac{?O_j}{?w_{ij}}=\frac{?O_j}{?x_j} \frac{?x_j}{?w_{ij}} =O_j (1-O_j)\frac{?x_j}{?w_{ij}}
∵?wij??Oj??=?xj??Oj???wij??xj??=Oj?(1?Oj?)?wij??xj??
∴
?
E
?
w
i
j
=
O
j
(
1
?
O
j
)
?
x
j
?
w
i
j
∑
k
∈
K
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
w
j
k
\therefore\frac{?E}{?w_{ij}} =O_j (1-O_j) \frac{?x_j}{?w_{ij}}∑_{k∈K}(O_k-t_k ) O_k (1-O_k)w_{jk}
∴?wij??E?=Oj?(1?Oj?)?wij??xj??k∈K∑?(Ok??tk?)Ok?(1?Ok?)wjk?
?
E
?
w
i
j
=
O
j
(
1
?
O
j
)
O
i
∑
k
∈
K
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
w
j
k
\frac{?E}{?w_{ij}} =O_j (1-O_j)O_i ∑_{k∈K}(O_k-t_k ) O_k (1-O_k)w_{jk}
?wij??E?=Oj?(1?Oj?)Oi?k∈K∑?(Ok??tk?)Ok?(1?Ok?)wjk?
∵
(
O
k
?
t
k
)
O
k
(
1
?
O
k
)
=
δ
k
\because (O_k-t_k ) O_k (1-O_k )=δ_k
∵(Ok??tk?)Ok?(1?Ok?)=δk?
∴
?
E
?
w
i
j
=
O
i
O
j
(
1
?
O
j
)
∑
k
∈
K
δ
k
w
j
k
\therefore \frac{?E}{?w_{ij}}=O_i O_j (1-O_j)∑_{k∈K}δ_k w_{jk}
∴?wij??E?=Oi?Oj?(1?Oj?)k∈K∑?δk?wjk?设:
δ
j
J
=
O
j
(
1
?
O
j
)
∑
k
∈
K
δ
k
w
j
k
δ_j^J=O_j (1-O_j)∑_{k∈K}δ_k w_{jk}
δjJ?=Oj?(1?Oj?)k∈K∑?δk?wjk?则:
?
E
?
w
i
j
=
δ
j
J
O
i
I
\frac{?E}{?w_{ij}}=δ_j^J O_i^I
?wij??E?=δjJ?OiI?注: 可以把
δ
k
K
δ_k^K
δkK?理解为当前连接w_ij对误差函数的贡献值;
3. 传播规律小结
- 输出层
?
E
?
w
j
k
=
δ
k
(
K
)
O
j
\frac{?E}{?w_{jk}}=δ_k^{(K)} O_j
?wjk??E?=δk(K)?Oj?
δ
k
(
K
)
=
O
k
(
1
?
O
k
)
(
O
k
?
t
k
)
δ_k^{(K)}=O_k (1-O_k)(O_k-t_k)
δk(K)?=Ok?(1?Ok?)(Ok??tk?) - 倒数第二层
?
E
?
w
i
j
=
δ
j
(
J
)
O
i
\frac{?E}{?w_{ij}}=δ_j^{(J)} O_i
?wij??E?=δj(J)?Oi?
δ
j
(
J
)
=
O
j
(
1
?
O
j
)
∑
k
δ
k
(
K
)
w
j
k
δ_j^{(J)}=O_j (1-O_j)∑_kδ_k^{(K)} w_{jk}
δj(J)?=Oj?(1?Oj?)k∑?δk(K)?wjk? - 倒数第三层
?
E
?
w
n
i
=
δ
i
(
I
)
O
n
\frac{?E}{?w_{ni}}=δ_i^{(I)} O_n
?wni??E?=δi(I)?On?
δ
i
(
I
)
=
O
i
(
1
?
O
i
)
∑
j
δ
j
(
J
)
w
i
j
δ_i^{(I)}=O_i (1-O_i)∑_jδ_j^{(J)} w_{ij}
δi(I)?=Oi?(1?Oi?)j∑?δj(J)?wij?其中
O
n
O_n
On?为倒数第三层的输入,即倒数第四层的输出。
依照此规律,只需要循环迭代计算每一层每个节点的
δ
k
(
K
)
δ_k^{(K)}
δk(K)?、
δ
j
(
J
)
δ_j^{(J)}
δj(J)?、
δ
i
(
I
)
δ_i^{(I)}
δi(I)?等值即可求得当前层的偏导数,从而得到每层权值矩阵W的梯度,再通过梯度下降算法迭代优化网络参数即可。
参考文献: [1] 龙良曲:《深度学习与TensorFlow2入门实战》
|