Cost Function and Backpropagation
Cost Function
Neural Network(Classification)
有m组训练集{
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
)
,
.
.
.
,
(
x
(
m
)
,
y
(
m
)
)
(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})
(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
L
L
L= total number of layers in network
s
l
s_l
sl?= number of units (not counting bias unit) in layer
l
l
l
B
i
n
a
r
y
??
c
l
a
s
s
i
f
i
c
a
t
i
o
n
 ̄
\underline{Binary \;classification}
Binaryclassification?
y=0 or 1
1 output unit
M
u
l
t
i
?
c
l
a
s
s
??
c
l
a
s
s
i
f
i
c
a
t
i
o
n
 ̄
\underline{Multi-class\; classification}
Multi?classclassification?(K classes)
y
∈
R
K
y\in {\R}^{K}
y∈RK
K output units
Cost function
? Logistic regression:
J
(
θ
)
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
l
o
g
(
h
θ
(
x
(
i
)
)
)
+
(
1
?
y
(
i
)
)
l
o
g
(
1
?
h
θ
(
x
(
i
)
)
)
]
+
λ
2
m
∑
j
=
1
n
θ
j
2
J{(θ)}=?\frac{1}{m}∑_{i=1}^m[y^{(i)}log(h_θ(x^{(i)}))+(1?y^{(i)}) log(1?h_θ(x^{(i)}))]+\frac{\lambda}{2m}\sum^n_{j=1}{\theta^2_j}
J(θ)=?m1?∑i=1m?[y(i)log(hθ?(x(i)))+(1?y(i))log(1?hθ?(x(i)))]+2mλ?∑j=1n?θj2?
? Neural network:
h
Θ
(
x
)
∈
R
K
h_\Theta(x)\in\R^K
hΘ?(x)∈RK
(
h
Θ
(
x
)
)
i
=
i
t
h
(h_\Theta(x))_i=i^{th}
(hΘ?(x))i?=ith output
J
(
Θ
)
=
?
1
m
[
∑
i
=
1
m
∑
k
=
1
K
y
k
(
i
)
l
o
g
(
h
Θ
(
x
(
i
)
)
)
k
+
(
1
?
y
k
(
i
)
)
l
o
g
(
1
?
h
Θ
(
x
(
i
)
)
)
k
]
+
λ
2
m
∑
l
=
1
L
?
1
∑
i
=
1
s
l
∑
j
=
1
s
l
+
1
(
Θ
j
i
(
l
)
)
2
J{(\Theta)}=?\frac{1}{m}[∑_{i=1}^m\sum^{K}_{k=1}y_k^{(i)}log(h_\Theta(x^{(i)}))_k+(1?y_k^{(i)}) log(1?h_\Theta(x^{(i)}))_k]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}{(\Theta_{ji}^{(l)})^2}
J(Θ)=?m1?[∑i=1m?∑k=1K?yk(i)?log(hΘ?(x(i)))k?+(1?yk(i)?)log(1?hΘ?(x(i)))k?]+2mλ?∑l=1L?1?∑i=1sl??∑j=1sl+1??(Θji(l)?)2
Note:
- the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
- the triple sum simply adds up the squares of all the individual Θs in the entire network.
- the i in the triple sum does not refer to training example i
Gradient computation: Backpropagation algorithm
反向传播算法
δ
j
(
l
)
\delta_j^{(l)}
δj(l)?= “error” of node
j
j
j in layer
l
l
l
Backpropagation algorithm
Training set {
(
x
(
1
)
,
y
(
1
)
)
,
(
x
(
2
)
,
y
(
2
)
)
,
.
.
.
,
(
x
(
m
)
,
y
(
m
)
)
(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})
(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}
Set
Δ
i
j
(
l
)
=
0
\Delta_{ij}^{(l)}=0
Δij(l)?=0(for all l,i,j) (用来作累加项计算偏导数)
For
i
=
1
i=1
i=1 to m
(
x
(
i
)
,
y
(
i
)
)
(x^{(i)},y^{(i)})
(x(i),y(i))
? Set
a
(
1
)
=
x
(
i
)
a^{(1)}=x^{(i)}
a(1)=x(i)
? Perform forward propagation to compute
a
(
l
)
a^{(l)}
a(l) for
l
=
2
,
3
,
.
.
.
,
L
l=2,3,...,L
l=2,3,...,L
? Using
y
(
i
)
y^{(i)}
y(i), compute
δ
(
L
)
=
a
(
L
)
?
y
(
i
)
\delta^{(L)}=a^{(L)}-y^{(i)}
δ(L)=a(L)?y(i)
? Compute
δ
(
L
?
1
)
,
δ
(
L
?
2
)
,
.
.
.
,
δ
(
2
)
\delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)}
δ(L?1),δ(L?2),...,δ(2) using
δ
(
l
)
=
(
(
Θ
(
l
)
)
T
δ
(
l
+
1
)
)
?
.
?
?
a
(
l
)
?
.
?
?
(
1
?
a
(
l
)
)
\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})
δ(l)=((Θ(l))Tδ(l+1))?.??a(l)?.??(1?a(l))
?
Δ
i
j
(
l
)
:
=
Δ
i
j
(
l
)
+
a
j
(
l
)
δ
i
(
l
+
1
)
\Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_{j}^{(l)}\delta_i^{(l+1)}
Δij(l)?:=Δij(l)?+aj(l)?δi(l+1)?(with vectorization,
Δ
(
l
)
:
=
Δ
(
l
)
+
δ
(
l
+
1
)
(
a
(
l
)
)
T
Δ^{(l)}:=Δ^{(l)}+δ^{(l+1)}(a^{(l)})^T
Δ(l):=Δ(l)+δ(l+1)(a(l))T)
Hence update new
Δ
\Delta
Δ matrix:
-
D
i
,
j
(
l
)
:
=
1
m
(
Δ
i
,
j
(
l
)
+
λ
Θ
i
,
j
(
l
)
)
,
i
f
??
j
≠
0
D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right), if \;j≠0
Di,j(l)?:=m1?(Δi,j(l)?+λΘi,j(l)?),ifj?=0.
-
D
i
,
j
(
l
)
:
=
1
m
Δ
i
,
j
(
l
)
I
f
??
j
=
0
D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j} If \;j=0
Di,j(l)?:=m1?Δi,j(l)?Ifj=0
get
?
?
Θ
i
j
(
l
)
J
(
Θ
)
=
D
i
j
(
l
)
\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)= D_{ij}^{(l)}
?Θij(l)???J(Θ)=Dij(l)?
Backpropagation Intuition
固定步骤:
forward:
第一层
(
x
(
i
)
,
y
(
i
)
)
(x^{(i)},y^{(i)})
(x(i),y(i)), 传播到第一个隐藏层时,计算出输入单元的加权和
z
(
2
)
z^{(2)}
z(2)
,通过sigmoid逻辑函数和sigmoid激活函数算得激活值
a
(
2
)
a^{(2)}
a(2),继续同样的方法进行前向传播
第一个隐藏层输入到第二个隐藏层的权重为
Θ
(
2
)
\Theta^{(2)}
Θ(2),,可以得到关系式:
第三层
z
(
3
)
=
Θ
10
(
2
)
×
1
+
Θ
11
(
2
)
×
a
1
(
2
)
+
.
.
.
z^{(3)}=\Theta^{(2)}_{10}\times1+\Theta^{(2)}_{11}\times a^{(2)}_1+...
z(3)=Θ10(2)?×1+Θ11(2)?×a1(2)?+...
反向传播则相反
代价函数表达式中:
J
(
θ
)
=
?
1
m
∑
i
=
1
m
[
y
k
(
t
)
l
o
g
(
h
θ
(
x
(
t
)
)
)
k
+
(
1
?
y
k
(
t
)
)
l
o
g
(
1
?
h
θ
(
x
(
t
)
)
k
)
]
+
λ
2
m
∑
l
=
1
L
?
1
∑
i
=
1
s
l
∑
j
=
1
s
l
+
1
θ
j
,
i
2
J{(θ)}=?\frac{1}{m}∑_{i=1}^m[y^{(t)}_klog(h_θ(x^{(t)}))_k+(1?y^{(t)}_k) log(1?h_θ(x^{(t)})_k)]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{sl}\sum^{sl+1}_{j=1}{\theta^2_{j,i}}
J(θ)=?m1?∑i=1m?[yk(t)?log(hθ?(x(t)))k?+(1?yk(t)?)log(1?hθ?(x(t))k?)]+2mλ?∑l=1L?1?∑i=1sl?∑j=1sl+1?θj,i2?
c
o
s
t
(
t
)
=
y
(
t
)
l
o
g
(
h
Θ
(
x
(
t
)
)
)
+
(
1
?
y
(
t
)
)
l
o
g
(
1
?
h
Θ
(
x
(
t
)
)
)
cost(t)=y(t) log(h_Θ(x^{(t)}))+(1?y(t)) log(1?h_Θ(x^{(t)}))
cost(t)=y(t)log(hΘ?(x(t)))+(1?y(t))log(1?hΘ?(x(t)))可以看作一种方差函数 (
c
o
s
t
(
t
)
≈
(
h
Θ
(
x
(
t
)
)
?
y
(
t
)
)
2
cost(t)\approx(h_Θ(x^{(t)})-y^{(t)})^2
cost(t)≈(hΘ?(x(t))?y(t))2)
可以直观地看到:
δ
j
(
l
)
=
?
?
z
j
(
l
)
c
o
s
t
(
t
)
\delta^{(l)}_j=\frac{\partial}{\partial{z^{(l)}_j}}cost(t)
δj(l)?=?zj(l)???cost(t)(for
j
≥
0
j\geq0
j≥0)
δ
j
(
l
)
\delta^{(l)}_j
δj(l)?就是第l层第i个单元得到的激活项误差error
计算:从最后一层,假设为
δ
1
(
4
)
=
a
(
4
)
?
y
\delta_1^{(4)}=a^{(4)}-y
δ1(4)?=a(4)?y(预测值减去实际值)
继续向前传播算得
δ
(
3
)
\delta^{(3)}
δ(3)等,计算过程:
δ
(
3
)
\delta^{(3)}
δ(3)=
δ
(
4
)
×
Θ
3
\delta^{(4)}\times\Theta_3
δ(4)×Θ3?
Notice:All the
δ
\delta
δ values only for the hidden units but excluding the biased units.
|