根据吴恩达老师机器学习课程中在 Logistics Regression 中定义的损失函数:
J
(
θ
)
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
log
?
(
h
θ
(
x
(
i
)
)
)
+
(
1
?
y
(
i
)
)
log
?
(
1
?
h
θ
(
x
(
i
)
)
)
]
J(\theta)=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]
J(θ)=?m1?i=1∑m?[y(i)log(hθ?(x(i)))+(1?y(i))log(1?hθ?(x(i)))]
对其中每个参数
θ
j
\theta_{j}
θj? 求偏导数,过程如下:
?
?
θ
j
J
(
θ
)
=
?
?
θ
j
?
1
m
∑
i
=
1
m
[
y
(
i
)
log
?
(
h
θ
(
x
(
i
)
)
)
+
(
1
?
y
(
i
)
)
log
?
(
1
?
h
θ
(
x
(
i
)
)
)
]
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
?
?
θ
j
log
?
(
h
θ
(
x
(
i
)
)
)
+
(
1
?
y
(
i
)
)
?
?
θ
j
log
?
(
1
?
h
θ
(
x
(
i
)
)
)
]
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
?
?
θ
j
h
θ
(
x
(
i
)
)
h
θ
(
x
(
i
)
)
+
(
1
?
y
(
i
)
)
?
?
θ
j
(
1
?
h
θ
(
x
(
i
)
)
)
1
?
h
θ
(
x
(
i
)
)
]
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
?
?
θ
j
σ
(
θ
T
x
(
i
)
)
h
θ
(
x
(
i
)
)
+
(
1
?
y
(
i
)
)
?
?
θ
j
(
1
?
σ
(
θ
T
x
(
i
)
)
)
1
?
h
θ
(
x
(
i
)
)
]
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
σ
(
θ
T
x
(
i
)
)
(
1
?
σ
(
θ
T
x
(
i
)
)
)
?
?
θ
j
θ
T
x
(
i
)
h
θ
(
x
(
i
)
)
+
?
(
1
?
y
(
i
)
)
σ
(
θ
T
x
(
i
)
)
(
1
?
σ
(
θ
T
x
(
i
)
)
)
?
?
θ
j
θ
T
x
(
i
)
1
?
h
θ
(
x
(
i
)
)
]
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
h
θ
(
x
(
i
)
)
(
1
?
h
θ
(
x
(
i
)
)
)
?
?
θ
j
θ
T
x
(
i
)
h
θ
(
x
(
i
)
)
?
(
1
?
y
(
i
)
)
h
θ
(
x
(
i
)
)
(
1
?
h
θ
(
x
(
i
)
)
)
?
?
θ
j
θ
T
x
(
i
)
1
?
h
θ
(
x
(
i
)
)
]
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
(
1
?
h
θ
(
x
(
i
)
)
)
x
j
(
i
)
?
(
1
?
y
(
i
)
)
h
θ
(
x
(
i
)
)
x
j
(
i
)
]
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
(
1
?
h
θ
(
x
(
i
)
)
)
?
(
1
?
y
(
i
)
)
h
θ
(
x
(
i
)
)
]
x
j
(
i
)
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
?
y
(
i
)
h
θ
(
x
(
i
)
)
?
h
θ
(
x
(
i
)
)
+
y
(
i
)
h
θ
(
x
(
i
)
)
]
x
j
(
i
)
=
?
1
m
∑
i
=
1
m
[
y
(
i
)
?
h
θ
(
x
(
i
)
)
]
x
j
(
i
)
=
1
m
∑
i
=
1
m
[
h
θ
(
x
(
i
)
)
?
y
(
i
)
]
x
j
(
i
)
\begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) &=\frac{\partial}{\partial \theta_{j}} \frac{-1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \frac{\partial}{\partial \theta_{j}} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \frac{\partial}{\partial \theta_{j}} \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} \frac{\partial}{\partial \theta_{j}} h_{\theta}\left(x^{(i)}\right)}{h_{\theta}\left(x^{(i)}\right)}+\frac{\left(1-y^{(i)}\right) \frac{\partial}{\partial \theta_{j}}\left(1-h_{\theta}\left(x^{(i)}\right)\right)}{1-h_{\theta}\left(x^{(i)}\right)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} \frac{\partial}{\partial \theta_{j}} \sigma\left(\theta^{T} x^{(i)}\right)}{h_{\theta}\left(x^{(i)}\right)}+\frac{\left(1-y^{(i)}\right) \frac{\partial}{\partial \theta_{j}}\left(1-\sigma\left(\theta^{T} x^{(i)}\right)\right)}{1-h_{\theta}\left(x^{(i)}\right)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} \sigma\left(\theta^{T} x^{(i)}\right)\left(1-\sigma\left(\theta^{T} x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{h_{\theta}\left(x^{(i)}\right)}+\frac{-\left(1-y^{(i)}\right) \sigma\left(\theta^{T} x^{(i)}\right)\left(1-\sigma\left(\theta^{T} x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{1-h_{\theta}\left(x^{(i)}\right)}\right]\\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[\frac{y^{(i)} h_{\theta}\left(x^{(i)}\right)\left(1-h_{\theta}\left(x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{h_{\theta}\left(x^{(i)}\right)}-\frac{\left(1-y^{(i)}\right) h_{\theta}\left(x^{(i)}\right)\left(1-h_{\theta}\left(x^{(i)}\right)\right) \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)}}{1-h_{\theta}\left(x^{(i)}\right)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}\left(1-h_{\theta}\left(x^{(i)}\right)\right) x_{j}^{(i)}-\left(1-y^{(i)}\right) h_{\theta}\left(x^{(i)}\right) x_{j}^{(i)}\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}\left(1-h_{\theta}\left(x^{(i)}\right)\right)-\left(1-y^{(i)}\right) h_{\theta}\left(x^{(i)}\right)\right] x_{j}^{(i)} \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}-y^{(i)} h_{\theta}\left(x^{(i)}\right)-h_{\theta}\left(x^{(i)}\right)+y^{(i)} h_{\theta}\left(x^{(i)}\right)\right] x_{j}^{(i)} \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}-h_{\theta}\left(x^{(i)}\right)\right] x_{j}^{(i)} \\ &=\frac{1}{m} \sum_{i=1}^{m}\left[h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right] x_{j}^{(i)} \end{aligned}
?θj???J(θ)?=?θj???m?1?i=1∑m?[y(i)log(hθ?(x(i)))+(1?y(i))log(1?hθ?(x(i)))]=?m1?i=1∑m?[y(i)?θj???log(hθ?(x(i)))+(1?y(i))?θj???log(1?hθ?(x(i)))]=?m1?i=1∑m?[hθ?(x(i))y(i)?θj???hθ?(x(i))?+1?hθ?(x(i))(1?y(i))?θj???(1?hθ?(x(i)))?]=?m1?i=1∑m?[hθ?(x(i))y(i)?θj???σ(θTx(i))?+1?hθ?(x(i))(1?y(i))?θj???(1?σ(θTx(i)))?]=?m1?i=1∑m?[hθ?(x(i))y(i)σ(θTx(i))(1?σ(θTx(i)))?θj???θTx(i)?+1?hθ?(x(i))?(1?y(i))σ(θTx(i))(1?σ(θTx(i)))?θj???θTx(i)?]=?m1?i=1∑m?[hθ?(x(i))y(i)hθ?(x(i))(1?hθ?(x(i)))?θj???θTx(i)??1?hθ?(x(i))(1?y(i))hθ?(x(i))(1?hθ?(x(i)))?θj???θTx(i)?]=?m1?i=1∑m?[y(i)(1?hθ?(x(i)))xj(i)??(1?y(i))hθ?(x(i))xj(i)?]=?m1?i=1∑m?[y(i)(1?hθ?(x(i)))?(1?y(i))hθ?(x(i))]xj(i)?=?m1?i=1∑m?[y(i)?y(i)hθ?(x(i))?hθ?(x(i))+y(i)hθ?(x(i))]xj(i)?=?m1?i=1∑m?[y(i)?hθ?(x(i))]xj(i)?=m1?i=1∑m?[hθ?(x(i))?y(i)]xj(i)??
可以发现,该偏导数和线性回归中损失函数对参数
θ
\theta
θ 的偏导数形式是一致的,线性回归的损失函数定义为:
J
(
θ
)
=
1
2
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
?
y
(
i
)
)
2
J(\theta)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}
J(θ)=2m1?i=1∑m?(hθ?(x(i))?y(i))2 其偏导数为:
?
J
(
θ
)
?
θ
j
=
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
?
y
(
i
)
)
?
x
j
(
i
)
\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_{j}} &=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) \cdot x_{j}^{(i)} \end{aligned}
?θj??J(θ)??=m1?i=1∑m?(hθ?(x(i))?y(i))?xj(i)??
将其进行向量化:
?
J
(
θ
)
?
θ
j
=
1
m
x
j
→
T
(
X
θ
?
y
?
)
\frac{\partial J(\theta)}{\partial \theta_{j}} \quad=\frac{1}{m} \overrightarrow{x_{j}}^{T}(X \theta-\vec{y})
?θj??J(θ)?=m1?xj?
?T(Xθ?y
?) 进一步得到损失函数的梯度:
?
J
(
θ
)
=
1
m
X
T
(
X
θ
?
y
?
)
\nabla J(\theta) \quad=\frac{1}{m} X^{T}(X \theta-\vec{y})
?J(θ)=m1?XT(Xθ?y
?)
然后通过该梯度进行参数更新:
θ
:
=
θ
?
α
m
X
T
(
X
θ
?
y
?
)
\theta:=\theta-\frac{\alpha}{m} X^{T}(X \theta-\vec{y})
θ:=θ?mα?XT(Xθ?y
?)
其他内容可参考:吴恩达机器学习
|