线性回归
假设数据集为:
D
=
{
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
.
.
.
,
(
x
N
,
y
N
)
}
D = \{(x_1,y_1),(x_2,y_2),...,(x_N,y_N)\}
D={(x1?,y1?),(x2?,y2?),...,(xN?,yN?)} 后面我们记:
X
=
(
x
1
,
x
2
,
.
.
.
,
x
N
)
T
,
Y
=
(
y
1
,
y
2
,
.
.
.
,
y
N
)
T
X=(x_1,x_2,...,x_N)^T, Y=(y_1,y_2,...,y_N)^T
X=(x1?,x2?,...,xN?)T,Y=(y1?,y2?,...,yN?)T 线性回归假设:
f
(
w
)
=
w
T
x
f(w)=w^Tx
f(w)=wTx
最小二乘法
对这个问题,采用二范数定义的平方误差来定义损失函数:
L
(
w
)
=
∑
i
=
1
N
∣
∣
w
T
x
i
?
y
i
∣
∣
2
2
L(w) = {\sum}^N_{i=1}||w^Tx_i-y_i||^2_2
L(w)=∑i=1N?∣∣wTxi??yi?∣∣22? 展开得到:
L
(
w
)
=
(
w
T
x
1
?
y
1
,
.
.
.
,
w
T
x
N
?
y
N
)
.
(
w
T
x
1
?
y
1
,
.
.
.
,
w
T
x
N
?
y
N
)
T
=
(
w
T
X
T
?
Y
T
)
.
(
X
w
?
Y
)
=
w
T
X
T
X
w
?
Y
T
X
w
?
w
T
X
T
Y
+
Y
T
Y
=
w
L(w)=(w^Tx_1-y_1,...,w^Tx_N-y_N).(w^Tx_1-y_1,...,w^Tx_N-y_N)^T\\ =(w^TX^T-Y^T).(Xw-Y)\\ =w^TX^TXw-Y^TXw-w^TX^TY+Y^TY\\ =w
L(w)=(wTx1??y1?,...,wTxN??yN?).(wTx1??y1?,...,wTxN??yN?)T=(wTXT?YT).(Xw?Y)=wTXTXw?YTXw?wTXTY+YTY=w 最小化这个值的
w
^
\hat{w}
w^:
w
^
=
a
r
g
m
i
n
w
L
(
w
)
?
?
?
w
L
(
w
)
=
0
?
2
X
T
X
w
^
?
2
X
T
Y
=
0
?
w
^
=
(
X
T
X
)
?
1
X
T
Y
=
X
+
Y
\hat{w}=argmin_wL(w)\\ \longrightarrow\frac{\partial}{\partial{w}}L(w)=0\\ \longrightarrow{2X^TX\hat{w}-2X^TY=0}\\ \longrightarrow\hat{w}=(X^TX)^{-1}X^TY=X^+Y
w^=argminw?L(w)??w??L(w)=0?2XTXw^?2XTY=0?w^=(XTX)?1XTY=X+Y 这个式子中
(
X
T
X
)
?
1
X
T
(X^TX)^{-1}X^{T}
(XTX)?1XT又称为伪逆。对于行满秩或者列满秩的X,可以直接求解,但是对于非满秩的样本集合,需要使用奇异值分解(SVD)的方法,对X求奇异值分解,得到
X
=
U
Σ
V
T
X=U{\Sigma}V^T
X=UΣVT 在几何上,最小二乘法相当于模型和试验值的距离的平方求和,假设我们的试验样本张成一个p维空间:
X
=
S
p
a
n
(
x
1
,
.
.
.
,
x
N
)
X=Span(x_1,...,x_N)
X=Span(x1?,...,xN?),而模型可以写成
f
(
w
)
=
X
β
f(w)=X\beta
f(w)=Xβ,也就是
x
1
,
.
.
.
,
x
N
x_1,...,x_N
x1?,...,xN?的某种组合,而最小二乘法就是说希望Y和这个模型距离越小越好,于是它们的差应该与这个张成的空间垂直:
X
T
.
(
Y
?
X
β
)
=
0
?
β
=
(
X
T
X
)
?
1
X
T
Y
X^T.(Y-X\beta)=0\longrightarrow\beta=(X^TX)^{-1}X^TY
XT.(Y?Xβ)=0?β=(XTX)?1XTY
噪声为高斯分布的MLE
一维的高斯分布
N
(
μ
,
σ
)
=
1
2
π
σ
e
x
p
(
?
(
x
?
μ
)
2
2
σ
2
)
N(\mu,\sigma)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x-\mu)^2}{2\sigma^2})
N(μ,σ)=2π
?σ1?exp(?2σ2(x?μ)2?)
p维高斯分布
N
(
μ
,
Σ
)
=
1
(
2
π
)
p
/
2
Σ
1
/
2
e
x
p
(
?
1
2
(
x
?
μ
)
T
Σ
?
1
(
x
?
μ
)
)
N(\mu, \Sigma)=\frac{1}{(2\pi)^{p/2}\Sigma^{1/2}}exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))
N(μ,Σ)=(2π)p/2Σ1/21?exp(?21?(x?μ)TΣ?1(x?μ))
MLE极大似然估计
θ
M
L
E
=
a
r
g
m
a
x
θ
P
(
x
∣
θ
)
\theta_{MLE}=argmax_{\theta}P(x|\theta)
θMLE?=argmaxθ?P(x∣θ) 对于一维的情况,记
y
=
w
T
x
+
?
,
?
~
N
(
0
,
σ
2
)
y=w^Tx+\epsilon,\epsilon\sim{N(0,\sigma^2)}
y=wTx+?,?~N(0,σ2),那么
y
~
N
(
w
T
x
,
σ
2
)
y\sim{N(w^Tx,\sigma^2)}
y~N(wTx,σ2)。代入极大似然估计中:
L
(
w
)
=
l
o
g
p
(
Y
∣
X
,
w
)
=
l
o
g
∏
i
=
1
N
p
(
y
i
∣
x
i
,
w
)
=
∑
i
=
1
N
l
o
g
(
1
2
π
σ
e
?
(
y
i
?
w
T
x
i
)
2
2
σ
2
)
=
∑
i
=
1
N
[
l
o
g
1
2
π
+
l
o
g
1
σ
?
(
x
i
?
μ
)
2
2
σ
2
]
L(w)=logp(Y|X, w)=log\prod_{i=1}^Np(y_i|x_i,w)\\ =\sum_{i=1}^Nlog(\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y_i-w^Tx_i)^2}{2\sigma^2}})\\ =\sum_{i=1}^N[log\frac{1}{\sqrt{2\pi}}+log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2}]
L(w)=logp(Y∣X,w)=logi=1∏N?p(yi?∣xi?,w)=i=1∑N?log(2π
?σ1?e?2σ2(yi??wTxi?)2?)=i=1∑N?[log2π
?1?+logσ1??2σ2(xi??μ)2?] 因此,我们求
μ
M
L
E
\mu_{MLE}
μMLE?,有:
μ
M
L
E
=
a
r
g
m
a
x
μ
l
o
g
p
(
x
∣
θ
)
=
a
r
g
m
a
x
μ
∑
i
=
1
N
?
(
x
i
?
μ
)
2
2
σ
2
=
a
r
g
m
i
n
μ
∑
i
=
1
N
(
x
i
?
μ
)
2
\mu_{MLE}=argmax_\mu{logp(x|\theta)}\\ = argmax_\mu{\sum_{i=1}^N-\frac{(x_i-\mu)^2}{2\sigma^2}}\\ =argmin_\mu{\sum_{i=1}^N(x_i-\mu)^2}
μMLE?=argmaxμ?logp(x∣θ)=argmaxμ?i=1∑N??2σ2(xi??μ)2?=argminμ?i=1∑N?(xi??μ)2 求导得:
?
?
μ
∑
i
=
1
N
(
x
i
?
μ
)
2
=
∑
i
=
1
N
2
(
x
i
?
μ
)
(
?
1
)
=
0
?
∑
i
=
1
N
(
x
i
?
μ
)
=
0
?
∑
i
=
1
N
x
i
=
N
μ
?
μ
M
L
E
=
1
N
∑
i
=
1
N
x
i
\frac{\partial}{\partial{\mu}}\sum_{i=1}^N(x_i-\mu)^2=\sum_{i=1}^N2(x_i-\mu)(-1)=0\\ \longrightarrow\sum_{i=1}^{N}(x_i-\mu)=0\\ \longrightarrow\sum_{i=1}^Nx_i=N\mu\\ \longrightarrow\mu_{MLE}=\frac{1}{N}\sum_{i=1}^Nx_i
?μ??i=1∑N?(xi??μ)2=i=1∑N?2(xi??μ)(?1)=0?i=1∑N?(xi??μ)=0?i=1∑N?xi?=Nμ?μMLE?=N1?i=1∑N?xi?
对于
σ
M
L
E
2
\sigma^2_{MLE}
σMLE2?,我们依然可以根据上式得到:
σ
M
L
E
2
=
a
r
g
m
a
x
σ
l
o
g
P
(
x
∣
θ
)
=
a
r
g
m
a
x
σ
(
?
l
o
g
σ
?
1
2
σ
2
(
x
i
?
μ
)
2
)
\sigma^2_{MLE}=argmax_\sigma{logP(x|\theta)}\\ =argmax_\sigma(-log\sigma-\frac{1}{2\sigma^2}(x_i-\mu)^2)
σMLE2?=argmaxσ?logP(x∣θ)=argmaxσ?(?logσ?2σ21?(xi??μ)2) 求导得:
?
?
σ
=
?
1
σ
?
1
2
(
x
i
?
μ
)
2
(
?
2
)
σ
?
3
=
0
?
?
1
σ
+
(
x
i
?
μ
)
2
σ
?
3
=
0
?
?
σ
2
+
(
x
I
?
μ
)
2
=
0
?
∑
i
=
1
N
?
σ
2
+
(
x
i
?
μ
)
2
=
0
?
σ
M
L
E
2
=
1
N
∑
i
=
1
N
(
x
i
?
μ
)
2
\frac{\partial}{\partial\sigma}=-\frac{1}{\sigma}-\frac{1}{2}(x_i-\mu)^2(-2)\sigma^{-3}=0\\ \longrightarrow-\frac{1}{\sigma}+(x_i-\mu)^2\sigma^{-3}=0\\ \longrightarrow-\sigma^2+(x_I-\mu)^2=0\\ \longrightarrow\sum_{i=1}^N-\sigma^2+(x_i-\mu)^2=0\\ \longrightarrow\sigma^2_{MLE}=\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2
?σ??=?σ1??21?(xi??μ)2(?2)σ?3=0??σ1?+(xi??μ)2σ?3=0??σ2+(xI??μ)2=0?i=1∑N??σ2+(xi??μ)2=0?σMLE2?=N1?i=1∑N?(xi??μ)2 但是上面的
σ
\sigma
σ属于有偏估计。 我们对
σ
M
L
E
2
\sigma_{MLE}^2
σMLE2?求期望,有:
E
[
σ
M
L
E
2
]
=
1
N
∑
i
=
1
N
E
(
x
i
?
μ
M
L
E
)
2
=
1
N
∑
i
=
1
N
E
(
x
i
2
?
2
x
i
μ
M
L
E
+
μ
M
L
E
2
)
=
1
N
(
∑
i
=
1
N
E
(
x
i
2
)
?
2
∑
i
=
1
N
E
(
x
i
μ
M
L
E
)
+
∑
i
=
1
N
E
(
μ
M
L
E
2
)
)
=
1
N
(
∑
i
=
1
N
E
(
x
i
2
)
?
2
μ
M
L
E
2
+
μ
M
L
E
2
)
=
1
N
E
(
∑
i
=
1
N
x
i
2
?
μ
M
L
E
2
)
=
E
[
1
N
∑
i
=
1
N
x
i
2
?
μ
2
?
(
μ
M
L
E
2
?
μ
2
)
]
=
E
[
1
N
∑
i
=
1
N
x
i
2
?
μ
2
]
?
E
[
(
μ
M
L
E
2
?
μ
2
)
]
E[\sigma_{MLE}^2]=\frac{1}{N}\sum_{i=1}^NE(x_i-\mu_{MLE})^2\\ =\frac{1}{N}\sum_{i=1}^{N}E(x_i^2-2x_i\mu_{MLE}+\mu^2_{MLE})\\ =\frac{1}{N}(\sum_{i=1}^NE(x_i^2)-2\sum_{i=1}^NE(x_i\mu_{MLE})+\sum_{i=1}^NE(\mu^2_{MLE}))\\ =\frac{1}{N}(\sum_{i=1}^NE(x_i^2)-2\mu^2_{MLE}+\mu^2_{MLE})\\ =\frac{1}{N}E(\sum_{i=1}^Nx_i^2-\mu^2_{MLE})\\ =E[\frac{1}{N}\sum_{i=1}^Nx^2_i-\mu^2-(\mu^2_{MLE}-\mu^2)]\\ =E[\frac{1}{N}\sum_{i=1}^Nx^2_i-\mu^2]-E[(\mu^2_{MLE}-\mu^2)]
E[σMLE2?]=N1?i=1∑N?E(xi??μMLE?)2=N1?i=1∑N?E(xi2??2xi?μMLE?+μMLE2?)=N1?(i=1∑N?E(xi2?)?2i=1∑N?E(xi?μMLE?)+i=1∑N?E(μMLE2?))=N1?(i=1∑N?E(xi2?)?2μMLE2?+μMLE2?)=N1?E(i=1∑N?xi2??μMLE2?)=E[N1?i=1∑N?xi2??μ2?(μMLE2??μ2)]=E[N1?i=1∑N?xi2??μ2]?E[(μMLE2??μ2)] 而
E
[
1
N
∑
i
=
1
N
x
i
2
?
μ
2
]
=
E
(
1
N
∑
i
=
1
N
(
x
i
2
?
μ
2
)
)
=
1
N
∑
i
=
1
N
(
x
i
2
?
μ
2
)
=
1
N
∑
i
=
1
N
E
(
x
i
2
)
?
E
(
μ
2
)
=
1
N
∑
i
=
1
N
σ
2
=
σ
2
E[\frac{1}{N}\sum_{i=1}^Nx^2_i-\mu^2]=E(\frac{1}{N}\sum_{i=1}^N(x_i^2-\mu^2))\\ =\frac{1}{N}\sum_{i=1}^N(x_i^2-\mu^2)\\ =\frac{1}{N}\sum_{i=1}^NE(x_i^2)-E(\mu^2)\\ =\frac{1}{N}\sum_{i=1}^N\sigma^2=\sigma^2
E[N1?i=1∑N?xi2??μ2]=E(N1?i=1∑N?(xi2??μ2))=N1?i=1∑N?(xi2??μ2)=N1?i=1∑N?E(xi2?)?E(μ2)=N1?i=1∑N?σ2=σ2
E
[
(
μ
M
L
E
2
?
μ
2
)
]
=
E
(
μ
M
L
E
2
)
?
E
(
μ
2
)
=
E
(
μ
M
L
E
2
)
?
μ
2
=
V
a
r
(
μ
M
L
E
)
=
V
a
r
[
1
N
∑
i
=
1
N
x
i
]
=
1
N
2
∑
i
=
1
N
V
a
r
(
x
i
)
=
1
N
2
∑
i
=
1
N
σ
2
=
1
N
σ
2
E[(\mu^2_{MLE}-\mu^2)]=E(\mu^2_{MLE})-E(\mu^2)\\ =E(\mu^2_{MLE})-\mu^2=Var(\mu_{MLE})\\ =Var[\frac{1}{N}\sum^N_{i=1}x_i]=\frac{1}{N^2}\sum^N_{i=1}Var(x_i)\\ =\frac{1}{N^2}\sum_{i=1}^N\sigma^2=\frac{1}{N}\sigma^2
E[(μMLE2??μ2)]=E(μMLE2?)?E(μ2)=E(μMLE2?)?μ2=Var(μMLE?)=Var[N1?i=1∑N?xi?]=N21?i=1∑N?Var(xi?)=N21?i=1∑N?σ2=N1?σ2
所以最终:
E
[
σ
M
L
E
2
]
=
N
?
1
N
σ
2
E[\sigma^2_{MLE}]=\frac{N-1}{N}\sigma^2
E[σMLE2?]=NN?1?σ2 而
σ
\sigma
σ的无偏估计为:
σ
=
1
N
?
1
∑
i
=
1
N
(
x
i
?
μ
M
L
E
)
\sigma = \frac{1}{N-1}\sum_{i=1}^N(x_i-\mu_{MLE})
σ=N?11?i=1∑N?(xi??μMLE?)
|