引入
回归问题和分类问题以前老是弄混,最直接的记法就是,分类问题是针对离散空间的,就比如说给定一张图片,根据其特征判断这张图片是猫还是狗,这就是分类问题;回归问题的话是针对连续空间的,比如预测距离,概率等。
最小二乘法线性回归(Least Squares Linear Regression)
一次项回归
给定一组训练数据以及相关的函数值
(
x
i
,
y
i
)
(x_i, y_i)
(xi?,yi?)(这里的
x
i
x_i
xi?是一组向量)如下:
X
=
{
x
1
∈
R
d
,
.
.
.
,
x
n
}
X=\left\{x_1 \in R^d,...,x_n\right\}
X={x1?∈Rd,...,xn?}
Y
=
{
y
1
∈
R
,
.
.
.
,
y
n
}
Y=\left\{y_1 \in R,...,y_n\right\}
Y={y1?∈R,...,yn?}线性回归的方程如下:
x
i
T
w
+
w
0
=
y
i
?
?
i
=
1
,
.
.
.
n
x_i^Tw+w_0=y_i\ \forall i =1,...n
xiT?w+w0?=yi???i=1,...n这个等式与最小二乘法分类的等式一样,唯一的区别就是值是连续的。
w
0
w_0
w0?在这里是一个偏置,即是常见的
b
i
a
s
bias
bias,一般我们会把它整合到权重里,既方便也好算。 最小二乘法线性回归的一般性步骤如下: Step 1:Define:
x
^
i
=
[
x
i
1
]
\hat{x}_i=\begin{bmatrix} x_i\\ 1\end{bmatrix}
x^i?=[xi?1?]
w
^
=
[
w
w
0
]
\hat{w}=\begin{bmatrix} w\\ w_0\end{bmatrix}
w^=[ww0??] Step 2: Rewrite:
x
^
i
T
w
^
=
y
i
?
?
i
=
1
,
.
.
.
,
n
\hat{x}_i^T\hat{w}=y_i\ \forall i=1,...,n
x^iT?w^=yi???i=1,...,n Step 3: Matrix-vector notation
X
^
T
w
^
=
y
\hat{X}^T\hat{w}=y
X^Tw^=y这里
X
^
=
[
x
^
1
,
.
.
.
,
x
^
n
]
\hat{X}=[\hat{x}_1,...,\hat{x}_n]
X^=[x^1?,...,x^n?],每个
x
^
i
\hat{x}_i
x^i?都是一个向量,且
y
=
[
y
1
,
.
.
.
,
y
n
]
T
y=[y_1,...,y_n]^T
y=[y1?,...,yn?]T。 Step 4: Fine the least squares solution 这里就直接解出权重
w
w
w了,在这一系列的计算里,计算量最大的就是那个
R
D
×
D
R^{D\times D}
RD×D的逆矩阵,直接计算逆矩阵的话,其复杂度是
O
(
D
3
)
O(D^3)
O(D3),所以当
D
D
D很大时计算成本就很高了,所以一般我们会用下面两种方法: 1. Gradient descent 2. Work with fewer dimensions
这两种方法应该都挺熟悉了,所以这里不做介绍。
多项式回归(Polynomial Regression)
这是建立在最小二乘回归上的,等式形式跟上面的类似:
y
(
x
)
=
w
T
?
(
x
)
=
∑
i
=
0
M
w
i
?
i
(
x
)
y(x)=w^T\phi(x)=\sum_{i=0}^{M}w_i\phi_i(x)
y(x)=wT?(x)=i=0∑M?wi??i?(x) 并且
?
0
(
x
)
=
1
\phi_0(x)=1
?0?(x)=1,
?
i
(
.
)
\phi_i(.)
?i?(.)叫基础方程(basis functions),对于
w
w
w来说这依旧是个线性方程。 此处的
?
(
x
)
\phi(x)
?(x)代表多项式,比如说:
?
(
x
)
=
(
1
,
x
,
x
2
,
x
3
)
T
\phi(x)=(1,x,x^2,x^3)^T
?(x)=(1,x,x2,x3)T 举个例子,对于不同degree的多项式,其拟合效果如下: 对于不同的多项式会出现欠拟合或者过拟合的现象。
回归的最大似然法(Maximum Likelihood Approach to Regression)
概率回归(Probabilistic Regression)
对于概率回归有以下两个假设: Assumption 1:我们的目标函数值是通过向方程添加噪声产生的,即是:
y
=
f
(
x
,
w
)
+
?
y = f(x,w)+\epsilon
y=f(x,w)+?
y
y
y:目标函数值;
x
x
x:输入值
f
f
f: 回归方程;
w
w
w:权重或者说参数;
?
\epsilon
?:噪声 Assumption 2:这个噪声是服从高斯分布的随机变量,即:
?
~
N
(
0
,
β
?
1
)
\epsilon \sim N(0,\beta^{-1})
?~N(0,β?1)
p
(
y
∣
x
,
w
,
β
)
=
N
(
y
∣
f
(
x
,
w
)
,
β
?
1
)
p(y|x,w,\beta)=N(y|f(x,w),\beta^{-1})
p(y∣x,w,β)=N(y∣f(x,w),β?1)其中
f
(
x
,
w
)
f(x,w)
f(x,w)是均值,
β
?
1
\beta^{-1}
β?1是方差(
β
\beta
β是precision),注意,现在
y
y
y是一个随机变量,其基本概率分布为
p
(
y
∣
x
,
w
,
β
)
p(y|x,w,\beta)
p(y∣x,w,β)。
具体说明 现在给定一组数据点
X
=
[
x
1
,
.
.
.
,
x
n
]
∈
R
d
×
n
X=[x_1,...,x_n]\in \R^{d\times n}
X=[x1?,...,xn?]∈Rd×n以及相关的函数值
Y
=
[
y
1
,
.
.
.
,
y
n
]
T
Y=[y_1,...,y_n]^T
Y=[y1?,...,yn?]T 其条件似然为(设数据是i.i.d.的):
p
(
y
∣
X
,
w
,
β
)
=
∏
i
=
1
n
N
(
y
i
∣
f
(
x
i
,
w
)
,
β
?
1
)
p(y|X,w,\beta)=\prod_{i=1}^n N(y_i|f(x_i,w),\beta^{-1})
p(y∣X,w,β)=i=1∏n?N(yi?∣f(xi?,w),β?1)带入线性模型:
=
∏
i
=
1
n
N
(
y
i
∣
w
T
?
(
x
i
)
,
β
?
1
)
=\prod_{i=1}^n N(y_i|w^T\phi(x_i),\beta^{-1})
=i=1∏n?N(yi?∣wT?(xi?),β?1)其中
w
T
?
(
x
i
)
w^T\phi(x_i)
wT?(xi?)是广义的线性回归方程,然后现在我们就要计算关于
β
\beta
β和
w
w
w的最大化似然了。 用对数似然如下: 对于
w
w
w的梯度为: 我们做如下定义: 求解上面的式子可以得到: 由此可见,用最大似然求解的权重值和前面最小二乘法回归的结果一样,我们也可得出这么一个结论:
最小二乘法等同于假设目标是高斯分布。
所以要注意,运用最小二乘法是有假设前提的。 然而,与最小二乘法相比,最大似然的运用更广法,因为除了权重
w
w
w,我们还可以用这个方法估计
β
\beta
β: 而且可以衡量我们的估计的不确定性(用loss function)。
回归中的损失函数(Loss Functions in Regression)
如果我们现在有一个新的数据
x
t
x_t
xt?,在最小二乘回归中,其对应的值为:
y
t
=
x
^
t
T
w
^
y_t=\hat{x}_t^T\hat{w}
yt?=x^tT?w^ 但在最大似然回归中,我们则是要计算相关概率值:
p
(
y
∣
x
,
w
,
β
)
p(y|x,w,\beta)
p(y∣x,w,β)。我们如何准确地估计
y
t
y_t
yt?呢?这就要引入损失函数(loss function)了。
L
:
R
×
R
→
R
+
L: \R\times \R \rightarrow \R^+
L:R×R→R+
(
y
t
,
f
(
x
t
)
)
→
L
(
y
t
,
f
(
x
t
)
)
(y_t,f(x_t))\rightarrow L(y_t,f(x_t))
(yt?,f(xt?))→L(yt?,f(xt?)) 然后我们最小化预期损失:
E
x
,
y
~
p
(
x
,
y
)
[
L
]
=
∫
∫
L
(
y
,
f
(
x
)
)
p
(
x
,
y
)
d
x
d
y
E_{x,y\sim p(x,y)}[L]=\int \int L(y,f(x))p(x,y)dxdy
Ex,y~p(x,y)?[L]=∫∫L(y,f(x))p(x,y)dxdy 举个例子,比如说损失函数是squared loss的话,可进行如下计算: 由此可见,对于squared loss,其最佳回归方程为后验
p
(
y
∣
x
)
p(y|x)
p(y∣x)的均值
E
[
y
∣
x
]
E[y|x]
E[y∣x],这也称为均值预测。 所以对于我们的广泛线性回归方程,可写为:
f
(
x
)
=
∫
y
N
(
y
∣
w
T
?
(
x
)
,
β
?
1
)
d
y
=
w
T
?
(
x
)
f(x)=\int yN(y|w^T\phi(x),\beta^{-1})dy=w^T\phi(x)
f(x)=∫yN(y∣wT?(x),β?1)dy=wT?(x)这又可以跟前面 产生联系了。
贝叶斯线性回归(Bayesian Linear Regression)
现在回到最原来的问题,我们想要避免过拟合和不稳定,然而用最大似然法依旧可能造成过拟合现象(比如说只有一个数据点的情况)。为了解决这个问题,现在就要引入贝叶斯线性回归。 我们在参数
w
w
w上放置一个先验,以控制不稳定性:
p
(
w
∣
X
,
y
)
∝
p
(
y
∣
X
,
w
)
p
(
w
)
p(w|X,y)\propto p(y|X,w)p(w)
p(w∣X,y)∝p(y∣X,w)p(w)其中:
p
(
w
)
p(w)
p(w)为参数先验;
p
(
y
∣
X
,
w
)
p(y|X,w)
p(y∣X,w)为给定数据和参数下目标的似然(跟前面定义的一样);
p
(
w
∣
X
,
y
)
p(w|X,y)
p(w∣X,y)为参数后验。
注:这里得到的不再是参数的单一值,而是参数的概率分布
最简单的想法就是假设参数
w
w
w的先验符合高斯分布:
w
~
p
(
w
∣
α
)
=
N
(
w
∣
0
,
α
?
1
I
)
w\sim p(w|\alpha)=N(w|0,\alpha^{-1}I)
w~p(w∣α)=N(w∣0,α?1I)这里通过放置一个”软“限制,也就是
α
\alpha
α,来避免不稳定性。另外,这里设均值为0只是为了方便后面计算,均值可为其它数。 然后就有如下关系: 求解上面的参数概率可有多种方法,这里介绍两种:
最大后验(MAP)
取对数后验,然后最大化,如下: 然后对
w
w
w求导: (注:上面这个图里的式子有错,右边
β
\beta
β应该在括号里,但我实在不想手打这些式子,只是计算过程而已。。。) 这就得出参数的等式了:
w
m
a
p
=
(
Φ
Φ
T
+
α
β
)
?
1
Φ
y
w_{map}=(\Phi \Phi^T+\frac{\alpha}{\beta})^{-1}\Phi y
wmap?=(ΦΦT+βα?)?1Φy 与前面的结论不同,这里多了个
α
β
\frac{\alpha}{\beta}
βα?,先验的作用是可以正则化这个伪逆,这个也叫岭回归(ridge regression)。
既然说到了这个概念,就顺道提一下另一个跟它很像的
L
A
S
S
O
LASSO
LASSO 回归,Lasso回归跟岭回归非常相似,它们的差别在于使用了不同的正则化项(LASSO是
l
1
l1
l1正则化,岭回归是
l
2
l2
l2)。最终都实现了约束参数从而防止过拟合的效果。另外,Lasso能够将一些作用比较小的特征的参数训练为0,从而获得稀疏解。也就是说用这种方法,在训练模型的过程中实现了降维(特征筛选)的目的。
下面这张图可以直观显示岭回归和最小二乘回归的差别:
MAP与正则化的最小二乘法的比较
现在我们在前面的最小二乘法的结论等式上加一个正则化项: 求解
w
w
w我们得到一个新的估计:
w
^
=
(
X
^
X
^
T
+
λ
I
)
?
1
X
^
y
\hat{w}=(\hat{X}\hat{X}^T+\lambda I)^{-1}\hat{X}y
w^=(X^X^T+λI)?1X^y 此处
λ
=
α
/
β
\lambda = \alpha / \beta
λ=α/β。 也就是说,如果我们在最小二乘回归上添加一个正则化项
λ
\lambda
λ,则意味着我们假设目标有一个符合高斯分布的噪音,而且参数也是符合高斯分布的。 以前面的9次多项式回归为例,加上不同的正则化项
λ
\lambda
λ后,其效果为: 由此可见,
λ
=
α
/
β
\lambda=\alpha / \beta
λ=α/β控制模型的复杂程度,并决定过拟合的程度。
完全贝叶斯回归(Full Bayesian Regression)
换个思路,其实我们也不是非得求出参数
w
w
w,我们只需要通过训练数据来预测对应的函数值即可。也就是说可用边缘概率进行计算:$
p
(
y
t
∣
x
t
,
X
,
y
)
=
∫
p
(
y
t
,
w
∣
x
t
,
X
,
y
)
d
w
p(y_t|x_t,X,y)=\int p(y_t,w|x_t,X,y)dw
p(yt?∣xt?,X,y)=∫p(yt?,w∣xt?,X,y)dw 用贝叶斯公式可将上面等式写为: 若先验之类的都是高斯的话,则这个预测分布也将服从高斯分布,即: (注:上图的等式有一个错误,中间那个等式最右边的
Φ
\Phi
Φ没有转置。)
(附)作业相关代码
大概就是按照这篇博文全部实现一遍:(这次的代码大的方面应该是没问题的,但一些细节上可能会有问题,改得有点心累,先这样吧) (a):
def linear_features(X_train, X_test, y_train, y_test, _lambda = 0.01):
"""
param:
X_train(ndarray): shape = (N, 1)
y_train(ndarray): shape = (N, 1)
_lambda(float)
return:
"""
def add_dim_bias(X):
"""
[x1, x2, x3, ... , xn] -> [[x1,1], [x2,1], ... ,[xn,1]]
param:
X(ndarray): train_data (N_train, 1)
return:
new_X(ndarray): train_data (2, N_train)
"""
N_x = X.shape[0]
bias_dim = np.ones((N_x, 1))
new_X = np.c_[X, bias_dim].T
return new_X
def loss_fn(X, y, N):
"""
calculate RMSE
"""
rmse = np.sqrt(np.sum((X.T @ w - y) ** 2) / N)
return rmse
X_train = add_dim_bias(X_train)
X_test = add_dim_bias(X_test)
N_train = X_train.shape[1]
N_test = X_test.shape[1]
w = np.linalg.inv(X_train @ X_train.T + _lambda * np.eye(X_train.shape[0])) @ X_train @ y_train # w(2, 1)
rmse_train = loss_fn(X_train, y_train, N_train)
rmse_test = loss_fn(X_test, y_test, N_test)
plt.figure()
x = np.linspace(X_train[0,:].min(), X_train[0,:].max(), 100)
x = add_dim_bias(x)
for i in range(X_train.shape[1]):
plt.scatter(X_train[0,i], y_train[i, 0], marker = "o", color = "black")
plt.plot(x[[0]].flatten(), (w.T @ x).flatten(), color = "blue")
plt.title("linear_features")
plt.show()
return rmse_train, rmse_test
def main_a():
rmse_train, rmse_test = linear_features(X_train, X_test, y_train, y_test, _lambda = 0.01)
print("root mean squared error of the training data: ", rmse_train)
print("root mean squared error of the test data: ",rmse_test)
(b):
def polynomial_features(X_train, X_test, y_train, y_test, _lambda = 0.01, degrees = [2, 3, 4]):
"""
param:
X_train(ndarray): shape(N, 1)
y_train(ndarray): shape(N, 1)
_lambda(floar): ridge coefficient
degrees(list): list of degrees
"""
def add_degree(X, degree):
N = X.shape[0]
new_X = np.ones((N, 1))
for i in range(1, degree+1):
temp = X ** i
new_X = np.c_[new_X, temp]
new_X = new_X.T
return new_X
def loss_fn(w, X, y):
N = len(y)
rmse = np.sqrt(np.sum((X.T @ w - y) ** 2) / N)
return rmse
w_lst = []
rmse_train = []
rmse_test = []
for i in range(len(degrees)):
tmp_X = add_degree(X_train, degrees[i])
tmp_w = np.linalg.inv(tmp_X @ tmp_X.T + _lambda * np.eye(tmp_X.shape[0])) @ tmp_X @ y_train
w_lst.append(tmp_w)
for i in range(len(degrees)):
tmp_X_train = add_degree(X_train, degrees[i])
tmp_X_test = add_degree(X_test, degrees[i])
tmp_rmse_train = loss_fn(w_lst[i], tmp_X_train, y_train)
tmp_rmse_test = loss_fn(w_lst[i], tmp_X_test, y_test)
rmse_train.append(tmp_rmse_train)
rmse_test.append(tmp_rmse_test)
for i in range(len(degrees)):
print("polynomials of degrees={}, root mean squared error of the training data is: {}".format(degrees[i], rmse_train[i]))
for i in range(len(degrees)):
print("polynomials of degrees={}, root mean squared error of the test data is: {}".format(degrees[i], rmse_test[i]))
for i in range(len(w_lst)):
x = np.linspace(X_train[:,0].min(), X_train[:,0].max(), 100)
plt.figure()
for j in range(X_train.shape[0]):
plt.scatter(X_train[j,0], y_train[j, 0], marker = "o", color = "black")
x = add_degree(x, degrees[i])
plt.plot(x[1].flatten(), (w_lst[i].T @ x).flatten(), color = "blue")
plt.title("polynomial_features with degree = {}".format(degrees[i]))
plt.show()
def main_b():
polynomial_features(X_train, X_test, y_train, y_test)
?:
def bayesian_linear_regression(X_train, y_train, X_test, y_test, mu = 0, sigma = 0.1, _lambda = 0.01, std = [1,2,3]):
def add_dim_bias(X):
"""
[x1, x2, x3, ... , xn] -> [[x1,1], [x2,1], ... ,[xn,1]]
param:
X(ndarray): train_data (N_train, 1)
return:
new_X(ndarray): train_data (2, N_train)
"""
N_x = X.shape[0]
bias_dim = np.ones((N_x, 1))
new_X = np.c_[X, bias_dim].T
return new_X
def prediction_mu_sigma(X):
#X_bias = add_dim_bias(X)#(2,50)
mu_or_pred = X.T @ w#(50,1)
sigma_2 = 1 / beta + X.T @ np.linalg.inv(alpha * np.eye(X.shape[0]) + beta *
X @ X.T) @ X
# Here take attention
sigma_2 += sigma * np.eye(sigma_2.shape[1])
return mu_or_pred, np.sqrt(sigma_2.diagonal())
def loss_fn(X, y):
N = len(y)
pred, _ = prediction_mu_sigma(X)
rmse = np.sqrt(np.sum((pred - y) ** 2) / N)
return rmse
def log_likelihood_fn(X, y):
n = len(y)
log_likelihood = n / 2 * (np.log(beta) - np.log(2 * np.pi)) - beta / 2 * (np.linalg.norm(y - w.T @ X) ** 2)
average_ll = log_likelihood / n
return average_ll
beta = 1 / (sigma ** 2)
alpha = _lambda * beta
X_train = add_dim_bias(X_train)#(2,50) y_train (50,1)
X_test = add_dim_bias(X_test)#(2,100) y_test (100,1)
w = np.linalg.inv(X_train @ X_train.T + _lambda * np.eye(X_train.shape[0])) @ X_train @ y_train # w(2, 1)
rmse_train = loss_fn(X_train, y_train)
rmse_test = loss_fn(X_test, y_test)
log_ll_train = log_likelihood_fn(X_train, y_train)
log_ll_test = log_likelihood_fn(X_test, y_test)
print("the RMSE of the train data is: {}".format(rmse_train))
print("the RMSE of the test data is: {}".format(rmse_test))
print("the average log-likelihood of the train data is: {}".format(log_ll_train))
print("the average log-likelihood of the test data is: {}".format(log_ll_test))
pred, x_std = prediction_mu_sigma(X_train)
plt.figure()
for j in range(X_train.shape[1]):
plt.scatter(X_train[0,j], y_train[j, 0], marker = "o", color = "black")
plt.plot(X_train[0], pred.flatten(), color = "blue")
for i in range(len(std)):
plt.fill_between(X_train[0], pred.flatten()+std[i]*x_std, pred.flatten()-std[i]*x_std,
color = "blue", alpha = 0.2)
plt.title("bayesian_linear_regression")
plt.show()
def main_c():
bayesian_linear_regression(X_train, y_train, X_test, y_test)
(d):
def squared_exponential_features(X_train, y_train, X_test, y_test, _lambda = 0.01, k = 20, sigma = 0.1, beta = 10, std = [1,2,3]):
def new_data(X):
n = len(X)
"""
new_x = np.zeros((n, k))
for i in range(n):
for j in range(k):
alpha_j = j * 0.1 - 1
tmp = np.exp(-0.5 * beta * (X[i] - alpha_j) ** 2)
new_x[i][j] = tmp
"""
X_poly = np.ones(X.shape)
for j in range(1, k+1):
X_poly = np.hstack((X_poly, np.exp(-beta/2*np.power(X-(j*0.1 -1), 2))))
return X_poly.T
def prediction_mu_sigma(X):
mu_or_pred = X.T @ w
sigma_2 = 1 / beta + X.T @ np.linalg.inv(alpha * np.eye(X.shape[0]) + beta *
X @ X.T) @ X
sigma_2 += sigma * np.eye(sigma_2.shape[1])
return mu_or_pred, np.sqrt(sigma_2.diagonal())
def loss_fn(X, y):
N = len(y_test)
pred, _ = prediction_mu_sigma(X)
rmse = np.sqrt(np.sum((pred - y) ** 2) / N)
return rmse
def log_likelihood_fn(X, y):
_beta = 1 / (sigma ** 2)
n = len(y)
log_likelihood = n / 2 * (np.log(_beta) - np.log(2 * np.pi)) - _beta / 2 * (np.linalg.norm(y - w.T @ X) ** 2)
average_ll = log_likelihood / n
return average_ll
orignal_data = X_train.copy()
X_train = new_data(X_train)
X_test = new_data(X_test)
alpha = _lambda * beta
w = np.linalg.pinv(X_train @ X_train.T + _lambda * np.eye(X_train.shape[0])) @ X_train @ y_train
rmse_train = loss_fn(X_train, y_train)
rmse_test = loss_fn(X_test, y_test)
log_ll_train = log_likelihood_fn(X_train, y_train)
log_ll_test = log_likelihood_fn(X_test, y_test)
print("the RMSE of the train data is: {}".format(rmse_train))
print("the RMSE of the test data is: {}".format(rmse_test))
print("the average log-likelihood of the train data is: {}".format(log_ll_train))
print("the average log-likelihood of the test data is: {}".format(log_ll_test))
pred, x_std = prediction_mu_sigma(X_train)
sorted_idx = np.argsort(X_train[0])
orignal_data = orignal_data[sorted_idx]
pred = pred.flatten()[sorted_idx]
y_train = y_train[sorted_idx]
plt.figure()
for j in range(orignal_data.shape[0]):
plt.scatter(orignal_data[j, 0], y_train[j, 0], marker = "o", color = "black")
plt.plot(orignal_data[:,0], pred.flatten(), color = "blue")
for i in range(len(std)):
plt.fill_between(orignal_data[0], pred.flatten()+std[i]*x_std, pred.flatten()-std[i]*x_std,
color = "blue", alpha = 0.2)
plt.title("bayesian_linear_regression")
plt.show()
效果图如下:
|