有监督学习–线性回归
一:算法原理
多元线性回归:每个 x 代表一个特征
模型:
[
y
1
y
2
.
.
.
y
p
]
=
[
1
,
x
11
,
x
12
,
.
.
.
,
x
1
p
1
,
x
21
,
x
22
,
.
.
.
,
x
2
p
.
.
.
1
,
x
n
1
,
x
n
2
,
.
.
.
,
x
n
p
]
?
[
w
0
w
1
.
.
.
w
p
]
=
w
0
+
X
W
\left[ \begin{matrix} y_1 \\ y_2 \\ ... \\ y_p \end{matrix} \right] = \left[ \begin{matrix} 1,x_{11},x_{12},...,x_{1p} \\ 1,x_{21},x_{22},...,x_{2p} \\ ... \\ 1,x_{n1},x_{n2},...,x_{np} \end{matrix} \right] * \left[ \begin{matrix} w_0 \\ w_1 \\ ... \\ w_p \end{matrix} \right] = w_0+XW
?????y1?y2?...yp???????=?????1,x11?,x12?,...,x1p?1,x21?,x22?,...,x2p?...1,xn1?,xn2?,...,xnp?????????????w0?w1?...wp???????=w0?+XW 目的:找到最好的 W
方法:最小二乘法
关键概念:损失函数
损失函数:模型表现越不好,则损失的信息越多,值代表误差平方和SSE,即预测值和观测值的误差平方和
代码实现一元线性回归:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)
x = np.random.rand(50)*10
np.random.seed(120)
y = 2*x - 5 + np.random.randn(50)
plt.plot(x,y,'.')
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True)
x_x = x.reshape(-1,1)
lr = lr.fit(x_x,y)
lr.coef_
lr.intercept_
x_prt= np.linspace(0,10,100)
x_prt = x_prt.reshape(-1,1)
y_prt = lr.predict(x_prt)
plt.plot(x_prt,y_prt)
plt.plot(x,y,'.')
二:房价预测Demo
模型拟合程度:
- MSE:平均残差 , (SSE是误差平方和, MSE=SSE/ m)
- R方:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing
house_data = fetch_california_housing()
x_house = pd.DataFrame(house_data.data,columns=house_data.feature_names)
print(df_house.head())
y_house = house_data.target
print(y_house)
Xtrain,Xtest,Ytrain,Ytest = train_test_split(x_house,y_house,test_size=0.3,random_state=420)
lr = LinearRegression()
lr = lr.fit(Xtrain,Ytrain)
三:模型评估
3.1 MSE均方误差
MSE指标:
from sklearn.metrics import mean_absolute_error
y_yuce = lr.predict(Xtrain)
mse = mean_absolute_error(Ytrain,y_yuce)
print('均方误差=',mse)
y_test_yuce = lr.predict(Xtest)
mse_test = mean_absolute_error(Ytest,y_test_yuce)
print('测试集mse=',mse_test)
MSE交叉验证:
ls2 = LinearRegression()
mse = cross_val_score(ls2,Xtrain,Ytrain,cv=10,scoring='neg_mean_squared_error')
mse.mean()
如何查询 scoring 参数:
import sklearn
sorted(sklearn.metrics.SCORERS.keys())
3.2 MAE 绝对均值误差(和MSE差不多,二者取其一即可)
from sklearn.metrics import mean_absolute_error
mean_absolute_error(Ytrain,y_yuce)
3.3 R方
R方-方差,用于衡量s数据集包含多少信息量
R方越趋近于1,代表模型拟合效果越好
方式一:
from sklearn.metrics import r2_score
r2 = r2_score(Ytrain,y_yuce)
print('训练集R方值=',r2)
r2_test = r2_score(Ytest,y_test_yuce)
print('测试集R方值=',r2_test)
方式二:
lr.score(Xtrain,Ytrain)
lr.score(Xtest,Ytest)
交叉验证:
lr2 = LinearRegression()
cross_val_score(lr2,Xtrain,Ytrain,cv=10,scoring='r2').mean()
3.4 查看模型系数
w系数
lr.coef_
list(zip(x_house.columns,lr.coef_))
截距
lr.intercept_
3.5 模型公式
根据3.4模型系数得出:
y = 0.43×MedInc+0.01×HouseAge-0.1×AveRooms+0.62×AveBedrms+0.0000005×Population-0.003×AveOccup-0.41×Latitude-0.42×Longitude-36.25
四:将数据集标准化之后再训练
标准化: 消除量纲的影响
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
Xtrain_std = std.fit_transform(Xtrain)
lr_std = LinearRegression()
lr_std = lr_std.fit(Xtrain_std,Ytrain)
lr_std.score(Xtrain_std,Ytrain)
五:绘制拟合图像
plt.scatter(range(len(Ytest)),sorted(Ytest),s=2,label='True')
index_sort = np.argsort(Ytest)
y2 = lr.predict(Xtest)
y3 = y2[index_sort]
plt.scatter(range(len(Ytest)),y3,s=1,label='Predict',alpha=0.3)
plt.legend()
plt.show()
六:多重共线性
即特征和特征之间存在高度相关性
from sklearn.preprocessing import PolynomialFeatures
pl = PolynomialFeatures(degree=4).fit(x_house,y_house)
pl.get_feature_names()
x_trans = pl.transform(x_house)
Xtrain,Xtest,Ytrain,Ytest = train_test_split(x_trans,y_house,test_size=0.3,random_state=420)
result = LinearRegression().fit(Xtrain,Ytrain)
result.coef_
[*zip(pl.get_feature_names(x_house.columns),result.coef_)]
result.score(Xtrain,Ytrain)
|