线性回归–波士顿房价
前几天,机器学习老师布置了个作业。
实验要求
- 编写一个线性回归分类器(以梯度下降更新参数;以均方误差计算cost function;补充计算accuracy的函数)
- 使用波士顿房价数据集作为实验数据集(load_boston;数据集相关信息)
- 分别使用自己编写的线性回归分类器、sklearn的线性回归分类器训练模型(sklearn的线性模型的API文档)
- 使用多项式回归优化模型
实验内容
首先,我们需要导入相关的库(由于在pycharm可以很方便的导入需要的类,所以这方面不重要)
from sklearn import preprocessing
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
1、加载数据集,划分数据,并输出训练集和测试集中数据的情况
boston = load_boston()
print(boston.keys())
print(boston['data'].shape,boston['target'].shape)
X = preprocessing.scale(boston['data'])
Y = preprocessing.scale(boston['target'])
data_size = Y.shape[0]
preprocessing.scale 是标准化操作,默认axis=0 ,使得每个特征平均值为0,方差为1,源码阅读参考https://blog.csdn.net/qikaihuting/article/details/82633882
2、编写线性回归
class LinerRegression:
def __init__(self, learning_rate=0.01, max_iter=100):
self.max_iter = max_iter
self.lr = learning_rate
self.theta = np.zeros(X.shape[1]+1)
self.coef_ = np.zeros(X.shape[1])
self.intercept_ = np.zeros(1)
self.loss_arr = []
def fit(self, x, y):
if x.shape[0] != y.shape[0]:
raise Exception("Error! X.shape and Y.shape are incompatible")
self.x = np.hstack([x,np.ones((len(x),1))])
self.y = y
for i in range(self.max_iter):
self._train_step()
self.loss_arr.append(self.loss())
self.coef_ = self.theta[:-1]
self.intercept_ = self.theta[-1]
def _train_step(self):
temp = self.theta.copy()
temp = temp - self.lr * self._batch_gradient_decent(self.x , self.y)
self.theta = temp
def _batch_gradient_decent(self,x,y):
inner = x.T.dot(x.dot(self.theta.T)-y)
return inner / x.shape[0]
def loss(self, y_true = None, y_predict = None):
if y_true is None or y_predict is None:
y_true = self.y
y_predict = self.x.dot(self.theta)
return np.mean((y_true - y_predict)**2)
def predict(self,x_p):
if x_p.shape[1] != len(self.coef_):
raise Exception('the feature number of X_predict must equal to X_train')
x = np.hstack([x_p,np.ones((len(x_p),1))])
return x.dot(self.theta.T)
def score(self, y_true, y_pred):
total = sum((y_true - np.mean(y_true)) ** 2)
residual = sum((y_true - y_pred) ** 2)
R_square = 1 - residual / total
return R_square
损失函数使用的是均方差函数(常用于回归任务),参考西瓜书(2.2)
E
=
1
m
∑
i
=
1
m
(
f
(
x
i
)
?
y
i
)
2
E = \frac{1}{m}\sum_{i = 1}^{m}(f(x_i)-y_i)^2
E=m1?i=1∑m?(f(xi?)?yi?)2 梯度下降根据的公式,参考西瓜书式(3.10)
?
E
?
θ
=
2
X
T
(
X
θ
?
y
)
\frac{\partial E}{\partial \theta} = 2 X^T(X\theta - y)
?θ?E?=2XT(Xθ?y) 其中X和theta相乘过程如下图所示:
模型得分使用的是R^2 = 1 - RSS /TSS
其中
- ? TSS(Total Sum of Squares)表示实际值与期望值的离差平方和,代表变量的总变动程度
- ? RSS(Residual Sum of Squares)表示实际值与预测值的离差平方和,代表变量的未知变动程度
R^2取值范围在[0,1],越接近1拟合度越高
3、分别使用编写的线性回归分类器、sklearn的线性回归分类器进行模型训练,并输出在测试集的正确率,保留3位小数
shuffled_index = np.random.permutation(data_size)
x = X[shuffled_index]
y = Y[shuffled_index]
split_index = int(data_size * 0.7)
x_train = x[:split_index]
y_train = y[:split_index]
x_test = x[split_index:]
y_test = y[split_index:]
regr = LinerRegression(learning_rate=0.1, max_iter=150)
regr.fit(x_train, y_train)
print('损失: \t{:.3}'.format(regr.loss()))
print('权重:\t' + str(regr.coef_))
print('偏移量: \t{:.3}'.format(regr.intercept_))
y_pred = regr.predict(x_train)
print('模型得分:\t{:.3}'.format(regr.score(y_train,y_pred)))
plt.scatter(np.arange(len(regr.loss_arr)), regr.loss_arr, marker='o', c='red')
plt.show()
训练过程损失值变化如下
正确率为0.72
使用sklearn库进行掉包操作,其训练具体过程可以参考上面自己实现的代码
x_train,x_test,y_train,y_test = train_test_split(boston['data'],boston['target'],test_size=0.3)
x_train = preprocessing.scale(x_train)
x_test = preprocessing.scale(x_test)
y_train = preprocessing.scale(y_train)
y_test = preprocessing.scale(y_test)
lr = LinearRegression()
lr.fit(x_train,y_train)
y_p = lr.predict(x_test)
print('损失:\t',lr.coef_)
print('权重:\t',lr.coef_)
print('偏移量:\t',lr.intercept_)
accuracy = lr.score(x_test,y_test)
print('模型得分:\t%.3f' %accuracy)
正确率为0.76
4、使用多项式回归进行模型优化
使用PolynomialFeatures()对训练集增加相互影响的特征
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
lr = LinearRegression()
poly = PolynomialFeatures(degree=2)
print('x_train.shape:',x_train.shape)
x_train_poly = poly.fit_transform(x_train)
print('x_train_poly.shape:',x_train_poly.shape)
lr.fit(x_train_poly, y_train)
y_p = lr.predict(x_train_poly)
print("分数为:")
print(r2_score(y_train, y_p))
此时正确值可达到0.9之多
遇到的错误
一开始拿到实验题目的时候,不管三七二十一,直接默认是简单的一元线性回归,即y=wx+b,然后把13个特征分成13份数据,每一份都和y值去单独训练,得到的结果如下:
没错,一个特征表示一种颜色 (┓( ′?` )┏)
如果单独拆出来,有些数据还没那么离谱,至少看得出是个线性回归的图。
但是有些就…
|