Datawhale组队学习第27期:集成学习 本次学习的指导老师萌弟的教学视频 本贴为学习记录帖,有任何问题欢迎随时交流~ 部分内容可能还不完整,后期随着知识积累逐步完善。 开始时间:2021年7月23日 最新更新:2021年7月25日(Task6 Boosting)
一、Boosting概念
1. 思想
- 对同一组数据集反复学习
- 通过减少偏差的形式来降低测试误差(与Bagging的本质区别)
2. 常见模型
- AdaBoosting
- GradientBoosting
- Xgboost
- LightGBM
- Catboost
3. 理论来源
- 强可学习与弱可学习
- 弱学习:识别错误率小于1/2,准确率略高于随机猜测
- 强学习:识别准确率很高,并能在多项式时间内完成的学习算法
- PAC学习框架下,强可学习和弱可学习是等价的。
- 意味着如果发现了弱可学习算法,可以通过提升方法使得学习增强
- 大多数boosting都是改变训练数据集的概率分布(训练数据不同样本的权值),针对不同概率分布的数据调用弱分类算法学习一系列的弱分类器。
- 需要回答两个关键问题:
- 每一轮学习如何改变数据的概率分布
- 各个弱分类器如何组合
二、Adaboost
1. 基本思想
- 提高前一轮分类器错误分类样本的权重,降低正确分类样本的权重
- 各个弱分类器采用加权表决的方式组合
- Adaboost实际上是增加了模型的复杂度,减少了偏差
2. 基本步骤
3. 代码实现
- 选用数据集
diabetes ,二分类问题 - 定义弱分类器,单层决策树
class DecisionStump:
def __init__(self, x, y):
self.x = np.array(x)
self.y = np.array(y)
self.n = self.x.shape[0]
self.w = None
self.threshold_value = 0
self.threshold_pos = 0
self.threshold_res = 0
def train(self, w, steps=1000):
"""
用于返回所有参数中阈值最小的那一个
w 是长度为n的向量,表示n个样本的权值
threshold_value 为阈值
threshold_pos 为第几个参数(参数下标)
threshold_tag 取值空间为{-1, +1},大于阈值则分为某个threshold_tag,小于则相反(这里的如果相等?)
"""
min_ = float("inf")
threshold_value = 0
threshold_pos = 0
threshold_tag = 0
self.w = np.array(w)
for i in range(self.n):
value, errcnt = self.findmin(i, 1, steps)
if errcnt < min_:
min_ = errcnt
threshold_value = value
threshold_pos = i
threshold_tag = 1
self.threshold_value = threshold_value
self.threshold_pos = threshold_pos
self.threshold_res = threshold_tag
print(self.threshold_value, self.threshold_pos, self.threshold_res)
return min_
def findmin(self, i, tag, steps):
t = 0
tmp = self.predintrain(self.x, i, t, tag).transpose()
errcnt = np.sum((tmp != self.y) * self.w)
bottom = np.min(self.x[i, :])
up = np.max(self.x[i, :])
minerr = float("inf")
value = 0
st = (up - bottom) / steps
for t in np.arange(bottom, up, st):
tmp = self.predintrain(self.x, i, t, tag).transpose()
errcnt = np.sum((tmp != self.y) * self.w)
if errcnt < minerr:
minerr = errcnt
value = t
return value, minerr
def predintrain(self, test_set, i, t, tag):
test_set = np.array(test_set).reshape(self.n, -1)
pre_y = np.ones((np.array(test_set).shape[1], 1))
pre_y[test_set[i, :] * tag < t * tag] = -1
return pre_y
def pred(self, test_x):
test_x = np.array(test_x).reshape(self.n, -1)
pre_y = np.ones((np.array(test_x).shape[1], 1))
pre_y[test_x[self.threshold_pos, :] * self.threshold_res <
self.threshold_value * self.threshold_res] = -1
return pre_y
class AdaBoost:
def __init__(self, x, y, weaker=DecisionStump):
self.x = np.array(x)
self.y = np.array(y).flatten("F")
self.weaker = weaker
self.sums = np.zeros(self.y.shape)
self.w = np.ones(self.x.shape[1]) / self.x.shape[1]
self.q = 0
self.g = {}
self.alpha = {}
def train(self, m=5):
for i in range(m):
self.g.setdefault(i)
self.alpha.setdefault(i)
for i in range(m):
self.g[i] = self.weaker(self.x, self.y)
e = self.g[i].train(self.w)
self.alpha[i] = 1.0 / 2 * 2 * np.log((1 - e) / e)
res = self.g[i].pred(self.x)
print("weak classfier acc", accuracy_score(self.y, res))
print("======================================================")
z = self.w * np.exp(-self.alpha[i] * self.y * res.transpose())
self.w = (z / z.sum()).flatten("F")
self.q = i
if self.errorcnt(i) == 0:
print("%d个弱分类器可以将错误率降到0" % (i + 1))
break
def errorcnt(self, t):
self.sums = self.sums + self.g[t].pred(self.x).flatten('F') * self.alpha[t]
pre_y = np.zeros(np.array(self.sums).shape)
pre_y[self.sums >= 0] = 1
pre_y[self.sums < 0] = -1
t = (pre_y != self.y).sum()
return t
def pred(self, test_x):
test_x = np.array(test_x)
sums = np.zeros(test_x.shape[1])
for i in range(self.q + 1):
sums = sums + self.g[i].pred(test_x).flatten("F") * self.alpha[i]
pre_y = np.zeros(np.array(sums).shape)
pre_y[sums >= 0] = 1
pre_y[sums < 0] = -1
return pre_y
data_set = pd.read_csv('diabetes.csv')
x = data_set.values[:, :8]
y = data_set.values[:, 8]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
random_state=0)
x_train = x_train.transpose()
y_train[y_train == 1] = 1
y_train[y_train == 0] = -1
x_test = x_test.transpose()
y_test[y_test == 1] = 1
y_test[y_test == 0] = -1
print(x_train.shape, x_test.shape)
ada = AdaBoost(x_train, y_train)
ada.train(10)
y_pre = ada.pred(x_test)
print("total test", len(y_pre))
print('true pred', len(y_pre[y_pre == y_test]))
print('acc', accuracy_score(y_test, y_pre))
三、参考资料
- https://github.com/datawhalechina/ensemble-learning
- https://www.bilibili.com/video/BV1Mb4y1o7ck?t=470
|