《机器学习》西瓜书课后习题8.5——python实现基于决策树的Bagging算法
8.5 试编程实现Bagging,以决策树桩为基学习器,在西瓜数据集3.0a上训练一个Bagging集成,并玉图8.6相比较
写在前面:由于写者的失误错将“西瓜数据集3.0a"看成”西瓜数据集3.0“,由于这两个数据集存在较大的差别,本文所论述的完全是基于3.0a数据集的,但是算法的核心思想没有改变,仅供读者参考!!!
参考博客:
《机器学习》西瓜书课后习题4.3——python实现基于信息熵划分的决策树算法(简单、全面)
1.模型的核心思想
本文所论述的Bagging模型,主要有7个决策树学习器组成,采用自助法构造训练集,最终结果由7个学习器投票产生。
鉴于本问题的关键是Bagging算法,那么决策树的构造使用了python中的sklearn库。
对于训练集的划分问题,我们使用了自助法,通过随机采样17次,形成了包括17条数据的训练集,原数据集作为测试集。
该数据集中包括了大部分的离散属性,所以对于这些数据进行编码,详细参考前文提到的参考博客!!!
2.代码实现
'''
8.5 试编程实现Bagging,以决策树桩为基学习器,在西瓜数据集3.0a上训练一个Bagging集成,并玉图8.6相比较
'''
import csv
from sklearn.feature_extraction import DictVectorizer
from sklearn import preprocessing
from sklearn import tree
import random
class Bagging:
train_featureList = []
train_labelList = []
train_dummyX = []
train_dummyY = []
test_featrueList = []
test_labelList = []
test_dummyX = []
test_dummyY = []
bagging_vote_good = []
bagging_vote_bad = []
def __init__(self):
filename = '西瓜数据集3.0.csv'
self.loadData(filename)
self.build_trainset()
self.bagging_vote_good = [0] * 17
self.bagging_vote_bad = [0] * 17
for i in range(0,7):
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(self.train_dummyX,self.train_dummyY)
predict = clf.predict(self.test_dummyX)
self.count_vote(predict)
print('认为是好瓜的票数:',self.bagging_vote_good)
print('认为是坏瓜的票数:',self.bagging_vote_bad)
sum = 0
for i in range(0,17):
if (self.test_dummyY[i] == 1 and self.bagging_vote_good[i] > self.bagging_vote_bad[i]) or (self.test_dummyY[i] == 0 and self.bagging_vote_good[i] < self.bagging_vote_bad[i]):
sum += 1
print('准确率为:',sum/17)
def count_vote(self,predict):
for i in range(0,17):
if predict[i] == 1:
self.bagging_vote_good[i] += 1
else:
self.bagging_vote_bad[i] += 1
def is_number(self,n):
is_number = True
try:
num = float(n)
is_number = num == num
except ValueError:
is_number = False
return is_number
def encoder(self,featureList,labelList):
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList).toarray()
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
return dummyX, dummyY
def loadData(self,filename):
data=open(filename,'r',encoding='GBK')
reader = csv.reader(data)
headers = next(reader)
featureList = []
labelList = []
for row in reader:
labelList.append(row[len(row)-1])
rowDict = {}
for i in range(1,len(row)-1):
if self.is_number(row[i]) == True:
rowDict[headers[i]] = float(row[i])
else:
rowDict[headers[i]]=row[i]
featureList.append(rowDict)
self.test_featrueList = featureList
self.test_labelList = labelList
self.test_dummyX,self.test_dummyY = self.encoder(featureList,labelList)
def build_trainset(self):
for i in range(0,17):
pos = random.randint(0, 16)
self.train_featureList.append(self.test_featrueList[pos])
self.train_labelList.append(self.test_labelList[pos])
self.train_dummyX,self.train_dummyY = self.encoder(self.train_featureList,self.train_labelList)
bagging = Bagging()
|