Xgboost实战二分类
1 xgboost的基本使用
"""上面是libsvm的数据存储格式, 也是一种常用的格式,存储的稀疏数据。
第一列是label. a:b a表示index, b表示在该index下的数值, 这就类似于one-hot"""
import numpy as np
import scipy.sparse
import pickle
import xgboost as xgb
dtrain = xgb.DMatrix('./xgbdata/agaricus.txt.train')
dtest = xgb.DMatrix('./xgbdata/agaricus.txt.test')
当然不需要全记住,常用的几个记住即可。可以结合着上面的数学原理,看看哪个参数到底对于xgboost有什么作用,这样利于调参。设置好参数,训练测试就行了,使用起来和sklearn的模型非常像
"""paramet setting"""
param = {
'max_depth': 2,
'eta': 1,
'silent': 1,
'objective': 'binary:logistic'
}
watch_list = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 5
model = xgb.train(params=param, dtrain=dtrain, num_boost_round=num_round, evals=watch_list)
然后就是预测:
"""预测"""
pred = model.predict(dtest)
from sklearn.metrics import accuracy_score
predict_label = [round(values) for values in pred]
accuracy_score(labels, predict_label)
模型的保存了解一下:
"""两种方式: 第一种, pickle的序列化和反序列化"""
pickle.dump(model, open('./model/xgb1.pkl', 'wb'))
model1 = pickle.load(open('./model/xgb1.pkl', 'rb'))
model1.predict(dtest)
"""第二种模型的存储与导入方式 - sklearn的joblib"""
from sklearn.externals import joblib
joblib.dump(model, './model/xgb.pkl')
model2 = joblib.load('./model/xgb.pkl')
model2.predict(dtest)
2 交叉验证 xgb.cv
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
num_round = 5
xgb.cv(param, dtrain, num_round, nfold=5, metrics={'error'}, seed=3)
3 调整样本权重
这个是针对样本不平衡的情况,可以在训练时设置样本的权重, 训练的时候设置fpreproc这个参数, 相当于在训练之前先对样本预处理。
def preproc(dtrain, dtest, param):
labels = dtrain.get_label()
ratio = float(np.sum(labels==0)) / np.sum(labels==1)
param['scale_pos_ratio'] = ratio
return (dtrain, dtest, param)
xgb.cv(param, dtrain, num_round, nfold=5, metrics={'auc'}, seed=3, fpreproc=preproc)
4 自定义目标函数(损失函数)
如果在一个比赛中,人家给了自己的评判标准,那么这时候就需要用人家的这个评判标准,这时候需要修改xgboost的损失函数, 但是这时候请注意一定要提供一阶和二阶导数
def logregobj(pred, dtrain):
labels = dtrain.get_label()
pred = 1.0 / (1+np.exp(-pred))
grad = pred - labels
hess = pred * (1-pred)
return grad, hess
def evalerror(pred, dtrain):
labels = dtrain.get_label()
return 'error', float(sum(labels!=(pred>0.0)))/len(labels)
训练的时候,把损失函数指定就可以了:
param = {'max_depth':2, 'eta':1, 'silent':1}
model = xgb.train(param, dtrain, num_round, watch_list, logregobj, evalerror)
xgb.cv(param, dtrain, num_round, nfold=5, seed=3, obj=logregobj, feval=evalerror)
5 用前n棵树做预测 ntree_limit
太多的树可能发生过拟合,这时候我们可以指定前n棵树做预测, 预测的时候设置ntree_limit这个参数
pred1 = model.predict(dtest, ntree_limit=1)
evalerror(pred2, dtest)
6 画出特征重要度 plot_importance
from xgboost import plot_importance
plot_importance(model, max_num_features=10)
7 同样,也可以用sklearn的GridSearchCV调参
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
model = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.1, 0.2, 0.3]
param_grid = dict(learning_rate=learning_rate)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(x_train, y_train)
print("best: %f using %s" %(grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
for mean, param in zip(means, params):
print("%f with: %r" % (mean, param))
好了,实战部分就整理这么多吧, 重点在于怎么使用,xgboost使用起来和sklearn的模型也是非常像, 也是.fit(), .predict()方法,只不过xgboost的参数很多,这个调起来会比较复杂, 但是懂了原理之后,至少每个参数是干啥的就了解了,关于调参的技术, 得从经验中多学习,多尝试,多总结才能慢慢修炼出来
|