LightGBM 是微软开源的一个基于决策树和XGBoost的机器学习算法。具有分布式和高效处理大量数据的特点。
- 更快的训练速度,比XGBoost的准确性更高
- 更低的内存使用率,通过使用直方图算法将连续特征提取为离散特征,实现了惊人的快速训练速度和较低的内存使用率
- 通过使用按叶分割而不是按级别分割来获得更高精度,加快目标函数收敛速度,并在非常复杂的树中捕获训练数据的底层模式。使用num_leaves和max_depth超参数控制过拟合
- 支持并行计算、分布式处理和GPU学习
LightGBM的特点
- XGBoost 使用决策树对一个变量进行拆分,并在该变量上探索不同的切割点(按级别划分的树生长策略)
- LightGBM 专注于按叶子节点进行拆分,以便获得更好的拟合(按叶划分的树生长策略)
这使得LightGBM 能够快速获得很好的数据拟合,并生成能够替代XGBoost的解决方案。从算法上讲,XGBoost将决策树进行的分割结构作为一个图来计算,使用广度搜索优先(BFS),而LightGBM使用的是深度优先(DFS)
安装
conda install -c conda-forge lightgbm
python3.6 -m pip install lightgbm
基本使用
训练的过程有很多API接口可以使用, 下面分别说明一些常用API的使用方法和使用示例 https://lightgbm.readthedocs.io/en/v3.3.2/Python-API.html
lightgbm.train
parameters = {
'learning_rate': 0.05,
'boosting_type': 'gbdt',
'objective': 'binary',
'metrics': classification_metrics,
'num_leaves': 32,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'seed': 2022,
'bagging_seed': 1,
'feature_fraction_seed': 7,
'min_data_in_leaf': 20,
'n_jobs': -1,
'verbose': -1,
}
lightgbm.train(
params,
train_set,
num_boost_round=100,
valid_sets=None,
valid_names=None,
fobj=None,
feval=None,
init_model=None,
feature_name='auto',
categorical_feature='auto',
early_stopping_rounds=None,
evals_result=None,
verbose_eval='warn',
learning_rates=None,
keep_training_booster=False,
callbacks=None)
参数 | 说明 |
---|
params | 模型训练的超参数, 比如学习率、评价指标等 | train_set | 训练集 | num_boost_round | boosting 迭代次数 | valid_sets | 验证集,一般 valid_sets = [valid_set, train_set] | verbose_eval | | early_stopping_rounds | 模型在验证分数停止提升(收敛了)就停止迭代了,early_stopping_rounds 限制一个最小的迭代次数,比如不少于200次 | evals_result | store all evaluation results of all the items in valid_sets, 一般用evals_result 来画loss在迭代过程中的图 |
使用示例 :lightgbm.train K折交叉验证 Train 二分类模型的过程
import lightgbm as lgb
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, precision_score, recall_score
X_train, X_test = data[~data['label'].isna()], data[data['label'].isna()]
Y_train = X_train['label']
KF = StratifiedKFold(n_splits=5, shuffle=True, random_state=2022)
parameters = {
'learning_rate': 0.05,
'boosting_type': 'gbdt',
'objective': 'binary',
'metric': 'auc',
'num_leaves': 32,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 5,
'seed': 2022,
'bagging_seed': 1,
'feature_fraction_seed': 7,
'min_data_in_leaf': 20,
'n_jobs': -1,
'verbose': -1,
}
lgb_result = np.zeros(len(X_train))
for fold_, (trn_idx, val_idx) in enumerate(KF.split(X_train.values, Y_train.values)):
print("fold 5 of {}".format(fold_))
trn_data = lgb.Dataset(X_train.iloc[trn_idx][features],label=Y_train.iloc[trn_idx])
val_data = lgb.Dataset(X_train.iloc[val_idx][features],label=Y_train.iloc[val_idx])
evaluation_result = {}
model = lgb.train(
params=parameters,
train_set=trn_data,
num_boost_round=num_round,
valid_sets=[trn_data, val_data],
verbose_eval=500,
early_stopping_rounds=100,
evals_result=evaluation_result
)
lgb_result[val_idx] = model.predict(X_train.iloc[val_idx][features], num_iteration=model.best_iteration)
model.save_model(f'model/model_{fold_}.txt')
lgb.plot_metric(evaluation_result, metric=current_metrics)
train_predict = model.predict(X_train, num_iteration=model.best_iteration)
test_predict = model.predict(X_test, num_iteration=model.best_iteration)
print('Train Precision score: {}'.format(precision_score(Y_train, [1 if i >= 0.5 else 0 for i in train_predict])))
print('Train Recall score: {}'.format(recall_score(Y_train, [1 if i >= 0.5 else 0 for i in train_predict])))
print('Train AUC score: {}'.format(roc_auc_score(Y_train, train_predict)))
print('Train F1 score: {}\r\n'.format(f1_score(Y_train, [1 if i >= 0.5 else 0 for i in train_predict])))
print('Test Precision score: {}'.format(precision_score(Y_test, [1 if i >= 0.5 else 0 for i in test_predict])))
print('Test Recall score: {}'.format(recall_score(Y_test, [1 if i >= 0.5 else 0 for i in test_predict])))
print('Test AUC score: {}'.format(roc_auc_score(Y_test, test_predict)))
print('Test F1 score: {}'.format(f1_score(Y_test, [1 if i >= 0.5 else 0 for i in test_predict])))
调参
可视化
lightgbm.plot_importance(booster, ax=None, height=0.2, xlim=None, ylim=None, title='Feature importance', xlabel='Feature importance', ylabel='Features',
importance_type='auto', max_num_features=None,
ignore_zero=True, figsize=None, dpi=None, grid=True, precision=3, **kwargs)
lightgbm.plot_importance(model, max_num_features=10)
模型保存 / 模型加载
model = lgb.train(.....)
model.save_model(
filename,
num_iteration=None,
start_iteration=0,
importance_type='split'
)
model.save_model(os.path.join(MODEL_PATH, MODEL_NAME),
num_iteration=model.best_iteration)
lightgbm.Booster(
params=None,
train_set=None,
model_file=None,
model_str=None)
def load_model(model_path):
if not os.path.exists(model_path):
return None
try:
model = lgb.Booster(model_file=model_path)
except IOError:
print('Failed to load model, path: ', model_path)
return None
return model
- 另一种方式使用sklearn的 joblib扩展库
注意:保存的后缀名是.pkl
from sklearn.externals import joblib
joblib.dump(model, 'model.pkl')
model= joblib.load('model.pkl')
Y_pred = model.predict(X_test, num_iteration=model.best_iteration_)
模型转化
参考文档
|