AutoML(TPOT)实现回归模型
一、TPOT原理
TPOT是一种AutoML的工具,借助遗传算法来生成Pipeline代码。基于Python,建立在scikit-learn的基础上。
**主要原理:**遗传算法进行特征、模型选择 目的:实现对特征、模型、超参的优化,并生成主体代码
只需要给定结构化的数据,自动的进行多模型比较调优,经过一定次数的迭代,会得到一个最优模型,框架会保留模型参数来构建一个pipelin.py,我们只需要在pipelin.py中微调数据传入的接口即可得到一个最优模型。
二、TPOT注意事项
-
在使用TPOT进行建模前需要对数据进行必要的清洗和特征处理操作。 -
TPOT目前只能做有监督学习。 -
TPOT目前支持的分类器主要有贝叶斯、决策树、集成树、SVM、KNN、线性模型、xgboost。 -
TPOT目前支持的回归器主要有决策树、集成树、线性模型、xgboost。 -
TPOT会对输入的数据做进一步处理操作,例如二值化、聚类、降维、标准化、正则化、独热编码操作等。 -
根据模型效果,TPOT会对输入特征做特征选择操作,包括基于树模型、基于方差、基于F-值的百分比。 -
可以通过export()方法把训练过程导出为形式为sklearn pipeline的.py文件
三、案例实操
使用自带案例,波士顿房价预测,只需输入特征、标签即可得到较优的模型参数。
1、数据特征构建
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,test_size=0.3,random_state=22)
2、pipeline脚本生成
tpot=TPOTRegressor(generations=30, population_size=50, verbosity=2,random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')
3、结合pipeline脚本得到的参数进行预测
对数据处理的部分与训练脚本一致,只需将得到的pipeline脚本中的exported_pipeline 部分作为模型加入。后续如sklearn的相似,使用**.fit、.predict**等命令进行拟合和预测。经过结果打分,可得到以下结果:
exported_pipeline = make_pipeline(
SelectPercentile(score_func=f_regression, percentile=98),
SelectFwe(score_func=f_regression, alpha=0.044),
XGBRegressor(learning_rate=0.1, max_depth=5, min_child_weight=4, n_estimators=100, n_jobs=1, objective="reg:squarederror", subsample=0.55, verbosity=0)
)
set_param_recursive(exported_pipeline.steps, 'random_state', 42)
joblib.dump(exported_pipeline, 'Baseline_train_model.pkl')
使用模型预测结果如下:
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
OUT[]:
model MSE is 11.27
model MAPE is 11.61 %
model R2 is 81.45 %
代码如下:
from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectFwe, SelectPercentile, f_regression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from tpot.export_utils import set_param_recursive
from sklearn.metrics import r2_score
from sklearn.metrics import *
import joblib
def mape(true, predicted):
"""
计算mape指标
"""
inside_sum = np.abs(predicted - true) / true
return round(100 * np.sum(inside_sum) / inside_sum.size, 2)
def get_mape(y_true, y_pred):
"""
Compute mean absolute percentage error (MAPE)
"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def score_print(results,testing_target):
"""
评估模型效果
"""
print(f"model MSE is {round(mean_squared_error(results,testing_target), 2)}")
print(f"model MAPE is {mape(results,testing_target)} %")
print(f"model R2 is {round(r2_score(results,testing_target) * 100, 2)} %")
if __name__ == '__main__':
housing = load_boston()
tpot_data = housing.target
features = housing.data
training_features, testing_features, training_target, testing_target = train_test_split(features, tpot_data, random_state=42)
exported_pipeline = make_pipeline(
SelectPercentile(score_func=f_regression, percentile=98),
SelectFwe(score_func=f_regression, alpha=0.044),
XGBRegressor(learning_rate=0.1, max_depth=5, min_child_weight=4, n_estimators=100, n_jobs=1, objective="reg:squarederror", subsample=0.55, verbosity=0)
)
set_param_recursive(exported_pipeline.steps, 'random_state', 42)
joblib.dump(exported_pipeline, 'Baseline_train_fenlie1_model.pkl')
exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
score_print(results, testing_target)
3、加载pkl模型文件进行预测
对此前用joblib保存的模型参数文件进行加载,然后进行预测
def load_model_predict(model_path,training_features, training_target,testing_features,testing_target):
"""
joblib加载模型参数预测
"""
exported_pipeline_load = joblib.load(model_path)
set_param_recursive(exported_pipeline_load.steps, 'random_state', 42)
exported_pipeline_load.fit(training_features, training_target)
results = exported_pipeline_load.predict(testing_features)
score_print(results, testing_target)
return results
|