Python Scikit-learn 机械学习超参数优化

大家好，我是一个喜欢研究算法、机械学习和生物计算的小青年，我的CSDN博客是：一骑代码走天涯
如果您喜欢我的笔记，那么请点一下关注、点赞和收藏。如果内容有错或者有改进的空间，也可以在评论让我知道。😄

一般我们在Python做机械学习都会用到 Sci-kit Learn 这个包，里面除了有各种机械学习的算法模型可供使用，还有很多评估和优化模型表现的工具，其中就包括超参数优化 (Hyperparameter tuning)的工具。这篇文章简单记录了我在训练模型时何使用这个工具。

1. RandomizedSearchCV 和 GridSearchCV

Scikit-learn 里面主要有两个调参函数可以用：RandomizedSearchCV 和 GridSearchCV。

两者有何不同？
RandomizedSearchCV:
在一组指定的超参数范围内，根据使用者的设定，随机组合 n_iter 组测试。该超参数范围 param_distributions 用dict来表达。适合需要在很大的超参数测试范围內找出最理想组合时使用。

GridSearchCV:
提供一个指定的超参数范围后，计算机会把所有组合整理出来逐一测试。该超参数范围 param_grid 用dict来表达。适合已经有一定小的或者集中的超参数测试范围的时候使用。

备註：根据 Version 0.24.2的手册，还有 HalvingGridSearchCV and HalvingRandomSearchCV两个可用的函数，理论上比前两个更快速，但现阶段还是Experimental阶段。

2. 模型训练和调参

用随机森林 (Random Forest) 举个栗子：

RandomizedSearchCV:
下面的超参数范围一共有 (2x10x3x5x4x10)=12000个组合，但因为 n_iter 设定为100，所以计算机只会随机抽其中100个测试，然后找出当中最好的一个。这个时候RandomizedSearchCV 比 GridSearchCV 好是因为节省时间，能在比较短的时间内找出接近最优解的答案。

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(Features, Labels, test_size=0.3)
# Create the parameter grid
random_grid = {
 'bootstrap': [True, False],
 'max_depth': [20, 40, 60, 80, 100, 200, 400, 600, 800, None],
 'max_features': ["auto", "sqrt", "log2"],
 'min_samples_leaf': [1, 2, 4, 8, 12],
 'min_samples_split': [1, 2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]
}
# Define random forest model.
rf = RandomForestRegressor()
# Instantiate the random search model
rand_search = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, n_jobs = -1)
rand_search.fit(X_train, y_train)
# Retrieve the best-performing model
best_model = rand_search.best_estimator_
# Print out the hyperparameters in this model
print(best_model)

GridSearchCV:
当测试范围很集中，或者参数选择不多的时候，就可以用GridSearchCV做全面搜查。下面的例子就只有 (4x4x3x3)=144个组合。

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(Features, Labels, test_size=0.3)
param_grid = {
    'max_depth': [3, 6, 9, 12],
    'max_features': [2, 3, 4, 5],
    'min_samples_leaf': [3, 4, 5],
    'n_estimators': [50, 100, 150]
}
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, y_train)
# Retrieve the best-performing model
best_model = grid_search.best_estimator_
# Print out the hyperparameters in this model
print(best_model)

Python知识库最新文章

Python中String模块

【Python】 14-CVS文件操作

python的panda库读写文件

使用Nordic的nrf52840实现蓝牙DFU过程

【Python学习记录】numpy数组用法整理

Python学习笔记

python字符串和列表

python如何从txt文件中解析出有效的数据

Python编程从入门到实践自学/3.1-3.2

python变量

加:2021-07-14 23:01:27 更:2021-07-14 23:02:00

360图书馆购物三丰科技阅读网日历万年历 2026年1日历

-2026/1/6 1:16:39-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码