吃透python——Scikit-Learn数据建模
一、基本介绍
sklearn具有分类、回归、聚类、数据降维、模型选择、数据处理六大功能。
sklearn中具有用于监督学习和无监督学习的基本方法。
sklearn中的函数大致可以分为两类,分别是估计器和转换器。估计器就是模型,用于对数据的预测和回归,转换器就是对数据的处理,如标准化、数据将为及特征选择等。
估计器中通常具有三个函数: fit() socre() predict()。 fit()函数用于训练模型 score()函数用于对模型评分 predict()函数用于对数据的预测,并输出预测标签
转换器中通常具有三个函数: fit() transform() fit_transform() 。 fit() 函数用于计算数据变换方式;transform()函数根据已经计算的变换方式,计算数据的变换结果;fit_transform()函数用于计算出数据变换方式之后对输入数据进行就地转换。
二、数据建模基本流程
数据建模的基本流程包括:数据集加载,数据集划分,数据集预处理,数据模型评估
1.数据集加载
? sklearn库中集成了datasets模块,该模块包含了数据分析中常用经典数据集。
函数 | 说明 |
---|
load_boston([return_x_y]) | 加载波士顿房屋价格(用于回归建模) | load_diabetes([return_x_y]) | 加载并返回糖尿病数据集(回归) | load_digits([return_x_y]) | 加载并返回数字数据集(分类) | load_breast_canner([n_class,return_x_y]) | 加载并返回威斯康星州乳腺癌数据集(分类) | load_iris([return_x_y]) | 加载并返回鸢尾花数据集(分类) | load_wine([return_x_y]) | 加载并返回wine数据集(分类) | load_linnerud([return_x_y]) | 加载并返回linnerud数据集(多元回归) |
sklearn 同时支持加载实际的数据集和外部数据集,加载数据集主要:
通过pandas.io加载CSV,EXCEL,JSON,SQL等类型数据
通过scipy.io可以加载.mat,.arff格式的数据
通过skimage.io或者Imageio加载图象或者视频数据,并将数据处理为Numpy的数据类型数据
通过scipy.io.wavfile.read()函数读取WAV形式的音频数据
(以上我也不太懂,只是指导这个东西就行,以后碰到时可以直接开学,不会太懵)
跟着敲代码
from sklearn.datasets import load_iris
iris=load_iris()
print(len(iris))
print(type(iris))
print(dir(iris))
7
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']
DESCR 指该数据对象的基本描述信息,描述信息中会包括数据对象中的基本数据信息, 如实例个数,属性个数等
print(iris.DESCR)
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
print(iris.data)
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
print(iris.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris.filename)
D:\anaconda3\lib\site-packages\sklearn\datasets\data\iris.csv
print(iris.target_names)
['setosa' 'versicolor' 'virginica']
print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
2.数据集划分
数据集划分是指将加载的数据源按要求进行相关成分的调整。
一般对于量大的数据可以划分为训练集、测试机、验证集;
对于量少的数据可以使用k折交叉法进行划分。
数据集的划分一般会使用到train_test_split(*arrays,**options)函数,将数组或矩阵分割、生成随机排序与测试子集
train_test_split(*arrays,**options)函数:
参数 | 说明 |
---|
*arrays | : 接受一个或多个数据集。代表需要划分的数据集。若为分类回归,则分别传入数据和标签;若为聚类,则传入数据。无默认值。 | test_size | 接收float\int或None,表示测试及大小。如果是Float类型,则必须在0-1之间,表示测试机占百分比。若是int类型,则代表绝对数目。 该参数和train_size只可以传一个 | train_size | 接收float\int或None,表示测试及大小,该参数和test_size只可以传一个 | random_state | 接收int,表示随机种子号,相同随机钟子编号产生相同的随即结果。默认为None | shuffle | 接收bool,代表是否进行有放回的抽样。若该参数取值为True时,则stratify参数必须不为空 | stratify | 接收array或者None,如果不为None,则使用传入的标签进行分层抽样 |
跟着敲代码
from sklearn.datasets import load_iris
iris_data=load_iris()['data']
iris_target=load_iris()['target']
print(iris_data.shape)
(150, 4)
from sklearn.model_selection import train_test_split
iris_data_train,iris_data_test,iris_target_train,iris_target_test=train_test_split(iris_data,iris_target,test_size=0.2,random_state=42)
print(iris_data_train.shape)
print(iris_data_test.shape)
print(iris_target_train.shape)
print(iris_target_test.shape)
(120, 4)
(30, 4)
(120,)
(30,)
3.数据集预处理
数据集预处理是指使用sklearn转换器对数据进行数据预处理与降维等相关操作,这里只做简单说明。
①离差标准化
sklearn中的preprocessing模块为数据处理提供了许多函数,该模块中包含了MinMaxScaler类,用于离差标准化的处理
import numpy as np
from sklearn.preprocessing import MinMaxScaler
Scaler=MinMaxScaler().fit(iris_data_train)
iris_trainScaler=Scaler.transform(iris_data_train)
iris_testScaler=Scaler.transform(iris_data_test)
print(np.min(iris_data_train))
print(np.min(iris_trainScaler))
print(np.max(iris_data_train))
print(np.max(iris_trainScaler))
print(np.min(iris_data_test))
print(np.min(iris_testScaler))
print(np.max(iris_data_test))
print(np.max(iris_testScaler))
0.1
0.0
7.7
1.0
0.1
0.0
7.9
1.0588235294117647
跟着敲代码
②PCA数据降维
数据降维的目的是在不丢失太多数据信息的前提下简化数据,操作如下:
跟着敲代码
from sklearn.decomposition import PCA
pca_model =PCA().fit(iris_trainScaler)
iris_trainPca=pca_model.transform(iris_trainScaler)
iris_testPca=pca_model.transform(iris_testScaler)
print(iris_trainScaler.shape)
print(iris_trainPca.shape)
print(iris_testScaler.shape)
print(iris_testPca.shape)
(120, 4)
(120, 4)
(30, 4)
(30, 4)
4.数据模型评估
是指在模型创建完成后,对于模型进行基本的数据模型评估,评估的好坏能在一定程度上反应模型的问题。
-----------------2021.11.15更新-------------还有待解决部分,后续更新--------
|