开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 吃透python——Scikit-Learn数据建模基础流程概要 -> 正文阅读

[人工智能]吃透python——Scikit-Learn数据建模基础流程概要

吃透python——Scikit-Learn数据建模

一、基本介绍

sklearn具有分类、回归、聚类、数据降维、模型选择、数据处理六大功能。

sklearn中具有用于监督学习和无监督学习的基本方法。

sklearn中的函数大致可以分为两类，分别是估计器和转换器。估计器就是模型，用于对数据的预测和回归，转换器就是对数据的处理，如标准化、数据将为及特征选择等。

估计器中通常具有三个函数: fit() socre() predict()。 fit()函数用于训练模型 score()函数用于对模型评分 predict()函数用于对数据的预测，并输出预测标签

转换器中通常具有三个函数: fit() transform() fit_transform() 。 fit() 函数用于计算数据变换方式；transform()函数根据已经计算的变换方式，计算数据的变换结果；fit_transform()函数用于计算出数据变换方式之后对输入数据进行就地转换。

二、数据建模基本流程

数据建模的基本流程包括：数据集加载，数据集划分，数据集预处理，数据模型评估

1.数据集加载

? sklearn库中集成了datasets模块，该模块包含了数据分析中常用经典数据集。

函数	说明
load_boston([return_x_y])	加载波士顿房屋价格（用于回归建模）
load_diabetes([return_x_y])	加载并返回糖尿病数据集（回归）
load_digits([return_x_y])	加载并返回数字数据集（分类）
load_breast_canner([n_class,return_x_y])	加载并返回威斯康星州乳腺癌数据集（分类）
load_iris([return_x_y])	加载并返回鸢尾花数据集（分类）
load_wine([return_x_y])	加载并返回wine数据集（分类）
load_linnerud([return_x_y])	加载并返回linnerud数据集（多元回归）

sklearn 同时支持加载实际的数据集和外部数据集，加载数据集主要：

通过pandas.io加载CSV,EXCEL,JSON,SQL等类型数据

通过scipy.io可以加载.mat,.arff格式的数据

通过skimage.io或者Imageio加载图象或者视频数据，并将数据处理为Numpy的数据类型数据

通过scipy.io.wavfile.read()函数读取WAV形式的音频数据

（以上我也不太懂，只是指导这个东西就行，以后碰到时可以直接开学，不会太懵）

跟着敲代码

from sklearn.datasets import load_iris
iris=load_iris()  #通过datasets模块导入load_iris()函数 ，并加载

print(len(iris))
print(type(iris))
print(dir(iris))#查看基本属性

7
<class 'sklearn.utils.Bunch'>
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']

DESCR 指该数据对象的基本描述信息，描述信息中会包括数据对象中的基本数据信息，
如实例个数，属性个数等

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

print(iris.data) #用data参数直接查看数据

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.9 1.5]
 [5.5 2.3 4.  1.3]
 [6.5 2.8 4.6 1.5]
 [5.7 2.8 4.5 1.3]
 [6.3 3.3 4.7 1.6]
 [4.9 2.4 3.3 1. ]
 [6.6 2.9 4.6 1.3]
 [5.2 2.7 3.9 1.4]
 [5.  2.  3.5 1. ]
 [5.9 3.  4.2 1.5]
 [6.  2.2 4.  1. ]
 [6.1 2.9 4.7 1.4]
 [5.6 2.9 3.6 1.3]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [6.3 2.3 4.4 1.3]
 [5.6 3.  4.1 1.3]
 [5.5 2.5 4.  1.3]
 [5.5 2.6 4.4 1.2]
 [6.1 3.  4.6 1.4]
 [5.8 2.6 4.  1.2]
 [5.  2.3 3.3 1. ]
 [5.6 2.7 4.2 1.3]
 [5.7 3.  4.2 1.2]
 [5.7 2.9 4.2 1.3]
 [6.2 2.9 4.3 1.3]
 [5.1 2.5 3.  1.1]
 [5.7 2.8 4.1 1.3]
 [6.3 3.3 6.  2.5]
 [5.8 2.7 5.1 1.9]
 [7.1 3.  5.9 2.1]
 [6.3 2.9 5.6 1.8]
 [6.5 3.  5.8 2.2]
 [7.6 3.  6.6 2.1]
 [4.9 2.5 4.5 1.7]
 [7.3 2.9 6.3 1.8]
 [6.7 2.5 5.8 1.8]
 [7.2 3.6 6.1 2.5]
 [6.5 3.2 5.1 2. ]
 [6.4 2.7 5.3 1.9]
 [6.8 3.  5.5 2.1]
 [5.7 2.5 5.  2. ]
 [5.8 2.8 5.1 2.4]
 [6.4 3.2 5.3 2.3]
 [6.5 3.  5.5 1.8]
 [7.7 3.8 6.7 2.2]
 [7.7 2.6 6.9 2.3]
 [6.  2.2 5.  1.5]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [7.7 2.8 6.7 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.1]
 [7.2 3.2 6.  1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [6.4 2.8 5.6 2.1]
 [7.2 3.  5.8 1.6]
 [7.4 2.8 6.1 1.9]
 [7.9 3.8 6.4 2. ]
 [6.4 2.8 5.6 2.2]
 [6.3 2.8 5.1 1.5]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [6.3 3.4 5.6 2.4]
 [6.4 3.1 5.5 1.8]
 [6.  3.  4.8 1.8]
 [6.9 3.1 5.4 2.1]
 [6.7 3.1 5.6 2.4]
 [6.9 3.1 5.1 2.3]
 [5.8 2.7 5.1 1.9]
 [6.8 3.2 5.9 2.3]
 [6.7 3.3 5.7 2.5]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]

print(iris.feature_names) #查看特征值的名称

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(iris.filename) #查看来源文件

D:\anaconda3\lib\site-packages\sklearn\datasets\data\iris.csv

print(iris.target_names)#查看标签名

['setosa' 'versicolor' 'virginica']

print(iris.target) #查看具体值

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

2.数据集划分

数据集划分是指将加载的数据源按要求进行相关成分的调整。

一般对于量大的数据可以划分为训练集、测试机、验证集；

对于量少的数据可以使用k折交叉法进行划分。

数据集的划分一般会使用到train_test_split(*arrays,**options)函数,将数组或矩阵分割、生成随机排序与测试子集

train_test_split(*arrays,**options)函数：

参数	说明
*arrays	: 接受一个或多个数据集。代表需要划分的数据集。若为分类回归，则分别传入数据和标签；若为聚类，则传入数据。无默认值。
test_size	接收float\int或None,表示测试及大小。如果是Float类型，则必须在0-1之间，表示测试机占百分比。若是int类型，则代表绝对数目。该参数和train_size只可以传一个
train_size	接收float\int或None,表示测试及大小,该参数和test_size只可以传一个
random_state	接收int,表示随机种子号，相同随机钟子编号产生相同的随即结果。默认为None
shuffle	接收bool，代表是否进行有放回的抽样。若该参数取值为True时，则stratify参数必须不为空
stratify	接收array或者None,如果不为None,则使用传入的标签进行分层抽样

跟着敲代码

#加载数据集
from sklearn.datasets import load_iris
#去除数据的具体值和标签
iris_data=load_iris()['data']
iris_target=load_iris()['target']
print(iris_data.shape)# 查看数据的形状

(150, 4)

#划分数据集，测试集占20%,随机种子编号为42
from sklearn.model_selection import train_test_split
iris_data_train,iris_data_test,iris_target_train,iris_target_test=train_test_split(iris_data,iris_target,test_size=0.2,random_state=42)

#查看划分出来的数据
print(iris_data_train.shape)
print(iris_data_test.shape)
print(iris_target_train.shape)
print(iris_target_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)

3.数据集预处理

数据集预处理是指使用sklearn转换器对数据进行数据预处理与降维等相关操作，这里只做简单说明。

①离差标准化

sklearn中的preprocessing模块为数据处理提供了许多函数，该模块中包含了MinMaxScaler类，用于离差标准化的处理

import numpy as np
from sklearn.preprocessing import MinMaxScaler #先导入到python
Scaler=MinMaxScaler().fit(iris_data_train)#通过MinMaxScaler类的fit（）函数生成相应的规则

#生成的规则对象Scaler的tranform()函数应用于训练集和数据测试及得到标准化的结果
iris_trainScaler=Scaler.transform(iris_data_train)
iris_testScaler=Scaler.transform(iris_data_test)

print(np.min(iris_data_train))#标准化前的训练集最小值
print(np.min(iris_trainScaler))#标准化后训练集最小值
print(np.max(iris_data_train))
print(np.max(iris_trainScaler))
print(np.min(iris_data_test))
print(np.min(iris_testScaler))
print(np.max(iris_data_test))
print(np.max(iris_testScaler))

#可以看到经过数据标准化后,iris_data_train的数据完全映射到了[0,1]区间,而测试集iris_data_test测试数据由小部分超过[0,1]区间

0.1
0.0
7.7
1.0
0.1
0.0
7.9
1.0588235294117647

跟着敲代码

②PCA数据降维

数据降维的目的是在不丢失太多数据信息的前提下简化数据，操作如下：

跟着敲代码

# PCA数据降维
from sklearn.decomposition import PCA  #导入相关库
pca_model =PCA().fit(iris_trainScaler)  #通过其fit()函数制定规则

#将定制好的规则应用于训练集数据和测试集数据
iris_trainPca=pca_model.transform(iris_trainScaler)
iris_testPca=pca_model.transform(iris_testScaler)

print(iris_trainScaler.shape)
print(iris_trainPca.shape)
print(iris_testScaler.shape)
print(iris_testPca.shape)
#书上说这样数据降维可以减少数据量,加快数据的对应处理 ?  结果没减少呀==!