[人工智能] 5. KFold StratifiedKFoldStratifiedShuffleSplit GroupKFold区别以及Stratified Group KFold

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 5. KFold StratifiedKFoldStratifiedShuffleSplit GroupKFold区别以及Stratified Group KFold -> 正文阅读

[人工智能]5. KFold StratifiedKFoldStratifiedShuffleSplit GroupKFold区别以及Stratified Group KFold

5. KFold, StratifiedKFold,StratifiedShuffleSplit, GroupKFold区别以及Stratified Group KFold 实现

在机器学习，一般不能直接拿整个数据集取训练，而采用cross-validation方法来训练。增强随机性减小噪声等，来减少过拟合，从而有限的数据中获取学习到更全面的信息，让模型泛化能力强。在sklearn中，经常使用的有：KFold, StratifiedKFold,StratifiedShuffleSplit, GroupKFold。逐一解释使用区别,使用一个简单的df，df信息如图。(一般情况下， n_splits=5/10)

import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, StratifiedKFold,\
            StratifiedShuffleSplit, GroupKFold, GroupShuffleSplit
          
          
df2 = pd.DataFrame([[6.5, 1, 2],
            [8, 1, 0],
            [61, 2, 1],
            [54, 0, 1],
            [78, 0, 1],
            [119, 2, 2],
            [111, 1, 2],
            [23, 0, 0],
            [31, 2, 0]], columns=['h', 'w', 'class'])
df2

	h		w class
0	6.5		1	2
1	8.0		1	0
2	61.0 	2	1
3	54.0	0	1
4	78.0	0	1
5	119.0	2	2
6	111.0	1	2
7	23.0	0	0
8	31.0	2	0

1. KFold 使用

X = df2.drop(['class'], axis=1)
y = df2['class']
floder = KFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in floder.split(X,y):
    print("KFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
    # print(X.iloc[train_idx], y.iloc[train_idx], '\n', X.iloc[test_idx], y.iloc[test_idx])
===================================================================
KFold Spliting:
Train index: [0 1 3 5 6 8] | test index: [2 4 7]
KFold Spliting:
Train index: [0 2 3 4 7 8] | test index: [1 5 6]
KFold Spliting:
Train index: [1 2 4 5 6 7] | test index: [0 3 8]

注意划分后得到的是针对数据的索引。我们现在只关注其test index，可以发现每次划分得到的索引不是按照class对应的类别均匀划分的，如第一次[2,4,7]对应类别是1,1,0. 其实 train index也一样，2,0,1,2,2,0.这在很多时候是不满足要求的，因为我们很多时候希望每次划分得到的train dataset/valid dataset其中对应的target类别是均匀的。

有意思的是，你将 n_splits=8或9试试，可以看到不同划分数目，得到test index数目是不一样的。如 n_splits=8时，第1 folds中test index size为 n_samples // n_splits + 1= 2，其余为1。

The first n_samples % n_splits folds have size n_samples // n_splits + 1, other folds have size n_samples // n_splits, where n_samples is the number of samples.

? —— kfold

现在我们知道，KFold不能按照target类别来均匀划分，如果数据集必须按target类别来划分呢？那就要用到 StratifiedKFold。

2. StratifiedKFold使用

sfolder = StratifiedKFold(n_splits=3, random_state=2020, shuffle=True)
for train_idx, test_idx in sfolder.split(X,y):
    print("StratifiedKFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
    
======================================================
StratifiedKFold Spliting:
Train index: [0 3 4 5 7 8] | test index: [1 2 6]
StratifiedKFold Spliting:
Train index: [1 2 3 5 6 8] | test index: [0 4 7]
StratifiedKFold Spliting:
Train index: [0 1 2 4 6 7] | test index: [3 5 8]

这时我们得到的第一次test index 为 [1 2 6]，train index也可以验证，也就是说，划分得到的数据集target类别是均匀的。但是还有些数据，如df中特征列 w如果也代表类别，我们希望将这个特征列相同类别划成一组呢？就像df.groupby一样意思。这可以用 GroupKFold.

3. GroupKFold使用

gfolder = GroupKFold(n_splits=3)
for train_idx, test_idx in gfolder.split(X,y, groups=X['w']):
    print("GroupKFold Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
   
========================================================================
GroupKFold Spliting:
Train index: [0 1 3 4 6 7] | test index: [2 5 8]
GroupKFold Spliting:
Train index: [2 3 4 5 7 8] | test index: [0 1 6]
GroupKFold Spliting:
Train index: [0 1 2 5 6 8] | test index: [3 4 7]

这里第一次test index为 [2 5 8]，对应w列为2。 [0 1 6]为1。这样就得到了按组划分了。可以试试将 groups=y看看。

4. StratifiedShuffleSplit使用

StratifiedShuffleSplit是 StratifiedKFold和 ShuffleSplit缝合怪。其跟 StratifiedKFold最大区别是可以重复采样，可以看到第一个test index是 [1 5 4]，第二个是 [8 0 4]，那么有可能某两个fold的index是一样的， not guarantee that all folds will be different。

shuffle_split = StratifiedShuffleSplit(n_splits=3, random_state=2020, test_size=3) #test_size必须比类别大或者 可以重复采样
for train_idx, test_idx in shuffle_split.split(X,y):
    print("StratifiedShuffleSplit Spliting:")
    print('Train index: %s | test index: %s' % (train_idx, test_idx))
====================================================================
StratifiedShuffleSplit Spliting:
Train index: [8 2 3 0 6 7] | test index: [1 5 4]
StratifiedShuffleSplit Spliting:
Train index: [3 1 6 2 7 5] | test index: [8 0 4]
StratifiedShuffleSplit Spliting:
Train index: [1 8 2 6 0 4] | test index: [7 3 5]

5. Stratified Group KFold 实现

现在很多数据集会出现非常不均衡情况，如果在训练可能要求按照某些特征列和target列这两个均匀划分，为此出现了 Stratified Group KFold, 可以看做 GroupKFold和 StratifiedKFold缝合怪。

下面代码来自于stratifiedgroupkfold , 数据集是sklearn iris。另外再添加一列ID，就是令groups=df[‘ID’]并且划分后train valid 中y还是跟原数据集分布一样。

import numpy as np
import pandas as pd
import random
from sklearn.model_selection import GroupKFold
from collections import Counter, defaultdict
from sklearn.datasets import load_iris

def read_data():
    iris = load_iris()
    df = pd.DataFrame(iris.data, columns=iris.feature_names)
    df['target'] = iris.target

    #新定义一个ID列
    list_id = ['A', 'B', 'C', 'D', 'E']
    df['ID'] = np.random.choice(list_id, len(df))

    features = iris.feature_names
    return df, features

df, features = read_data()
print(df.sample(6))

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target   ID
133                6.3               2.8                5.1               1.5   	2    	C
21                 5.1               3.7                1.5               0.4   	0		A
84                 5.4               3.0                4.5               1.5   	1		A
62                 6.0               2.2                4.0               1.0   	1		D
5                  5.4               3.9                1.7               0.4   	0		B	
132                6.4               2.8                5.6               2.2   	2		E

StratiiedGroupKFold 分解实现：

def count_y(y, groups):
    """统计每个group里各个y 数目"""
    unique_num = np.max(y) + 1
    #key不存在默认返回 np.zeros(unique_num)
    y_counts_per_group = defaultdict(lambda : np.zeros(unique_num))

    for label, g  in zip(y, groups):
        y_counts_per_group[g][label] += 1

    # defaultdict(<function__main__.<lambda>>,
    # {'A': array([5., 9., 8.]),
    # 'B': array([11., 12., 10.]),
    # 'C': array([13., 8., 8.]),
    # 'D': array([9., 11., 11.]),
    # 'E': array([12., 10., 13.])})
    return y_counts_per_group

def StratiiedGroupKFold(X, y, groups, features, k, seed=None):
    """
    StratiiedGroupKFold数据，yeild划分后数据集索引
    :param X: 数据集X
    :param y: y target
    :param groups: 指定其分布划分的groups
    :param features: 特征
    :param k: n_split
    :param seed:
    """
    max_y = np.max(y)
    #得到每个groups y的数目的统计字典
    y_counts_per_group = count_y(y, groups)
    gf = GroupKFold(n_splits=k)
    for train_idx, val_idx in gf.split(X, y, groups):
        #分别获取train val划分后数据 以及各自对应的ID列类别数目
        x_train = X.iloc[train_idx,:]
        #id列类别数目
        id_train = x_train['ID'].unique()
        x_train = x_train[features]

        x_val, y_val = X.iloc[val_idx, :], y.iloc[val_idx]
        id_val = x_val['ID'].unique()
        x_val = x_val[features]

        #统计training dataset 和 validation dataset中y中每个类别数目
        y_counts_train = np.zeros(max_y + 1)
        y_counts_val = np.zeros(max_y + 1)
        for id in id_train:
            y_counts_train += y_counts_per_group[id]
        for id in id_val:
            y_counts_val += y_counts_per_group[id]

        #train dataset中按ID列统计y类别相对于最大数目的比例
        numratio_train = y_counts_train / np.max(y_counts_train)
        #stratified 数目: validation dataset对应y_counts_train最大值索引的count数目 * numratio_train向上取整
        stratified_count = np.ceil(y_counts_val[np.argmax(y_counts_train)] * numratio_train).astype(int)

        val_idx = np.array([])
        np.random.rand(seed)
        for num in range(max_y + 1):
            val_idx = np.append(val_idx, np.random.choice(y_val[y_val==num].index, stratified_count[num]))
        val_idx = val_idx.astype(int)

        yield train_idx, val_idx

看看划分效果：

def get_distribution(y_vals):
    """返回个y各类别的占比"""
    y_distribut = Counter(y_vals)
    y_vals_sum = sum(y_distribut.values())
    return [f'{y_distribut[i]/y_vals_sum:.2%}' for i in range(np.max(y_vals) + 1)]

X = df.drop('target', axis=1)
y = df['target']
groups = df['ID']

distribution = [get_distribution(y)]
index = ['all dataset']

#看看划分情况
for fold, (train_idx, val_idx) in enumerate(StratiiedGroupKFold(X, y, groups, features, k=3, seed=2020)):
    print(f'Train ID - fold {fold:1d}:{groups[train_idx].unique()}\
       Test ID - fold {fold:1d}:{groups[val_idx].unique()}')

    distribution.append(get_distribution(y[train_idx]))
    index.append(f'train set - fold{fold:1d}')
    distribution.append(get_distribution(y[val_idx]))
    index.append(f'valid set - fold{fold:1d}')
print(pd.DataFrame(distribution, index=index, columns={f' Label{l:2d}' for l in range(np.max(y)+1)}))

Train ID - fold 0:['B' 'A' 'C' 'D']   Test ID - fold 0:['E']
Train ID - fold 1:['A' 'D' 'E']       Test ID - fold 1:['B' 'C']
Train ID - fold 2:['B' 'C' 'E']       Test ID - fold 2:['A' 'D']
                   Label 1  Label 2  Label 0
all dataset         33.33%   33.33%   33.33%
train set - fold0   32.48%   31.62%   35.90%
valid set - fold0   33.33%   33.33%   33.33%
train set - fold1   34.44%   33.33%   32.22%
valid set - fold1   33.93%   33.93%   32.14%
train set - fold2   33.33%   35.48%   31.18%
valid set - fold2   33.33%   35.42%   31.25%

通用实现：

def stratified_group_k_fold(X, y, groups, k, seed=None):
    labels_num = np.max(y) + 1
    y_counts_per_group = defaultdict(lambda: np.zeros(labels_num))
    y_distr = Counter()
    for label, g in zip(y, groups):
        y_counts_per_group[g][label] += 1
        y_distr[label] += 1

    y_counts_per_fold = defaultdict(lambda: np.zeros(labels_num))
    groups_per_fold = defaultdict(set)

    def eval_y_counts_per_fold(y_counts, fold):
        y_counts_per_fold[fold] += y_counts
        std_per_label = []
        for label in range(labels_num):
            label_std = np.std([y_counts_per_fold[i][label] / y_distr[label] for i in range(k)])
            std_per_label.append(label_std)
        y_counts_per_fold[fold] -= y_counts
        return np.mean(std_per_label)

    groups_and_y_counts = list(y_counts_per_group.items())
    random.Random(seed).shuffle(groups_and_y_counts)

    for g, y_counts in sorted(groups_and_y_counts, key=lambda x: -np.std(x[1])):
        best_fold = None
        min_eval = None
        for i in range(k):
            fold_eval = eval_y_counts_per_fold(y_counts, i)
            if min_eval is None or fold_eval < min_eval:
                min_eval = fold_eval
                best_fold = i
        y_counts_per_fold[best_fold] += y_counts
        groups_per_fold[best_fold].add(g)

    all_groups = set(groups)
    for i in range(k):
        train_groups = all_groups - groups_per_fold[i]
        test_groups = groups_per_fold[i]

        train_indices = [i for i, g in enumerate(groups) if g in train_groups]
        test_indices = [i for i, g in enumerate(groups) if g in test_groups]

        yield train_indices, test_indices