分箱简介
分箱操作的数据预处理中将连续变量处理成分类变量的操作,本文将介绍两种分箱方法。
数据准备
import numpy as np
import pandas as pd
score_list = np.random.randint(25, 100, size=20)
score_cut = pd.cut(score_list, bins)
df = pd.DataFrame()
df['score'] = score_list
print(df)
score
0 66
1 44
2 30
3 28
4 98
5 48
6 81
7 67
8 46
9 78
10 72
11 30
12 30
13 34
14 40
15 37
16 93
17 62
18 52
19 30
IF分箱
def get_score_level(x):
if x <= 59:
return 'A'
if x <= 70:
return 'B'
return 'C'
df.loc[:, "age_type"] = df['score'].apply(get_score_level)
print(df)
score age_type
0 66 B
1 44 A
2 30 A
3 28 A
4 98 C
5 48 A
6 81 C
7 67 B
8 46 A
9 78 C
10 72 C
11 30 A
12 30 A
13 34 A
14 40 A
15 37 A
16 93 C
17 62 B
18 52 A
19 30 A
pd.cut分箱
bins = [0, 59, 70, 100]
df['Categories'] = pd.cut(df['score'], bins, labels=['A', 'B', 'C'])
print(df)
score Categories
0 83 C
1 99 C
2 58 A
3 90 C
4 44 A
5 73 C
6 37 A
7 51 A
8 41 A
9 90 C
10 76 C
11 41 A
12 42 A
13 63 B
14 80 C
15 36 A
16 55 A
17 53 A
18 42 A
19 68 B
|