在python环境中实现 R for Data Science https://r4ds.had.co.nz/ 中的5,12,13三章中的功能
1.Data transformation数据转换
参考python dfply文档https://github.com/kieferk/dfply/blob/master/README.md.
!pip install dfply
from dfply import *
import numpy as np
import pandas as pd
加载内置的数据集diamonds数据集,数据集共53940行,有carat、cut、color、clarity、depth、table、price、x、y、z共10列,对应每个钻石的一些参数值。
diamonds.head()
| carat | cut | color | clarity | depth | table | price | x | y | z |
---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
---|
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
---|
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
---|
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
---|
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
---|
1.1 Filter rows过滤筛选
diamonds >> filter_by(X.cut == 'Ideal') >> head()
| carat | cut | color | clarity | depth | table | price | x | y | z |
---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
---|
11 | 0.23 | Ideal | J | VS1 | 62.8 | 56.0 | 340 | 3.93 | 3.90 | 2.46 |
---|
13 | 0.31 | Ideal | J | SI2 | 62.2 | 54.0 | 344 | 4.35 | 4.37 | 2.71 |
---|
16 | 0.30 | Ideal | I | SI2 | 62.0 | 54.0 | 348 | 4.31 | 4.34 | 2.68 |
---|
39 | 0.33 | Ideal | I | SI2 | 61.8 | 55.0 | 403 | 4.49 | 4.51 | 2.78 |
---|
管道函数是将上一步的结果直接传参给下一步的函数,从而省略了中间的赋值步骤,可以大量减少内存中的对象,节省内存。
diamonds >> filter_by(X.cut == 'Ideal', X.price < 337)
| carat | cut | color | clarity | depth | table | price | x | y | z |
---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
---|
diamonds >> mask(X.cut == 'Ideal', X.color == 'E', X.table < 55, X.price < 500)
| carat | cut | color | clarity | depth | table | price | x | y | z |
---|
26683 | 0.33 | Ideal | E | SI2 | 62.2 | 54.0 | 427 | 4.44 | 4.46 | 2.77 |
---|
32297 | 0.34 | Ideal | E | SI2 | 62.4 | 54.0 | 454 | 4.49 | 4.52 | 2.81 |
---|
40928 | 0.30 | Ideal | E | SI1 | 61.6 | 54.0 | 499 | 4.32 | 4.35 | 2.67 |
---|
50623 | 0.30 | Ideal | E | SI2 | 62.1 | 54.0 | 401 | 4.32 | 4.35 | 2.69 |
---|
50625 | 0.30 | Ideal | E | SI2 | 62.0 | 54.0 | 401 | 4.33 | 4.35 | 2.69 |
---|
1.2 Arrange rows排列
diamonds >> arrange(X.table, ascending=False) >> head(5)
| carat | cut | color | clarity | depth | table | price | x | y | z |
---|
24932 | 2.01 | Fair | F | SI1 | 58.6 | 95.0 | 13387 | 8.32 | 8.31 | 4.87 |
---|
50773 | 0.81 | Fair | F | SI2 | 68.8 | 79.0 | 2301 | 5.26 | 5.20 | 3.58 |
---|
51342 | 0.79 | Fair | G | SI1 | 65.3 | 76.0 | 2362 | 5.52 | 5.13 | 3.35 |
---|
52860 | 0.50 | Fair | E | VS2 | 79.0 | 73.0 | 2579 | 5.21 | 5.18 | 4.09 |
---|
49375 | 0.70 | Fair | H | VS1 | 62.0 | 73.0 | 2100 | 5.65 | 5.54 | 3.47 |
---|
1.3 Select columns选择
diamonds >> select(X.cut, 'price', X.x) >> head(3)
| cut | price | x |
---|
0 | Ideal | 326 | 3.95 |
---|
1 | Premium | 326 | 3.89 |
---|
2 | Good | 327 | 4.05 |
---|
diamonds >> select(1, X.price, ['x', 'y']) >> head(2)
| cut | price | x | y |
---|
0 | Ideal | 326 | 3.95 | 3.98 |
---|
1 | Premium | 326 | 3.89 | 3.84 |
---|
starts_with(‘c’) :查找以字符c前缀开头的列; ends_with(‘c’):查找以字符c后缀结束的列; contains(‘c’):筛选出包包字符c的列; everything():所有列。
diamonds >> select(starts_with('c')) >> head(3)
| carat | cut | color | clarity |
---|
0 | 0.23 | Ideal | E | SI2 |
---|
1 | 0.21 | Premium | E | SI1 |
---|
2 | 0.23 | Good | E | VS1 |
---|
1.4 Add new variables添加新变量
可以使用mutate()函数创建新变量,在一次调用中可创建多个变量
diamonds >> mutate(x_plus_y=X.x + X.y) >> select(columns_from('x')) >> head(3)
| x | y | z | x_plus_y |
---|
0 | 3.95 | 3.98 | 2.43 | 7.93 |
---|
1 | 3.89 | 3.84 | 2.31 | 7.73 |
---|
2 | 4.05 | 4.07 | 2.31 | 8.12 |
---|
diamonds >> transmute(x_plus_y=X.x + X.y, y_div_z=(X.y / X.z)) >> head(3)
| x_plus_y | y_div_z |
---|
0 | 7.93 | 1.637860 |
---|
1 | 7.73 | 1.662338 |
---|
2 | 8.12 | 1.761905 |
---|
1.5 Grouped summaries分组汇总
diamonds >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
| price_mean | price_std |
---|
0 | 3932.799722 | 3989.439738 |
---|
diamonds >> group_by('cut') >> summarize(price_mean=X.price.mean(), price_std=X.price.std())
| cut | price_mean | price_std |
---|
0 | Fair | 4358.757764 | 3560.386612 |
---|
1 | Good | 3928.864452 | 3681.589584 |
---|
2 | Ideal | 3457.541970 | 3808.401172 |
---|
3 | Premium | 4584.257704 | 4349.204961 |
---|
4 | Very Good | 3981.759891 | 3935.862161 |
---|
diamonds >> summarize_each([np.mean, np.var], X.price, 'depth')
| price_mean | price_var | depth_mean | depth_var |
---|
0 | 3932.799722 | 1.591533e+07 | 61.749405 | 2.052366 |
---|
diamonds >> group_by(X.cut) >> summarize_each([np.mean, np.var], X.price, 4)
| cut | price_mean | price_var | depth_mean | depth_var |
---|
0 | Fair | 4358.757764 | 1.266848e+07 | 64.041677 | 13.266319 |
---|
1 | Good | 3928.864452 | 1.355134e+07 | 62.365879 | 4.705224 |
---|
2 | Ideal | 3457.541970 | 1.450325e+07 | 61.709401 | 0.516274 |
---|
3 | Premium | 4584.257704 | 1.891421e+07 | 61.264673 | 1.342755 |
---|
4 | Very Good | 3981.759891 | 1.548973e+07 | 61.818275 | 1.900466 |
---|
2.Tidy data数据整理
2.1Pivoting旋转
import pandas as pd
import numpy as np
mydata=pd.DataFrame({
"Name":["苹果","谷歌","脸书","亚马逊","腾讯"],
"Conpany":["Apple","Google","Facebook","Amozon","Tencent"],
"Sale2013":[5000,3500,2300,2100,3100],
"Sale2014":[5050,3800,2900,2500,3300],
"Sale2015":[5050,3800,2900,2500,3300],
"Sale2016":[5050,3800,2900,2500,3300]
})
mydata
| Name | Conpany | Sale2013 | Sale2014 | Sale2015 | Sale2016 |
---|
0 | 苹果 | Apple | 5000 | 5050 | 5050 | 5050 |
---|
1 | 谷歌 | Google | 3500 | 3800 | 3800 | 3800 |
---|
2 | 脸书 | Facebook | 2300 | 2900 | 2900 | 2900 |
---|
3 | 亚马逊 | Amozon | 2100 | 2500 | 2500 | 2500 |
---|
4 | 腾讯 | Tencent | 3100 | 3300 | 3300 | 3300 |
---|
python中melt函数(数据宽转长)与R中pivot_longer函数类似,通过减少列并将它们转换为值来延长数据框
mydata1=mydata.melt(id_vars=["Name","Conpany"],
var_name="Year",
value_name="Sale" )
mydata1
| Name | Conpany | Year | Sale |
---|
0 | 苹果 | Apple | Sale2013 | 5000 |
---|
1 | 谷歌 | Google | Sale2013 | 3500 |
---|
2 | 脸书 | Facebook | Sale2013 | 2300 |
---|
3 | 亚马逊 | Amozon | Sale2013 | 2100 |
---|
4 | 腾讯 | Tencent | Sale2013 | 3100 |
---|
5 | 苹果 | Apple | Sale2014 | 5050 |
---|
R中pivot_wider函数实现的功能可用python中pivot_table函数(数据长转宽)代替,跨多列传播键值对
mydata1.pivot_table(index=["Name","Conpany"],
columns=["Year"],
values=["Sale"])
| | Sale |
---|
| Year | Sale2013 | Sale2014 | Sale2015 | Sale2016 |
---|
Name | Conpany | | | | |
---|
亚马逊 | Amozon | 2100 | 2500 | 2500 | 2500 |
---|
脸书 | Facebook | 2300 | 2900 | 2900 | 2900 |
---|
腾讯 | Tencent | 3100 | 3300 | 3300 | 3300 |
---|
苹果 | Apple | 5000 | 5050 | 5050 | 5050 |
---|
谷歌 | Google | 3500 | 3800 | 3800 | 3800 |
---|
也可使用plydata库进行数据整理
python中的plydata库基于 R 中的 dplyr、tidyr 和 forcats 包,许多函数名称都是直接借用过来的
https://plydata.readthedocs.io/en/stable/api.html.
2.2 Separating and uniting分离与融合
df = pd.DataFrame({'alpha': 1,'x': ['a,1', 'b,2', 'c,3'],'zeta': 6})
print(df)
df >> separate('x', into=['A', 'B'], remove=False)
| alpha | x | zeta | A | B |
---|
0 | 1 | a,1 | 6 | a | 1 |
---|
1 | 1 | b,2 | 6 | b | 2 |
---|
2 | 1 | c,3 | 6 | c | 3 |
---|
df = pd.DataFrame({'c1': [1, 2, 3, 4, None],'c2': list('abcde'),'c3': list('vwxyz')})
df >> unite('c1c2', 'c1', 'c2', na_rm=True)
[‘c1’, ‘c2’] _ True maintain
| c3 | c1c2 |
---|
0 | v | 1.0_a |
---|
1 | w | 2.0_b |
---|
2 | x | 3.0_c |
---|
3 | y | 4.0_d |
---|
4 | z | NaN |
---|
2.3 Missing values缺失值
df
| c1 | c2 | c3 |
---|
0 | 1.0 | a | v |
---|
1 | 2.0 | b | w |
---|
2 | 3.0 | c | x |
---|
3 | 4.0 | d | y |
---|
4 | NaN | e | z |
---|
print(df.isnull())
print(df.isnull().sum())
删除缺失值可用df.dropna(),也可以给这个方法传入how="all"参数,只有在整行为空的前提下才删除
df.fillna('*')
| c1 | c2 | c3 |
---|
0 | 1.0 | a | v |
---|
1 | 2.0 | b | w |
---|
2 | 3.0 | c | x |
---|
3 | 4.0 | d | y |
---|
4 | * | e | z |
---|
3.Relational data关系数据
3.1 pd.merge(),pd.concat()
pd.merge()官方文档 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html?highlight=merge#pandas.merge.
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
print(df1)
print(df2)
df1.merge(df2, how='inner', on='a')
df1.merge(df2, how='cross')
| a_x | b | a_y | c |
---|
0 | foo | 1 | foo | 3 |
---|
1 | foo | 1 | baz | 4 |
---|
2 | bar | 2 | foo | 3 |
---|
3 | bar | 2 | baz | 4 |
---|
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
columns=['letter', 'number', 'animal'])
pd.concat([df1, df3], join="inner")
3.2运用dfply库中的join系列函数
a = pd.DataFrame({'x1':['A','B','C'],'x2':[1,2,3]})
b = pd.DataFrame({'x1':['A','B','D'],'x3':[True,False,True]})
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-HDlnjDWV-1639040492974)(attachment:image.png)]
a >> inner_join(b, by='x1')
a >> outer_join(b, by='x1')
a >> left_join(b, by='x1')
a >> right_join(b, by='x1')
a >> semi_join(b, by='x1')
a >> anti_join(b, by='x1')
|