使用到的模块
import pandas as pd
import numpy as np
先来看看原始的表格数据(数据无意义,只是用来做实验的呀)
获取到所有的值:
[[Timestamp('2021-07-01 00:00:00') 'France' 44.0 72000.0 'No']
[Timestamp('2021-08-01 00:00:00') 'Spain' 27.0 48000.0 'Yes']
[Timestamp('2021-07-01 00:00:00') 'Germany' 30.0 54000.0 'No']
[Timestamp('2021-09-15 00:00:00') 'Spain' 38.0 61000.0 'No']
[Timestamp('2021-09-15 00:00:00') 'Germany' 40.0 nan 'Yes']
[Timestamp('2021-07-01 00:00:00') 'France' 35.0 58000.0 'Yes']
[Timestamp('2021-08-01 00:00:00') 'Spain' nan 52000.0 'No']
[Timestamp('2021-09-15 00:00:00') 'France' 48.0 79000.0 'Yes']
[Timestamp('2021-09-15 00:00:00') 'Germany' 50.0 83000.0 'No']
[Timestamp('2021-07-01 00:00:00') 'France' 37.0 67000.0 'Yes']]
取5个样本数据,
print(df.sample(5))
输出:
Date Country Age Salary Purchased
8 2021-09-15 Germany 50.0 83000.0 No
3 2021-09-15 Spain 38.0 61000.0 No
5 2021-07-01 France 35.0 58000.0 Yes
1 2021-08-01 Spain 27.0 48000.0 Yes
6 2021-08-01 Spain NaN 52000.0 No
将时间按照月份统计,并取5个样本看看效果:
df.insert(1,"月份",df["Date"].apply(lambda x:x.month))
print(df.sample(5))
Date 月份 Country Age Salary Purchased
9 2021-07-01 7 France 37.0 67000.0 Yes
6 2021-08-01 8 Spain NaN 52000.0 No
7 2021-09-15 9 France 48.0 79000.0 Yes
2 2021-07-01 7 Germany 30.0 54000.0 No
3 2021-09-15 9 Spain 38.0 61000.0 No
PS:为什么需要取5个样本呢?
解答:遇到数据多的时候,只取样本看看效果可以节省时间
重头戏来了,使用python做数据透视表,按照月份统计各个城市的薪水总和
df1 = pd.pivot_table(df,index="Country",columns="月份",
values="Salary",aggfunc=np.sum)
print(df1)
?结果展示:
月份 7 8 9
Country
France 197000.0 NaN 79000.0
Germany 54000.0 NaN 83000.0
Spain NaN 100000.0 61000.0
|