0. 前言
前面我们经历了十分钟学废pandas,相信大家一定都学废了(狗头保命),下面我们开始学习pandas基础知识。
1. Series
import pandas as pd
import numpy as np
s = pd.Series([1, 3, 6, np.nan, 44, 1])
print(s)
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
2.DataFrame
2.1 DataFrame的简单运用
dates = pd.date_range('2018-08-19', periods=6)
'''
numpy.random.randn(d0, d1,.... ,dn)是从标准正太分布中返回一个或多个样本值。
numpy.random.rand(d0, d1,...,dn)的随机样本位于[0,1]中
(6, 4)表示六行四列数据
'''
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['a', 'b', 'c','d'])
print(df)
a b c d
2018-08-19 -0.193563 0.774822 0.791951 -0.001489
2018-08-20 1.383536 0.013180 -1.013866 0.277929
2018-08-21 0.194067 -0.112442 0.537806 0.775922
2018-08-22 -1.257753 -1.241477 1.099022 0.487283
2018-08-23 -0.383184 -0.299835 -1.212893 0.884345
2018-08-24 0.691404 -1.207610 -0.168567 0.642692
print(df['b'])
2018-08-19 0.774822
2018-08-20 0.013180
2018-08-21 -0.112442
2018-08-22 -1.241477
2018-08-23 -0.299835
2018-08-24 -1.207610
Freq: D, Name: b, dtype: float64
df1 = pd.DataFrame(np.arange(12).reshape(3,4))
print(df1)
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
df2 = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': pd.Timestamp('20180819'),
'C': pd.Series([1, 6, 9, 10], dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(['test', 'train', 'test', 'train']),
'F': 'foo'
})
print(df2)
A B C D E F
0 1 2018-08-19 1.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
3 4 2018-08-19 10.0 3 train foo
print(df2.index)
RangeIndex(start=0, stop=4, step=1)
print(df2.columns)
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
print(df2.values)
[[1 Timestamp('2018-08-19 00:00:00') 1.0 3 'test' 'foo']
[2 Timestamp('2018-08-19 00:00:00') 6.0 3 'train' 'foo']
[3 Timestamp('2018-08-19 00:00:00') 9.0 3 'test' 'foo']
[4 Timestamp('2018-08-19 00:00:00') 10.0 3 'train' 'foo']]
'''
对于四分位数的确定,有不同的方法,另外一种方法基于N-1 基础。即
Q1的位置=1+(n-1)x 0.25
Q2的位置=1+(n-1)x 0.5
Q3的位置=1+(n-1)x 0.75
'''
print(df2.describe())
A C D
count 4.000000 4.000000 4.0
mean 2.500000 6.500000 3.0
std 1.290994 4.041452 0.0
min 1.000000 1.000000 3.0
25% 1.750000 4.750000 3.0
50% 2.500000 7.500000 3.0
75% 3.250000 9.250000 3.0
max 4.000000 10.000000 3.0
print(df2.T)
0 1 2 \
A 1 2 3
B 2018-08-19 00:00:00 2018-08-19 00:00:00 2018-08-19 00:00:00
C 1 6 9
D 3 3 3
E test train test
F foo foo foo
3
A 4
B 2018-08-19 00:00:00
C 10
D 3
E train
F foo
'''
axis=1表示行
axis=0表示列
默认ascenging(升序)为True
ascending=True表示升序,ascending=False表示降序
下面两行分别表示按行升序和按行降序
'''
print(df2.sort_index(axis=1, ascending=True))
A B C D E F
0 1 2018-08-19 1.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
3 4 2018-08-19 10.0 3 train foo
print(df2.sort_index(axis=1, ascending=False))
F E D C B A
0 foo test 3 1.0 2018-08-19 1
1 foo train 3 6.0 2018-08-19 2
2 foo test 3 9.0 2018-08-19 3
3 foo train 3 10.0 2018-08-19 4
print(df2.sort_index(axis=0, ascending=False))
A B C D E F
3 4 2018-08-19 10.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
0 1 2018-08-19 1.0 3 test foo
print(df2.sort_index(axis=0, ascending=True))
A B C D E F
0 1 2018-08-19 1.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
3 4 2018-08-19 10.0 3 train foo
print(df2.sort_values(by='C', ascending=False))
A B C D E F
3 4 2018-08-19 10.0 3 train foo
2 3 2018-08-19 9.0 3 test foo
1 2 2018-08-19 6.0 3 train foo
0 1 2018-08-19 1.0 3 test foo
3. 筛选
3.1 实战筛选
import pandas as pd
import numpy as np
dates = pd.date_range('20180819', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
2018-08-19 0 1 2 3
2018-08-20 4 5 6 7
2018-08-21 8 9 10 11
2018-08-22 12 13 14 15
2018-08-23 16 17 18 19
2018-08-24 20 21 22 23
print(df['A'])
2018-08-19 0
2018-08-20 4
2018-08-21 8
2018-08-22 12
2018-08-23 16
2018-08-24 20
Freq: D, Name: A, dtype: int32
print(df.A)
2018-08-19 0
2018-08-20 4
2018-08-21 8
2018-08-22 12
2018-08-23 16
2018-08-24 20
Freq: D, Name: A, dtype: int32
print(df[0:3])
A B C D
2018-08-19 0 1 2 3
2018-08-20 4 5 6 7
2018-08-21 8 9 10 11
print(df['2018-08-19': '2018-08-21'])
A B C D
2018-08-19 0 1 2 3
2018-08-20 4 5 6 7
2018-08-21 8 9 10 11
print(df.loc['20180819'])
A 0
B 1
C 2
D 3
Name: 2018-08-19 00:00:00, dtype: int32
print(df.loc[:, 'A':'B'])
A B
2018-08-19 0 1
2018-08-20 4 5
2018-08-21 8 9
2018-08-22 12 13
2018-08-23 16 17
2018-08-24 20 21
print(df.loc[:, ['A', 'B']])
A B
2018-08-19 0 1
2018-08-20 4 5
2018-08-21 8 9
2018-08-22 12 13
2018-08-23 16 17
2018-08-24 20 21
print(df.loc['20180819', ['A', 'B']])
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
print(df.iloc[3, 1])
13
print(df.iloc[3:5, 1:3])
B C
2018-08-22 13 14
2018-08-23 17 18
print(df.iloc[[1, 3, 5], 1:3])
B C
2018-08-20 5 6
2018-08-22 13 14
2018-08-24 21 22
print(df.ix[:3, ['A', 'C']])
A C
2018-08-19 0 2
2018-08-20 4 6
2018-08-21 8 10
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
print(df.iloc[:3, [0, 2]])
A C
2018-08-19 0 2
2018-08-20 4 6
2018-08-21 8 10
print(df[df.A>8])
A B C D
2018-08-22 12 13 14 15
2018-08-23 16 17 18 19
2018-08-24 20 21 22 23
print(df.loc[df.A>8])
A B C D
2018-08-22 12 13 14 15
2018-08-23 16 17 18 19
2018-08-24 20 21 22 23
3.2 筛选总结
1. iloc与ix区别
总结:
相同点:iloc可以取相应的值,操作方便,与ix操作类似。
不同点:ix可以混合选择,可以填入column对应的字符选择,而iloc只能采用index索引,对于列数较多的情况下,ix要方便操作许多。
2. loc与iloc区别
总结:
相同点:都可以索引出块数据
不同点:iloc可以检索对应值,两者操作不同。
3. ix与loc、iloc三者的区别
总结:ix是混合loc与iloc操作
如下:对比三者操作,输出结果相同
print(df.loc['20180819', 'A':'B'])
print(df.iloc[0, 0:2])
print(df.ix[0, 'A':'B'])
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
A 0
B 1
Name: 2018-08-19 00:00:00, dtype: int32
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
This is separate from the ipykernel package so we can avoid doing imports until
4.Pandas设置值
4.1 创建数据
import pandas as pd
import numpy as np
dates = pd.date_range('20180820', periods=6)
df = pd.DataFrame(np.arange(24).reshape(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
2018-08-20 0 1 2 3
2018-08-21 4 5 6 7
2018-08-22 8 9 10 11
2018-08-23 12 13 14 15
2018-08-24 16 17 18 19
2018-08-25 20 21 22 23
4.2 根据位置设置loc与iloc
df.iloc[2, 2] = 111
df.loc['20180820', 'B'] = 2222
print(df)
A B C D
2018-08-20 0 2222 2 3
2018-08-21 4 5 6 7
2018-08-22 8 9 111 11
2018-08-23 12 13 14 15
2018-08-24 16 17 18 19
2018-08-25 20 21 22 23
4.3 根据条件设置
df.B[df.A>4] = 0
print(df)
A B C D
2018-08-20 0 2222 2 3
2018-08-21 4 5 6 7
2018-08-22 8 0 111 11
2018-08-23 12 0 14 15
2018-08-24 16 0 18 19
2018-08-25 20 0 22 23
df.B.loc[df.A>4] = 0
print(df)
A B C D
2018-08-20 0 2222 2 3
2018-08-21 4 5 6 7
2018-08-22 8 0 111 11
2018-08-23 12 0 14 15
2018-08-24 16 0 18 19
2018-08-25 20 0 22 23
4.4 按行或列设置
df['F'] = np.nan
print(df)
A B C D F
2018-08-20 0 2222 2 3 NaN
2018-08-21 4 5 6 7 NaN
2018-08-22 8 0 111 11 NaN
2018-08-23 12 0 14 15 NaN
2018-08-24 16 0 18 19 NaN
2018-08-25 20 0 22 23 NaN
4.5 添加Series序列(长度必须对齐)
df['E'] = pd.Series([1, 2, 3, 4, 5, 6], index = pd.date_range('20180820', periods=6))
print(df)
A B C D F E
2018-08-20 0 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
4.6 设定某行某列为特定值
df.ix['20180820', 'A'] = 56
print(df)
A B C D F E
2018-08-20 56 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
df.loc['20180820', 'A'] = 67
print(df)
A B C D F E
2018-08-20 67 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
df.iloc[0, 0] = 76
print(df)
A B C D F E
2018-08-20 76 2222 2 3 NaN 1
2018-08-21 4 5 6 7 NaN 2
2018-08-22 8 0 111 11 NaN 3
2018-08-23 12 0 14 15 NaN 4
2018-08-24 16 0 18 19 NaN 5
2018-08-25 20 0 22 23 NaN 6
4.7 修改一整行数据
df.iloc[1] = np.nan
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 8.0 0.0 111.0 11.0 NaN 3.0
2018-08-23 12.0 0.0 14.0 15.0 NaN 4.0
2018-08-24 16.0 0.0 18.0 19.0 NaN 5.0
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
df.loc['20180823'] = np.nan
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 8.0 0.0 111.0 11.0 NaN 3.0
2018-08-23 NaN NaN NaN NaN NaN NaN
2018-08-24 16.0 0.0 18.0 19.0 NaN 5.0
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
df.ix[2] = np.nan
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 NaN NaN NaN NaN NaN NaN
2018-08-23 NaN NaN NaN NaN NaN NaN
2018-08-24 16.0 0.0 18.0 19.0 NaN 5.0
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
"""Entry point for launching an IPython kernel.
df.ix['20180824'] = np.nan
print(df)
A B C D F E
2018-08-20 76.0 2222.0 2.0 3.0 NaN 1.0
2018-08-21 NaN NaN NaN NaN NaN NaN
2018-08-22 NaN NaN NaN NaN NaN NaN
2018-08-23 NaN NaN NaN NaN NaN NaN
2018-08-24 NaN NaN NaN NaN NaN NaN
2018-08-25 20.0 0.0 22.0 23.0 NaN 6.0
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
"""Entry point for launching an IPython kernel.
5. Pandas处理丢失数据
5.1 创建含NaN的矩阵
import pandas as pd
import numpy as np
dates = pd.date_range('20180820', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6, 4)), index=dates, columns=['A', 'B', 'C', 'D'])
print(df)
A B C D
2018-08-20 0 1 2 3
2018-08-21 4 5 6 7
2018-08-22 8 9 10 11
2018-08-23 12 13 14 15
2018-08-24 16 17 18 19
2018-08-25 20 21 22 23
df.iloc[0, 1] = np.nan
df.iloc[1, 2] = np.nan
print(df)
A B C D
2018-08-20 0 NaN 2.0 3
2018-08-21 4 5.0 NaN 7
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
5.2 删除掉有NaN的行或者列
print(df.dropna())
A B C D
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
print(df.dropna(
axis = 0,
how = 'any'
))
A B C D
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
print(df.dropna(
axis = 1,
how = 'any'
))
A D
2018-08-20 0 3
2018-08-21 4 7
2018-08-22 8 11
2018-08-23 12 15
2018-08-24 16 19
2018-08-25 20 23
5.3 替换NaN值为0或其他
print(df.fillna(value=0))
A B C D
2018-08-20 0 0.0 2.0 3
2018-08-21 4 5.0 0.0 7
2018-08-22 8 9.0 10.0 11
2018-08-23 12 13.0 14.0 15
2018-08-24 16 17.0 18.0 19
2018-08-25 20 21.0 22.0 23
5.4 是否有缺失数据NaN
print(df.isnull())
A B C D
2018-08-20 False True False False
2018-08-21 False False True False
2018-08-22 False False False False
2018-08-23 False False False False
2018-08-24 False False False False
2018-08-25 False False False False
print(df.isna())
A B C D
2018-08-20 False True False False
2018-08-21 False False True False
2018-08-22 False False False False
2018-08-23 False False False False
2018-08-24 False False False False
2018-08-25 False False False False
print(df.isnull().any())
A False
B True
C True
D False
dtype: bool
print(np.any(df.isnull()==True))
True
6. Pandas导入导出
6.1 导入数据
import pandas as pd
data = pd.read_csv('student.csv')
print(data)
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 N 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
print(data.head(3))
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
print(data.tail(3))
Student ID name age gender
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
6.2 导出数据
data.to_pickle('student.pickle')
print(pd.read_pickle('student.pickle'))
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 N 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
7. Pandas合并操作
7.1 Pandas合并concat
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])
df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd'])
print(df1)
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
print(df2)
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
print(df3)
a b c d
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
res = pd.concat([df1, df2, df3], axis=0)
print(res)
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
print(res)
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
print(df1)
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
print(df2)
b c d e
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
'''
join='outer',函数默认为join='outer'。此方法是依照column来做纵向合并,有相同的column上下合并到一起,
其他独自的column各自成列,原来没有值的位置皆为NaN填充
'''
res = pd.concat([df1, df2], axis=0, join='outer')
print(res)
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:6: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
res = pd.concat([df1, df2], axis=0, join='outer', ignore_index=True)
print(res)
a b c d e
0 0.0 0.0 0.0 0.0 NaN
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 NaN 1.0 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0 1.0
5 NaN 1.0 1.0 1.0 1.0
D:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
res = pd.concat([df1, df2], axis=0, join='inner')
print(res)
b c d
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
2 1.0 1.0 1.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
print(df1)
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
print(df2)
b c d e
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
print(res)
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
res = pd.concat([df1, df2], axis=1)
print(res)
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])
df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd'])
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
res = df1.append(df2, ignore_index=True)
print(res)
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
res = df1.append([df2, df3], ignore_index=True)
print(res)
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
res = df1.append(s1, ignore_index=True)
print(res)
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 2.0 3.0 4.0
res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
res1 = df1.append([df2, df3], ignore_index=True)
print(res)
print(res1)
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
7.2 Pandas合并merge
7.2.1 定义资料集并打印出来
import pandas as pd
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
print(right)
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
7.2.2 依据key column合并,并打印
res = pd.merge(left, right, on='key')
print(res)
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
key1 key2 A B
0 K0 K0 A0 B0
1 K0 K1 A1 B1
2 K1 K0 A2 B2
3 K2 K1 A3 B3
print(right)
key1 key2 C D
0 K0 K0 C0 D0
1 K1 K0 C1 D1
2 K1 K0 C2 D2
3 K2 K0 C3 D3
7.2.3 两列合并
res = pd.merge(left, right, on=['key1', 'key2'], how='inner')
print(res)
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
res = pd.merge(left, right, on=['key1', 'key2'], how='outer')
print(res)
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
5 K2 K0 NaN NaN C3 D3
res = pd.merge(left, right, on=['key1', 'key2'], how='left')
print(res)
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K0 K1 A1 B1 NaN NaN
2 K1 K0 A2 B2 C1 D1
3 K1 K0 A2 B2 C2 D2
4 K2 K1 A3 B3 NaN NaN
res = pd.merge(left, right, on=['key1', 'key2'], how='right')
print(res)
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3
7.2.4 Indicator设置合并列名称
df1 = pd.DataFrame({'coll': [0, 1], 'col_left': ['a', 'b']})
df2 = pd.DataFrame({'coll': [1, 2, 2], 'col_right': [2, 2, 2]})
print(df1)
coll col_left
0 0 a
1 1 b
print(df2)
coll col_right
0 1 2
1 2 2
2 2 2
res = pd.merge(df1, df2, on='coll', how='outer', indicator=True)
print(res)
coll col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
res = pd.merge(df1, df2, on='coll', how='outer', indicator='indicator_column')
print(res)
coll col_left col_right indicator_column
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 2 NaN 2.0 right_only
7.2.5 依据index合并
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index = ['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index = ['K0', 'K2', 'K3'])
print(left)
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
print(right)
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
res = pd.merge(left, right, left_index=True, right_index=True, how='outer')
print(res)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
res = pd.merge(left, right, left_index=True, right_index=True, how='inner')
print(res)
A B C D
K0 A0 B0 C0 D0
K2 A2 B2 C2 D2
7.2.6 解决overlapping的问题
boys = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'age': [1, 2, 3]})
girls = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'age': [4, 5, 6]})
print(boys)
k age
0 K0 1
1 K1 2
2 K2 3
print(girls)
k age
0 K0 4
1 K0 5
2 K3 6
res = pd.merge(boys, girls, on='k', suffixes=['_boy', '_girl'], how='inner')
print(res)
k age_boy age_girl
0 K0 1 4
1 K0 1 5
8. Pandas plot出图
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.Series(np.random.randn(1000), index=np.arange(1000))
print(data)
0 -0.900869
1 -2.240660
2 -1.341060
3 1.167607
4 0.537295
5 2.060912
6 -0.652946
7 -0.197946
8 -1.487262
9 -0.142167
10 2.138304
11 0.211346
12 -0.292327
13 1.120665
14 0.075635
15 0.512769
16 -0.545445
17 1.144968
18 0.739835
19 1.120818
20 -0.280899
21 0.370114
22 -0.261241
23 0.177805
24 -0.999936
25 0.242468
26 0.303800
27 3.315516
28 -2.445023
29 -0.962314
...
970 -1.474397
971 -0.155966
972 0.194949
973 0.171376
974 -0.630451
975 1.410388
976 1.096179
977 0.484153
978 0.731605
979 0.554297
980 0.867232
981 0.644251
982 0.572517
983 -0.073557
984 0.514066
985 2.157163
986 0.610739
987 -1.196603
988 -1.367025
989 -0.839110
990 -0.446147
991 -1.591431
992 0.416488
993 0.351122
994 -1.084580
995 0.020307
996 -2.021584
997 -0.107877
998 0.307315
999 -0.429553
Length: 1000, dtype: float64
print(data.cumsum())
0 -0.900869
1 -3.141529
2 -4.482588
3 -3.314981
4 -2.777686
5 -0.716774
6 -1.369720
7 -1.567666
8 -3.054928
9 -3.197095
10 -1.058791
11 -0.847445
12 -1.139772
13 -0.019107
14 0.056529
15 0.569298
16 0.023853
17 1.168821
18 1.908656
19 3.029474
20 2.748575
21 3.118689
22 2.857448
23 3.035252
24 2.035316
25 2.277784
26 2.581584
27 5.897101
28 3.452077
29 2.489764
...
970 -2.727075
971 -2.883041
972 -2.688092
973 -2.516716
974 -3.147167
975 -1.736779
976 -0.640599
977 -0.156446
978 0.575159
979 1.129456
980 1.996689
981 2.640940
982 3.213457
983 3.139900
984 3.653966
985 5.811129
986 6.421868
987 5.225265
988 3.858240
989 3.019130
990 2.572983
991 0.981551
992 1.398040
993 1.749162
994 0.664583
995 0.684890
996 -1.336694
997 -1.444571
998 -1.137256
999 -1.566809
Length: 1000, dtype: float64
data.plot()
plt.show()
输出结果如下图:
data = pd.DataFrame(
np.random.randn(1000, 4),
index = np.arange(1000),
columns = list("ABCD")
)
data.cumsum()
data.plot()
plt.show()
输出结果如下图:
ax = data.plot.scatter(x='A', y='B', color='DarkBlue', label='Class1')
data.plot.scatter(x='A', y='C', color='LightGreen', label='Class2', ax=ax)
plt.show()
输出结果如下图:
9. 结语
pandas基础更新完毕,下个博客一起做题练习一下。各位珍惜暑假时间,多学习呀!!!这个时间,是真实的,你把握得住(狗头保命)。
|