IT数码 购物 网址 头条 软件 日历 阅读 图书馆
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
图片批量下载器
↓批量下载图片,美女图库↓
图片自动播放器
↓图片自动播放器↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁
 
   -> Python知识库 -> 做项目常用的Pandas基本操作 -> 正文阅读

[Python知识库]做项目常用的Pandas基本操作

import numpy as np
import pandas  as pd

对象创建

  • Series通过传递值列表来创建a,让pandas创建一个默认整数索引
s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
  • DataFrame通过传递一个Numpy数组、一个日期时间索引和标签列来创建一个:
dates = pd.date_range("2022-06-07",periods=6)
dates
DatetimeIndex(['2022-06-07', '2022-06-08', '2022-06-09', '2022-06-10',
               '2022-06-11', '2022-06-12'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list("ABCD"))
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
  • DataFrame通过传递可以转换为类似系列结构的对象字典来创建a:
df2 = pd.DataFrame(
{
    "A":1.0,
    "B":pd.Timestamp("20130102"),
    "C":pd.Series(1,index=list(range(4)),dtype="float32"),
    "D":np.array([3]*4,dtype="int32"),
    "E":pd.Categorical(["test","train","test","train"]),
    "F":"foo",
})
df2
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo
df3 = pd.DataFrame(
{
    "A":1.0,
    "B":pd.Timestamp("20130102"),
    "C":pd.Series(1,index=list(range(5)),dtype="float32"),
    "D":np.array([3]*5,dtype="int32"),
    "E":pd.Categorical(["test","train","test","train","a"]),
    "F":"foo",
})
df3
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo
41.02013-01-021.03afoo
df2
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo
# 结果的列DataFrame具有不同的dtypes
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
  • 如果您使用的是IPython,则会自动启用列名(以及公共属性)的制表符(Tab键)补全。
df2.A
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64
df2.abs
<bound method NDFrame.abs of      A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo>
df2.add
<bound method flex_arith_method_FRAME.<locals>.f of      A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo>
df2.all
<bound method NDFrame._add_numeric_operations.<locals>.all of      A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo>

查看数据

df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
# 前三行
df.head()
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
df.head(3)
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
df.tail()
ABCD
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
# 后三行
df.tail(3)
ABCD
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
# 显示索引
df.index
DatetimeIndex(['2022-06-07', '2022-06-08', '2022-06-09', '2022-06-10',
               '2022-06-11', '2022-06-12'],
              dtype='datetime64[ns]', freq='D')
# 显示列
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
  • 注意:DataFrame.to_numpy()给出底层数据的NumPy表示,请注意,当您的DataFrame列具有不同的数据类型时,这可能是一项昂贵的操作,这归结为pandas和NumPy之间的根本区别:NumPy数组对整个数组有一个dtype,而pandas
    DataFrames每列有一个dtype.当您调用时 DataFrame.to_numpy(),pandas会找到可以容纳DataFrame中所有dtype的
    NumPy dtype。这最终可能是object,这需要将每个值转换为Python对象。
  • 对于df,我们DataFrame的所有浮点值,DataFrame.to_numpy()速度很快并且不需要复制数据:
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object
df.to_numpy()
array([[ 0.40526325,  0.46532668,  0.07694617, -0.3115456 ],
       [ 0.06912909,  0.9769407 , -0.28743027,  1.08426954],
       [-0.20022708,  1.17280586,  1.34307017,  0.56144631],
       [-0.34616439, -1.60996101,  1.18171013,  0.04600243],
       [-1.83349661, -0.26301183,  0.36815984,  0.16598165],
       [-0.61690579,  0.95554251, -0.60358546,  0.89023561]])
df2
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object
df2.to_numpy()
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)
df.index
DatetimeIndex(['2022-06-07', '2022-06-08', '2022-06-09', '2022-06-10',
               '2022-06-11', '2022-06-12'],
              dtype='datetime64[ns]', freq='D')
df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
  • DataFrame.to_numpy() 不包括输出中的索引或列标签

  • describe()显示数据的快速统计摘要:

# describe()方法一般用于对数据进行统计学估计,输出行名分别为:count(行数),mean(平均值),std(标准差),min(最小值),25%(第一四分位数),50%(第二四分位数),75%(第三四分位数),max(最大值)。
df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean-0.4204000.2829400.3464780.406065
std0.7759901.0621010.7833760.533062
min-1.833497-1.609961-0.603585-0.311546
25%-0.549220-0.080927-0.1963360.075997
50%-0.2731960.7104350.2225530.363714
75%0.0017900.9715910.9783230.808038
max0.4052631.1728061.3430701.084270
# 转置数据
df.T
2022-06-072022-06-082022-06-092022-06-102022-06-112022-06-12
A0.4052630.069129-0.200227-0.346164-1.833497-0.616906
B0.4653270.9769411.172806-1.609961-0.2630120.955543
C0.076946-0.2874301.3430701.1817100.368160-0.603585
D-0.3115461.0842700.5614460.0460020.1659820.890236
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df2
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo
df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df2.describe()
ACD
count4.04.04.0
mean1.01.03.0
std0.00.00.0
min1.01.03.0
25%1.01.03.0
50%1.01.03.0
75%1.01.03.0
max1.01.03.0
df2.T
0123
A1.01.01.01.0
B2013-01-02 00:00:002013-01-02 00:00:002013-01-02 00:00:002013-01-02 00:00:00
C1.01.01.01.0
D3333
Etesttraintesttrain
Ffoofoofoofoo
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.T
2022-06-072022-06-082022-06-092022-06-102022-06-112022-06-12
A0.4052630.069129-0.200227-0.346164-1.833497-0.616906
B0.4653270.9769411.172806-1.609961-0.2630120.955543
C0.076946-0.2874301.3430701.1817100.368160-0.603585
D-0.3115461.0842700.5614460.0460020.1659820.890236
#  sort_index()方法专门用于对index排序
# axis=0对应的是对左边一列的index进行排序(列排); axis=1对应的是对上边一行的index进行排序(行排)
# ascending=False代表降序
df.sort_index(axis=1,ascending=False)
DCBA
2022-06-07-0.3115460.0769460.4653270.405263
2022-06-081.084270-0.2874300.9769410.069129
2022-06-090.5614461.3430701.172806-0.200227
2022-06-100.0460021.181710-1.609961-0.346164
2022-06-110.1659820.368160-0.263012-1.833497
2022-06-120.890236-0.6035850.955543-0.616906
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.sort_index(axis=0,ascending=False)
ABCD
2022-06-12-0.6169060.955543-0.6035850.890236
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-080.0691290.976941-0.2874301.084270
2022-06-070.4052630.4653270.076946-0.311546
df.sort_index(axis=1,ascending=False)
DCBA
2022-06-07-0.3115460.0769460.4653270.405263
2022-06-081.084270-0.2874300.9769410.069129
2022-06-090.5614461.3430701.172806-0.200227
2022-06-100.0460021.181710-1.609961-0.346164
2022-06-110.1659820.368160-0.263012-1.833497
2022-06-120.890236-0.6035850.955543-0.616906
df.sort_index(axis=0,ascending=False)
ABCD
2022-06-12-0.6169060.955543-0.6035850.890236
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-080.0691290.976941-0.2874301.084270
2022-06-070.4052630.4653270.076946-0.311546
df.sort_index(axis=0,ascending=True)
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.T
2022-06-072022-06-082022-06-092022-06-102022-06-112022-06-12
A0.4052630.069129-0.200227-0.346164-1.833497-0.616906
B0.4653270.9769411.172806-1.609961-0.2630120.955543
C0.076946-0.2874301.3430701.1817100.368160-0.603585
D-0.3115461.0842700.5614460.0460020.1659820.890236
# 对df进行转置,然后按轴排序(降序)
df.T.sort_index(axis=1,ascending=False)
2022-06-122022-06-112022-06-102022-06-092022-06-082022-06-07
A-0.616906-1.833497-0.346164-0.2002270.0691290.405263
B0.955543-0.263012-1.6099611.1728060.9769410.465327
C-0.6035850.3681601.1817101.343070-0.2874300.076946
D0.8902360.1659820.0460020.5614461.084270-0.311546
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
# 按值排序
df.sort_values(by="B")
ABCD
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-070.4052630.4653270.076946-0.311546
2022-06-12-0.6169060.955543-0.6035850.890236
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
"
df.sort_values(
    by,
    axis: 'Axis' = 0,
    ascending=True,
    inplace: 'bool' = False,
    kind: 'str' = 'quicksort',
    na_position: 'str' = 'last',
    ignore_index: 'bool' = False,
    key: 'ValueKeyFunc' = None,
)
"
df.sort_values(axis=1,by="2022-06-10")
BADC
2022-06-070.4653270.405263-0.3115460.076946
2022-06-080.9769410.0691291.084270-0.287430
2022-06-091.172806-0.2002270.5614461.343070
2022-06-10-1.609961-0.3461640.0460021.181710
2022-06-11-0.263012-1.8334970.1659820.368160
2022-06-120.955543-0.6169060.890236-0.603585
df.sort_values(axis=0,by="B")
ABCD
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-070.4052630.4653270.076946-0.311546
2022-06-12-0.6169060.955543-0.6035850.890236
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446

选择

  • 虽然用于选择和设置的标准Python/NumPy表达式很直观,并且在交互工作中派上用场,但对于生产代码,我们推荐优化的pandas数据访问方法.at、、、、.iat和.loc .iloc

获取

  • 选择单个列,这会产生a Series,相当于df.A:
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df["A"]
2022-06-07    0.405263
2022-06-08    0.069129
2022-06-09   -0.200227
2022-06-10   -0.346164
2022-06-11   -1.833497
2022-06-12   -0.616906
Freq: D, Name: A, dtype: float64
df["D"]
2022-06-07   -0.311546
2022-06-08    1.084270
2022-06-09    0.561446
2022-06-10    0.046002
2022-06-11    0.165982
2022-06-12    0.890236
Freq: D, Name: D, dtype: float64
  • 选择via [],对行切片
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df[0:3]
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
df["20220608":"20220611"]
ABCD
2022-06-080.0691290.976941-0.287431.084270
2022-06-09-0.2002271.1728061.343070.561446
2022-06-10-0.346164-1.6099611.181710.046002
2022-06-11-1.833497-0.2630120.368160.165982

按标签选择

  • 在按标签选择中查看更多信息

  • 使用标签获取横截面:

df.loc[dates[0]]
A    0.405263
B    0.465327
C    0.076946
D   -0.311546
Name: 2022-06-07 00:00:00, dtype: float64
  • 按标签在多轴选择:
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.loc()
<pandas.core.indexing._LocIndexer at 0x162f252a130>
df.loc[dates[1]]
A    0.069129
B    0.976941
C   -0.287430
D    1.084270
Name: 2022-06-08 00:00:00, dtype: float64
df.loc[dates[2]]
A   -0.200227
B    1.172806
C    1.343070
D    0.561446
Name: 2022-06-09 00:00:00, dtype: float64
  • 按标签在多轴上选择:
df.loc[:,]
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.loc[:,["A","B"]]
AB
2022-06-070.4052630.465327
2022-06-080.0691290.976941
2022-06-09-0.2002271.172806
2022-06-10-0.346164-1.609961
2022-06-11-1.833497-0.263012
2022-06-12-0.6169060.955543
  • 显示标签切片,包括两个端点:
df.loc[["20220607","20220609"],["C","D"]]
CD
2022-06-070.076946-0.311546
2022-06-091.3430700.561446
df.loc[["20220607","20220611"],["A","D"]]
AD
2022-06-070.405263-0.311546
2022-06-11-1.8334970.165982
df.loc[["20220611"],["A","D"]]
AD
2022-06-11-1.8334970.165982
  • 返回对象的尺寸减少:
df.loc["20220611",["A","D"]]
A   -1.833497
D    0.165982
Name: 2022-06-11 00:00:00, dtype: float64
  • 获取标量值
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.loc[dates[0]]
A    0.405263
B    0.465327
C    0.076946
D   -0.311546
Name: 2022-06-07 00:00:00, dtype: float64
df.loc[dates[1]]
A    0.069129
B    0.976941
C   -0.287430
D    1.084270
Name: 2022-06-08 00:00:00, dtype: float64
df.loc[dates[1],"A"]
0.06912908863219207
df.loc[dates[1],"C"]
-0.28743026681864575
  • 为了快速访问标量
df.at[dates[0],"A"]
0.40526325343260083
df.at[dates[1],"C"]
-0.28743026681864575

按位置选择

  • 通过传递整数的位置进行选择
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.iloc[3]
A   -0.346164
B   -1.609961
C    1.181710
D    0.046002
Name: 2022-06-10 00:00:00, dtype: float64
df.iloc[4]
A   -1.833497
B   -0.263012
C    0.368160
D    0.165982
Name: 2022-06-11 00:00:00, dtype: float64
  • 按标签在多轴上选择
df.iloc[3:5]
ABCD
2022-06-10-0.346164-1.6099611.181710.046002
2022-06-11-1.833497-0.2630120.368160.165982
df.iloc[3:5,0:2]
AB
2022-06-10-0.346164-1.609961
2022-06-11-1.833497-0.263012
  • 通过整数位置列表,类似于Numpy/Python样式
df.iloc[[1,2,4],[0,2]]
AC
2022-06-080.069129-0.28743
2022-06-09-0.2002271.34307
2022-06-11-1.8334970.36816
  • 对于显式切片行:
df.iloc[1:3,:]
ABCD
2022-06-080.0691290.976941-0.287431.084270
2022-06-09-0.2002271.1728061.343070.561446
df.iloc[1:3,2:3]
C
2022-06-08-0.28743
2022-06-091.34307
  • 对于显示切片列
df.iloc[:,1:3]
BC
2022-06-070.4653270.076946
2022-06-080.976941-0.287430
2022-06-091.1728061.343070
2022-06-10-1.6099611.181710
2022-06-11-0.2630120.368160
2022-06-120.955543-0.603585
df.iloc[1:4,1:3]
BC
2022-06-080.976941-0.28743
2022-06-091.1728061.34307
2022-06-10-1.6099611.18171
  • 要明确获取值:
df.iloc[1,1]
0.9769407016879463
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.iloc[3,3]
0.04600243177073029
  • 为了快速访问标量
df.iat[1,1]
0.9769407016879463
df.iat[3,3]
0.04600243177073029

布尔索引

  • 使用单个列的值来选择数据
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df["A"]
2022-06-07    0.405263
2022-06-08    0.069129
2022-06-09   -0.200227
2022-06-10   -0.346164
2022-06-11   -1.833497
2022-06-12   -0.616906
Freq: D, Name: A, dtype: float64
df["A"]>0
2022-06-07     True
2022-06-08     True
2022-06-09    False
2022-06-10    False
2022-06-11    False
2022-06-12    False
Freq: D, Name: A, dtype: bool
df[df["A"]>0]
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
df
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df[df["B"] < 0]
ABCD
2022-06-10-0.346164-1.6099611.181710.046002
2022-06-11-1.833497-0.2630120.368160.165982
  • 使用isin()过滤方法:
df3 = df.copy()
df3
ABCD
2022-06-070.4052630.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df3["E"] = ["zero","one","two","three","four","five"]
df3
ABCDE
2022-06-070.4052630.4653270.076946-0.311546zero
2022-06-080.0691290.976941-0.2874301.084270one
2022-06-09-0.2002271.1728061.3430700.561446two
2022-06-10-0.346164-1.6099611.1817100.046002three
2022-06-11-1.833497-0.2630120.3681600.165982four
2022-06-12-0.6169060.955543-0.6035850.890236five
df3["E"]
2022-06-07     zero
2022-06-08      one
2022-06-09      two
2022-06-10    three
2022-06-11     four
2022-06-12     five
Freq: D, Name: E, dtype: object
df3["E"].isin(["two"])
2022-06-07    False
2022-06-08    False
2022-06-09     True
2022-06-10    False
2022-06-11    False
2022-06-12    False
Freq: D, Name: E, dtype: bool
df3["E"].isin(["two","four"])
2022-06-07    False
2022-06-08    False
2022-06-09     True
2022-06-10    False
2022-06-11     True
2022-06-12    False
Freq: D, Name: E, dtype: bool
df3[df3["E"].isin(["two","four"])]
ABCDE
2022-06-09-0.2002271.1728061.343070.561446two
2022-06-11-1.833497-0.2630120.368160.165982four

设置

  • 设置新列会自动按索引对齐数据:
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range("20220607",periods=6))
s1
2022-06-07    1
2022-06-08    2
2022-06-09    3
2022-06-10    4
2022-06-11    5
2022-06-12    6
Freq: D, dtype: int64
  • 按标签设置值:
df.at[dates[0],"A"] = 0
df
ABCD
2022-06-070.0000000.4653270.076946-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
  • 按位置设置值
df
ABCD
2022-06-070.0000000.0000000.000000-0.311546
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
df.iat[0,3]=0
df
ABCD
2022-06-070.0000000.0000000.0000000.000000
2022-06-080.0691290.976941-0.2874301.084270
2022-06-09-0.2002271.1728061.3430700.561446
2022-06-10-0.346164-1.6099611.1817100.046002
2022-06-11-1.833497-0.2630120.3681600.165982
2022-06-12-0.6169060.955543-0.6035850.890236
  • 通过分配一个NumPy数组来设置:
df.loc[:,"D"]
2022-06-07    0.000000
2022-06-08    1.084270
2022-06-09    0.561446
2022-06-10    0.046002
2022-06-11    0.165982
2022-06-12    0.890236
Freq: D, Name: D, dtype: float64
df.loc[:,"D"] = np.array([5] * len(df))
df
ABCD
2022-06-070.0000000.0000000.0000005
2022-06-080.0691290.976941-0.2874305
2022-06-09-0.2002271.1728061.3430705
2022-06-10-0.346164-1.6099611.1817105
2022-06-11-1.833497-0.2630120.3681605
2022-06-12-0.6169060.955543-0.6035855
df.loc[:,"F"] = np.array([i for i in range(6)])
df
ABCDF
2022-06-070.0000000.0000000.00000050
2022-06-080.0691290.976941-0.28743051
2022-06-09-0.2002271.1728061.34307052
2022-06-10-0.346164-1.6099611.18171053
2022-06-11-1.833497-0.2630120.36816054
2022-06-12-0.6169060.955543-0.60358555
df.loc[:,"E"] = np.array([5] * len(df))
df
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
  • where带设置的操作:
df4 = df.copy()
df4
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
df4 > 0
ABCDFE
2022-06-07FalseFalseFalseTrueFalseTrue
2022-06-08TrueTrueFalseTrueTrueTrue
2022-06-09FalseTrueTrueTrueTrueTrue
2022-06-10FalseFalseTrueTrueTrueTrue
2022-06-11FalseFalseTrueTrueTrueTrue
2022-06-12FalseTrueFalseTrueTrueTrue
df4[df4>0]
ABCDFE
2022-06-07NaNNaNNaN5NaN5
2022-06-080.0691290.976941NaN51.05
2022-06-09NaN1.1728061.3430752.05
2022-06-10NaNNaN1.1817153.05
2022-06-11NaNNaN0.3681654.05
2022-06-12NaN0.955543NaN55.05
-df4
ABCDFE
2022-06-07-0.000000-0.000000-0.000000-50-5
2022-06-08-0.069129-0.9769410.287430-5-1-5
2022-06-090.200227-1.172806-1.343070-5-2-5
2022-06-100.3461641.609961-1.181710-5-3-5
2022-06-111.8334970.263012-0.368160-5-4-5
2022-06-120.616906-0.9555430.603585-5-5-5
# 条件符合的不变,条件不符合的变为相反数
df4[df4>0] = -df4
df4
ABCDFE
2022-06-070.0000000.0000000.000000-50-5
2022-06-08-0.069129-0.976941-0.287430-5-1-5
2022-06-09-0.200227-1.172806-1.343070-5-2-5
2022-06-10-0.346164-1.609961-1.181710-5-3-5
2022-06-11-1.833497-0.263012-0.368160-5-4-5
2022-06-12-0.616906-0.955543-0.603585-5-5-5
df4
ABCDFE
2022-06-070.0000000.0000000.000000-50-5
2022-06-08-0.069129-0.976941-0.287430-5-1-5
2022-06-09-0.200227-1.172806-1.343070-5-2-5
2022-06-10-0.346164-1.609961-1.181710-5-3-5
2022-06-11-1.833497-0.263012-0.368160-5-4-5
2022-06-12-0.616906-0.955543-0.603585-5-5-5
df3
ABCDE
2022-06-070.4052630.4653270.076946-0.311546zero
2022-06-080.0691290.976941-0.2874301.084270one
2022-06-09-0.2002271.1728061.3430700.561446two
2022-06-10-0.346164-1.6099611.1817100.046002three
2022-06-11-1.833497-0.2630120.3681600.165982four
2022-06-12-0.6169060.955543-0.6035850.890236five
df6 = df.copy()
df6
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
df6>0
ABCDFE
2022-06-07FalseFalseFalseTrueFalseTrue
2022-06-08TrueTrueFalseTrueTrueTrue
2022-06-09FalseTrueTrueTrueTrueTrue
2022-06-10FalseFalseTrueTrueTrueTrue
2022-06-11FalseFalseTrueTrueTrueTrue
2022-06-12FalseTrueFalseTrueTrueTrue
df6[df6>0]
ABCDFE
2022-06-07NaNNaNNaN5NaN5
2022-06-080.0691290.976941NaN51.05
2022-06-09NaN1.1728061.3430752.05
2022-06-10NaNNaN1.1817153.05
2022-06-11NaNNaN0.3681654.05
2022-06-12NaN0.955543NaN55.05
df6
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
-df6
ABCDFE
2022-06-07-0.000000-0.000000-0.000000-50-5
2022-06-08-0.069129-0.9769410.287430-5-1-5
2022-06-090.200227-1.172806-1.343070-5-2-5
2022-06-100.3461641.609961-1.181710-5-3-5
2022-06-111.8334970.263012-0.368160-5-4-5
2022-06-120.616906-0.9555430.603585-5-5-5
df6[df6>0] = -df6
df6
ABCDFE
2022-06-070.0000000.0000000.000000-50-5
2022-06-08-0.069129-0.976941-0.287430-5-1-5
2022-06-09-0.200227-1.172806-1.343070-5-2-5
2022-06-10-0.346164-1.609961-1.181710-5-3-5
2022-06-11-1.833497-0.263012-0.368160-5-4-5
2022-06-12-0.616906-0.955543-0.603585-5-5-5
df6
ABCDFE
2022-06-070.0000000.0000000.000000-50-5
2022-06-08-0.069129-0.976941-0.287430-5-1-5
2022-06-09-0.200227-1.172806-1.343070-5-2-5
2022-06-10-0.346164-1.609961-1.181710-5-3-5
2022-06-11-1.833497-0.263012-0.368160-5-4-5
2022-06-12-0.616906-0.955543-0.603585-5-5-5
df7 = -df6
df7
ABCDFE
2022-06-07-0.000000-0.000000-0.000000505
2022-06-080.0691290.9769410.287430515
2022-06-090.2002271.1728061.343070525
2022-06-100.3461641.6099611.181710535
2022-06-111.8334970.2630120.368160545
2022-06-120.6169060.9555430.603585555
df7[df7>0]
ABCDFE
2022-06-07NaNNaNNaN5NaN5
2022-06-080.0691290.9769410.28743051.05
2022-06-090.2002271.1728061.34307052.05
2022-06-100.3461641.6099611.18171053.05
2022-06-111.8334970.2630120.36816054.05
2022-06-120.6169060.9555430.60358555.05
df7[df7>0]=0
df7
ABCDFE
2022-06-07-0.0-0.0-0.0000
2022-06-080.00.00.0000
2022-06-090.00.00.0000
2022-06-100.00.00.0000
2022-06-110.00.00.0000
2022-06-120.00.00.0000

缺失数据

  • pandas 主要使用该值np.nan来表示缺失数据。默认情况下,它不包含在计算中。

  • 重新索引允许您更改/添加/删除/指定轴上的索引。这将返回数据的副本:

df
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ["E"])
df1.loc[dates[0] :dates[1],"E"] = 1
df1
ABCDFEE
2022-06-070.0000000.0000000.000005011
2022-06-080.0691290.976941-0.287435111
2022-06-09-0.2002271.1728061.343075255
2022-06-10-0.346164-1.6099611.181715355
  • 要删除任何缺少数据的行:
df1.dropna(how="any")
ABCDFEE
2022-06-070.0000000.0000000.000005011
2022-06-080.0691290.976941-0.287435111
2022-06-09-0.2002271.1728061.343075255
2022-06-10-0.346164-1.6099611.181715355
df1 = df1[df1>0]
df1
ABCDFEE
2022-06-07NaNNaNNaN5NaN11
2022-06-080.0691290.976941NaN51.011
2022-06-09NaN1.1728061.3430752.055
2022-06-10NaNNaN1.1817153.055
df1.dropna(how="any")
ABCDFEE
df6
ABCDFE
2022-06-070.0000000.0000000.000000-50-5
2022-06-08-0.069129-0.976941-0.287430-5-1-5
2022-06-09-0.200227-1.172806-1.343070-5-2-5
2022-06-10-0.346164-1.609961-1.181710-5-3-5
2022-06-11-1.833497-0.263012-0.368160-5-4-5
2022-06-12-0.616906-0.955543-0.603585-5-5-5
-df6
ABCDFE
2022-06-07-0.000000-0.000000-0.000000505
2022-06-080.0691290.9769410.287430515
2022-06-090.2002271.1728061.343070525
2022-06-100.3461641.6099611.181710535
2022-06-111.8334970.2630120.368160545
2022-06-120.6169060.9555430.603585555
df8 = -df6[-df6>0]
df8
ABCDFE
2022-06-07NaNNaNNaN5NaN5
2022-06-080.0691290.9769410.28743051.05
2022-06-090.2002271.1728061.34307052.05
2022-06-100.3461641.6099611.18171053.05
2022-06-111.8334970.2630120.36816054.05
2022-06-120.6169060.9555430.60358555.05
# 删除任何缺少数据的行
df8.dropna(how="any")
ABCDFE
2022-06-080.0691290.9769410.28743051.05
2022-06-090.2002271.1728061.34307052.05
2022-06-100.3461641.6099611.18171053.05
2022-06-111.8334970.2630120.36816054.05
2022-06-120.6169060.9555430.60358555.05
  • 填充缺失数据:
df1
ABCDFEE
2022-06-07NaNNaNNaN5NaN11
2022-06-080.0691290.976941NaN51.011
2022-06-09NaN1.1728061.3430752.055
2022-06-10NaNNaN1.1817153.055
df1.fillna(value=5)
ABCDFEE
2022-06-075.0000005.0000005.0000055.011
2022-06-080.0691290.9769415.0000051.011
2022-06-095.0000001.1728061.3430752.055
2022-06-105.0000005.0000001.1817153.055
  • 要获取值所在的布尔掩码nan:
df1
ABCDFEE
2022-06-07NaNNaNNaN5NaN11
2022-06-080.0691290.976941NaN51.011
2022-06-09NaN1.1728061.3430752.055
2022-06-10NaNNaN1.1817153.055
pd.isna(df1)
ABCDFEE
2022-06-07TrueTrueTrueFalseTrueFalseFalse
2022-06-08FalseFalseTrueFalseFalseFalseFalse
2022-06-09TrueFalseFalseFalseFalseFalseFalse
2022-06-10TrueTrueFalseFalseFalseFalseFalse

操作

统计

  • 操作通常排除丢失的数据

  • 执行描述性统计:

df
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
df.mean()
A   -0.487944
B    0.205386
C    0.333654
D    5.000000
F    2.500000
E    5.000000
dtype: float64
  • 在另一个轴上进行相同的操作:
df.mean(1)
2022-06-07    1.666667
2022-06-08    1.959773
2022-06-09    2.385941
2022-06-10    2.037597
2022-06-11    2.045275
2022-06-12    2.455842
Freq: D, dtype: float64
df.mean(0)
A   -0.487944
B    0.205386
C    0.333654
D    5.000000
F    2.500000
E    5.000000
dtype: float64
  • 使用具有不同维度且需要对齐的对象进行操作。此外,pandas会自动沿指定维度进行广播:
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
s
2022-06-07    NaN
2022-06-08    NaN
2022-06-09    1.0
2022-06-10    3.0
2022-06-11    5.0
2022-06-12    NaN
Freq: D, dtype: float64
pd.Series([1,3,5,np.nan,6,8],index=dates).shift(3)
2022-06-07    NaN
2022-06-08    NaN
2022-06-09    NaN
2022-06-10    1.0
2022-06-11    3.0
2022-06-12    5.0
Freq: D, dtype: float64
pd.Series([1,3,5,np.nan,6,8],index=dates).shift(1)
2022-06-07    NaN
2022-06-08    1.0
2022-06-09    3.0
2022-06-10    5.0
2022-06-11    NaN
2022-06-12    6.0
Freq: D, dtype: float64
pd.Series([1,3,5,np.nan,6,8],index=dates)
2022-06-07    1.0
2022-06-08    3.0
2022-06-09    5.0
2022-06-10    NaN
2022-06-11    6.0
2022-06-12    8.0
Freq: D, dtype: float64
dates
DatetimeIndex(['2022-06-07', '2022-06-08', '2022-06-09', '2022-06-10',
               '2022-06-11', '2022-06-12'],
              dtype='datetime64[ns]', freq='D')
pd.Series([1,3,5,np.nan,7,8],index=dates)
2022-06-07    1.0
2022-06-08    3.0
2022-06-09    5.0
2022-06-10    NaN
2022-06-11    7.0
2022-06-12    8.0
Freq: D, dtype: float64
# 默认情况下,shift函数是在行方向上移动一个单位
# shift(2)在行方向上移动两个单位
s = pd.Series([1,3,5,np.nan,7,8],index=dates).shift(2)
s
2022-06-07    NaN
2022-06-08    NaN
2022-06-09    1.0
2022-06-10    3.0
2022-06-11    5.0
2022-06-12    NaN
Freq: D, dtype: float64

应用

  • 对数据应用函数:
df
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
np.cumsum
<function numpy.cumsum(a, axis=None, dtype=None, out=None)>
df.apply(np.cumsum)
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.28743010110
2022-06-09-0.1310982.1497471.05564015315
2022-06-10-0.4772620.5397862.23735020620
2022-06-11-2.3107590.2767742.605510251025
2022-06-12-2.9276651.2323162.001924301530
df
ABCDFE
2022-06-070.0000000.0000000.000000505
2022-06-080.0691290.976941-0.287430515
2022-06-09-0.2002271.1728061.343070525
2022-06-10-0.346164-1.6099611.181710535
2022-06-11-1.833497-0.2630120.368160545
2022-06-12-0.6169060.955543-0.603585555
df.apply(lambda x: x.max() - x.min())
A    1.902626
B    2.782767
C    1.946656
D    0.000000
F    5.000000
E    0.000000
dtype: float64
df.apply(lambda x: x.max() - x.min(),axis=1)
2022-06-07    5.000000
2022-06-08    5.287430
2022-06-09    5.200227
2022-06-10    6.609961
2022-06-11    6.833497
2022-06-12    5.616906
Freq: D, dtype: float64
  • series,只是一个一维数据结构,它由index和value组成。
  • dataframe,是一个二维结构,除了拥有index和value之外,还拥有column。
  • 联系:
  • dataframe由多个series组成,无论是行还是列,单独拆分出来都是一个series。

直方图

s = pd.Series(np.random.randint(0,7,size=10))
s
0    2
1    5
2    2
3    4
4    1
5    3
6    0
7    2
8    5
9    4
dtype: int32
s.value_counts()
2    3
5    2
4    2
1    1
3    1
0    1
dtype: int64

字符串的方法

  • Series 在属性中配备了一组字符串处理方式str,可以方便地对数组的每个元素进行操作,如下面的代码片段所示。请注意,模式匹配str通常默认使用正则表达式
s = pd.Series(["A","B","C","Aaba","Baca",np.nan,"CABA","dog","cat"])
s
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
s.str
<pandas.core.strings.accessor.StringMethods at 0x26552ed2a90>
s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object
s.str.upper()
0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

合并

  • 在连接/合并类型操作的情况下,pandas提供了各种工具,可以轻松地将Series和DataFrame对象与索引和关系代数功能的各种集合逻辑组合在一起。

  • 将pandas对象与连接在一起concat():

# 生成一个浮点数或N维浮点数组,取数范围:正态分布的随机样本数。
df = pd.DataFrame(np.random.randn(10,4))
df
0123
00.879358-0.162415-0.122199-1.436661
1-0.0904630.173721-0.425374-0.509393
2-1.155403-1.3515600.0327340.085148
3-0.808055-1.6376110.3829220.525315
40.659453-0.8511030.2147211.031853
50.5326331.5066301.476901-1.016453
60.8602193.0153841.003056-2.795348
70.580518-2.5754081.470146-1.946652
8-1.1047150.9541150.4794311.001990
90.709469-1.6139240.424452-0.641368
pd.DataFrame(np.random.randn(10,4))
0123
0-1.2310200.0629660.248977-2.006465
1-0.121096-0.7908541.2700020.437691
2-1.342012-0.213068-0.632990-0.454876
3-2.299231-0.4491790.7998231.320912
4-0.214516-0.759868-0.5099290.125942
51.743264-0.0472200.5321170.087455
6-0.1720500.3876250.9032311.419179
70.610765-0.666323-0.3968730.956829
8-0.7401471.3970830.3602410.106912
9-0.4029851.289189-0.202836-1.308507
df
0123
00.879358-0.162415-0.122199-1.436661
1-0.0904630.173721-0.425374-0.509393
2-1.155403-1.3515600.0327340.085148
3-0.808055-1.6376110.3829220.525315
40.659453-0.8511030.2147211.031853
50.5326331.5066301.476901-1.016453
60.8602193.0153841.003056-2.795348
70.580518-2.5754081.470146-1.946652
8-1.1047150.9541150.4794311.001990
90.709469-1.6139240.424452-0.641368
# 拆分
df[:3]
0123
00.879358-0.162415-0.122199-1.436661
1-0.0904630.173721-0.425374-0.509393
2-1.155403-1.3515600.0327340.085148
df[3:7]
0123
3-0.808055-1.6376110.3829220.525315
40.659453-0.8511030.2147211.031853
50.5326331.5066301.476901-1.016453
60.8602193.0153841.003056-2.795348
df[7:]
0123
70.580518-2.5754081.470146-1.946652
8-1.1047150.9541150.4794311.001990
90.709469-1.6139240.424452-0.641368
pieces=[df[:3],df[3:7],df[7:]]
pieces
[          0         1         2         3
 0  0.879358 -0.162415 -0.122199 -1.436661
 1 -0.090463  0.173721 -0.425374 -0.509393
 2 -1.155403 -1.351560  0.032734  0.085148,
           0         1         2         3
 3 -0.808055 -1.637611  0.382922  0.525315
 4  0.659453 -0.851103  0.214721  1.031853
 5  0.532633  1.506630  1.476901 -1.016453
 6  0.860219  3.015384  1.003056 -2.795348,
           0         1         2         3
 7  0.580518 -2.575408  1.470146 -1.946652
 8 -1.104715  0.954115  0.479431  1.001990
 9  0.709469 -1.613924  0.424452 -0.641368]
pd.concat(pieces)
0123
00.879358-0.162415-0.122199-1.436661
1-0.0904630.173721-0.425374-0.509393
2-1.155403-1.3515600.0327340.085148
3-0.808055-1.6376110.3829220.525315
40.659453-0.8511030.2147211.031853
50.5326331.5066301.476901-1.016453
60.8602193.0153841.003056-2.795348
70.580518-2.5754081.470146-1.946652
8-1.1047150.9541150.4794311.001990
90.709469-1.6139240.424452-0.641368
 a = [df[3:]]
a
[          0         1         2         3
 3 -0.808055 -1.637611  0.382922  0.525315
 4  0.659453 -0.851103  0.214721  1.031853
 5  0.532633  1.506630  1.476901 -1.016453
 6  0.860219  3.015384  1.003056 -2.795348
 7  0.580518 -2.575408  1.470146 -1.946652
 8 -1.104715  0.954115  0.479431  1.001990
 9  0.709469 -1.613924  0.424452 -0.641368]
pd.concat(a)
0123
3-0.808055-1.6376110.3829220.525315
40.659453-0.8511030.2147211.031853
50.5326331.5066301.476901-1.016453
60.8602193.0153841.003056-2.795348
70.580518-2.5754081.470146-1.946652
8-1.1047150.9541150.4794311.001990
90.709469-1.6139240.424452-0.641368
pd
<module 'pandas' from 'D:\\software\\anaconda\\lib\\site-packages\\pandas\\__init__.py'>
df
0123
00.879358-0.162415-0.122199-1.436661
1-0.0904630.173721-0.425374-0.509393
2-1.155403-1.3515600.0327340.085148
3-0.808055-1.6376110.3829220.525315
40.659453-0.8511030.2147211.031853
50.5326331.5066301.476901-1.016453
60.8602193.0153841.003056-2.795348
70.580518-2.5754081.470146-1.946652
8-1.1047150.9541150.4794311.001990
90.709469-1.6139240.424452-0.641368
  • [笔记]
  • 向a添加列DataFrame相对较快。但是,添加一行需要一个副本,并且可能很昂贵。我们建议将预先构建的记录列表传递给DataFrame构造函数,而不是DataFrame通过迭代地将记录附加到它来构建一个。

加入

  • SQL样式合并
left = pd.DataFrame({"key":["foo","foo"],"lval":[1,2]})
left
keylval
0foo1
1foo2
right = pd.DataFrame({"key":["foo","foo"],"rval":[4,5]})
right
keyrval
0foo4
1bar5
pd.merge(left,right,on="key")
keylvalrval
0foo14
1foo15
2foo24
3foo25
pd
<module 'pandas' from 'D:\\software\\anaconda\\lib\\site-packages\\pandas\\__init__.py'>
left
keylval
0foo1
1foo2
right
keyrval
0foo4
1bar5
# pd.merge(left,right,on="rval") 会报错,因为没有相同字段
  • 例子
left = pd.DataFrame({"key":["foo","bar"],"lval":[1,2]})
left
keylval
0foo1
1bar2
right = pd.DataFrame({"key":["foo","bar"],"rval":[4,5]})
right
keyrval
0foo4
1bar5
pd.merge(left,right,on="key")
keylvalrval
0foo14
1bar25

分组

  • "分组依据"是指涉及以下一个或多个步骤的过程:
  • 根据某些标准将数据分组
  • 将函数独立应用于每个组
  • 将结果组合成数据结构
df = pd.DataFrame({
    "A":["foo","bar","foo","bar","foo","bar","foo","foo"],
    "B":["zero","one","two","there","four","five","six","seven"],
    "C":np.random.randn(8),
    "D":np.random.randn(8),
    }
)
df
ABCD
0foozero0.7295450.301263
1barone1.6038890.458280
2footwo0.6333821.820535
3barthere-0.7231701.917200
4foofour0.5814050.961305
5barfive-1.4147550.986130
6foosix0.5772220.851816
7fooseven-1.3180730.757913
  • 分组,然后将sum()函数应用于结果组:
df.groupby("A")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002655C66AFD0>
df.groupby("A").sum()
CD
A
bar-0.5340353.361610
foo1.2034814.692833
df.index
RangeIndex(start=0, stop=8, step=1)
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.groupby("D").mean()
C
D
0.3012630.729545
0.4582801.603889
0.757913-1.318073
0.8518160.577222
0.9613050.581405
0.986130-1.414755
1.8205350.633382
1.917200-0.723170
df.groupby("D").sum()
C
D
0.3012630.729545
0.4582801.603889
0.757913-1.318073
0.8518160.577222
0.9613050.581405
0.986130-1.414755
1.8205350.633382
1.917200-0.723170
df.sum()
A           foobarfoobarfoobarfoofoo
B    zeroonetwotherefourfivesixseven
C                           0.669445
D                           8.054443
dtype: object
df.groupby("A").sum()
CD
A
bar-0.5340353.361610
foo1.2034814.692833
df.groupby("B").sum()
CD
B
five-1.4147550.986130
four0.5814050.961305
one1.6038890.458280
seven-1.3180730.757913
six0.5772220.851816
there-0.7231701.917200
two0.6333821.820535
zero0.7295450.301263
  • 按照多列分组形成层次索引,我们可以再次应用该sum()函数:
df.groupby(["A","B"]).sum()
CD
AB
barfive-1.4147550.986130
one1.6038890.458280
there-0.7231701.917200
foofour0.5814050.961305
seven-1.3180730.757913
six0.5772220.851816
two0.6333821.820535
zero0.7295450.301263

重塑

堆栈

tuples = list(
zip(
    *[
        ["bar","bar","baz","baz","foo","foo","qux","que"],
        ["one","two","one","two","one","two","one","there"]
     ]
    )
)
tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('que', 'there')]
# 构造多变指标(MultiIndex)的四种方式
# from_tuples:列表中的每一个元组 构造MultiIndex
# from_arrays:根据元素的位置,一一对应形成组合 构造MultiIndex
# from_product:通过参数形成的“全组合” 构造MultiIndex
# from_frame:通过 DataFrame 构造MultiIndex
index = pd.MultiIndex.from_tuples(tuples,name=["first","second"])
index
MultiIndex([('bar',   'one'),
            ('bar',   'two'),
            ('baz',   'one'),
            ('baz',   'two'),
            ('foo',   'one'),
            ('foo',   'two'),
            ('qux',   'one'),
            ('que', 'there')],
           names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8,2),index=index,columns=["A","B"])
df
AB
firstsecond
barone-0.4845291.071371
two-0.2951871.255282
bazone-0.3413450.790919
two-1.297284-1.285871
fooone-0.598789-0.624319
two-0.297379-2.395637
quxone-0.1949910.614675
quethere0.8663280.098695
df2 = df[:4]
df2
AB
firstsecond
barone-0.4845291.071371
two-0.2951871.255282
bazone-0.3413450.790919
two-1.297284-1.285871
  • 该stack()方法在DataFrame的列中"压缩"一个级别:
stacked = df2.stack()
stacked
first  second   
bar    one     A   -0.484529
               B    1.071371
       two     A   -0.295187
               B    1.255282
baz    one     A   -0.341345
               B    0.790919
       two     A   -1.297284
               B   -1.285871
dtype: float64
  • 对于"堆叠"的DataFrame或Series (将aMultiIndexz作为),is index的逆运算,默认情况下会取消堆栈最后一层:stack()unstack()
stacked
first  second   
bar    one     A   -0.484529
               B    1.071371
       two     A   -0.295187
               B    1.255282
baz    one     A   -0.341345
               B    0.790919
       two     A   -1.297284
               B   -1.285871
dtype: float64
stacked.unstack()
AB
firstsecond
barone-0.4845291.071371
two-0.2951871.255282
bazone-0.3413450.790919
two-1.297284-1.285871
stacked.unstack().unstack()
AB
secondonetwoonetwo
first
bar-0.484529-0.2951871.0713711.255282
baz-0.341345-1.2972840.790919-1.285871
  • stack()就是把二维表转化成一维表
  • unstack() 则为stack的逆函数,即把一维表转化成二维表的过程
stacked.unstack(1)
secondonetwo
first
barA-0.484529-0.295187
B1.0713711.255282
bazA-0.341345-1.297284
B0.790919-1.285871
stacked.unstack(2)
AB
firstsecond
barone-0.4845291.071371
two-0.2951871.255282
bazone-0.3413450.790919
two-1.297284-1.285871
stacked.unstack(0)

数据透视表

df = pd.DataFrame({
    "A":["one","one","two","three"] * 3,
    "B":["A","B","C"] * 4,
    "C":["foo","foo","foo","bar","bar","bar"]*2,
    "D": np.random.randn(12),
    "E": np.random.randn(12),
})
df
ABCDE
0oneAfoo0.0138460.444128
1oneBfoo1.785051-0.880777
2twoCfoo2.020651-1.403231
3threeAbar-0.623111-0.053250
4oneBbar-0.022848-0.821333
5oneCbar-0.9627510.691853
6twoAfoo0.991734-1.796295
7threeBfoo-0.326107-1.437360
8oneCfoo1.634899-0.036184
9oneAbar-0.0991101.219143
10twoBbar0.1400442.462987
11threeCbar1.043458-0.416262
  • 生成数据透视表:
pd.pivot_table(df,values="D",index=["A","B"],columns=["C"])
Cbarfoo
AB
oneA-0.0991100.013846
B-0.0228481.785051
C-0.9627511.634899
threeA-0.623111NaN
BNaN-0.326107
C1.043458NaN
twoANaN0.991734
B0.140044NaN
CNaN2.020651

时间序列

  • pandas具有简单、强大、高效的功能,用于在频率转换期间执行重采样操作(例如,将秒数据转换为5分钟数据)。这在但不限于金融应用程序中极为常见
rng = pd.date_range("6/8/2022",periods=100,freq="S")
rng
DatetimeIndex(['2022-06-08 00:00:00', '2022-06-08 00:00:01',
               '2022-06-08 00:00:02', '2022-06-08 00:00:03',
               '2022-06-08 00:00:04', '2022-06-08 00:00:05',
               '2022-06-08 00:00:06', '2022-06-08 00:00:07',
               '2022-06-08 00:00:08', '2022-06-08 00:00:09',
               '2022-06-08 00:00:10', '2022-06-08 00:00:11',
               '2022-06-08 00:00:12', '2022-06-08 00:00:13',
               '2022-06-08 00:00:14', '2022-06-08 00:00:15',
               '2022-06-08 00:00:16', '2022-06-08 00:00:17',
               '2022-06-08 00:00:18', '2022-06-08 00:00:19',
               '2022-06-08 00:00:20', '2022-06-08 00:00:21',
               '2022-06-08 00:00:22', '2022-06-08 00:00:23',
               '2022-06-08 00:00:24', '2022-06-08 00:00:25',
               '2022-06-08 00:00:26', '2022-06-08 00:00:27',
               '2022-06-08 00:00:28', '2022-06-08 00:00:29',
               '2022-06-08 00:00:30', '2022-06-08 00:00:31',
               '2022-06-08 00:00:32', '2022-06-08 00:00:33',
               '2022-06-08 00:00:34', '2022-06-08 00:00:35',
               '2022-06-08 00:00:36', '2022-06-08 00:00:37',
               '2022-06-08 00:00:38', '2022-06-08 00:00:39',
               '2022-06-08 00:00:40', '2022-06-08 00:00:41',
               '2022-06-08 00:00:42', '2022-06-08 00:00:43',
               '2022-06-08 00:00:44', '2022-06-08 00:00:45',
               '2022-06-08 00:00:46', '2022-06-08 00:00:47',
               '2022-06-08 00:00:48', '2022-06-08 00:00:49',
               '2022-06-08 00:00:50', '2022-06-08 00:00:51',
               '2022-06-08 00:00:52', '2022-06-08 00:00:53',
               '2022-06-08 00:00:54', '2022-06-08 00:00:55',
               '2022-06-08 00:00:56', '2022-06-08 00:00:57',
               '2022-06-08 00:00:58', '2022-06-08 00:00:59',
               '2022-06-08 00:01:00', '2022-06-08 00:01:01',
               '2022-06-08 00:01:02', '2022-06-08 00:01:03',
               '2022-06-08 00:01:04', '2022-06-08 00:01:05',
               '2022-06-08 00:01:06', '2022-06-08 00:01:07',
               '2022-06-08 00:01:08', '2022-06-08 00:01:09',
               '2022-06-08 00:01:10', '2022-06-08 00:01:11',
               '2022-06-08 00:01:12', '2022-06-08 00:01:13',
               '2022-06-08 00:01:14', '2022-06-08 00:01:15',
               '2022-06-08 00:01:16', '2022-06-08 00:01:17',
               '2022-06-08 00:01:18', '2022-06-08 00:01:19',
               '2022-06-08 00:01:20', '2022-06-08 00:01:21',
               '2022-06-08 00:01:22', '2022-06-08 00:01:23',
               '2022-06-08 00:01:24', '2022-06-08 00:01:25',
               '2022-06-08 00:01:26', '2022-06-08 00:01:27',
               '2022-06-08 00:01:28', '2022-06-08 00:01:29',
               '2022-06-08 00:01:30', '2022-06-08 00:01:31',
               '2022-06-08 00:01:32', '2022-06-08 00:01:33',
               '2022-06-08 00:01:34', '2022-06-08 00:01:35',
               '2022-06-08 00:01:36', '2022-06-08 00:01:37',
               '2022-06-08 00:01:38', '2022-06-08 00:01:39'],
              dtype='datetime64[ns]', freq='S')
ts = pd.Series(np.random.randint(0,500,len(rng)),index=rng)
ts
2022-06-08 00:00:00    355
2022-06-08 00:00:01    109
2022-06-08 00:00:02    457
2022-06-08 00:00:03    481
2022-06-08 00:00:04    220
                      ... 
2022-06-08 00:01:35    104
2022-06-08 00:01:36    461
2022-06-08 00:01:37    176
2022-06-08 00:01:38     37
2022-06-08 00:01:39     26
Freq: S, Length: 100, dtype: int32
# 重采样(Resampling)指的是把时间序列的频度变为另一个频度的过程
ts.resample("5Min").sum()
2022-06-08    24384
Freq: 5T, dtype: int32
  • 时区表示
pd.date_range(
    start=None,#开始时间
    end=None,#截止时间
    periods=None,#总长度
    freq=None,#时间间隔
    tz=None,#时区
    normalize=False,#是否标准化到midnight
    name=None,#date名称
    closed=None,#首尾是否在内
    **kwargs,
)
# freq为D表示每日日历,为S表示每秒
rng = pd.date_range("6/8/2022 14:30",periods=3,freq="D")
rng
DatetimeIndex(['2022-06-08 14:30:00', '2022-06-09 14:30:00',
               '2022-06-10 14:30:00'],
              dtype='datetime64[ns]', freq='D')
pd.date_range("6/8/2022 14:30",periods=6,freq="D")
DatetimeIndex(['2022-06-08 14:30:00', '2022-06-09 14:30:00',
               '2022-06-10 14:30:00', '2022-06-11 14:30:00',
               '2022-06-12 14:30:00', '2022-06-13 14:30:00'],
              dtype='datetime64[ns]', freq='D')
# freq="T" 表示分钟
pd.date_range("6/8/2022 14:30",periods=6,freq="T")
DatetimeIndex(['2022-06-08 14:30:00', '2022-06-08 14:31:00',
               '2022-06-08 14:32:00', '2022-06-08 14:33:00',
               '2022-06-08 14:34:00', '2022-06-08 14:35:00'],
              dtype='datetime64[ns]', freq='T')
pd.date_range("6/8/2022 14:30",periods=6,freq="D")
DatetimeIndex(['2022-06-08 14:30:00', '2022-06-09 14:30:00',
               '2022-06-10 14:30:00', '2022-06-11 14:30:00',
               '2022-06-12 14:30:00', '2022-06-13 14:30:00'],
              dtype='datetime64[ns]', freq='D')
rng
DatetimeIndex(['2022-06-08 14:30:00', '2022-06-09 14:30:00',
               '2022-06-10 14:30:00', '2022-06-11 14:30:00',
               '2022-06-12 14:30:00', '2022-06-13 14:30:00'],
              dtype='datetime64[ns]', freq='D')
ts = pd.Series(np.random.randn(len(rng)),rng)
ts
2022-06-08 14:30:00   -1.849361
2022-06-09 14:30:00    1.354631
2022-06-10 14:30:00    0.412876
2022-06-11 14:30:00    1.465844
2022-06-12 14:30:00    0.665059
2022-06-13 14:30:00    2.036140
Freq: D, dtype: float64
# UTC:协调世界时,全世界唯一的统一时间;
ts_utc = ts.tz_localize("UTC")
ts_utc
2022-06-08 14:30:00+00:00   -1.849361
2022-06-09 14:30:00+00:00    1.354631
2022-06-10 14:30:00+00:00    0.412876
2022-06-11 14:30:00+00:00    1.465844
2022-06-12 14:30:00+00:00    0.665059
2022-06-13 14:30:00+00:00    2.036140
Freq: D, dtype: float64
  • 转换到另一个时区:
ts_utc.tz_convert("US/Eastern")
2022-06-08 10:30:00-04:00   -1.849361
2022-06-09 10:30:00-04:00    1.354631
2022-06-10 10:30:00-04:00    0.412876
2022-06-11 10:30:00-04:00    1.465844
2022-06-12 10:30:00-04:00    0.665059
2022-06-13 10:30:00-04:00    2.036140
Freq: D, dtype: float64
  • 在时间跨度表示之间转换:
rng = pd.date_range("6/8/2022",periods=5,freq="M")
rng
DatetimeIndex(['2022-06-30', '2022-07-31', '2022-08-31', '2022-09-30',
               '2022-10-31'],
              dtype='datetime64[ns]', freq='M')
pd.date_range("6/8/2022",periods=8,freq="M")
DatetimeIndex(['2022-06-30', '2022-07-31', '2022-08-31', '2022-09-30',
               '2022-10-31', '2022-11-30', '2022-12-31', '2023-01-31'],
              dtype='datetime64[ns]', freq='M')
ts = pd.Series(np.random.randn(len(rng)),index=rng)
ts
2022-06-30    1.211478
2022-07-31    0.086780
2022-08-31   -0.373740
2022-09-30   -1.602854
2022-10-31    0.428028
Freq: M, dtype: float64
# 操作 to_period 函数允许将日期转换为特定的时间间隔。使用该方法可以获取具有许多不同间隔或周期的日期,例如日、周、月、季度等。
ps = ts.to_period()
ps
2022-06    1.211478
2022-07    0.086780
2022-08   -0.373740
2022-09   -1.602854
2022-10    0.428028
Freq: M, dtype: float64
#  Period.to_timestamp()函数以指定频率(在周期的指定结束时间)在目标频率处返回周期的时间戳表示。
ps.to_timestamp()
2022-06-01    1.211478
2022-07-01    0.086780
2022-08-01   -0.373740
2022-09-01   -1.602854
2022-10-01    0.428028
Freq: MS, dtype: float64
  • 在句号和时间戳之间进行转换可以使用一些方便的算术函数

  • 在以下示例中,我们将年份以11月结束的季度频率转换为季度结束后下个月月底的上午9点:

# 可以通过pandas的period_range函数产生时间序列作为series的index。
# 季度为单位  freq="Q-NOV"
prng = pd.period_range("2021Q1","2022Q4",freq="Q-NOV")
prng
PeriodIndex(['2021Q1', '2021Q2', '2021Q3', '2021Q4', '2022Q1', '2022Q2',
             '2022Q3', '2022Q4'],
            dtype='period[Q-NOV]')
ts = pd.Series(np.random.randn(len(prng)),prng)
ts
2021Q1   -0.506796
2021Q2   -0.481430
2021Q3   -0.078390
2021Q4   -0.080919
2022Q1    0.057916
2022Q2   -0.151808
2022Q3   -0.936490
2022Q4   -0.320068
Freq: Q-NOV, dtype: float64
ts.index = (prng.asfreq("M","e") + 1).asfreq("H","s") + 9
ts.index
PeriodIndex(['2021-03-01 09:00', '2021-06-01 09:00', '2021-09-01 09:00',
             '2021-12-01 09:00', '2022-03-01 09:00', '2022-06-01 09:00',
             '2022-09-01 09:00', '2022-12-01 09:00'],
            dtype='period[H]')
ts.head()
2021-03-01 09:00   -0.506796
2021-06-01 09:00   -0.481430
2021-09-01 09:00   -0.078390
2021-12-01 09:00   -0.080919
2022-03-01 09:00    0.057916
Freq: H, dtype: float64

分类

  • pandas可以在DataFrame.如需完整文档
df = pd.DataFrame(
{"id":[1,2,3,4,5,6],"raw_grade":["a","b","b","a","a","e"]}
)
df
idraw_grade
01a
12b
23b
34a
45a
56e
  • 将原始成绩转换为分类数据类型
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
  • 将类别重命名为更有意义的名称:
df["grade"].cat.categories = ["very good","good","very bad"]
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
  • 重新排序类别并同时添加缺失的类别(默认情况下Series.cat()返回一个新的方法):Series
df["grade"] = df["grade"].cat.set_categories(
["very bad","bad","medium","good","very good"]
)
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
  • 排序是按类别中的顺序进行的,而不是词法顺序:
df.sort_values(by="grade")
idraw_gradegrade
56every bad
12bgood
23bgood
01avery good
34avery good
45avery good
  • 按类别列分组也会显示空类别:
df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

绘图

  • 我们使用标准约定来引用 matplotlib API:
import matplotlib.pyplot as plt
plt.close("all")
  • 该close()方法用于关闭图形窗口:
ts = pd.Series(np.random.randn(1000),index=pd.date_range("6/8/2022",periods=1000))
ts = ts.cumsum()
ts
2022-06-08    -0.416538
2022-06-09    -1.186893
2022-06-10    -0.974144
2022-06-11    -0.929173
2022-06-12     0.371832
                ...    
2025-02-27    14.514577
2025-02-28    15.186525
2025-03-01    15.595083
2025-03-02    16.554780
2025-03-03    17.165945
Freq: D, Length: 1000, dtype: float64
ts.plot()
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7a5pa0ex-1655290660137)(output_382_1.png)]

  • 如果在Jupyter Notebook 下运行,绘图将出现在plot().否则使用matplotlib.pyplot.show显示它或matplotlib.pyplot.savefig将其写入文件。
plt
<module 'matplotlib.pyplot' from 'D:\\software\\anaconda\\lib\\site-packages\\matplotlib\\pyplot.py'>
plt.show()
  • 在DataFrame上,该plot()方法可以方便地绘制带有标签的所有列
ts.index
DatetimeIndex(['2022-06-08', '2022-06-09', '2022-06-10', '2022-06-11',
               '2022-06-12', '2022-06-13', '2022-06-14', '2022-06-15',
               '2022-06-16', '2022-06-17',
               ...
               '2025-02-22', '2025-02-23', '2025-02-24', '2025-02-25',
               '2025-02-26', '2025-02-27', '2025-02-28', '2025-03-01',
               '2025-03-02', '2025-03-03'],
              dtype='datetime64[ns]', length=1000, freq='D')
df = pd.DataFrame(
 np.random.randn(1000,4),index=ts.index,columns=["A","B","C","D"]
)
df
ABCD
2022-06-08-1.411527-0.124331-0.7481940.795625
2022-06-090.3273561.127876-0.176681-0.140429
2022-06-10-0.5460870.0566210.8796180.111533
2022-06-11-0.723865-1.197658-0.1344880.762858
2022-06-12-0.584152-0.205798-0.4571090.613583
...............
2025-02-270.9526180.809016-1.2567700.544052
2025-02-28-0.325551-1.333431-2.5934790.753844
2025-03-010.0723500.9502981.1128010.644935
2025-03-02-0.149229-0.704682-1.6479900.780895
2025-03-030.9447890.6803620.892620-1.074460

1000 rows × 4 columns

# df.cumsum()  按列相加 上一个位置加下一个位置
df.cumsum()
ABCD
2022-06-08-1.411527-0.124331-0.7481940.795625
2022-06-09-1.0841711.003545-0.9248750.655196
2022-06-10-1.6302591.060166-0.0452570.766730
2022-06-11-2.354124-0.137492-0.1797451.529588
2022-06-12-2.938276-0.343290-0.6368542.143171
...............
2025-02-275.42532533.51382433.9726940.586048
2025-02-285.09977532.18039431.3792151.339892
2025-03-015.17212433.13069232.4920161.984828
2025-03-025.02289532.42600930.8440262.765723
2025-03-035.96768333.10637231.7366471.691263

1000 rows × 4 columns

plt.figure()
<Figure size 432x288 with 0 Axes>




<Figure size 432x288 with 0 Axes>
df.plot()
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oN6p6buB-1655290660139)(output_392_1.png)]

plt.legend(loc='best')
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.





<matplotlib.legend.Legend at 0x265646b7d90>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SU3Ki60k-1655290660140)(output_393_2.png)]

输入/输出数据

CSV

  • 写入csv文件
df.to_csv("foo.csv")
  • 从csv文件中读取:
pd.read_csv("foo.csv")
Unnamed: 0ABCD
02022-06-08-1.411527-0.124331-0.7481940.795625
12022-06-090.3273561.127876-0.176681-0.140429
22022-06-10-0.5460870.0566210.8796180.111533
32022-06-11-0.723865-1.197658-0.1344880.762858
42022-06-12-0.584152-0.205798-0.4571090.613583
..................
9952025-02-270.9526180.809016-1.2567700.544052
9962025-02-28-0.325551-1.333431-2.5934790.753844
9972025-03-010.0723500.9502981.1128010.644935
9982025-03-02-0.149229-0.704682-1.6479900.780895
9992025-03-030.9447890.6803620.892620-1.074460

1000 rows × 5 columns

HDF5

  • 读取和写入HDFStores
  • 写入HDF5存储:
df.to_hdf("foo.h5","df")
  • 从HDF5存储中读取:
pd
<module 'pandas' from 'D:\\software\\anaconda\\lib\\site-packages\\pandas\\__init__.py'>
pd.read_hdf("foo.h5","df")
ABCD
2022-06-08-1.411527-0.124331-0.7481940.795625
2022-06-090.3273561.127876-0.176681-0.140429
2022-06-10-0.5460870.0566210.8796180.111533
2022-06-11-0.723865-1.197658-0.1344880.762858
2022-06-12-0.584152-0.205798-0.4571090.613583
...............
2025-02-270.9526180.809016-1.2567700.544052
2025-02-28-0.325551-1.333431-2.5934790.753844
2025-03-010.0723500.9502981.1128010.644935
2025-03-02-0.149229-0.704682-1.6479900.780895
2025-03-030.9447890.6803620.892620-1.074460

1000 rows × 4 columns

Excel

读取和写入MS Excel

  • 写入excel文件:
# excel_writer : ExcelWriter目标路径
# sheet_name :excel表名命名
# index:默认为True,显示index,当index=False 则不显示行索引(名字)
df.to_excel("foo.xlsx",sheet_name="Sheet1")
  • 从excel文件读取:
pd.read_excel("foo.xlsx","Sheet1",index_col=None,na_values=['NA'])
Unnamed: 0ABCD
02022-06-08-1.411527-0.124331-0.7481940.795625
12022-06-090.3273561.127876-0.176681-0.140429
22022-06-10-0.5460870.0566210.8796180.111533
32022-06-11-0.723865-1.197658-0.1344880.762858
42022-06-12-0.584152-0.205798-0.4571090.613583
..................
9952025-02-270.9526180.809016-1.2567700.544052
9962025-02-28-0.325551-1.333431-2.5934790.753844
9972025-03-010.0723500.9502981.1128010.644935
9982025-03-02-0.149229-0.704682-1.6479900.780895
9992025-03-030.9447890.6803620.892620-1.074460

1000 rows × 5 columns

  Python知识库 最新文章
Python中String模块
【Python】 14-CVS文件操作
python的panda库读写文件
使用Nordic的nrf52840实现蓝牙DFU过程
【Python学习记录】numpy数组用法整理
Python学习笔记
python字符串和列表
python如何从txt文件中解析出有效的数据
Python编程从入门到实践自学/3.1-3.2
python变量
上一篇文章      下一篇文章      查看所有文章
加:2022-06-18 23:23:49  更:2022-06-18 23:24:21 
 
开发: C++知识库 Java知识库 JavaScript Python PHP知识库 人工智能 区块链 大数据 移动开发 嵌入式 开发工具 数据结构与算法 开发测试 游戏开发 网络协议 系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑 笔记本 显卡 显示器 固态硬盘 硬盘 耳机 手机 iphone vivo oppo 小米 华为 单反 装机 图拉丁

360图书馆 购物 三丰科技 阅读网 日历 万年历 2024年12日历 -2024/12/27 4:13:16-

图片自动播放器
↓图片自动播放器↓
TxT小说阅读器
↓语音阅读,小说下载,古典文学↓
一键清除垃圾
↓轻轻一点,清除系统垃圾↓
图片批量下载器
↓批量下载图片,美女图库↓
  网站联系: qq:121756557 email:121756557@qq.com  IT数码
数据统计