Dataframe既有行索引也有列索引,可以被看做由Series组成的字典(共用一个索引)。
一、列索引:
1、df[‘col_name’]:按照“列名”索引提取列数据
按照列名选择列,只选择一列输出Series,选择多列输出Dataframe
df[] 一般用于选择列,[] 中写列名(所以一般数据colunms都会单独制定,不会用默认数字列名,以免和index冲突);- 单选列为Series,print结果为Series格式;
- 多选列为Dataframe,print结果为Dataframe格式;
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(12).reshape(3, 4) * 100,
index=['one', 'two', 'three'],
columns=['a', 'b', 'c', 'd'])
print("df = ", df)
print('-' * 100)
data1 = df['a']
data2 = df[['a', 'c']]
print("data1 = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print('-' * 100)
print("data2 = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))
打印结果:
df = a b c d
one 12.427304 39.089892 22.467365 22.711018
two 50.808058 67.916443 39.312617 95.227642
three 3.399731 57.874266 45.771234 99.649908
----------------------------------------------------------------------------------------------------
data1 =
one 12.427304
two 50.808058
three 3.399731
Name: a, dtype: float64
type(data1) = <class 'pandas.core.series.Series'>
----------------------------------------------------------------------------------------------------
data2 =
a c
one 12.427304 22.467365
two 50.808058 39.312617
three 3.399731 45.771234
type(data2) = <class 'pandas.core.frame.DataFrame'>
二、行索引
按照index选择行,只选择一行输出Series,选择多行输出Dataframe
1、df[1:3]:按“位置下标”切片提取行数据
按照行切片索引:df[row_index_start : row_index_end]
df[] 中为数字时,默认选择行,且只能进行切片的选择,不能单独选择(df[0] 是错误的)- 输出结果为Dataframe,即便只选择一行
df[] 不能通过索引标签名来选择行(df['one'] )- df[]:利用 默认位置下标 来获取想要的行【末端不包含】
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['one', 'two', 'four', 'three'],
columns=['a', 'b', 'c', 'd'])
print("df1 = \n{0}\ntype(df1) = {1}".format(df1, type(df1)))
print('-' * 50)
df2 = df1[1:3]
print("多行索引:df2 = df1[1:3] = \n{0}\ntype(df2) = {1}".format(df2, type(df2)))
print('-' * 100)
打印结果:
df1 =
a b c d
one 14.697748 84.130102 75.636127 65.541925
two 25.242130 53.488123 45.072336 24.906057
four 94.686317 88.176227 67.092432 35.897882
three 36.527603 70.150568 27.110961 45.964728
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
多行索引:df2 = df1[1:3] =
a b c d
two 25.242130 53.488123 45.072336 24.906057
four 94.686317 88.176227 67.092432 35.897882
type(df2) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
Process finished with exit code 0
2、df.loc[]:按“索引名”提取行数据
2.1 单标签索引:df.loc[1]、df.loc[‘one’]
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['one', 'two', 'three', 'four'],
columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
columns=['a', 'b', 'c', 'd'])
print("df1 = \n{0}\ntype(df1) = {1}".format(df1, type(df1)))
print('-' * 50)
print("df2 = \n{0}\ntype(df2) = {1}".format(df2, type(df2)))
print('-' * 100)
data1 = df1.loc['one']
data2 = df2.loc[1]
print("单标签索引:data1 = \ndf1.loc['one'] = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print('-' * 50)
print("单标签索引:data2 = \ndf2.loc[1] = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))
print('-' * 100)
打印结果:
df1 =
a b c d
one 93.037642 52.895322 42.547540 95.435676
two 24.088954 56.966169 79.185705 48.582922
three 76.162602 32.962263 41.853371 99.138612
four 24.979909 10.191909 27.335317 20.452524
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df2 =
a b c d
0 21.656858 31.404614 88.520987 41.839721
1 26.884644 9.943081 91.739139 81.479288
2 96.522109 71.673956 55.843560 38.131336
3 73.574839 93.350715 89.358183 45.521198
type(df2) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
单标签索引:data1 =
df1.loc['one'] =
a 93.037642
b 52.895322
c 42.547540
d 95.435676
Name: one, dtype: float64
type(data1) = <class 'pandas.core.series.Series'>
--------------------------------------------------
单标签索引:data2 =
df2.loc[1] =
a 26.884644
b 9.943081
c 91.739139
d 81.479288
Name: 1, dtype: float64
type(data2) = <class 'pandas.core.series.Series'>
2.2 多标签索引:df.loc[[3, 2, 1]]、df.loc[[‘one’, ‘three’]]
df.loc[]
- df.loc[]:利用index的名称来获取想要的行【末端包含】
- 核心:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index
- 其中的int类型的索引时索引的名称,而非下标位置信息;
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['one', 'two', 'three', 'four'],
columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
columns=['a', 'b', 'c', 'd'])
print("df1 = \n{0}\ntype(df1) = {1}".format(df1, type(df1)))
print('-' * 50)
print("df2 = \n{0}\ntype(df2) = {1}".format(df2, type(df2)))
print('-' * 100)
data3 = df1.loc[['two', 'three']]
data4 = df2.loc[[3, 2, 1]]
print("多标签索引:data3 = \ndf1.loc[['two', 'three']] = \n{0}\ntype(data3) = {1}".format(data3, type(data3)))
print('-' * 50)
print("多标签索引:data4 = \ndf2.loc[[3, 2, 1]] = \n{0}\ntype(data4) = {1}".format(data4, type(data4)))
print('-' * 100)
打印结果:
df1 =
a b c d
one 93.037642 52.895322 42.547540 95.435676
two 24.088954 56.966169 79.185705 48.582922
three 76.162602 32.962263 41.853371 99.138612
four 24.979909 10.191909 27.335317 20.452524
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df2 =
a b c d
0 21.656858 31.404614 88.520987 41.839721
1 26.884644 9.943081 91.739139 81.479288
2 96.522109 71.673956 55.843560 38.131336
3 73.574839 93.350715 89.358183 45.521198
type(df2) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
多标签索引:data3 =
df1.loc[['two', 'three']] =
a b c d
two 24.088954 56.966169 79.185705 48.582922
three 76.162602 32.962263 41.853371 99.138612
type(data3) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
多标签索引:data4 =
df2.loc[[3, 2, 1]] =
a b c d
3 73.574839 93.350715 89.358183 45.521198
2 96.522109 71.673956 55.843560 38.131336
1 26.884644 9.943081 91.739139 81.479288
type(data4) = <class 'pandas.core.frame.DataFrame'>
2.3 多行索引:df.loc[1:3]、df.loc[‘one’ : ‘three’]
df.loc[]
- df.loc[]:利用 index的名称 来获取想要的行【末端包含】
- 核心:df.loc[label]主要针对index选择行,同时支持指定index,及默认数字index
- 其中的int类型的索引时索引的名称,而非下标位置信息;
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['one', 'two', 'four', 'three'],
columns=['a', 'b', 'c', 'd'])
print("df1 = \n{0}\ntype(df1) = {1}".format(df1, type(df1)))
print('-' * 50)
data3 = df1.loc['one':'three']
print("多行索引:data3 = \ndf1.loc['one':'three'] = \n{0}\ntype(data3) = {1}".format(data3, type(data3)))
print('-' * 100)
df2 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=[2, 3, 1, 0],
columns=['a', 'b', 'c', 'd'])
print("df2 = \n{0}\ntype(df2) = {1}".format(df2, type(df2)))
print('-' * 50)
data4 = df2.loc[1:0]
print("多行索引:data4 = \ndf2.loc[1:0] = \n{0}\ntype(data4) = {1}".format(data4, type(data4)))
print('-' * 100)
打印结果:
df1 =
a b c d
one 23.078737 9.156431 0.439799 64.906356
two 91.265745 12.581287 96.020470 95.584070
four 95.501825 57.346484 73.247475 58.338678
three 29.789496 95.718276 30.426301 94.037233
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
多行索引:data3 =
df1.loc['one':'three'] =
a b c d
one 23.078737 9.156431 0.439799 64.906356
two 91.265745 12.581287 96.020470 95.584070
four 95.501825 57.346484 73.247475 58.338678
three 29.789496 95.718276 30.426301 94.037233
type(data3) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
df2 =
a b c d
2 56.269699 50.504353 25.417596 65.251456
3 93.865887 64.941123 92.814477 36.161766
1 45.549343 58.922568 84.581590 75.418238
0 38.015140 24.555141 41.455598 92.127194
type(df2) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
多行索引:data4 =
df2.loc[1:0] =
a b c d
1 45.549343 58.922568 84.581590 75.418238
0 38.015140 24.555141 41.455598 92.127194
type(data4) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
Process finished with exit code 0
3、df.iloc[]:按“位置下标”提取行数据
- df.iloc[] - 按照整数位置(从轴的0到length-1)选择行
- 类似list的索引,其顺序就是dataframe的整数位置,从0开始计
3.1 单位置索引:df.iloc[-1]
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['two', 'three', 'one', 'four'],
columns=['a', 'b', 'c', 'd'])
print("df = \n{0}\ntype(df) = {1}".format(df, type(df)))
print('-' * 100)
data1 = df.iloc[0]
data2 = df.iloc[-1]
print("data1 = df.iloc[0] = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print('-' * 50)
print("data2 = df.iloc[-1] = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))
打印结果:
df =
a b c d
two 24.860960 54.227202 18.018653 38.724716
three 69.652166 97.651980 19.959022 24.155129
one 31.995642 59.591356 76.431234 44.830302
four 33.716382 32.102688 20.937836 73.288219
type(df) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
data1 = df.iloc[0] =
a 24.860960
b 54.227202
c 18.018653
d 38.724716
Name: two, dtype: float64
type(data1) = <class 'pandas.core.series.Series'>
--------------------------------------------------
data2 = df.iloc[-1] =
a 33.716382
b 32.102688
c 20.937836
d 73.288219
Name: four, dtype: float64
type(data2) = <class 'pandas.core.series.Series'>
Process finished with exit code 0
3.2 多位置索引:df.iloc[[3, 2, 1]]
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['two', 'three', 'one', 'four'],
columns=['a', 'b', 'c', 'd'])
print("df = \n{0}\ntype(df) = {1}".format(df, type(df)))
print('-' * 100)
data1 = df.iloc[[0, 2]]
data2 = df.iloc[[3, 2, 1]]
print("data1 = df.iloc[[0, 2]] = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print('-' * 50)
print("data2 = df.iloc[[3, 2, 1]] = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))
打印结果:
df =
a b c d
two 4.708440 22.863487 26.806435 3.948613
three 43.713656 75.754603 54.269785 64.708510
one 49.566989 89.956527 26.388450 54.651651
four 31.750995 81.558108 45.912672 40.851126
type(df) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
data1 = df.iloc[[0, 2]] =
a b c d
two 4.708440 22.863487 26.806435 3.948613
one 49.566989 89.956527 26.388450 54.651651
type(data1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
data2 = df.iloc[[3, 2, 1]] =
a b c d
four 31.750995 81.558108 45.912672 40.851126
one 49.566989 89.956527 26.388450 54.651651
three 43.713656 75.754603 54.269785 64.708510
type(data2) = <class 'pandas.core.frame.DataFrame'>
Process finished with exit code 0
3.3 切片索引:df.iloc[1:3]
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['two', 'three', 'one', 'four'],
columns=['a', 'b', 'c', 'd'])
print("df = \n{0}\ntype(df) = {1}".format(df, type(df)))
print('-' * 100)
data1 = df.iloc[0:2]
data2 = df.iloc[1:3]
print("data1 = df.iloc[0:2] = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print('-' * 50)
print("data2 = df.iloc[1:3] = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))
打印结果:
df =
a b c d
two 58.579340 15.748046 11.903897 90.569674
three 64.781174 49.745905 5.778577 99.143819
one 96.295298 61.041770 61.024144 9.930110
four 36.892635 26.641423 16.890470 43.553212
type(df) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
data1 = df.iloc[0:2] =
a b c d
two 58.579340 15.748046 11.903897 90.569674
three 64.781174 49.745905 5.778577 99.143819
type(data1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
data2 = df.iloc[1:3] =
a b c d
three 64.781174 49.745905 5.778577 99.143819
one 96.295298 61.041770 61.024144 9.930110
type(data2) = <class 'pandas.core.frame.DataFrame'>
Process finished with exit code 0
三、行列同时索引
1、提取 “目标行 & 目标列”:df.loc[[‘b’, ‘c’], [‘y’, 8]]
df =
x y z 8 9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
- df.loc[:, [‘y’, 8]]:提取第‘y’、8列的所有行;
df.loc[:, ‘y’:8] :错误表达;
import numpy as np
import pandas as pd
df = pd.DataFrame(np.nan,
index=list('abcde'),
columns=['x', 'y', 'z', 8, 9])
print("df = \n", df)
print("-" * 100)
data1 = df.loc[['b', 'c']]
print("data1 = \n", data1)
print("-" * 50)
data2 = df.loc[:, ['y', 8]]
print("data2 = \n", data2)
print("-" * 50)
data3 = df.loc[['b', 'c'], ['y', 8]]
print("data3 = \n", data3)
打印结果:
df =
x y z 8 9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
----------------------------------------------------------------------------------------------------
data1 =
x y z 8 9
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
--------------------------------------------------
data2 =
y 8
a NaN NaN
b NaN NaN
c NaN NaN
d NaN NaN
e NaN NaN
--------------------------------------------------
data3 =
y 8
b NaN NaN
c NaN NaN
Process finished with exit code 0
2、先选择列再选择行
先选择列再选择行:相当于对于一个数据,先筛选字段,再选择数据量
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
index=['one', 'two', 'three', 'four'],
columns=['a', 'b', 'c', 'd'])
print("df = \n", df)
print('-' * 100)
data1 = df['a'].loc[['one', 'three']]
print("data1 = df['a'].loc[['one', 'three']] = \n", df['a'].loc[['one', 'three']])
print('-' * 50)
data2 = df[['b', 'c', 'd']]
print("data2 = df[['b', 'c', 'd']] = \n", data2)
print('-' * 100)
data = df['a'] < 50
print("data = df['a'] < 50 = \n", data)
print('-' * 50)
data3 = df[df['a'] < 50]
print("data3 = df[df['a'] < 50] = \n", data3)
print('-' * 50)
data4 = df[df['a'] < 50].iloc[1]
print("data4 = df[df['a'] < 50].iloc[:2] = \n", data4)
打印结果:
df =
a b c d
one 9.835341 90.198909 41.946498 57.696927
two 42.118455 92.361098 12.128027 58.962167
three 57.007146 18.977019 92.999803 47.113144
four 97.706270 99.227877 4.032991 27.748419
----------------------------------------------------------------------------------------------------
data1 = df['a'].loc[['one', 'three']] =
one 9.835341
three 57.007146
Name: a, dtype: float64
--------------------------------------------------
data2 = df[['b', 'c', 'd']] =
b c d
one 90.198909 41.946498 57.696927
two 92.361098 12.128027 58.962167
three 18.977019 92.999803 47.113144
four 99.227877 4.032991 27.748419
----------------------------------------------------------------------------------------------------
data = df['a'] < 50 =
one True
two True
three False
four False
Name: a, dtype: bool
--------------------------------------------------
data3 = df[df['a'] < 50] =
a b c d
one 9.835341 90.198909 41.946498 57.696927
two 42.118455 92.361098 12.128027 58.962167
--------------------------------------------------
data4 = df[df['a'] < 50].iloc[:2] =
a 42.118455
b 92.361098
c 12.128027
d 58.962167
Name: two, dtype: float64
Process finished with exit code 0
3、行切片&列切片:df.iloc[1:3, 2:6]
根据位置和名称信息混搭的取数:对于一个DaraFrame,如果我想提取c行及其之前所有的,同时属于前4列的数呢?
iloc[num_of_row_start : num_of_row_end, num_of_column_start : num_of_column_end]
import numpy as np
import pandas as pd
df = pd.DataFrame(np.nan,
index=list('abcde'),
columns=['x', 'y', 'z', 8, 9])
print("df = \n", df)
print("-" * 100)
df_select = df.iloc[:df.index.get_loc('c') + 1, :4]
print("df_select = \n", df_select)
打印结果:
df =
x y z 8 9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
----------------------------------------------------------------------------------------------------
df_select =
x y z 8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN
Process finished with exit code 0
get_loc(pandas 0.24.1)是一个应用在index的工具,即“获取名称对象在index的位置(整数)”。注意,因为不包含num_of_end,所以需要 +1才能包含c行。
四、df.loc[] 和 df.iloc[] 的区别
前提,简单介绍一下它俩:
- loc利用 index的名称,来获取想要的行(或列)【名称导向】
- iloc利用 index的具体位置(所以它只能是整数型参数),来获取想要的行(或列)。
import numpy as np
import pandas as pd
s = pd.Series(np.nan, index=[49, 48, 47, 46, 45, 1, 2, 3, 4, 5])
print("s = \n", s)
打印结果:
s =
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
dtype: float64
让我们用整数3来试着提数
- s.iloc[:3]返回给我们的是前3行的数(因为把3当作位置信息做的处理);
- s.loc[:3]返回前8行得数(因为把3当作名称对象做的处理);
import numpy as np
import pandas as pd
s = pd.Series(np.nan, index=[49, 48, 47, 46, 45, 1, 2, 3, 4, 5])
print("s.iloc[:3] = \n", s.iloc[:3])
print("-" * 50)
print("s.loc[:3] = \n", s.loc[:3])
打印结果:
s.iloc[:3] =
49 NaN
48 NaN
47 NaN
dtype: float64
--------------------------------------------------
s.loc[:3] =
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
dtype: float64
如果我们试着用一个不在index里的整数,比如6会出现什么结果呢?
- 当然s.iloc[:6]返回的是前6行的数。
import numpy as np
import pandas as pd
s = pd.Series(np.nan, index=[49, 48, 47, 46, 45, 1, 2, 3, 4, 5])
print("s.iloc[:6] = \n", s.iloc[:6])
打印结果:s.iloc[:6] =
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
dtype: float64
- 但是,s.loc[:6]会被挂起提示KeyError,这是因为6不是index的元素。
参考资料: Python笔记:df.loc[]和df.iloc[]的区别
|