目录
一、基本操作
1导入pandas库,读入csv文件,?.read_csv()函数
2 pd.head() , df.info(),df.index,df.columns,df.values
3 .set_index()
?4. Series对象运算以及一些统计指标
①加 10 相当于对每个元素都加上10
②age.mean(),age.max(),age.min()
二、索引
?1 定位一个列,取前5条数据,Age列中前5个数据
2 定位两个列,取前5条数据,Age,和Fare列中前5个数据
3?.loc --- 通过标签获取定位 .iloc ---通过位置获取定位
通过位置获取定位? .iloc
通过标签获取数据 .loc[]
4? 使用label作为索引时,也可以使用切片
5 修改数据,针对某一个数据直接赋值
6?bool类型的索引
三 、groupby的使用
?1 聚类求和
2 .groupby()实现的功能就类似于下面的for循环
3 可以结合numpy一起使用,.aggregate()函数,如下:
四、数值运算
?1、pandas中.sum()默认是按列求和,如果需要按行求和,可以增加参数axis=1
?2 df.mean(),.max(),min(),.median()
3 二元统计
①.cov() 协方差
②.corr()??相关系数
4 .value_counts()统一各特征值的个数
5 计数统计.count()函数
一、基本操作
1导入pandas库,读入csv文件,?.read_csv()函数
#导入pandas库
import pandas as pd
#读取csv文件
df = pd.read_csv('泰坦尼克号数据.csv')
2 pd.head() , df.info(),df.index,df.columns,df.values
pd.head()---获取默认显示df前5条信息,可以传参数,显示想要的信息数量
df.info()---显示当前df信息
df.index---获取df的索引值
df.columns---获取列名字值
df.values---获取df中的值,以一个二维矩阵显示
?① pd.head()
In?[25]:
df[['PassengerId','Survived','Pclass','Name']].head()
Out[25]:
| PassengerId | Survived | Pclass | Name |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 5 | 0 | 3 | Allen, Mr. William Henry |
②df.info()
#显示各列字段名称;非空值的数量,每列的数据类型,以及内存占用情况
In [24]:
df.info()
#显示各列字段名称;非空值的数量,每列的数据类型,以及内存占用情况
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
③获取索引值,列名称,df的数据值?
#获取df的索引值
In [11]:
df.index
Out[11]:
RangeIndex(start=0, stop=891, step=1)
#获取df的列名称
In [12]:
df.columns
Out[12]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
#获取df中的值,以一个二维矩阵显示
In [14]:
df.values
Out[14]:
array([[1, 0, 3, ..., 7.25, nan, 'S'],
[2, 1, 1, ..., 71.2833, 'C85', 'C'],
[3, 1, 3, ..., 7.925, nan, 'S'],
...,
[889, 0, 3, ..., 23.45, nan, 'S'],
[890, 1, 1, ..., 30.0, 'C148', 'C'],
[891, 0, 3, ..., 7.75, nan, 'Q']], dtype=object)
#获取矩阵中的第一列值,pandas中object表示 字符串类型
In [15]:
df.values[0]
Out[15]:
array([1, 0, 3, 'Braund, Mr. Owen Harris', 'male', 22.0, 1, 0,
'A/5 21171', 7.25, nan, 'S'], dtype=object)
3 .set_index()
索引可以自己指定,.set_index(),inplace参数为True时,表示该操作改变原来的df
In [35]:
df.set_index('Name',inplace = True)
In [38]:
age = df['Age'][:5]
#此时age为一个Series对象,索引值为Name,且显示前5条数据
Out[38]:
Name
Braund, Mr. Owen Harris 22.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 38.0
Heikkinen, Miss. Laina 26.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0
Allen, Mr. William Henry 35.0
Name: Age, dtype: float64
#age为Series对象,使用Name作为索引值,使用Name获取对应年龄
In [39]:
age['Braund, Mr. Owen Harris']
Out[39]:
22.0
?4. Series对象运算以及一些统计指标
①加 10 相当于对每个元素都加上10
In [40]:
age = age + 10
age
Out[40]:
Name
Braund, Mr. Owen Harris 32.0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 48.0
Heikkinen, Miss. Laina 36.0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 45.0
Allen, Mr. William Henry 45.0
Name: Age, dtype: float64
②age.mean(),age.max(),age.min()
In [41]:
age.mean()
Out[41]:
41.2
In [42]:
age.max()
Out[42]:
48.0
In [43]:
age.min()
Out[43]:
32.0
③.describe()统计指标
In?[44]:
df.describe()
Out[44]:
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 | mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 | std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 | min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 | 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 | 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 | max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
二、索引
In [46]:
import pandas as pd
df = pd.read_csv('泰坦尼克号数据.csv')
df.head()
Out[25]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
?1 定位一个列,取前5条数据,Age列中前5个数据
#定位一个列,取前5条数据,Age列中前5个数据
In [48]:
df['Age'][:5]
Out[48]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
2 定位两个列,取前5条数据,Age,和Fare列中前5个数据
In [49]:
df[['Age','Fare']][:5]
Out[49]:
Age Fare
0 22.0 7.2500
1 38.0 71.2833
2 26.0 7.9250
3 35.0 53.1000
4 35.0 8.0500
3?.loc --- 通过标签获取定位 .iloc ---通过位置获取定位
通过位置获取定位? .iloc
#获取第一行数据,df[0]的语法是错误的
In [52]:
df.iloc[0]
Out[52]:
PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
Age 22
SibSp 1
Parch 0
Ticket A/5 21171
Fare 7.25
Cabin NaN
Embarked S
Name: 0, dtype: object
df.iloc[0:5] #获取1至5行的数据
df.iloc[0:5,0:4]#获取1至5行,1至4列的数据
Out[57]:
| PassengerId | Survived | Pclass | Name |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 5 | 0 | 3 | Allen, Mr. William Henry |
通过标签获取数据 .loc[]
df = df.set_index('Name')#设定索引,将Name列设定为索引值
In [60]:
df.loc['Heikkinen, Miss. Laina']
Out[60]:
PassengerId 3
Survived 1
Pclass 3
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: Heikkinen, Miss. Laina, dtype: object
#获取Heikkinen, Miss. Laina的 Fare信息
In [61]:
df.loc['Heikkinen, Miss. Laina','Fare']
Out[61]:
7.925
#获取Heikkinen, Miss. Laina的 所有信息
In [62]:
df.loc['Heikkinen, Miss. Laina',:]
Out[62]:
PassengerId 3
Survived 1
Pclass 3
Sex female
Age 26
SibSp 0
Parch 0
Ticket STON/O2. 3101282
Fare 7.925
Cabin NaN
Embarked S
Name: Heikkinen, Miss. Laina, dtype: object
4? 使用label作为索引时,也可以使用切片
In [63]:
df.loc['Heikkinen, Miss. Laina':'Allen, Mr. William Henry',:]
Out[63]:
| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
Name | | | | | | | | | | | |
---|
Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.100 | C123 | S | Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.050 | NaN | S |
5 修改数据,针对某一个数据直接赋值
In [64]:
df.loc['Heikkinen, Miss. Laina','Fare'] = 1000
In [65]:
df.head()
Out[65]:
| PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
Name | | | | | | | | | | | |
---|
Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Heikkinen, Miss. Laina | 3 | 1 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 1000.0000 | NaN | S | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 4 | 1 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Allen, Mr. William Henry | 5 | 0 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
6?bool类型的索引
In [66]:
df['Fare'] > 40
Out[66]:
Name
Braund, Mr. Owen Harris False
Cumings, Mrs. John Bradley (Florence Briggs Thayer) True
Heikkinen, Miss. Laina True
Futrelle, Mrs. Jacques Heath (Lily May Peel) True
Allen, Mr. William Henry False
...
Montvila, Rev. Juozas False
Graham, Miss. Margaret Edith False
Johnston, Miss. Catherine Helen "Carrie" False
Behr, Mr. Karl Howell False
Dooley, Mr. Patrick False
Name: Fare, Length: 891, dtype: bool
#获取Fare值大于40的,前5行和前4列数据
In [35]:
df[df['Fare'] > 40].iloc[:5,0:4]
Out[35]:
PassengerId Survived Pclass Name
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
6 7 0 1 McCarthy, Mr. Timothy J
27 28 0 1 Fortune, Mr. Charles Alexander
31 32 1 1 Spencer, Mrs. William Augustus (Marie Eugenie)
In [77]:
#下面两种表达是等效的。
df[df['Sex'] == 'male']['Age'].mean()
df.loc[df['Sex'] == 'male','Age'].mean()
Out[77]:
30.72664459161148
三 、groupby的使用
In [1]:
import pandas as pd
In [3]:
data1 = {'key':['A','B','C','A','B','C','A','B','C'],
'data':[1,4,5,2,5,7,4,5,5]}
df = pd.DataFrame(data1)
df
Out[3]:
key data
0 A 1
1 B 4
2 C 5
3 A 2
4 B 5
5 C 7
6 A 4
7 B 5
8 C 5
?1 聚类求和
In [13]:
data2_gro = df.groupby('key')
data2_gro.sum()
Out[13]:
data
key
A 7
B 14
C 17
2 .groupby()实现的功能就类似于下面的for循环
In [11]:
for key in ['A','B','C']:
print(key,df[df['key'] == key].loc[:,'data'].sum())
A 7
B 14
C 17
3 可以结合numpy一起使用,.aggregate()函数,如下:
In [15]:
import numpy as np
df.groupby('key').aggregate(np.sum)
Out[15]:
data
key
A 7
B 14
C 17
In [16]:
df.groupby('key').aggregate(np.mean)
Out[16]:
data
key
A 2.333333
B 4.666667
C 5.666667
In [18]:
df = pd.read_csv('泰坦尼克号数据.csv')
df.head()
Out[18]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [20]:
df.groupby('Sex')['Age'].mean()
Out[20]:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
In [24]:
df.groupby('Sex')['Age','Fare'].aggregate(np.mean)
Out[24]:
Age Fare
Sex
female 27.915709 44.479818
male 30.726645 25.523893
In [25]:
df.groupby('Sex')[['Age','Fare']].aggregate(np.mean)
Out[25]:
Age Fare
Sex
female 27.915709 44.479818
male 30.726645 25.523893
四、数值运算
In [29]:
import pandas as pd
In [32]:
df = pd.DataFrame([[1,2,4],[3,6,9]],index = ['a','b'],columns = ['A','B','C'])
df
Out[32]:
A B C
a 1 2 4
b 3 6 9
?1、pandas中.sum()默认是按列求和,如果需要按行求和,可以增加参数axis=1
In [33]:
df.sum()
Out[33]:
A 4
B 8
C 13
dtype: int64
In [34]:
df.sum(axis=1)
Out[34]:
a 7
b 18
dtype: int64
In [35]:
df.sum('columns')#按行求和
Out[35]:
a 7
b 18
dtype: int64
In [38]:
df.sum('index')#按列求和
Out[38]:
A 4
B 8
C 13
dtype: int64
In [39]:
df.mean()
Out[39]:
A 2.0
B 4.0
C 6.5
dtype: float64
In [40]:
df.max()
Out[40]:
A 3
B 6
C 9
dtype: int64
In [41]:
df.min()
Out[41]:
A 1
B 2
C 4
dtype: int64
In [42]:
df.median()
Out[42]:
A 2.0
B 4.0
C 6.5
dtype: float64
3 二元统计
①.cov() 协方差
#协方差
In [47]:
df.cov()
Out[47]:
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
---|
PassengerId | 66231.000000 | -0.626966 | -7.561798 | 138.696504 | -16.325843 | -0.342697 | 161.883369 | Survived | -0.626966 | 0.236772 | -0.137703 | -0.551296 | -0.018954 | 0.032017 | 6.221787 | Pclass | -7.561798 | -0.137703 | 0.699015 | -4.496004 | 0.076599 | 0.012429 | -22.830196 | Age | 138.696504 | -0.551296 | -4.496004 | 211.019125 | -4.163334 | -2.344191 | 73.849030 | SibSp | -16.325843 | -0.018954 | 0.076599 | -4.163334 | 1.216043 | 0.368739 | 8.748734 | Parch | -0.342697 | 0.032017 | 0.012429 | -2.344191 | 0.368739 | 0.649728 | 8.661052 | Fare | 161.883369 | 6.221787 | -22.830196 | 73.849030 | 8.748734 | 8.661052 | 2469.436846 |
②.corr()??相关系数
#相关系数
In [48]:
df.corr()
Out[48]:
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare |
---|
PassengerId | 1.000000 | -0.005007 | -0.035144 | 0.036847 | -0.057527 | -0.001652 | 0.012658 | Survived | -0.005007 | 1.000000 | -0.338481 | -0.077221 | -0.035322 | 0.081629 | 0.257307 | Pclass | -0.035144 | -0.338481 | 1.000000 | -0.369226 | 0.083081 | 0.018443 | -0.549500 | Age | 0.036847 | -0.077221 | -0.369226 | 1.000000 | -0.308247 | -0.189119 | 0.096067 | SibSp | -0.057527 | -0.035322 | 0.083081 | -0.308247 | 1.000000 | 0.414838 | 0.159651 | Parch | -0.001652 | 0.081629 | 0.018443 | -0.189119 | 0.414838 | 1.000000 | 0.216225 | Fare | 0.012658 | 0.257307 | -0.549500 | 0.096067 | 0.159651 | 0.216225 | 1.000000 |
4 .value_counts()统一各特征值的个数
有2个参数,ascending,和bins
In [55]:
df['Age'].value_counts()#默认是降序显示
Out[55]:
24.00 30
22.00 27
18.00 26
19.00 25
30.00 25
..
55.50 1
70.50 1
66.00 1
23.50 1
0.42 1
Name: Age, Length: 88, dtype: int64
In [57]:
df['Age'].value_counts(ascending = True)#如果想要升序显示,可以传入参数,ascending = True
Out[57]:
0.42 1
23.50 1
66.00 1
70.50 1
55.50 1
..
30.00 25
19.00 25
18.00 26
22.00 27
24.00 30
Name: Age, Length: 88, dtype: int64
In [62]:
df['Age'].value_counts(ascending = True,bins = 5)#增加bins参数,分区间显示
Out[62]:
(64.084, 80.0] 11
(48.168, 64.084] 69
(0.339, 16.336] 100
(32.252, 48.168] 188
(16.336, 32.252] 346
Name: Age, dtype: int64
5 计数统计.count()函数
In [63]:
df['Age'].count()#统计非空值的个数
Out[63]:
714
In [64]:
df['Pclass'].count()
Out[64]:
891
|