[人工智能] Pandas操作总结

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> Pandas操作总结 -> 正文阅读

[人工智能]Pandas操作总结

3.1Pandas

3.1.1 Series

class pandas.Series(data = None, index = None, dtype = None, name = None, copy = False, fastpath = False)

data 表示传入的数据
index 表示索引
dtype 数据类型，默认会自己判断
name 设置名称
copy 拷贝数据，默认为 False

//通过传入列表创建series对象
import pandas as pd
a = ["Google", "Runoob", "Wiki"]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)

//通过字典(键值对)创建Series
import pandas as pd
sites = {1: "Google", 2: "Runoob", 3: "Wiki"}
myvar = pd.Series(sites)
print(myvar)

注：以下均为jupyter中代码实例

In [1]:

import pandas as pd                       # 导入pandas库
ser_obj = pd.Series([1, 2, 3, 4, 5])      # 创建Series类对象
ser_obj

Out[1]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [2]:

# 创建Series类对象，并指定索引
ser_obj = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
ser_obj

Out[2]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [3]:

year_data = {2001: 17.8, 2002: 20.1, 2003: 16.5}
ser_obj2 = pd.Series(year_data)   # 创建Series类对象
ser_obj2

Out[3]:

2001    17.8
2002    20.1
2003    16.5
dtype: float64

In [4]:

ser_obj.index         # 获取ser_obj的索引

Out[4]:

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [5]:

ser_obj.values       # 获取ser_obj的数据

Out[5]:

array([1, 2, 3, 4, 5], dtype=int64)

In [5]:

ser_obj[3]            # 获取位置索引3对应的数据

Out[5]:

In [6]:

ser_obj * 2

Out[6]:

a     2
b     4
c     6
d     8
e    10
dtype: int64

3.1.2 DataFrame

pandas.DataFrame(data, index, columns, dtype, copy)

data：一组数据(ndarray,series, map, lists, dict 等类型)
index：索引值，或者可以称为行标签
columns：列标签，默认为 RangeIndex (0, 1, 2, …, n)
dtype：数据类型
copy：拷贝数据，默认为 False

In [7]:

import numpy as np
import pandas as pd

demo_arr = np.array([['a', 'b', 'c'], ['d', 'e', 'f']]) 
# 创建数组
df_obj = pd.DataFrame(demo_arr)    # 基于数组创建DataFrame对象
df_obj

Out[7]:

	0	1	2
0	a	b	c
1	d	e	f

In [8]:

# 创建DataFrame对象，指定列索引
df_obj1 = pd.DataFrame(demo_arr, columns=['No1', 'No2', 'No3'])
df_obj1

Out[8]:

	No1	No2	No3
0	a	b	c
1	d	e	f

In [10]:

element = df_obj1['No2']  # 通过列索引的方式获取一列数据
element

Out[10]:

0    b
1    e
Name: No2, dtype: object

In [11]:

type(element)                # 查看返回结果的类型

Out[11]:

pandas.core.series.Series

In [11]:

element = df_obj1.No2  # 通过属性获取列数据
element

Out[11]:

0    b
1    e
Name: No2, dtype: object

In [12]:

type(element)           # 查看返回结果的类型

Out[12]:

pandas.core.series.Series

In [13]:

df_obj1['No4'] = ['g', 'h']
df_obj1

Out[13]:

	No1	No2	No3	No4
0	a	b	c	g
1	d	e	f	h

In [14]:

del df_obj1['No4']
df_obj1

Out[14]:

	No1	No2	No3
0	a	b	c
1	d	e	f

3.2 索引操作及高级索引

3.2.1 索引对象

In [15]:

import pandas as pd
ser_obj = pd.Series(range(5), index=['a','b','c','d','e'])
ser_index = ser_obj.index
ser_index

Out[15]:

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [16]:

ser_obj

Out[16]:

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [17]:

ser_index['2'] = 'cc'  # (执行时，将注释打开，便可以看到错误信息)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-3d779dc501cd> in <module>
----> 1 ser_index['2'] = 'cc'  # (执行时，将注释打开，便可以看到错误信息)

c:\users\dell\anaconda3\envs\pyg\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   4082 
   4083     def __setitem__(self, key, value):
-> 4084         raise TypeError("Index does not support mutable operations")
   4085 
   4086     def __getitem__(self, key):

TypeError: Index does not support mutable operations

In [22]:

ser_obj1 = pd.Series(range(3), index=['a','b','c'])
ser_obj2 = pd.Series(['a','b','c'], index=ser_obj1.index)
ser_obj2.index is ser_obj1.index

Out[22]:

True

In [19]:

ser_obj1

Out[19]:

a    0
b    1
c    2
dtype: int64

In [20]:

ser_obj2

Out[20]:

a    a
b    b
c    c
dtype: object

3.2.2 重置索引

In [23]:

import pandas as pd


ser_obj = pd.Series([1, 2, 3, 4, 5], index=['c', 'd', 'a', 'b', 'e'])
ser_obj

Out[23]:

c    1
d    2
a    3
b    4
e    5
dtype: int64

In [24]:

# 重新索引
ser_obj2 = ser_obj.reindex(['a', 'b', 'c', 'd', 'e', 'f']) 
ser_obj2

Out[24]:

a    3.0
b    4.0
c    1.0
d    2.0
e    5.0
f    NaN
dtype: float64

In [21]:

# 重新索引时指定填充的缺失值
ser_obj2 = ser_obj.reindex(['a', 'b', 'c', 'd', 'e', 'f'], fill_value = 6)
ser_obj2

Out[21]:

a    3
b    4
c    1
d    2
e    5
f    6
dtype: int64

In [25]:

# 创建Series对象，并为其指定索引
ser_obj3 = pd.Series([1, 3, 5, 7], index=[0, 2, 4, 6])
ser_obj3

Out[25]:

0    1
2    3
4    5
6    7
dtype: int64

In [27]:

ser_obj3.reindex(range(6), method = 'ffill') # 重新索引，前向填充值

Out[27]:

0    1
1    1
2    3
3    3
4    5
5    5
dtype: int64

In [28]:

ser_obj3.reindex(range(6), method = 'bfill')# 重新索引，后向填充值

Out[28]:

0    1
1    3
2    3
3    5
4    5
5    7
dtype: int64

3.2.3 索引操作

In [25]:

import pandas as pd


ser_obj = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
ser_obj[2]       # 使用索引位置获取数据

Out[25]:

In [26]:

ser_obj['c']    # 使用索引名称获取数据

Out[26]:

In [27]:

ser_obj[2: 4]           # 使用位置索引进行切片（我们看到区间是左闭右开的，这种语法符合python的一贯风格）

Out[27]:

c    3
d    4
dtype: int64

In [28]:

ser_obj['c': 'e']      # 使用索引名称进行切片

Out[28]:

c    3
d    4
e    5
dtype: int64

In [29]:

ser_obj[[0, 2, 4]]          # 通过不连续位置索引获取数据集

Out[29]:

c    1
a    3
e    5
dtype: int64

In [30]:

ser_obj[['a', 'c', 'd']]   # 通过不连续索引名称获取数据集

Out[30]:

a    1
c    3
d    4
dtype: int64

In [30]:

ser_bool = ser_obj > 2         # 创建布尔型Series对象
ser_bool

Out[30]:

c    False
d    False
a     True
b     True
e     True
dtype: bool

In [32]:

ser_obj[ser_bool]               # 获取结果为True的数据

Out[32]:

c    3
d    4
e    5
dtype: int64

In [31]:

arr = np.arange(12).reshape(3, 4)
df_obj = pd.DataFrame(arr, columns=['a', 'b', 'c', 'd'])
df_obj

Out[31]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11

In [34]:

df_obj['b']

Out[34]:

0    1
1    5
2    9
Name: b, dtype: int32

In [35]:

type(df_obj['b'])

Out[35]:

pandas.core.series.Series

In [36]:

df_obj[['b', 'd']]        # 获取不连续的Series对象

Out[36]:

	b	d
0	1	3
1	5	7
2	9	11

In [37]:

df_obj[: 2]               # 使用切片获取第0~1行的数据

Out[37]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7

In [38]:

# 使用多个切片先通过行索引获取第0~2行的数据，再通过不连续列索引获取第b、d列的数据
df_obj[: 3][['b', 'd']]

Out[38]:

	b	d
0	1	3
1	5	7
2	9	11

用loc和iloc花式索引

In [32]:

arr = np.arange(16).reshape(4, 4)
dataframe_obj = pd.DataFrame(arr, columns=['a', 'b', 'c', 'd'])
dataframe_obj

Out[32]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

In [33]:

dataframe_obj.loc[:, ["c", "a"]]

Out[33]:

	c	a
0	2	0
1	6	4
2	10	8
3	14	12

In [34]:

dataframe_obj.iloc[:, [2, 0]]

Out[34]:

	c	a
0	2	0
1	6	4
2	10	8
3	14	12

In [35]:

dataframe_obj.loc[1:2, ['b','c']]#注意和iloc区别，loc对于行索引是双闭区间,iloc为左闭右开

Out[35]:

	b	c
1	5	6
2	9	10

In [36]:

dataframe_obj.iloc[1:3, [1, 2]]//iloc行索引为左闭右开

Out[36]:

	b	c
1	5	6
2	9	10

3.3 算术运算与数据对齐

In [37]:

obj_one = pd.Series(range(10, 13), index=range(3)) 
obj_one

Out[37]:

0    10
1    11
2    12
dtype: int64

In [38]:

obj_two = pd.Series(range(20, 25), index=range(5))
obj_two

Out[38]:

0    20
1    21
2    22
3    23
4    24
dtype: int64

In [39]:

obj_one + obj_two

Out[39]:

0    30.0
1    32.0
2    34.0
3     NaN
4     NaN
dtype: float64
//可以看到默认填充为NaN

In [40]:

obj_one.add(obj_two, fill_value = 0)   # 执行加法运算，补充缺失值

Out[40]:

0    30.0
1    32.0
2    34.0
3    23.0
4    24.0
dtype: float64
//注意:这里补充的缺失值是obj_one的缺失值填充而不是对加完的数据进行填充

3.4 数据排序

3.4.1 按索引排序

In [41]:

import pandas as pd
ser_obj = pd.Series(range(10, 15), index=[5, 3, 1, 3, 2])
ser_obj

Out[41]:

5    10
3    11
1    12
3    13
2    14
dtype: int64

In [42]:

ser_obj.sort_index()        # 按索引进行升序排列

Out[42]:

1    12
2    14
3    11
3    13
5    10
dtype: int64

In [43]:

ser_obj.sort_index(ascending = False)  # 按索引进行降序排列

Out[43]:

5    10
3    11
3    13
2    14
1    12
dtype: int64

In [44]:

import pandas as pd
import numpy as np
df_obj = pd.DataFrame(np.arange(9).reshape(3, 3), index=[4, 3, 5])
df_obj

Out[44]:

	0	1	2
4	0	1	2
3	3	4	5
5	6	7	8

In [45]:

df_obj.sort_index()                      # 按索引升序排列

Out[45]:

	0	1	2
3	3	4	5
4	0	1	2
5	6	7	8

In [46]:

df_obj.sort_index(ascending = False)     # 按索引降序排列

Out[46]:

	0	1	2
5	6	7	8
4	0	1	2
3	3	4	5

3.4.2 按值排序

In [49]:

ser_obj = pd.Series([4, np.nan, 6, np.nan, -3, 2])
ser_obj

Out[49]:

0    4.0
1    NaN
2    6.0
3    NaN
4   -3.0
5    2.0
dtype: float64

In [50]:

ser_obj.sort_values()   # 按值升序排列

Out[50]:

4   -3.0
5    2.0
0    4.0
2    6.0
1    NaN
3    NaN
dtype: float64

In [51]:

df_obj = pd.DataFrame([[0.4, -0.1, -0.3, 0.0], 
                       [0.2, 0.6, -0.1, -0.7],
                       [0.8, 0.6, -0.5, 0.1]])
df_obj

Out[51]:

	0	1	2	3
0	0.4	-0.1	-0.3	0.0
1	0.2	0.6	-0.1	-0.7
2	0.8	0.6	-0.5	0.1

In [53]:

df_obj.sort_values(by = 2)  # 对列索引值为2的数据进行排序

Out[53]:

	0	1	2	3
2	0.8	0.6	-0.5	0.1
0	0.4	-0.1	-0.3	0.0
1	0.2	0.6	-0.1	-0.7

3.5 统计计算与描述

3.5.1 常用的统计计算

In [54]:

df_obj = pd.DataFrame(np.arange(12).reshape(3, 4), columns=['a', 'b', 'c', 'd'])
df_obj

Out[54]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11

In [59]:

df_obj.sum()          # 计算每列元素的和

Out[59]:

a    12
b    15
c    18
d    21
dtype: int64

In [60]:

df_obj.max()         # 获取每列的最大值

Out[60]:

a     8
b     9
c    10
d    11
dtype: int32

In [61]:

df_obj.min(axis=1)   # 沿着横向轴，获取每行的最小值，axis默认为0

Out[61]:

0    0
1    4
2    8
dtype: int32

3.5.2 统计描述（descript）

In [62]:

df_obj = pd.DataFrame([[12, 6, -11, 19], 
                       [-1, 7, 50, 36],
                       [5, 9, 23, 28]])
df_obj

Out[62]:

	0	1	2	3
0	12	6	-11	19
1	-1	7	50	36
2	5	9	23	28

In [55]:

df_obj.describe()#提供数据各种维度的描述

Out[55]:

	a	b	c	d
count	3.0	3.0	3.0	3.0
mean	4.0	5.0	6.0	7.0
std	4.0	4.0	4.0	4.0
min	0.0	1.0	2.0	3.0
25%	2.0	3.0	4.0	5.0
50%	4.0	5.0	6.0	7.0
75%	6.0	7.0	8.0	9.0
max	8.0	9.0	10.0	11.0

3.6 层次化索引

3.6.1 认识层次化索引

In [56]:

import numpy as np
import pandas as pd


mulitindex_series = pd.Series([15848,13472,12073.8,7813,7446,6444,15230,8269],
                              index=[['河北省','河北省','河北省','河北省',
                                      '河南省','河南省','河南省','河南省'],
                                     ['石家庄市','唐山市','邯郸市','秦皇岛市',
                                      '郑州市','开封市','洛阳市','新乡市']])
mulitindex_series

Out[56]:

河北省  石家庄市    15848.0
     唐山市     13472.0
     邯郸市     12073.8
     秦皇岛市     7813.0
河南省  郑州市      7446.0
     开封市      6444.0
     洛阳市     15230.0
     新乡市      8269.0
dtype: float64

In [65]:

import pandas as pd
from pandas import DataFrame,Series
# 占地面积为增加的列索引
mulitindex_df = DataFrame({'占地面积':[15848,13472,12073.8,7813,
                                   7446,6444,15230,8269]},
                          index=[['河北省','河北省','河北省','河北省',
                                  '河南省','河南省','河南省','河南省'],
                                 ['石家庄市','唐山市','邯郸市','秦皇岛市',
                                  '郑州市','开封市','洛阳市','新乡市']])
mulitindex_df

Out[65]:

		占地面积
河北省	石家庄市	15848.0
	唐山市	13472.0
	邯郸市	12073.8
	秦皇岛市	7813.0
河南省	郑州市	7446.0
	开封市	6444.0
	洛阳市	15230.0
	新乡市	8269.0

In [57]:

from pandas import MultiIndex
# 创建包含多个元组的列表
list_tuples = [('A','A1'), ('A','A2'), ('B','B1'),
               ('B','B2'), ('B','B3')]
# 根据元组列表创建一个MultiIndex对象
multi_index = MultiIndex.from_tuples(tuples=list_tuples, 
                                     names=[ '外层索引', '内层索引'])
multi_index

Out[57]:

MultiIndex([('A', 'A1'),
            ('A', 'A2'),
            ('B', 'B1'),
            ('B', 'B2'),
            ('B', 'B3')],
           names=['外层索引', '内层索引'])

In [58]:

# 导入所需要的包
import pandas as pd

values = [[1, 2, 3], [8, 5, 7], [4, 7, 7], [5, 5, 4], [4, 9, 9]]
df_indexs = pd.DataFrame(data=values, index=multi_index)
df_indexs

Out[58]:

		0	1	2
外层索引	内层索引
A	A1	1	2	3
	A2	8	5	7
B	B1	4	7	7
	B2	5	5	4
	B3	4	9	9

In [59]:

from pandas import MultiIndex
# 根据列表创建一个MultiIndex对象
multi_array = MultiIndex.from_arrays(arrays =[['A', 'B', 'A', 'B', 'B'], 
                                              ['A1', 'A2', 'B1', 'B2', 'B3']],
                                     names=['外层索引','内层索引'])
multi_array

Out[59]:

MultiIndex([('A', 'A1'),
            ('B', 'A2'),
            ('A', 'B1'),
            ('B', 'B2'),
            ('B', 'B3')],
           names=['外层索引', '内层索引'])

In [60]:

# 导入所需要的包
import pandas as pd
import numpy as np
values = np.array([[1, 2, 3], [8, 5, 7], [4, 7, 7],
                   [5, 5, 4], [4, 9, 9]])
df_array = pd.DataFrame(data=values, index=multi_array)
df_array

Out[60]:

		0	1	2
外层索引	内层索引
A	A1	1	2	3
B	A2	8	5	7
A	B1	4	7	7
B	B2	5	5	4
	B3	4	9	9

In [61]:

from pandas import MultiIndex
import pandas as pd
numbers = [0, 1, 2]
colors = ['green', 'purple']
multi_product = pd.MultiIndex.from_product([numbers, colors], 
                                           names=['number', 'color'])
multi_product

Out[61]:

MultiIndex([(0,  'green'),
            (0, 'purple'),
            (1,  'green'),
            (1, 'purple'),
            (2,  'green'),
            (2, 'purple')],
           names=['number', 'color'])

In [62]:

# 导入所需要的包
import pandas as pd
# 使用变量values接收DataFrame对象的值
values = np.array([[7, 5], [6, 6], [3, 1], [5, 5], [4, 5], [5, 3]])
df_product = pd.DataFrame(data=values, index=multi_product)
df_product

Out[62]:

		0	1
number	color
0	green	7	5
	purple	6	6
1	green	3	1
	purple	5	5
2	green	4	5
	purple	5	5

3.6.2 层次化索引的操作

In [63]:

from pandas import Series, DataFrame
ser_obj = Series([50, 60, 40, 94, 63, 101, 200, 56, 45],
                 index=[['小说', '小说', '小说',
                         '散文随笔', '散文随笔', '散文随笔',
                         '传记', '传记', '传记'],
                        ['高山上的小邮局', '失踪的总统', '绿毛水怪',
                         '皮囊', '浮生六记', '自在独行',
                         '梅西', '老舍自传', '库里传']])
ser_obj

Out[63]:

小说    高山上的小邮局     50
      失踪的总统       60
      绿毛水怪        40
散文随笔  皮囊          94
      浮生六记        63
      自在独行       101
传记    梅西         200
      老舍自传        56
      库里传         45
dtype: int64

In [64]:

ser_obj['小说']     # 获取所有外层索引为“小说”的数据

Out[64]:

高山上的小邮局    50
失踪的总统      60
绿毛水怪       40
dtype: int64

In [65]:

ser_obj[:,'自在独行']       # 获取内层索引对应的数据

Out[65]:

散文随笔    101
dtype: int64

In [66]:

ser_obj.swaplevel()               # 交换外层索引与内层索引位置

Out[66]:

高山上的小邮局  小说       50
失踪的总统    小说       60
绿毛水怪     小说       40
皮囊       散文随笔     94
浮生六记     散文随笔     63
自在独行     散文随笔    101
梅西       传记      200
老舍自传     传记       56
库里传      传记       45
dtype: int64

In [67]:

from pandas import DataFrame,Series
df_obj = DataFrame({'str':['a','b','d','e','f','k','d','s','l'],
                    'num':[1, 2, 4, 5, 3, 2, 6, 2, 3]},
                   index=[['A', 'A', 'A', 'C', 'C', 'C', 'B', 'B', 'B'],
                          [1, 3, 2, 3, 1, 2, 4, 5, 8]])
df_obj

Out[67]:

		str	num
A	1	a	1
3	3	b	2
	2	d	4
C	3	e	5
	1	f	3
	2	k	2
B	4	d	6
5	5	s	2
	8	I	3

In [68]:

df_obj.sort_index()         # 按索引排序

Out[68]:

		str	num
A	1	a	1
2	2	d	4
	3	b	2
B	4	d	6
	5	s	2
	8	I	3
C	1	f	3
	2	k	2
	3	e	5

3.7 读写数据操作

3.7.1 读写文本文件

to_csv(path_or_buf=None,sep=',',na_rep='',float=format=None,columns=None,header=True,index=True,index_label=None,mode='w', ...)#写入csv

path_or_buf 文件路径
sep 分隔符，默认用逗号隔开

index 默认为True，如为False则不会显示索引

read_csv(filepath_or_buf,sep=',',delimiter=None,header='infer',index_col=None,usecols=None,prefix=None, ...)#读csv文件

header 指定行数用来作为列名
name 用于结果的列名列表

注意：读TXT文件也可用read_csv()或read_table(),二者的区别主要在于分隔符不同，前者用逗号，后者用\t作分割符

In [70]:

import pandas as pd


df = pd.DataFrame({'one_name':[1,2,3], 'two_name':[4,5,6]})
df

Out[70]:

	one_name	two_name
0	1	4
1	2	5
2	3	6

In [71]:

# 将df对象写入到csv格式的文件中
df.to_csv(r'itc.csv',index=False)
'写入完毕'

Out[71]:

'写入完毕'

In [73]:

# import pandas as pd
file = open(r"itc.csv")
# 读取指定目录下的csv格式的文件
file_data = pd.read_csv(file)
file_data

Out[73]:

	one_name	two_name
0	1	4
1	2	5
2	3	6

In [74]:

import pandas as pd
file = open(r'itcast.txt',encoding = 'utf-8')
data = pd.read_table(file)
data

Out[74]:

	AAA
0	不不不

3.7.2 读写Excel文件

In [75]:

import pandas as pd
df1 = pd.DataFrame({'col1': ['传', '智'], 'col2': ['播', '客']})
df1.to_excel(r'itcast.xlsx', 'python基础班')
'写入完毕'

Out[75]:

'写入完毕'

3.7.3 读取HTML表格数据

In [81]:

import pandas as pd
import requests


html_data = requests.get('http://kaoshi.edu.sina.com.cn/college/majorlist/')
#html_data = requests.get('http://kaoshi.edu.sina.com.cn/college/m/')用到了request库
html_table_data = pd.read_html(html_data.content,encoding='utf-8')
html_table_data[1]

Out[81]:

	0	1	2	3	4
0	专业名称	专业代码	专业大类	专业小类	操作
1	哲学类	0101	哲学	哲学类	开设院校加入对比
2	哲学	010101	哲学	哲学类	开设院校加入对比
3	逻辑学	010102	哲学	哲学类	开设院校加入对比
4	宗教学	010103	哲学	哲学类	开设院校加入对比
5	伦理学	010104	哲学	哲学类	开设院校加入对比
6	经济学类	0201	经济学	经济学类	开设院校加入对比
7	经济学	020101	经济学	经济学类	开设院校加入对比
8	经济统计学	020102	经济学	经济学类	开设院校加入对比
9	国民经济管理	020103	经济学	经济学类	开设院校加入对比
10	资源与环境经济学	020104	经济学	经济学类	开设院校加入对比
11	商务经济学	020105	经济学	经济学类	开设院校加入对比
12	能源经济	020106	经济学	经济学类	开设院校加入对比
13	劳动经济学	020107	经济学	经济学类	开设院校加入对比
14	经济工程	020108	经济学	经济学类	开设院校加入对比
15	数字经济	020109	经济学	经济学类	开设院校加入对比
16	财政学类	0202	经济学	财政学类	开设院校加入对比
17	财政学	020201	经济学	财政学类	开设院校加入对比
18	税收学	020202	经济学	财政学类	开设院校加入对比
19	金融学类	0203	经济学	金融学类	开设院校加入对比
20	金融学	020301	经济学	金融学类	开设院校加入对比

3.7.4读写数据库

In [83]:

import pandas as pd
from pandas import DataFrame
from sqlalchemy import create_engine
'''  mysql账号为root  密码为123456 数据名：info  
数据表名称：person_info
engine = create_engine('mysql+mysqlconnector://root:123456@127.0.0.1/info')
pd.read_sql('person_info',engine)
'''

mysql账号：root

密码：123456

数据名：info

数据表名称：person_info

engine = create_engine(‘mysql+mysqlconnector://root:123456@127.0.0.1/info’)

pd.read_sql(‘person_info’,engine)

In [84]:

engine = create_engine('mssql+pymssql://teboho:teboho@127.0.0.1:1433/Teboho')


connection = engine.raw_connection()

In [85]:

sql = 'select * from C_test'
pd.read_sql(sql,con = connection )

Out[85]:

	id	name	sex
0	1002	??si	??
1	1001	??si	??

In [86]:

engine = create_engine('mssql+pymssql://db2018:db2018@210.44.125.12:1433/DB2018')
connection = engine.raw_connection()

In [88]:

sql = 'select * from course'
pd.read_sql(sql,con = connection )

Out[88]:

	course_id	title	dept_name	credits
0	BIO-101	Intro. to Biology	Biology	4.0
1	BIO-301	Genetics	Biology	4.0
2	BIO-399	Computational Biology	Biology	3.0
3	CS-101	Intro. to Computer Science	Comp. Sci.	4.0
4	CS-190	Game Design	Comp. Sci.	4.0
5	CS-315	Robotics	Comp. Sci.	3.0
6	CS-319	Image Processing	Comp. Sci.	3.0
7	CS-347	Database System Concepts	Comp. Sci.	3.0
8	EE-181	Intro. to Digital Systems	Elec. Eng.	3.0
9	FIN-201	Investment Banking	Finance	3.0
10	HIS-351	World History	History	3.0
11	MU-199	Music Video Production	Music	3.0
12	PHY-101	Physical Principles	Physics	4.0

In [89]:

import pymssql

In [90]:

connect = pymssql.connect('localhost', 'teboho', 'teboho', 'Teboho',charset='utf8') #服务器名,账户,密码,数据库名
if connect:
    print('success')
success

In [38]:

cursor = connect.cursor()   #创建一个游标对象,python里的sql语句都要通过cursor来执行
cursor.execute("create table C_test(id varchar(20), name varchar(20), sex varchar(4))")   #执行sql语句
connect.commit()  #提交
cursor.close()   #关闭游标  #关闭连接

In [51]:

cursor = connect.cursor()   #创建一个游标对象,python里的sql语句都要通过cursor来执行
sql = "insert into C_test (id, name, sex)values(1001, '张si', '女')".encode('utf-8')
cursor.execute(sql)   #执行sql语句
connect.commit()  #提交
cursor.close()

In [52]:

cursor = connect.cursor()   #创建一个游标对象,python里的sql语句都要通过cursor来执行
sql = "select name, sex from C_test".encode('utf-8')
cursor.execute(sql)   #执行sql语句
row = cursor.fetchone()  #读取查询结果,
while row:              #循环读取所有结果
    print("Name=%s, Sex=%s" % (row[0],row[1]))   #输出结果
    row = cursor.fetchone()

cursor.close()   
connect.close()
Name=??si, Sex=??
Name=??si, Sex=??

In [86]:

import pandas as pd
from pandas import DataFrame,Series
from sqlalchemy import create_engine
# mysql账号为root  密码为123456 数据名：info
# 数据表名称：person_info
# 创建数据库引擎
# mysql+pymysql 表示使用Mysql数据库的pymysql驱动
engine = create_engine('mysql+mysqlconnector://root:123456@127.0.0.1/info')
sql = 'select * from person_info where id >3;'
pd.read_sql(sql,engine)

Out[86]:

	id	name	age	height	gender
0	4	刘华	59	175	男
1	5	王贤	18	172	女
2	6	周平	36	None	男
3	7	程坤	27	181	男
4	8	李平	38	160	女

In [87]:

from pandas import DataFrame,Series
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.types import *
df = DataFrame({"班级":["一年级","二年级","三年级","四年级"],
                              "男生人数":[25,23,27,30],
                              "女生人数":[19,17,20,20]})
# 创建数据库引擎
# mysql+pymysql 表示使用Mysql数据库的pymysql驱动
# 账号：root 密码：123456 数据库名：studnets_info
# 数据表的名称： students
engine=create_engine('mysql+mysqlconnector://root:123456@127.0.0.1/students_info')
df.to_sql('students',engine)

案例—读取北京市2006~2018年高考分数线表格信息及分析

In [91]:

import pandas as pd
# 指定文件的路径
file_path = 'scores.xlsx'
# 指定列标签的索引列表
df_obj = pd.read_excel(file_path, header=[0, 1],engine = 'openpyxl')
df_obj

Out[91]:

	Unnamed: 0_level_0	一本分数线	二本分数线
	Unnamed: 0_level_1	文科	理科	文科	理科
0	2018	576	532	488	432
1	2017	555	537	468	439
2	2016	583	548	532	494
3	2015	579	548	527	495
4	2014	565	543	507	495
5	2013	549	550	494	505
6	2012	495	477	446	433
7	2011	524	484	481	435
8	2010	524	494	474	441
9	2009	532	501	489	459
10	2008	515	502	472	455
11	2007	528	531	489	478
12	2006	516	528	476	476

原因是最近xlrd更新到了2.0.1版本，只支持.xls文件。所以pandas.read_excel(‘xxx.xlsx’)会报错。

可以安装旧版xlrd，在cmd中运行：

pip uninstall xlrd pip install xlrd==1.2.0

也可以用openpyxl代替xlrd打开.xlsx文件：

df=pandas.read_excel(‘data.xlsx’,engine=‘openpyxl’)

In [58]:

sorted_obj = df_obj.sort_index(ascending = False)
sorted_obj

Out[58]:

	Unnamed: 0_level_0	一本分数线	二本分数线
	Unnamed: 0_level_1	文科	理科	文科	理科
12	2006	516	528	476	476
11	2007	528	531	489	478
10	2008	515	502	472	455
9	2009	532	501	489	459
8	2010	524	494	474	441
7	2011	524	484	481	435
6	2012	495	477	446	433
5	2013	549	550	494	505
4	2014	565	543	507	495
3	2015	579	548	527	495
2	2016	583	548	532	494
1	2017	555	537	468	439
0	2018	576	532	488	432

In [90]:

sorted_obj.max()

Out[90]:

一本分数线  文科    583
       理科    550
二本分数线  文科    532
       理科    505
dtype: int64

In [91]:

sorted_obj.min()

Out[91]:

一本分数线  文科    495
       理科    477
二本分数线  文科    446
       理科    432
dtype: int64

In [92]:

result1 = sorted_obj["一本分数线", "文科"].ptp() # 文科分数差
result1

Out[92]:

In [93]:

result2 = sorted_obj["一本分数线", "理科"].ptp()  
result2

Out[93]:

In [94]:

result3 = sorted_obj["二本分数线", "文科"].ptp()  
result3

Out[94]:

In [95]:

result4 = sorted_obj["二本分数线", "理科"].ptp()  
result4

Out[95]:

In [96]:

ser_obj1 = sorted_obj['一本分数线','文科']
ser_obj1[2018] - ser_obj1[2017]

Out[96]:

In [97]:

ser_obj2 = sorted_obj['一本分数线','理科']
ser_obj2[2018] - ser_obj2[2017]

Out[97]:

-5

In [98]:

ser_obj3 = sorted_obj['二本分数线','文科']
ser_obj3[2018] - ser_obj3[2017]

Out[98]:

In [99]:

ser_obj4 = sorted_obj['二本分数线','理科']
ser_obj4[2018] - ser_obj4[2017]

Out[99]:

-7

In [100]:

sorted_obj.describe()

Out[100]:

	一本分数线	二本分数线
	文科	理科	文科	理科
count	13.000000	13.000000	13.000000	13.000000
mean	541.615385	521.153846	487.923077	464.384615
std	28.150010	25.986683	23.567144	27.274953
min	495.000000	477.000000	446.000000	432.000000
25%	524.000000	501.000000	474.000000	439.000000
50%	532.000000	531.000000	488.000000	459.000000
75%	565.000000	543.000000	494.000000	494.000000
max	583.000000	550.000000	532.000000	505.000000

人工智能最新文章

2022吴恩达机器学习课程——第二课（神经网

第十五章规则学习

FixMatch: Simplifying Semi-Supervised Le

数据挖掘Java——Kmeans算法的实现

大脑皮层的分割方法

【翻译】GPT-3是如何工作的

论文笔记:TEACHTEXT: CrossModal Generaliz

python从零学（六）

详解Python 3.x 导入(import)

【答读者问27】backtrader不支持最新版本的

加:2021-11-23 12:20:39 更:2021-11-23 12:20:54

360图书馆购物三丰科技阅读网日历万年历 2025年7日历

-2025/7/12 11:05:57-

图片自动播放器
↓图片自动播放器↓

TxT小说阅读器
↓语音阅读,小说下载,古典文学↓

一键清除垃圾
↓轻轻一点,清除系统垃圾↓

图片批量下载器
↓批量下载图片,美女图库↓

网站联系: qq:121756557 email:121756557@qq.com IT数码