开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> Python知识库 -> 《Python数据科学手册》学习笔记 -> 正文阅读

[Python知识库]《Python数据科学手册》学习笔记

前言

软件安装注意事项

Miniconda的可用下载地址：Miniconda — Conda documentation。但Miniconda需自己安装各Python程序包（新手不适）。建议直接使用Anaconda。

第1章

1.4 IPython魔法命令

1.4.1 粘贴代码块：%paste和%cpaste

%paste和%cpaste在Jupyter Notebook中不可用（%lsmagic魔法函数列表中也无对应项）。报错如下：

UsageError: Line magic function `%paste` not found.

实测在IPython中可用。

1.7 与shell相关的魔法命令

此处删不掉对应临时目录（本节内容应是在Anaconda Powershell Prompt下运行ipython）：

In [20]: rm -r tmp

1.9 代码的分析和计时

1.9.3 用%lprun进行逐行分析

Python3.7下安装line-profiler需Visual Studio 2017支持。

第2章

2.4 聚合：最小值、最大值和其他值

2.4.3 示例：美国总统的身高是多少

In[13]:!head -4 data/president_heights.csv

对应Windows系统下用type指令查看文件内容：

In[13]:!type data\president_heights.csv

第3章

3.6层级索引

3.6.2 多级索引的创建方法

In[17]:pd.MultiIndex(levels=[['a', 'b'], [1, 2]],

?????????????? labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Out[17]:MultiIndex(levels=[['a', 'b'], [1, 2]],

?????????? codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

d:\Users\Administrator\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead

现版本’labels’已经被’codes’取代。

3.7 合并数据集：Concat与Append操作

3.7.2 通过pd.concat实现简易合并

现版本axis=’col’需改为axis=’columns’

In[8]:??????? df3 = make_df('AB', [0, 1])

df4 = make_df('CD', [0, 1])

print(df3); print(df4); print(pd.concat([df3, df4], axis='columns'))

3.9 累计与分组

3.9.1 行星数据

通过Seaborn下载行星数据失败：

In[2]:??????? import seaborn as sns

planets = sns.load_dataset('planets')

URLError: <urlopen error [Errno 11004] getaddrinfo failed>

将电脑DNS设置改为114.114.114.114有可能修复

3.11 向量化字符串操作

3.11.3 案例：食谱数据库

新建一个字符串，将所有行JSON对象连接起来，然后再通过pd.read_json来读取所有数据：

In[20]:???? # read the entire file into a Python array

with open(' 'data/recipeitems-latest.json', 'r') as f:

??? ????????????????? # Extract each line

??? ????????????????? data = (line.strip() for line in f)

??? ????????????????? # Reformat so each line is the element of a list

data_json = "[{0}]".format(','.join(data))

会报错：

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 4058: illegal multibyte sequence

需改为：

In[20]:???? # 将文件内容读取成Python数组

with open('data/recipeitems-latest.json', 'r', encoding='UTF-8') as f:

??? ????????????????? # 提取每一行内容

??? ????????????????? data = (line.strip() for line in f)

??? ????????????????? # 将所有内容合并成一个列表

??? ????????????????? data_json = "[{0}]".format(','.join(data))

3.12 处理时间序列

3.12.5 重新取样、迁移和窗口

使用pandas-datareader程序包从谷歌/雅虎财经导入金融数据失败：

In[25]:???? from pandas_datareader import data

goog = data.DataReader('GOOG', start='2004', end='2016',

?????????????????????? ?????????? data_source='google')

NotImplementedError: data_source='google' is not implemented

若改为data_source='yahoo'：

ReadTimeout: HTTPSConnectionPool(host='finance.yahoo.com', port=443): Read timed out. (read timeout=30)

3.12.7 案例：美国西雅图自行车统计数据的可视化

In[36]:???? data.columns = ['West', 'East']

data['Total'] = data.eval('West + East')

因现在所用数据本身有总数项，此处改为：

In[36]:???? data.columns = ['Total', 'East', 'West']

3.13 高性能Pandas: eval()与query()

3.13.1 query()与eval()的设计动机：复合代数式

Numpy随机数获取失败：

In[1]:??????? import numpy as np

rng = np.random.RandomState(42)

x = rng.rand(1E6)

y = rng.rand(1E6)

TypeError: 'float' object cannot be interpreted as an integer

这里需改回为：

x = rng.rand(1000000)

y = rng.rand(1000000)

第4章

4.1 Matplotlib常用技巧

4.1.3 用不用show()? 如何显示图形

2.在IPython shell中画图

启动ipython后使用%matplotlib魔法命令报错：

In[1]:??????? %matplotlib

AttributeError: 'NoneType' object has no attribute 'lower'

暂时只在IPython Notebook中使用命令%matplotlib inline或%matplotlib notebook启动图形。

4.5 可视化异常处理

4.5.2 连续误差

高斯过程回归方法调用失败：

In[1]:??????? from sklearn.gaussian_process import GaussianProcess

ImportError: cannot import name 'GaussianProcess' from 'sklearn.gaussian_process' (d:\Users\Administrator\Anaconda3\lib\site-packages\sklearn\gaussian_process\__init__.py)

4.13 Matplotlib自定义：配置文件与样式表

4.13.1 手动配置图形

改用灰色背景时异常：

In[3]:??????? # use a gray background

ax = plt.axes(axisbg='#E6E6E6')

ax.set_axisbelow(True)

AttributeError: 'AxesSubplot' object has no property 'axisbg'

此处需要改为：

In[3]:??????? ax = plt.axes(facecolor='#E6E6E6')

4.15 用Basemap可视化地理数据

载入Basemap时故障：

In[1]:??????? from mpl_toolkits.basemap import Basemap

会报错KeyError:'PROJ_LIB'，需在本地系统中增加环境变量：

变量名：PROJ_LIB

变量值：D:\Users\Administrator\Anaconda3\Library\share

4.16 用Seaborn做数据可视化

4.16.2 Seaborn图形介绍

1.频次直方图、KDE和密度图

频次直方图的绘制时：

In[6]:??????? for col in 'xy':

??? ????????????????? plt.hist(data[col], normed=True, alpha=0.5)

新版本matplotlib中normed已被density取代，报错为：

AttributeError:'Rectangle' object has no property 'normed'

该调用语句可改为：

plt.hist(data[col], density=True, alpha=0.5)

获得一个二维数据可视化图时：

In[9]:??????? sns.kdeplot(data);

d:\Users\Administrator\Anaconda3\lib\site-packages\seaborn\distributions.py:679: UserWarning: Passing a 2D dataset for a bivariate plot is deprecated in favor of kdeplot(x, y), and it will cause an error in future versions. Please update your code.

? warnings.warn(warn_msg, UserWarning)

在更高版本环境中会报错，暂时没找到解决方法：

ValueError: If using all scalar values,you must pass an index

4.16.3 案例：探索马拉松比赛成绩数据

把字符串转换为时间类型：

In[25]:???? def convert_time(s):

??? ????????????????? h, m, s = map(int, s.split(':'))

??? ????????????????? return pd.datetools.timedelta(hours=h, minutes=m, seconds=s)

会报错：

AttributeError:module 'pandas' has no attribute 'datetools'

可不使用自建的这个函数，直接调用pd.to_timedelta()

即将下一段中调用部分改为：

converters={'split':pd.to_timedelta, 'final':pd.to_timedelta}

后续将时间换算成秒时：

In[27]:???? data['split_sec'] = data['split'].astype(int) / 1E9

data['final_sec'] = data['final'].astype(int) / 1E9

会报错：

TypeError:cannot astype a timedelta from [timedelta64[ns]] to [int32]

此处可改为：

In[27]:???? data['split_sec'] = data['split'].astype(np.int64) / 1E9

data['final_sec'] = data['final'].astype(np.int64) / 1E9

第5章

5.2 Scikit-Learn简介

5.2.2 Scikit-Learn的评估器API

3.有监督学习示例：鸢尾花数据分类

借助函数分割数据集：

In[15]:???? from sklearn.cross_validation import train_test_split

已无对应模块，报错为：

ModuleNotFoundError:No module named 'sklearn.cross_validation'

改为从现有模块调用该函数：

In[15]:???? from sklearn.model_selection import train_test_split

5.无监督学习示例：鸢尾花数据聚类

高斯混合模型的导入：

In[20]:???? from sklearn.mixture import GMM

会报错：

ImportError: cannot import name 'GMM' from 'sklearn.mixture'

应改为：

In[20]:???? from sklearn.mixture import GaussianMixture????? # 1.选择模型类

model = GaussianMixture(n_components=3,

???????????????????????????????????????????????????????????? covariance_type='full')???? # 2.设置超参数，初始化模型

5.2.3 应用：手写数据探索

2.无监督学习：降维

In[20]:???? plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,

??????????? ???????? ?edgecolor='none', alpha=0.5,

??????????? ???????? ?cmap=plt.cm.get_cmap('spectral', 10))

此处报错：

ValueError:Colormap spectral is not recogized.

此处对应方案首字母需大写，应该为：

???????? ?????????????? ?cmap=plt.cm.get_cmap('Spectral', 10)

3.数字分类

In[32]:???? test_images = xtest.reshape(-1, 8, 8)

报错为：

NameError:name 'xtest' is not defined

此前定义的是'Xtest'，此处应为：

In[32]:???? test_images = Xtest.reshape(-1, 8, 8)

5.3 超参数与模型验证

5.3.1 什么是模型验证

3.交叉检验

LOO交叉检验的调用：

In[8]:??????? from sklearn.model_selection import LeaveOneOut

scores = cross_val_score(model, X, y, cv=LeaveOneOut(len(X)))

会报错：

TypeError: LeaveOneOut() takes no arguments

改为去掉参数：

In[8]:??????? scores = cross_val_score(model, X, y, cv=LeaveOneOut())

5.3.2 选择最优模型

2.Scikit-Learn验证曲线

可视化验证曲线的调用：

In[13]:???? from sklearn.learning_curve import validation_curve

会报错：

ModuleNotFoundError: No module named 'sklearn.learning_curve'

现改为：

In[13]:???? from sklearn.model_selection import validation_curve

5.3.3 学习曲线

Scikit-Learn学习曲线

学习曲线的调用问题和前面问题相似：

In[17]:???? from sklearn.learning_curve import learning_curve

应改为：

In[17]:???? from sklearn.model_selection import learning_curve

5.3.4 验证实践：网格搜索

网格搜索元评估器的调用：

In[18]:???? from sklearn.grid_search import GridSearchCV

报错为：

ModuleNotFoundError: No module named 'sklearn.grid_search'

也改为：

In[18]:???? from sklearn.model_selection import GridSearchCV

画图显示时：

In[21]:???? plt.plot(X_test.ravel(), y_test, hold=True);

报错为：

AttributeError: 'Line2D' object has no property 'hold'

此处可去掉hold参数，即：

In[21]:???? plt.plot(X_test.ravel(), y_test);

5.6 专题：线性回归

5.6.4 案例：预测自行车流量

每一天的自行车流量计算：

In[15]:???? daily = counts.resample('d').sum()

daily['Total'] = daily.sum(axis=1)

daily = daily[['Total']] # remove other columns

因目前使用数据有总和项，此处进行对应修改：

In[15]:???? daily = counts.resample('d').sum()

daily = daily[['Fremont Bridge Total']] # remove other columns

daily.columns = ['Total']

线性回归模型的建立：

In[22]:???? column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday',

??????? ???????? ???????????'daylight_hrs', 'PRCP', 'dry day', 'Temp(C)', 'annual']

X = daily[column_names]

y = daily['Total']

model = LinearRegression(fit_intercept=False)

model.fit(X, y)

会报错：

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

这是因为两份原始数据对应的时间跨度不一致，会产生缺失值，增加语句删除含缺失值的行即可：

daily.dropna(inplace=True)

另外可用下句检查数据中是否有缺失值：

print(np.isnan(daily).any())

书中此例实际使用的是东西向均值而非总流量。

5.7 专题：支持向量机

5.7.3 案例：人脸识别

RandomizedPCA的调用：

In[20]:???? from sklearn.decomposition import RandomizedPCA

ImportError: cannot import name 'RandomizedPCA' from 'sklearn.decomposition' (d:\Users\Administrator\Anaconda3\lib\site-packages\sklearn\decomposition\__init__.py)

已没有单独的RandomizedPCA，改为直接调用PCA即可：

In[20]:???? from sklearn.decomposition import PCA as RandomizedPCA