1、数据准备
- 数据规模:数据分组、数据采样(处理大数据时候尤为需要)
- 数据类型:数值数据、分类数据(一定要对数据结构特别清楚:连续?离散?有序吗?)
- 数据规模:取值异常、数据缺失
data:image/s3,"s3://crabby-images/f7fe3/f7fe3a90aceb9770cb50000517091d5263915004" alt="在这里插入图片描述" 数据分组 groupby
df = pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.normal(size=5),
'data2':np.random.normal(size=5)})
df
data:image/s3,"s3://crabby-images/b0b5d/b0b5df6c4c50e4312b2c56aa76b357ce8772e155" alt="在这里插入图片描述"
df['data1'].groupby(df['key1']).mean()
data:image/s3,"s3://crabby-images/f0157/f015710beb45ae4f34410dcc61bc93cf93a2947b" alt="在这里插入图片描述" 数据采样 sample (处理大数据时尤为需要)
import random
x = np.arange(1,100)
y = random.sample(list(x),10)
y
data:image/s3,"s3://crabby-images/44e19/44e19fef66ebd7e3823a6f8ebe80881a959ba04c" alt="在这里插入图片描述" 取值异常、数据缺失
- 数据准备
- 确定图表
NA处理 dropna fillna
x = np.arange(0,100)
y = x[(x>90)|(x<100)]
y
data:image/s3,"s3://crabby-images/283a4/283a4109e5b3c319ad56f9aa5d2174610e3610d3" alt="在这里插入图片描述"
y = x
for i in np.arange(0,100):
if i >= 0 and i<=90:
y[i] = 0
else:
y[i] = i
y
data:image/s3,"s3://crabby-images/0cc1f/0cc1fa349465a97a649a8ac216e838131bd9ba54" alt="在这里插入图片描述"
np.array([0 if i>= 10 and i <= 90 else i for i in range(0,100)])
data:image/s3,"s3://crabby-images/566f8/566f86093222d7e89d5ecbe2355c9e7f5b79ae69" alt="在这里插入图片描述"
from numpy import nan
data = pd.Series([1,nan,2,nan,3,nan])
data
data:image/s3,"s3://crabby-images/b67e9/b67e989c110b3fee8ce0ad66fa9725abbad77dc2" alt="在这里插入图片描述"
data.dropna()
data:image/s3,"s3://crabby-images/9c7dd/9c7dde6ab086bbd0d3779cfd7890db6a68db6d82" alt="在这里插入图片描述"
data.fillna(4)
data:image/s3,"s3://crabby-images/6e420/6e4209aff9da3c12311cf8470867bac80cf73c81" alt="在这里插入图片描述"
2、确定图表
数据可视化里通常面临的三类问题:
- 关联分析:散点图,曲线图(scatter,plot)
- 分布分析:灰度图,密度图(hist,gaussian_kde,plot)
- 分类分析:柱状图,箱式图(bar,boxplot)
关联分析
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
bc = datasets.load_breast_cancer()
df = pd.DataFrame(bc.data,columns=bc.feature_names)
sns.lmplot('mean concavity','mean symmetry',df,
height=5,aspect=1.5,fit_reg=True,order=4)
data:image/s3,"s3://crabby-images/03fe4/03fe4eb806af464f24f97fd752acebd638ba89bc" alt="在这里插入图片描述"
sns.lmplot('mean concavity','mean concave points',df,
height=5,aspect=1.5,fit_reg=True,order=4)
data:image/s3,"s3://crabby-images/5786e/5786e4c5f3e3e28065f9240529fc6d19f239a286" alt="在这里插入图片描述" 分布分析
plt.figure(figsize=(7.5,5))
sns.distplot(df['mean concavity'],bins=10,kde=False)
plt.ylabel('Histogram')
plt.twinx()
sns.kdeplot(df['mean concavity'],cumulative=False)
plt.ylabel('KDE')
data:image/s3,"s3://crabby-images/be3c7/be3c7422d6c92aac87a26a18142941d36333b6e1" alt="在这里插入图片描述" 分类分析
plt.figure(figsize=(7.5,5))
sns.boxplot(x='mean concavity',data=df)
data:image/s3,"s3://crabby-images/618d4/618d47658d97353d80d7f99db6b4e0e3756a58c2" alt="在这里插入图片描述"
df['mean concavity'].describe()
data:image/s3,"s3://crabby-images/a36f9/a36f97ec888c4b4173f57abe172b3ef460695e1d" alt="在这里插入图片描述"
3、分析迭代
- 确定拟合模型:OLS, fit OLS = 最小二乘;fit = 拟合
- 分析拟合性能:summary_table统计学汇总
- 确定数据分布:hist
- 确定重点区间:quartile 分布的上下四分位数,以及各分位数之间的区间
sns.lmplot('mean concavity','mean concave points',df,
height=5,aspect=1.5,fit_reg=True,order=1)
data:image/s3,"s3://crabby-images/32814/32814449083f6d98ef378f9dd493d43170a581af" alt="在这里插入图片描述"
plt.figure(figsize=(7.5,5))
sns.distplot(df['mean concavity'],bins=10,kde=False)
plt.ylabel('Histogram')
plt.twinx()
sns.kdeplot(df['mean concavity'],kernel = 'gau',cumulative=False)
plt.ylabel('KDE')
data:image/s3,"s3://crabby-images/b1923/b1923b02fb88fd886c6e18409c43655efacb24c2" alt="在这里插入图片描述"
plt.figure(figsize=(7.5,5))
sns.boxplot(x='mean concavity',data=df)
data:image/s3,"s3://crabby-images/db2be/db2be3554e0e7d15ddc11812f3bf6a3d094f7f7e" alt="在这里插入图片描述"
- 箱型图Boxplot
观察分布的对称性和偏性 data:image/s3,"s3://crabby-images/3f45d/3f45dd317f6561bf2de048596cc428ef15068cea" alt="在这里插入图片描述" data:image/s3,"s3://crabby-images/3ab4a/3ab4acfd498e1f8cdd461acf91cf3353862a2e3f" alt="在这里插入图片描述" 箱型图的局限性 data:image/s3,"s3://crabby-images/d20f1/d20f15c0bc4680a8c31e37c85535149fe4bf62d3" alt="在这里插入图片描述" 对数据的本质理解会产生偏差,不同的数据集可能得到相同的箱型图。
"""
Edward Tufte uses this example from Anscombe to show 4 datasets of x
and y that have the same mean, standard deviation, and regression
line, but which are qualitatively different.
"""
x=[10,8,13,9,11,14,6,4,12,7,5]
x4=[8,8,8,8,8,8,8,19,8,8,8]
y1=[8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68]
y2=[9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74]
y3=[7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73]
y4=[6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89]
df=pd.DataFrame({'x':x,'y1':y1,'y2':y2,'y3':y3,'y4':y4,})
df[['y1','y2','y3','y4']].describe().loc['mean':'std']
分析迭代的要素,不仅依赖于数据本身,也依赖人的分析角度。
4、输出结论
- 养成看图说话的习惯
- 提出一个好问题,画出一个好图像,给出一个好结论
data:image/s3,"s3://crabby-images/5647b/5647b26cd17b1539162fbc9d84d9e5769eadc41c" alt="在这里插入图片描述" 结论: (1)人均GDP和石油消耗量成正比 (2)人均GDP和石油消耗量与国家人口数量没有关系
5、小结
data:image/s3,"s3://crabby-images/2ecdb/2ecdb47fe9a4a7fe2cccf75eafd0402bf22735bf" alt="在这里插入图片描述"
6、作业
绘制出Edward Tufte的散点图和一维拟合曲线,如下所示。(答案在下一篇文章公布) data:image/s3,"s3://crabby-images/9250c/9250c833ac368dcc52f9d41e676815f215960241" alt="在这里插入图片描述"
|