Pandas绘图
?Pandas数据可视化简介
Pandas 单变量可视化
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#aa5500">#加载数据</span>
<span style="color:#770088">import</span> <span style="color:#000000">pandas</span> <span style="color:#770088">as</span> <span style="color:#000000">pd</span>
<span style="color:#000000">reviews</span> = <span style="color:#000000">pd</span>.<span style="color:#000000">read_csv</span>(<span style="color:#aa1111">"data/winemag-data_first150k.csv"</span>, <span style="color:#000000">index_col</span>=<span style="color:#116644">0</span>)
<span style="color:#000000">reviews</span>.<span style="color:#000000">head</span>(<span style="color:#116644">3</span>)</span></span>
显示结果:
| country | description | designation | points | price | province | region_1 | region_2 | variety | winery |
---|
0 | US | This tremendous 100% varietal wine hails from ... | Martha's Vineyard | 96 | 235.0 | California | Napa Valley | Napa | Cabernet Sauvignon | Heitz | 1 | Spain | Ripe aromas of fig, blackberry and cassis are ... | Carodorum Selección Especial Reserva | 96 | 110.0 | Northern Spain | Toro | NaN | Tinta de Toro | Bodega Carmen Rodríguez | 2 | US | Mac Watson honors the memory of a wine once ma... | Special Selected Late Harvest | 96 | 90.0 | California | Knights Valley | Sonoma | Sauvignon Blanc | Macauley |
?柱状图和分类数据
条形图(柱状图)非常灵活:
-
高度可以代表任何东西,只要它是数字即可 -
每个条形可以代表任何东西,只要它是一个类别即可。
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#aa5500"># figsize 绘图区域大小, fontsize 字体大小 color 颜色</span>
<span style="color:#000000">text_kwargs</span> = <span style="color:#3300aa">dict</span>(
? ?<span style="color:#000000">figsize</span>=(<span style="color:#116644">16</span>,<span style="color:#116644">8</span>),
? ?<span style="color:#000000">fontsize</span>=<span style="color:#116644">20</span>,
? ?<span style="color:#000000">color</span>=[<span style="color:#aa1111">'b'</span>,<span style="color:#aa1111">'orange'</span>,<span style="color:#aa1111">'g'</span>,<span style="color:#aa1111">'r'</span>,<span style="color:#aa1111">'purple'</span>,<span style="color:#aa1111">'brown'</span>,<span style="color:#aa1111">'pink'</span>,<span style="color:#aa1111">'gray'</span>,<span style="color:#aa1111">'cyan'</span>,<span style="color:#aa1111">'y'</span>]
)
?
<span style="color:#aa5500"># 计算省份出现次数,取前10,画图;**text_kwargs表示解包</span>
<span style="color:#000000">reviews</span>[<span style="color:#aa1111">'province'</span>].<span style="color:#000000">value_counts</span>().<span style="color:#000000">head</span>(<span style="color:#116644">10</span>).<span style="color:#000000">plot</span>.<span style="color:#000000">bar</span>(<span style="color:#981a1a">**</span><span style="color:#000000">text_kwargs</span>)</span></span>
显示结果:
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#aa5500"># 计算省份出现次数,取前10,再分别除以数据的总数,就得到省份出产葡萄酒的占比</span>
(<span style="color:#000000">reviews</span>[<span style="color:#aa1111">'province'</span>].<span style="color:#000000">value_counts</span>().<span style="color:#000000">head</span>(<span style="color:#116644">10</span>) <span style="color:#981a1a">/</span> <span style="color:#3300aa">len</span>(<span style="color:#000000">reviews</span>)).<span style="color:#000000">plot</span>.<span style="color:#000000">bar</span>(<span style="color:#981a1a">**</span><span style="color:#000000">text_kwargs</span>)</span></span>
显示结果:在《葡萄酒杂志》(Wine Magazine)评述的葡萄酒中,加利福尼亚生产了近三分之一!
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#aa5500"># 计算所有不同评分的各自数量,再根据评分进行排序,再画图</span>
<span style="color:#000000">reviews</span>[<span style="color:#aa1111">'points'</span>].<span style="color:#000000">value_counts</span>().<span style="color:#000000">sort_index</span>().<span style="color:#000000">plot</span>.<span style="color:#000000">bar</span>(<span style="color:#981a1a">**</span><span style="color:#000000">text_kwargs</span>)</span></span>
显示结果:
?折线图
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#000000">reviews</span>[<span style="color:#aa1111">'points'</span>].<span style="color:#000000">value_counts</span>().<span style="color:#000000">sort_index</span>().<span style="color:#000000">plot</span>.<span style="color:#000000">line</span>()</span></span>
显示结果:
-
柱状图和折线图区别
-
小练习:柱状图或折线图
-
5种不同口味冰激凌,不同口味冰激凌的销售数量。 -
国产轿车不同品牌的月销售数量。 -
学生的考试分数,范围为0-100
?面积图
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#000000">reviews</span>[<span style="color:#aa1111">'points'</span>].<span style="color:#000000">value_counts</span>().<span style="color:#000000">sort_index</span>().<span style="color:#000000">plot</span>.<span style="color:#000000">area</span>()</span></span>
显示结果:
?直方图
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#aa5500"># price小于200的所有数据df,取price列的值,画图</span>
<span style="color:#000000">reviews</span>[<span style="color:#000000">reviews</span>[<span style="color:#aa1111">'price'</span>] <span style="color:#981a1a"><</span> <span style="color:#116644">200</span>][<span style="color:#aa1111">'price'</span>].<span style="color:#000000">plot</span>.<span style="color:#000000">hist</span>()</span></span>
显示结果:
<span style="background-color:#f8f8f8"><span style="color:#333333"><span style="color:#000000">reviews</span>[<span style="color:#aa1111">'price'</span>].<span style="color:#000000">plot</span>.<span style="color:#000000">hist</span>()</span></span>
显示结果:
<span style="background-color:#f8f8f8"><span style="color:#333333">#查看价格较高的葡萄酒情况
reviews[reviews['price'] > 1500]
</span></span>
显示结果:
| country | description | designation | points | price | province | region_1 | region_2 | variety | winery |
---|
13318 | US | The nose on this single-vineyard wine from a s... | Roger Rose Vineyard | 91 | 2013.0 | California | Arroyo Seco | Central Coast | Chardonnay | Blair | 34920 | France | A big, powerful wine that sums up the richness... | NaN | 99 | 2300.0 | Bordeaux | Pauillac | NaN | Bordeaux-style Red Blend | Chateau Latour | 34922 | France | A massive wine for Margaux, packed with tannin... | NaN | 98 | 1900.0 | Bordeaux | Margaux | NaN | Bordeaux-style Red Blend | Chateau Margaux |
<span style="background-color:#f8f8f8"><span style="color:#333333">reviews.shape
</span></span>
显示结果: 共计150930条数据
<span style="background-color:#f8f8f8">(150930, 10)
</span>
<span style="background-color:#f8f8f8"><span style="color:#333333">reviews[reviews['price'] >500].shape
</span></span>
显示结果: 价格大于500的数据只有73条
<span style="background-color:#f8f8f8">(73, 10)
</span>
<span style="background-color:#f8f8f8"><span style="color:#333333">reviews['points'].plot.hist()
</span></span>
显示结果:
?饼图
<span style="background-color:#f8f8f8"><span style="color:#333333">reviews['province'].value_counts().head(10).plot.pie()
</span></span>
显示结果:
?Pandas 双变量可视化
散点图
-
最简单的两个变量可视化图形是散点图,散点图中的一个点,可以表示两个变量 <span style="background-color:#f8f8f8"># 价格小于100的葡萄酒,随机取样100个数据,评分分布
reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')
</span>
显示结果:
<span style="background-color:#f8f8f8">reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points', figsize=(14,8), fontsize = 16)
</span>
<span style="background-color:#f8f8f8">from matplotlib import pyplot as plt
# 创建绘图区域和坐标轴
fig, axes = plt.subplots(ncols=1, figsize=(20,10))
# 使用pandas 在指定坐标轴内绘图
reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points', figsize=(14,8), fontsize=16, ax=axes)
# 通过坐标轴修改标签内容和字体大小
axes.set_xlabel('price', fontdict={'fontsize':16})
</span>
显示结果:价格和评分之间有一定的相关性:也就是说,价格较高的葡萄酒通常得分更高
-
请注意,我们必须对数据进行采样,从所有数据中抽取100条数据,如果将全部数据(15万条)都绘制到散点图上,会有很多点重叠在一起,不方便观察 <span style="background-color:#f8f8f8">reviews[reviews['price'] < 100].plot.scatter(x='price', y='points',figsize=(12,8))
</span>
显示结果:
hexplot蜂巢图
<span style="background-color:#f8f8f8"><span style="color:#333333">reviews[reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15, figsize=(14,8))
</span></span>
显示结果:
<span style="background-color:#f8f8f8"><span style="color:#333333">fig, axes = plt.subplots(ncols=1, figsize = (12,8))
reviews[reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15,ax = axes)
axes.set_xticks([0,20,40,60,80,100])
</span></span>
显示结果:
-
该图中的数据可以和散点图中的数据进行比较,但是hexplot能展示的信息更多 -
从hexplot中,可以看到《葡萄酒杂志》(Wine Magazine)评论的葡萄酒瓶大多数是87.5分,价格20美元 -
Hexplot和散点图可以应用于区间变量和有序分类变量的组合。
堆叠图(Stacked plots)
-
展示两个变量,除了使用散点图,也可以使用堆叠图 -
堆叠图是将一个变量绘制在另一个变量顶部的图表 -
接下来通过堆叠图来展示最常见的五种葡萄酒
<span style="background-color:#f8f8f8"><span style="color:#333333"># 将葡萄酒种类分组,找到最常见的五种葡萄酒
reviews.groupby(['variety'])['country'].count().sort_values(ascending = False)
</span></span>
显示结果:
<span style="background-color:#f8f8f8">variety
Chardonnay 14482
Pinot Noir 14288
Cabernet Sauvignon 12800
Red Blend 10061
Bordeaux-style Red Blend 7347
...
Chinuri 1
Petit Meslier 1
Espadeiro 1
Parraleta 1
Erbaluce 1
Name: country, Length: 632, dtype: int64
</span>
<span style="background-color:#f8f8f8"><span style="color:#333333">top_5_wine = reviews[reviews.variety.isin(['Chardonnay','Pinot Noir','Cabernet Sauvignon','Red Blend','Bordeaux-style Red Blend'])]
</span></span>
<span style="background-color:#f8f8f8"><span style="color:#333333"># 透视表计数
wine_counts = top_5_wine.pivot_table(index=['points'], columns ==['variety'], values='country', aggfunc='count')
# 修改列名
wine_counts.columns = ['Bordeaux-style Red Blend','Cabernet Sauvignon','Chardonnay','Pinot Noir','Red Blend']
wine_counts
</span></span>
显示结果:
points | Bordeaux-style Red Blend | Cabernet Sauvignon | Chardonnay | Pinot Noir | Red Blend |
---|
80 | 5.0 | 89.0 | 70.0 | 36.0 | 75.0 | 81 | 23.0 | 160.0 | 154.0 | 83.0 | 108.0 | 82 | 83.0 | 436.0 | 523.0 | 296.0 | 233.0 | 83 | 122.0 | 571.0 | 686.0 | 350.0 | 366.0 | 84 | 334.0 | 925.0 | 1170.0 | 757.0 | 623.0 | 85 | 379.0 | 1058.0 | 1299.0 | 903.0 | 608.0 | 86 | 467.0 | 1205.0 | 1525.0 | 1260.0 | 919.0 | 87 | 679.0 | 1589.0 | 1887.0 | 1784.0 | 1375.0 | 88 | 741.0 | 1160.0 | 1513.0 | 1586.0 | 1366.0 | 89 | 724.0 | 920.0 | 1039.0 | 1223.0 | 1013.0 | 90 | 901.0 | 1341.0 | 1435.0 | 1646.0 | 1131.0 | 91 | 821.0 | 825.0 | 938.0 | 1124.0 | 881.0 | 92 | 733.0 | 1018.0 | 953.0 | 1173.0 | 620.0 | 93 | 556.0 | 653.0 | 671.0 | 992.0 | 395.0 | 94 | 338.0 | 440.0 | 325.0 | 621.0 | 190.0 | 95 | 220.0 | 233.0 | 188.0 | 269.0 | 88.0 | 96 | 124.0 | 102.0 | 73.0 | 105.0 | 30.0 | 97 | 67.0 | 56.0 | 23.0 | 56.0 | 27.0 | 98 | 22.0 | 12.0 | 7.0 | 20.0 | 6.0 | 99 | 8.0 | 4.0 | NaN | 2.0 | 5.0 | 100 | NaN | 3.0 | 3.0 | 2.0 | 2.0 |
<span style="background-color:#f8f8f8"><span style="color:#333333">wine_counts.plot.bar(stacked=True)
</span></span>
显示结果:
<span style="background-color:#f8f8f8"><span style="color:#333333">wine_counts.plot.area()
</span></span>
显示结果:
<span style="background-color:#f8f8f8"><span style="color:#333333">wine_counts.plot.line()
</span></span>
显示结果:
总结
|