目录
一、课堂练习
操作步骤:
发现问题:
原因分析:
解决方案:??
结果展示:
具体代码:
二、扩展练习
新增步骤:
结果展示:
具体代码:
一、课堂练习
分词绘制十四五规划的词云图
操作步骤:
- 读取文本,使用jieba.cut(txt,cut_all=False)将文本切分,返回一个列表
- 遍历这个列表,统计词频,依次存入字典中,过滤某些没有意义的词
- 逐个将键值对存入列表中,使用list.sort()、list.reverse()降序排列,重新生成字典
- 设置配色、图形、字体等参数,使用generate_from_frequencies() ,根据词频生成词云图
发现问题:
打开.txt文件(utf-8编码):textFile = open("text.txt", "r").read()
出现报错:UnicodeDecodeError: 'gbk' codec can't decode byte 0xad in position 2: illegal multibyte sequence
原因分析:
open函数中,encoding参数的默认值是None,不能读取中文字符,要给encoding参数重新传入值“utf-8”才能读取中文字符
解决方案:??
textFile = open("test.txt", "r", encoding="utf-8").read()
结果展示:
具体代码:
from wordcloud import WordCloud as wc
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from imageio import imread
import jieba
with open('China145.txt','r',encoding='utf-8') as f:
rword=f.read()
seg_list=jieba.cut(rword,cut_all=False)
tf={}
for seg in seg_list:
if seg in tf:
tf[seg]+=1
else:
tf[seg]=1 #词频统计
word=list(tf.keys())
with open('stopword.txt','r',encoding='utf-8') as sw:
stopword=sw.read()
for seg in word:
if tf[seg]<5 or len(seg)<2 or seg in stopword or "一" in seg:
tf.pop(seg) #过滤词语
word, num, data = list(tf.keys()), list(tf.values()),[]
for i in range(len(tf)):
data.append((num[i],word[i])) #逐个将键值对存入data中
data.sort() #升序排列
data.reverse() #逆序,得到所需的降序排列
tf_sorted={}
for i in range(len(data)):
tf_sorted[data[i][1]]=data[i][0] #重新生成字典
font=r'C:\Users\ZZX\AppData\Local\Microsoft\Windows\Fonts\STZHONGS.TTF'
mask = imread("heart.png")
colormaps = colors.ListedColormap(['#FF0000','#FF7F50','#FFE4C4'])
mywc=wc(font_path=font,width=1600,height=900,max_words=300,background_color='white',colormap=colormaps,mask=mask).generate_from_frequencies(tf_sorted)
plt.axis('off') #去除坐标轴
plt.imshow(mywc) #负责对图像进行处理,并显示其格式,但是不能显示
plt.show()
二、扩展练习
分词绘制《西游记》的词云图,作为这本书的概览理解
新增步骤:
按照图片颜色绘制词云
from wordcloud import WordCloud,ImageColorGenerator
image_colors=ImageColorGenerator(mask) plt.imshow(wc.recolor(color_func=image_colors))
结果展示:
具体代码:
from wordcloud import WordCloud as wc
import matplotlib.pyplot as plt
from wordcloud import ImageColorGenerator
from PIL import Image
import numpy as np
import jieba
with open('Journey to the West.txt','r',encoding='utf-8') as f:
rword=f.read()
seg_list=jieba.cut(rword,cut_all=False)
tf={}
for seg in seg_list:
if seg in tf:
tf[seg]+=1
else:
tf[seg]=1
word=list(tf.keys())
with open('stopword.txt','r',encoding='utf-8') as sw:
stopword=sw.read()
for seg in word:
if tf[seg]<5 or len(seg)<2 or seg in stopword or "一" in seg:
tf.pop(seg)
word, num, data = list(tf.keys()), list(tf.values()),[]
for i in range(len(tf)):
data.append((num[i],word[i]))
data.sort()
data.reverse()
tf_sorted={}
for i in range(len(data)):
tf_sorted[data[i][1]]=data[i][0]
font=r'C:\Users\ZZX\AppData\Local\Microsoft\Windows\Fonts\STZHONGS.TTF'
mask=np.array(Image.open("photo.png"))
mywc=wc(width=600,height=600,max_words=300,font_path=font,background_color='white',mask=mask).generate_from_frequencies(tf_sorted)
image_colors=ImageColorGenerator(mask)
plt.imshow(mywc.recolor(color_func=image_colors))
plt.axis('off')
plt.show()
|