使用tf_idf模型实现中英文文本相似度计算
1. 英文文本相似度计算
- 测试文本
documents = [
"Is there anything good playing?",
"let's meet at the movie theater entrance tonight. Don't be late.",
"Are you going to the movie theater with me tonight?",
"I get a lump in my throat whenever I see a tragic movie.",
"you're just too emotional.",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
]
2.去除停用词,英文不需要分词操作
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
print(texts)
[['is', 'there', 'anything', 'good', 'playing?'], ["let's", 'meet', 'at', 'movie', 'theater', 'entrance', 'tonight.', "don't", 'be', 'late.'], ['are', 'you', 'going', 'movie', 'theater', 'with', 'me', 'tonight?'], ['i', 'get', 'lump', 'my', 'throat', 'whenever', 'i', 'see', 'tragic', 'movie.'], ["you're", 'just', 'too', 'emotional.'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering']]
- 构建词典,也就是给每个词分配一个唯一编号
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)
{'anything': 0, 'good': 1, 'is': 2, 'playing?': 3, 'there': 4, 'at': 5, 'be': 6, "don't": 7, 'entrance': 8, 'late.': 9, "let's": 10, 'meet': 11, 'movie': 12, 'theater': 13, 'tonight.': 14, 'are': 15, 'going': 16, 'me': 17, 'tonight?': 18, 'with': 19, 'you': 20, 'get': 21, 'i': 22, 'lump': 23, 'movie.': 24, 'my': 25, 'see': 26, 'throat': 27, 'tragic': 28, 'whenever': 29, 'emotional.': 30, 'just': 31, 'too': 32, "you're": 33, 'graph': 34, 'intersection': 35, 'paths': 36, 'trees': 37, 'iv': 38, 'minors': 39, 'ordering': 40, 'quasi': 41, 'well': 42, 'widths': 43}
Dictionary(44 unique tokens: ['anything', 'good', 'is', 'playing?', 'there']...)
- 根据词袋模型将分词列表集转换成稀疏向量集
doc2bow和word2vec等一样,是将词表示成特征,corpus的结果中(0, 1)理解为每个编号为0的词出现了1次。
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(12, 1), (13, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)], [(21, 1), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(30, 1), (31, 1), (32, 1), (33, 1)], [(34, 1), (35, 1), (36, 1), (37, 1)], [(34, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]]
- 训练tf-idf模型,语料库进行训练
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
- 用测试集来测试
query = 'I like movie'
vec_bow = dictionary.doc2bow(query.split())
print(vec_bow)
vec_tfidf = tfidf[vec_bow]
print(vec_tfidf)
- 用测试集的这句话与文本中的每一句话计算相似度
similarity = similarities.MatrixSimilarity(corpus_tfidf)
sims = similarity[vec_tfidf]
print(sims)
可以看的出来,测试集数据与第1句话和第2句话有相似性(编号从0开始的)
[0. 0.21666655 0.24635962 0. 0. 0. 0. ]
- 提取出相似值最大的值
max_loc = np.argmax(sims)
print(max_loc)
max_sim = sims[max_loc]
print(max_sim)
拿到了索引,我们可以做字典映射将原始文档中的话找出来,任务完成。
2
0.24635962
完整代码
from gensim import corpora, models, similarities
import numpy as np
documents = [
"Is there anything good playing?",
"let's meet at the movie theater entrance tonight. Don't be late.",
"Are you going to the movie theater with me tonight?",
"I get a lump in my throat whenever I see a tragic movie.",
"you're just too emotional.",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print(texts)
dictionary = corpora.Dictionary(texts)
print(dictionary)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
query = 'I like movie'
vec_bow = dictionary.doc2bow(query.split())
vec_tfidf = tfidf[vec_bow]
similarity = similarities.MatrixSimilarity(corpus_tfidf)
sims = similarity[vec_tfidf]
max_loc = np.argmax(sims)
max_sim = sims[max_loc]
2. 中文文本相似度计算
上面讲了详细步骤,直接上代码,中文就多了一个分词,其他都一样
import jieba
import numpy as np
from gensim import corpora, models, similarities
def _get_stop_words():
"""
读取停用词问价你
:return:
"""
with open('stop_words.txt', encoding='utf-8') as f:
stopwords = f.read()
return set(stopwords.split()) | {'shi'}
def delete_stopwords(documents):
"""
去除停用词
:return:
"""
stopwords_list = _get_stop_words()
cut_words_list = [jieba.lcut(i, cut_all=False, HMM=True) for i in documents]
del_stop_words_list = []
for word_list in cut_words_list:
del_stop_words_list.append([word for word in word_list if word not in stopwords_list])
return del_stop_words_list
documents = [
"今天去打篮球吗?",
"明天晚上八点半的电影,准时到shi",
"最近有新上映的电影,挺好看的,改天去看吗?",
"明天天气要下雨.",
"今天太热了.",
"工作好累啊,不想努力了",
"跟着自己的内心走",
]
texts = delete_stopwords(documents)
print(texts)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
query = '好想 去 电影院 看 电影'
vec_bow = dictionary.doc2bow(query.split())
vec_tfidf = tfidf[vec_bow]
similarity = similarities.MatrixSimilarity(corpus_tfidf)
sims = similarity[vec_tfidf]
max_loc = np.argmax(sims)
print(max_loc)
max_sim = sims[max_loc]
[['今天', '去', '打篮球'], ['明天', '八点半', '电影', '准时'], ['新', '上映', '电影', '挺', '好看', '改天', '去', '看'], ['明天', '天气', '下雨'], ['今天', '太热'], ['好累', '不想'], ['跟着', '内心', '走']]
2
说明测试集数据’好想 去 电影院 看 电影’和文本中的 "最近有新上映的电影,挺好看的,改天去看吗?"最相似。
总结
一般情况下,我们可以设置阈值来过滤相似度较低的文本。
|