[人工智能] 文本相似度计算（中英文）详解实战

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 人工智能 -> 文本相似度计算（中英文）详解实战 -> 正文阅读

[人工智能]文本相似度计算（中英文）详解实战

使用tf_idf模型实现中英文文本相似度计算

1. 英文文本相似度计算

测试文本

documents = [
    "Is there anything good playing?",
    "let's meet at the movie theater entrance tonight. Don't be late.",
    "Are you going to the movie theater with me tonight?",
    "I get a lump in my throat whenever I see a tragic movie.",
    "you're just too emotional.",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    ]

2.去除停用词，英文不需要分词操作

# 去掉停用词
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]
print(texts)

[['is', 'there', 'anything', 'good', 'playing?'], ["let's", 'meet', 'at', 'movie', 'theater', 'entrance', 'tonight.', "don't", 'be', 'late.'], ['are', 'you', 'going', 'movie', 'theater', 'with', 'me', 'tonight?'], ['i', 'get', 'lump', 'my', 'throat', 'whenever', 'i', 'see', 'tragic', 'movie.'], ["you're", 'just', 'too', 'emotional.'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering']]

构建词典，也就是给每个词分配一个唯一编号

# 将词语分词并保存
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id) # 查看每个词的唯一编号
# 词袋模型 基于词典，将分词列表集转换成稀疏向量集，即语料库
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

{'anything': 0, 'good': 1, 'is': 2, 'playing?': 3, 'there': 4, 'at': 5, 'be': 6, "don't": 7, 'entrance': 8, 'late.': 9, "let's": 10, 'meet': 11, 'movie': 12, 'theater': 13, 'tonight.': 14, 'are': 15, 'going': 16, 'me': 17, 'tonight?': 18, 'with': 19, 'you': 20, 'get': 21, 'i': 22, 'lump': 23, 'movie.': 24, 'my': 25, 'see': 26, 'throat': 27, 'tragic': 28, 'whenever': 29, 'emotional.': 30, 'just': 31, 'too': 32, "you're": 33, 'graph': 34, 'intersection': 35, 'paths': 36, 'trees': 37, 'iv': 38, 'minors': 39, 'ordering': 40, 'quasi': 41, 'well': 42, 'widths': 43}
Dictionary(44 unique tokens: ['anything', 'good', 'is', 'playing?', 'there']...)

根据词袋模型将分词列表集转换成稀疏向量集
doc2bow和word2vec等一样，是将词表示成特征，corpus的结果中(0, 1)理解为每个编号为0的词出现了1次。

corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], [(5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)], [(12, 1), (13, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1)], [(21, 1), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)], [(30, 1), (31, 1), (32, 1), (33, 1)], [(34, 1), (35, 1), (36, 1), (37, 1)], [(34, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1)]]

训练tf-idf模型，语料库进行训练

# 训练tf-idf模型，语料库进行训练
tfidf = models.TfidfModel(corpus)

# 用训练好的TF-IDF模型处理被检索文本，即语料库
corpus_tfidf = tfidf[corpus]

用测试集来测试

# 要检索的句子
query = 'I like movie'
# 利用doc2bow对其进行分割  把该句子列表根据dictionary变成稀疏向量
vec_bow = dictionary.doc2bow(query.split())
print(vec_bow)   # [(12, 1)] 说明movie的编号是12，出现了一次，i like没有在语料库中出现
# 然后求tf_idf
vec_tfidf = tfidf[vec_bow]
print(vec_tfidf)  # [(12, 1.0)]

用测试集的这句话与文本中的每一句话计算相似度

# 相似度检索  返回最
similarity = similarities.MatrixSimilarity(corpus_tfidf)
# 该句话与语料库所有句子计算的相似度值
sims = similarity[vec_tfidf]
print(sims)

可以看的出来，测试集数据与第1句话和第2句话有相似性（编号从0开始的）

[0.         0.21666655 0.24635962 0.         0.         0.        0.        ]

提取出相似值最大的值

# 最大相似度的文本索引
max_loc = np.argmax(sims)
print(max_loc)  # 2
# 最大相似度值
max_sim = sims[max_loc]
print(max_sim)  # 0.24635962

拿到了索引，我们可以做字典映射将原始文档中的话找出来，任务完成。

2
0.24635962

完整代码

from gensim import corpora, models, similarities
import numpy as np

# 建立词典
documents = [
    "Is there anything good playing?",
    "let's meet at the movie theater entrance tonight. Don't be late.",
    "Are you going to the movie theater with me tonight?",
    "I get a lump in my throat whenever I see a tragic movie.",
    "you're just too emotional.",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    ]
# 去掉停用词
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print(texts)

# 将词语分词并保存
dictionary = corpora.Dictionary(texts)
print(dictionary)
print(dictionary.token2id) # 查看每个词的唯一编号

# 词袋模型 基于词典，将分词列表集转换成稀疏向量集，即语料库
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

# 训练tf-idf模型，语料库进行训练
tfidf = models.TfidfModel(corpus)

# 用训练好的TF-IDF模型处理被检索文本，即语料库
corpus_tfidf = tfidf[corpus]

# 要检索的句子
query = 'I like movie'
# 利用doc2bow对其进行分割  把该句子列表根据dictionary变成稀疏向量
vec_bow = dictionary.doc2bow(query.split())

# 然后求tf_idf
vec_tfidf = tfidf[vec_bow]

# 相似度检索  返回最
similarity = similarities.MatrixSimilarity(corpus_tfidf)
# 该句话与语料库所有句子计算的相似度值
sims = similarity[vec_tfidf]

# 最大相似度的文本索引
max_loc = np.argmax(sims)

# 最大相似度值
max_sim = sims[max_loc]

# 模型保存并加载
# tfidf.save("data.tfidf")
# tfidf = models.TfidfModel.load("data.tfidf")
# print(tfidf_model.dfs)

2. 中文文本相似度计算

上面讲了详细步骤，直接上代码，中文就多了一个分词，其他都一样

import jieba
import numpy as np
from gensim import corpora, models, similarities


def _get_stop_words():  # 该函数用于读入停用词
    """
    读取停用词问价你
    :return:
    """
    with open('stop_words.txt', encoding='utf-8') as f:
        stopwords = f.read()
    return set(stopwords.split()) | {'shi'}


def delete_stopwords(documents):
    """
    去除停用词
    :return:
    """
    # 停用词列表
    stopwords_list = _get_stop_words()
    # 精准全模式 分词
    cut_words_list = [jieba.lcut(i, cut_all=False, HMM=True) for i in documents]

    # 去除停用词
    del_stop_words_list = []
    for word_list in cut_words_list:
        del_stop_words_list.append([word for word in word_list if word not in stopwords_list])
    return del_stop_words_list


# 建立词典
documents = [
    "今天去打篮球吗?",
    "明天晚上八点半的电影，准时到shi",
    "最近有新上映的电影，挺好看的，改天去看吗?",
    "明天天气要下雨.",
    "今天太热了.",
    "工作好累啊，不想努力了",
    "跟着自己的内心走",
]
# 去掉停用词
texts = delete_stopwords(documents)
print(texts)

# 将词语分词并保存
dictionary = corpora.Dictionary(texts)
# print(dictionary.token2id) # 查看每个词的唯一编号

# 词袋模型 基于词典，将分词列表集转换成稀疏向量集，即语料库
corpus = [dictionary.doc2bow(text) for text in texts]

# 训练tf-idf模型，语料库进行训练
tfidf = models.TfidfModel(corpus)

# 用训练好的TF-IDF模型处理被检索文本，即语料库
corpus_tfidf = tfidf[corpus]

# 要检索的句子
query = '好想 去 电影院 看 电影'
# 利用doc2bow对其进行分割  把该句子列表根据dictionary变成稀疏向量
vec_bow = dictionary.doc2bow(query.split())

# 然后求tf_idf
vec_tfidf = tfidf[vec_bow]

# 相似度检索  返回最
similarity = similarities.MatrixSimilarity(corpus_tfidf)
# 该句话与语料库所有句子计算的相似度值
sims = similarity[vec_tfidf]

# 最大相似度的文本索引
max_loc = np.argmax(sims)
print(max_loc)
# 最大相似度值
max_sim = sims[max_loc]

[['今天', '去', '打篮球'], ['明天', '八点半', '电影', '准时'], ['新', '上映', '电影', '挺', '好看', '改天', '去', '看'], ['明天', '天气', '下雨'], ['今天', '太热'], ['好累', '不想'], ['跟着', '内心', '走']]
2

说明测试集数据’好想去电影院看电影’和文本中的 "最近有新上映的电影，挺好看的，改天去看吗?"最相似。