目录
简介
方法一之SpaCy
方法二之Sentence Transformers
方法三之scipy
方法四之torch
方法五之TFHub Universal Sentence Encoder
方法六之TF-IDF
参考资料
简介
下面的大多数库应该是语义相似性比较的不错选择。您可以使用这些库中的预训练模型生成单词或句子向量,从而跳过直接单词比较。
方法一之SpaCy
参考文献
Linguistic Features · spaCy Usage Documentation
需要下载模型
要使用 en_core_web_md,请使用 python -m spacy download en_core_web_md 进行下载。要使用 en_core_web_lg,请使用 python -m spacy download en_core_web_lg。 大型模型大约为 830mb 左右,而且速度很慢,因此中型模型是一个不错的选择。
python -m spacy download en_core_web_lg
代码
import spacy
nlp = spacy.load("en_core_web_lg")
doc1 = nlp(u'the person wear red T-shirt')
doc2 = nlp(u'this person is walking')
doc3 = nlp(u'the boy wear red T-shirt')
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))
结果
0.7003971105290047
0.9671912343259517
0.6121211244876517
GitHub - UKPLab/sentence-transformers: Multilingual Sentence & Image Embeddings with BERT
Semantic Textual Similarity — Sentence-Transformers documentation
代码
这个会安装词嵌入
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
sentences = [
'the person wear red T-shirt',
'this person is walking',
'the boy wear red T-shirt'
]
sentence_embeddings = model.encode(sentences)
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding)
print("")
输出
Sentence: the person wear red T-shirt
Embedding: [ 1.31643847e-01 -4.20616418e-01 ... 8.13076794e-01 -4.64620918e-01]
Sentence: this person is walking
Embedding: [-3.52878094e-01 -5.04286848e-02 ... -2.36091137e-01 -6.77282438e-02]
Sentence: the boy wear red T-shirt
Embedding: [-2.36365378e-01 -8.49713564e-01 ... 1.06414437e+00 -2.70157874e-01]
方法三之scipy
代码
from scipy.spatial import distance
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[1]))
print(1 - distance.cosine(sentence_embeddings[0], sentence_embeddings[2]))
print(1 - distance.cosine(sentence_embeddings[1], sentence_embeddings[2]))
输出
0.4643629193305969
0.9069876074790955
0.3275738060474396
方法四之torch
代码
import torch.nn
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
b = torch.from_numpy(sentence_embeddings)
print(cos(b[0], b[1]))
print(cos(b[0], b[2]))
print(cos(b[1], b[2]))
输出
tensor(0.4644)
tensor(0.9070)
tensor(0.3276)
方法五之TFHub Universal Sentence Encoder
https://tfhub.dev/google/universal-sentence-encoder/4
https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb
这个大约 1GB 的模型非常大,而且看起来比其他模型慢。这也会生成句子的嵌入
代码
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
embeddings = embed([
"the person wear red T-shirt",
"this person is walking",
"the boy wear red T-shirt"
])
print(embeddings)
from scipy.spatial import distance
print(1 - distance.cosine(embeddings[0], embeddings[1]))
print(1 - distance.cosine(embeddings[0], embeddings[2]))
print(1 - distance.cosine(embeddings[1], embeddings[2]))
输出
tf.Tensor(
[[ 0.063188 0.07063895 -0.05998802 ... -0.01409875 0.01863449
0.01505797]
[-0.06786212 0.01993554 0.03236153 ... 0.05772103 0.01787272
0.01740014]
[ 0.05379306 0.07613157 -0.05256693 ... -0.01256405 0.0213196
-0.00262441]], shape=(3, 512), dtype=float32)
0.15320375561714172
0.8592830896377563
0.09080004692077637
其它嵌入
https://github.com/facebookresearch/InferSent
GitHub - Tiiiger/bert_score: BERT score for text generation
方法六之TF-IDF
安装sklearn包
pip install scikit-learn
代码
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["I 'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
print(vect.vocabulary_)
print('tfidf:',tfidf)
pairwise_similarity = tfidf * tfidf.T
print("pairwise_similarity:",pairwise_similarity)
print(pairwise_similarity.toarray() )
import numpy as np
arr = pairwise_similarity.toarray()
np.fill_diagonal(arr, np.nan)
input_doc = "The scikit-learn docs are Orange and Blue"
input_idx = corpus.index(input_doc)
print(input_idx)
result_idx = np.nanargmax(arr[input_idx])
print(corpus[result_idx])
输出结果
{'like': 9, 'apple': 0, 'day': 4, 'keeps': 7, 'doctor': 6, 'away': 1, 'compare': 3, 'orange': 10, 'prefer': 11, 'scikit': 12, 'learn': 8, 'docs': 5, 'blue': 2}
tfidf: (0, 0) 0.5564505207186616
(0, 9) 0.830880748357988
(1, 1) 0.4741246485558491
(1, 6) 0.4741246485558491
(1, 7) 0.4741246485558491
(1, 4) 0.4741246485558491
(1, 0) 0.31752680284846835
(2, 10) 0.48624041659157047
(2, 3) 0.7260444301457811
(2, 0) 0.48624041659157047
(3, 8) 0.4864843177105593
(3, 12) 0.4864843177105593
(3, 11) 0.6029847724484662
(3, 10) 0.40382592962643526
(4, 2) 0.516373967614865
(4, 5) 0.516373967614865
(4, 8) 0.4166072657167829
(4, 12) 0.4166072657167829
(4, 10) 0.3458216642191991
pairwise_similarity: (0, 2) 0.27056873300683837
(0, 1) 0.17668795478716204
(0, 0) 0.9999999999999998
(1, 2) 0.1543943648960287
(1, 0) 0.17668795478716204
(1, 1) 0.9999999999999999
(2, 1) 0.1543943648960287
(2, 0) 0.27056873300683837
(2, 4) 0.16815247007633355
(2, 3) 0.1963564882520361
(2, 2) 1.0
(3, 2) 0.1963564882520361
(3, 4) 0.5449975578692606
(3, 3) 0.9999999999999999
(4, 2) 0.16815247007633355
(4, 3) 0.5449975578692606
(4, 4) 1.0
[[1. 0.17668795 0.27056873 0. 0. ]
[0.17668795 1. 0.15439436 0. 0. ]
[0.27056873 0.15439436 1. 0.19635649 0.16815247]
[0. 0. 0.19635649 1. 0.54499756]
[0. 0. 0.16815247 0.54499756 1. ]]
4
参考资料
python - How to compute the similarity between two text documents? - Stack Overflow
优势是:速度快
参考资料
How to compute the similarity between two text documents?
https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity
https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632
scipy.spatial.distance.cosine — SciPy v0.14.0 Reference Guide
https://www.tensorflow.org/api_docs/python/tf/keras/losses/CosineSimilarity
deep learning - is there a way to check similarity between two full sentences in python? - Stack Overflow
NLP Town
|