Jaccard相似度
J
(
A
,
B
)
J(A,B)
J(A,B)表示有限样本集之间的相似程度:
J
(
A
,
B
)
=
∣
A
∩
B
∣
∣
A
∪
B
∣
=
∣
A
∩
B
∣
∣
A
∣
+
∣
B
∣
?
∣
A
∩
B
∣
J(A,B)=\frac{|A∩B|}{|A∪B|}=\frac{|A∩B|}{|A|+|B|-|A∩B|}
J(A,B)=∣A∪B∣∣A∩B∣?=∣A∣+∣B∣?∣A∩B∣∣A∩B∣?
Jaccard相似度:
d
j
(
A
,
B
)
=
1
?
J
(
A
,
B
)
=
∣
A
∪
B
∣
?
∣
A
∩
B
∣
∣
A
∪
B
∣
=
A
Δ
B
∣
A
∪
B
∣
d_j(A,B)=1-J(A,B)=\frac{|A∪B|-|A∩B|}{|A∪B|}=\frac{AΔB}{|A∪B|}
dj?(A,B)=1?J(A,B)=∣A∪B∣∣A∪B∣?∣A∩B∣?=∣A∪B∣AΔB?
当A=B时,Jaccard相似度为1;当|A∩B|=0时,Jaccard相似度为0.
Jaccard相似度的取值范围为[0,1],值越大表示越相似。代码如下:
def Jaccard(words1, words2):
words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
interNum = 0
for word in words1_cut:
if word in words2_cut:
interNum += 1
return float(interNum/(len(set(words1_cut))+len(set(words2_cut))-interNum))
余弦相似度
c
o
s
(
X
,
Y
)
=
X
?
Y
∣
X
∣
∣
Y
∣
cos(X,Y)=\frac{X·Y}{|X||Y|}
cos(X,Y)=∣X∣∣Y∣X?Y?
Dice系数
s
=
2
∣
A
∩
B
∣
∣
A
∣
+
∣
B
∣
s=2\frac{|A∩B|}{|A|+|B|}
s=2∣A∣+∣B∣∣A∩B∣?
def Dice(words1, words2):
words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
interNum = 0
for word in words1_cut:
if word in words2_cut:
interNum += 1
return float(2*interNum/(len(set(words1_cut))+len(set(words2_cut))))
匹配系数
o
v
e
r
l
a
p
(
X
,
Y
)
=
∣
X
∩
Y
∣
m
i
n
(
∣
X
∣
,
∣
Y
∣
)
overlap(X,Y)=\frac{|X∩Y|}{min(|X|,|Y|)}
overlap(X,Y)=min(∣X∣,∣Y∣)∣X∩Y∣?
def overlap(words1, words2):
words1_cut, words2_cut = set(jieba.cut(words1)), set(jieba.cut(words2))
interNum = 0
for word in words1_cut:
if word in words2_cut:
interNum += 1
return float(2*interNum/min(len(set(words1_cut)),len(set(words2_cut))))
|