一、DBSCAN的不足
DBSCAN 是基于密度聚类的代表性方法,可以识别任意形状的簇和噪音点。它的两个输入参数Eps和MinPts是全局参数,使得DBSCAN不能识别不同密度的簇。对于高密度簇的核心点,在较小的Eps邻域内就可以有至少MinPts个点;对于低密度簇的核心点,在较大的Eps邻域内才可以有MinPts个点。如下图所示,基于全局参数的DBSCAN聚类结果更倾向于A、B和C。
二、OPTICS 的相关概念
OPTICS算法解决了DBSCAN不能识别多密度簇的问题。 在DBSCAN的相关定义上,OPTICS引进了两个新的定义。
1.p的核距离 2.p关于o的可达距离
1.对象p的核距离
c
_
d
i
s
t
(
p
)
=
{
U
n
d
e
f
i
n
e
d
i
f
?
C
a
r
d
(
N
?
(
p
)
)
<
M
i
n
P
t
s
M
i
n
P
t
s
_
d
i
s
t
(
p
)
i
f
?
C
a
r
d
(
N
?
(
p
)
)
>
=
M
i
n
P
t
s
c\_dist(p)=\left\{ \begin{aligned} Undefined\quad\quad if\ Card(N_\epsilon(p))<MinPts \\ MinPts\_dist(p)\quad\quad if\ Card(N_\epsilon(p))>=MinPts \\ \end{aligned} \right.
c_dist(p)={Undefinedif?Card(N??(p))<MinPtsMinPts_dist(p)if?Card(N??(p))>=MinPts? 其中
C
a
r
d
(
N
?
(
p
)
Card(N_\epsilon(p)
Card(N??(p)是
?
\epsilon
?邻域内点的个数。
M
i
n
P
t
s
_
d
i
s
t
(
p
)
MinPts\_dist(p)
MinPts_dist(p)是p在
?
\epsilon
?邻域内成为核心点的最小距离。
2.p关于o的可达距离
r
_
d
i
s
t
(
p
)
=
{
U
n
d
e
f
i
n
e
d
i
f
?
C
a
r
d
(
N
?
(
o
)
)
<
M
i
n
P
t
s
m
a
x
(
c
_
d
i
s
t
(
p
)
,
d
i
s
t
(
p
,
o
)
)
i
f
?
C
a
r
d
(
N
?
(
o
)
)
>
=
M
i
n
P
t
s
r\_dist(p)=\left\{ \begin{aligned} Undefined\quad\quad if\ Card(N_\epsilon(o))<MinPts \\ max(c\_dist(p),dist(p,o))\quad\quad if\ Card(N_\epsilon(o))>=MinPts \\ \end{aligned} \right.
r_dist(p)={Undefinedif?Card(N??(o))<MinPtsmax(c_dist(p),dist(p,o))if?Card(N??(o))>=MinPts? 如果
C
a
r
d
(
N
?
(
p
)
)
<
M
i
n
P
t
s
Card(N_\epsilon(p))<MinPts
Card(N??(p))<MinPts,那么P在
?
\epsilon
?邻域内不可能成为核心点。 从o到p的可达距离不可能小于o的核心距离,否则,从o不能直接密度可达p。下图说明了核距离和可达距离的定义。
三、OPTICS伪代码
1.初始化 ordered_objects=[] //用于保存 r_dists=[] //用于保存每个对象到与它最近的核心点的可达距离 c_dists=[] //用于保存每个对象的核距离 orderedseeds=[] //
2.伪代码 OPTICS(setofObjects,Eps,MinPts)
\quad
For
i
i
i From 1 To setofObjects.size
\quad
\quad
object=setofObjects.get(
i
i
i) //获得第
i
i
i个对象
\quad
\quad
如果object未访问
\quad
\quad
\quad
调用函数ExpandClusterOrder(setofObjects,object,Eps,MinPts,) end
ExpandClusterOrder(setofObjects,object,Eps,MinPts)
\quad
获得object的Eps邻域neighbors//
\quad
将object的访问状态改为已访问
\quad
令object的可达距离r_dist=undefined //假定undefined比任何定义的距离都大
\quad
计算object.c_dist//
\quad
ordered_objects.append(object)//将object加入Ordered_Objects中
\quad
r_dists.append(r_dist)
\quad
c_dists.append(c_dist)
\quad
如果 object的c_dist != undefined //说明object关于Eps,MinPts是核心点
\quad\quad
调用函数update_orderSeeds(neighbors,object,c_dist)//更新orderSeeds
\quad \quad
WHILE orderSeeds NOT Empty
\quad \quad\quad
current_object=orderSeeds.first()//从orderSeeds中弹出第一个元素
\quad \quad\quad
获得current_object的Eps邻域//
\quad \quad\quad
将current_object的访问状态改为已访问//
\quad \quad\quad
计算current_object.c_dist
\quad \quad\quad
ordered_objects.append(current_object)
\quad \quad\quad
r_dists.append(curent_object.r_dist)
\quad \quad\quad
c_dists.append(curent_object.c_dist) //
\quad \quad\quad
如果 current_object.c_dist不等于undefined //
\quad \quad\quad\quad
调用函数update_orderSeeds(neighbors,current_object,c_dist) end
update_orderSeeds(neighbors,center_object,c_dist)
\quad
For all object FROM neighbors
\quad \quad
如果object未访问
\quad \quad\quad
new_r_dist=max(c_dist,dist(object,center_object))
\quad\quad\quad
如果object.r_dist=undefined
\quad\quad\quad\quad
令object.r_dist=new_r_dist
\quad\quad\quad\quad
将object根据r_dist插入到orderSeed中的适当位置//orderSeeds是按r_dist从小到大排列的
\quad\quad\quad
否则//说明object已经在orderSeed中
\quad\quad\quad\quad
如果new_r_dist<object.r_dist
\quad\quad\quad\quad\quad
object.r_dist=new_r_dist
\quad\quad\quad\quad\quad
将object根据r_dist插入到orderSeeds中的适当位置 end
ExtractDBSCAN-Clustering(ordered_objects,Eps1,MinPts) //这里要求Eps1<=Eps
\quad
k=1
\quad
clusterID=-1
\quad
For
i
i
i FROM 1 to ordered_objects.size
\quad\quad
获得ordered_objects第
i
i
i个对象object
\quad\quad
如果object.r_dist>Eps1 //假定undefined>Eps
\quad\quad\quad
如果object.c_dist<=Eps1
\quad\quad\quad\quad
clusterID=k
\quad\quad\quad\quad
m_i=clusterID //m_i表示第I个对象所属的簇类型
\quad\quad\quad\quad
k+=1
\quad\quad\quad
否则
\quad\quad\quad\quad
m_i=-1// -1表示是噪音点
\quad\quad
否则
\quad\quad\quad
m_i=clusterID end
四、python 代码
ordered=[]
c_dists=[]
r_dists=[]
orderedSeeds=[]
orderedSeeds_rdist=[]
def optics(X,eps,minpts):
unvisited=(X.copy()).tolist()
for i in range(len(X)):
obj=X[i]
if obj.tolist() in unvisited:
expandClusterOrder(X,obj,eps,minpts,unvisited)
def expandClusterOrder(X,obj,eps,minpts,unvisited):
neighbor=computeNeighbor(X,obj,eps)
unvisited.remove(obj.tolist())
r_dist=float('inf')
c_dist=computeC_dsit(neighbor,obj,minpts)
ordered.append(obj)
r_dists.append(r_dist)
c_dists.append(c_dist)
if c_dist!=float('inf'):
update_orderedSeeds(neighbor,obj,c_dist,unvisited)
while orderedSeeds:
cur_obj=np.array(orderedSeeds.pop(0))
r_dist=orderedSeeds_rdist.pop(0)
neighbor=computeNeighbor(X,cur_obj,eps)
if cur_obj.tolist() in unvisited:
unvisited.remove(cur_obj.tolist())
c_dist=computeC_dsit(neighbor,cur_obj,minpts)
ordered.append(cur_obj)
r_dists.append(r_dist)
c_dists.append(c_dist)
if c_dist!=float('inf'):
update_orderedSeeds(neighbor,cur_obj,c_dist,unvisited)
def computeNeighbor(X,obj,eps):
"""返回array形式"""
res=X[np.linalg.norm(X-obj,axis=1)<=eps]
return res
def computeC_dsit(neighbor,obj,minpts):
if len(neighbor)>=minpts:
n_dist=np.linalg.norm(neighbor-obj,axis=1)
c_dist=n_dist[np.argsort(n_dist)[minpts-1]]
return c_dist
else:
return float('inf')
def computeDist(obj1,obj2):
return np.linalg.norm(obj1-obj2)
def update_orderedSeeds(neighbor,obj,c_dist,unvisited):
for j in range(len(neighbor)):
cur_obj=neighbor[j]
if cur_obj.tolist() in unvisited:
new_r_dist=max(c_dist,computeDist(cur_obj,obj))
if cur_obj.tolist() not in orderedSeeds:
r_dist=new_r_dist
orderedSeeds_rdist.append(r_dist)
orderedSeeds_rdist.sort()
orderedSeeds.insert(orderedSeeds_rdist.index(r_dist),cur_obj.tolist())
else:
old_r_dist=orderedSeeds_rdist[orderedSeeds.index(cur_obj.tolist())]
if new_r_dist<old_r_dist:
r_dist=new_r_dist
orderedSeeds.remove(cur_obj.tolist())
orderedSeeds_rdist.remove(old_r_dist)
orderedSeeds_rdist.append(r_dist)
orderedSeeds_rdist.sort()
orderedSeeds.insert(orderedSeeds_rdist.index(r_dist),cur_obj.tolist())
def extractDBSCAN_clustering(ordered,eps1,minpts):
res=[]
clusterID=-1
k=1
for i in range(len(ordered)):
obj=ordered[i]
if r_dists[i]==float('inf') or r_dists[i]>eps1:
if c_dists[i]<=eps1:
clusterID=k
k+=1
m_i=clusterID
else:
m_i=-1
else:
m_i=clusterID
res.append(m_i)
return res
五、应用到数据中
1.数据集一
from sklearn.datasets import make_blobs
center = [[1, 1], [-1, -1], [1, -1]]
cluster_std = 0.35
X, Y = make_blobs(n_samples=300, centers=center,
n_features=2, cluster_std=cluster_std, random_state=1)
plt.scatter(X[:,0],X[:,1],c=Y)
optics(X,2,5)
ordered=np.array(ordered)
plt.plot(r_dists)
从图中可以看出,数据集分为三类,有三个波谷。
clusterID=extractDBSCAN_clustering(ordered,0.4,5)
set(clusterID).
plt.scatter(ordered[:,0],ordered[:,1],c=clusterID)
数据集二
center = [[1, 1], [-1, -1], [2, -2]]
cluster_std = [0.35, 0.1, 0.8]
X2, Y2 = make_blobs(n_samples=300, centers=center,
n_features=2, cluster_std=cluster_std, random_state=0)
plt.scatter(X2[:,0],X2[:,1],c=Y2)
ordered=[]
c_dists=[]
r_dists=[]
orderedSeeds=[]
orderedSeeds_rdist=[]
optics(X2,2,5)
ordered=np.array(ordered)
plt.plot(r_dists)
clusterID=extractDBSCAN_clustering(ordered,0.7,5)
set(clusterID)
plt.scatter(ordered[:,0],ordered[:,1],c=clusterID)
六、简述OPTICS的思想
给定MinPts,可以求出对象o成为核心点的最小邻域。该最小邻域所对应的半径就是核距离(core_distance)。给定Epsilon,如果core_distance < =Epsilon,那么对象o关于Eps,MinPts是核心点。那么如何判断对象o形成一个新的簇,还是属于已存在的某一簇中?此时就要看核心点o到与它最近的核心点p的距离(就是可达距离reachability_distance),如果reachability_distance<=Epsilon,那么就把核心点o归到p所在的簇中,否则以核心点o建立一个新的簇。
|