一、什么是聚类

聚类分析，即聚类，是一项无监督的机器学习任务。它包括自动发现数据中的自然分组。与监督学习（类似预测建模）不同，聚类算法只解释输入数据，并在特征空间中找到自然组或集群。

群集通常是特征空间中的密度区域，其中来自域的示例（观测或数据行）比其他群集更接近群集。群集可以具有作为样本或点特征空间的中心(质心)，并且可以具有边界或范围。

聚类可以作为数据分析活动提供帮助，以便了解更多关于问题域的信息，即所谓的模式发现或知识发现。例如：

该进化树可以被认为是人工聚类分析的结果；
将正常数据与异常值或异常分开可能会被认为是聚类问题；
根据自然行为将集群分开是一个集群问题，称为市场细分。

二、聚类算法

有许多类型的聚类算法。许多算法在特征空间中的示例之间使用相似度或距离度量，以发现密集的观测区域。因此，在使用聚类算法之前，扩展数据通常是良好的实践。

一些聚类算法要求您指定或猜测数据中要发现的群集的数量，而另一些算法要求指定观测之间的最小距离，其中示例可以被视为“关闭”或“连接”。因此，聚类分析是一个迭代过程，在该过程中，对所识别的群集的主观评估被反馈回算法配置的改变中，直到达到期望的或适当的结果。scikit-learn 库提供了一套不同的聚类算法供选择。下面列出了10种比较流行的算法：

亲和力传播
聚合聚类
BIRCH
DBSCAN
K-均值
Mini-Batch K-均值
Mean Shift
OPTICS
光谱聚类
高斯混合

每个算法都提供了一种不同的方法来应对数据中发现自然组的挑战。没有最好的聚类算法，也没有简单的方法来找到最好的算法。

2.1聚类数据集

我们将使用 make _ classification ()函数创建一个测试二分类数据集。数据集将有1000个示例，每个类有两个输入要素和一个群集。这些群集在两个维度上是可见的，因此我们可以用散点图绘制数据，并通过指定的群集对图中的点进行颜色绘制。

# 综合分类数据集
from numpy import where
from sklearn.datasets import make_classification
from matplotlib import pyplot
# 定义数据集
X, y = make_classification(
      n_samples=1000, 
      n_features=2, 
      n_informative=2, 
      n_redundant=0, 
      n_clusters_per_class=1, 
      random_state=4)

# 为每个类的样本创建散点图
for class_value in range(2):
  # 获取此类的示例的行索引
  row_ix = where(y == class_value)
  # 创建这些样本的散布
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
  # 绘制散点图
pyplot.show()

运行该示例将创建合成的聚类数据集，然后创建输入数据的散点图，其中点由类标签（理想化的群集）着色。我们可以清楚地看到两个不同的数据组在两个维度，并希望一个自动的聚类算法可以检测这些分组。
在这里插入图片描述

2.2亲和力传播

亲和力传播包括找到一组最能概括数据的范例。它作为两对数据点之间相似度的输入度量。在数据点之间交换实值消息，直到一组高质量的范例和相应的群集逐渐出现。

它是通过 AffinityPropagation 类实现的，要调整的主要配置是将“ 阻尼 ”设置为0.5到1，甚至可能是“首选项”。

# 亲和力传播聚类
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import AffinityPropagation
from matplotlib import pyplot

# 定义数据集
X, _ = make_classification(
    n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=4)

# 定义模型
model = AffinityPropagation(damping=0.9)
# 匹配模型
model.fit(X)
# 为每个示例分配一个集群
yhat = model.predict(X)

# 检索唯一群集
clusters = unique(yhat)
# 为每个群集的样本创建散点图
for cluster in clusters:
  # 获取此群集的示例的行索引
  row_ix = where(yhat == cluster)
  # 创建这些样本的散布
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
  # 绘制散点图
pyplot.show()

在这里插入图片描述

2.3聚合聚类

聚合聚类涉及合并示例，直到达到所需的群集数量为止。它是层次聚类方法的更广泛类的一部分，通过 AgglomerationClustering 类实现的，主要配置是“ n _ clusters ”集，这是对数据中的群集数量的估计

# 聚合聚类
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import AgglomerativeClustering
from matplotlib import pyplot

# 定义数据集
X, _ = make_classification(
   n_samples=1000, 
    n_features=2, 
    n_informative=2,
    n_redundant=0, 
    n_clusters_per_class=1,
    random_state=4)

# 定义模型
model = AgglomerativeClustering(n_clusters=2)
# 模型拟合与聚类预测
yhat = model.fit_predict(X)
# 检索唯一群集
clusters = unique(yhat)
# 为每个群集的样本创建散点图
for cluster in clusters:
  # 获取此群集的示例的行索引
  row_ix = where(yhat == cluster)
  # 创建这些样本的散布
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
  # 绘制散点图
pyplot.show()

在这里插入图片描述

2.4K均值

K-均值聚类可以是最常见的聚类算法，并涉及向群集分配示例，以尽量减少每个群集内的方差。

# k-means 聚类
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# 定义数据集
X, _ = make_classification(
   n_samples=1000, 
    n_features=2, 
    n_informative=2, 
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=4)

# 定义模型
model = KMeans(n_clusters=2)
# 模型拟合
model.fit(X)
# 为每个示例分配一个集群
yhat = model.predict(X)
# 检索唯一群集
clusters = unique(yhat)

# 为每个群集的样本创建散点图
for cluster in clusters:
  # 获取此群集的示例的行索引
  row_ix = where(yhat == cluster)
  # 创建这些样本的散布
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
  # 绘制散点图
pyplot.show()

在这里插入图片描述

2.5Mini-Batch K均值

Mini-Batch K-均值是 K-均值的修改版本，它使用小批量的样本而不是整个数据集对群集质心进行更新，这可以使大数据集的更新速度更快，并且可能对统计噪声更健壮。

# mini-batch k均值聚类
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from matplotlib import pyplot

# 定义数据集
X, _ = make_classification(
   n_samples=1000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_clusters_per_class=1,
    random_state=4)

# 定义模型
model = MiniBatchKMeans(n_clusters=2)
# 模型拟合
model.fit(X)
# 为每个示例分配一个集群
yhat = model.predict(X)
# 检索唯一群集
clusters = unique(yhat)

# 为每个群集的样本创建散点图
for cluster in clusters:
  # 获取此群集的示例的行索引
  row_ix = where(yhat == cluster)
  # 创建这些样本的散布
  pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
  # 绘制散点图
pyplot.show()

在这里插入图片描述

2.6光谱聚类

光谱聚类是一类通用的聚类方法，取自线性代数。

它是通过 Spectral 聚类类实现的，而主要的 Spectral 聚类是一个由聚类方法组成的通用类，取自线性线性代数。要优化的是“ n _ clusters ”超参数，用于指定数据中的估计群集数量。

# spectral clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import SpectralClustering
from matplotlib import pyplot

# 定义数据集
X, _ = make_classification(
  n_samples=1000,
  n_features=2, 
  n_informative=2, 
  n_redundant=0, 
  n_clusters_per_class=1, 
  random_state=4)

# 定义模型
model = SpectralClustering(n_clusters=2)
# 模型拟合与聚类预测
yhat = model.fit_predict(X)
# 检索唯一群集
clusters = unique(yhat)
# 为每个群集的样本创建散点图

for cluster in clusters:
    # 获取此群集的示例的行索引
    row_ix = where(yhat == cluster)
    # 创建这些样本的散布
    pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
    # 绘制散点图
pyplot.show()