层次聚类

这节我们要看的是层次聚类,它的目的就是形成一个层级的聚类情况,从下到上反映了从单点相粘合形成小聚类,小聚类形成大聚类的过程.不像K-means或者高斯混合模型这种划分聚类方法需要事先给一个聚类的数量K,一般它会形成一个树状图,允许我们在各个层次(对应不同数量的聚类)看聚类情况,从而克服了划分聚类方法由固定的K带来的缺点.

层次聚类分类

层次聚类方法按照构建层级的方法分为两种:

  • 聚合型 : 从下到上,从每个单点构成的聚类出发,一步步将相近的聚类粘合

  • 分裂型 : 从上到下,所有的点都属于一个聚类,慢慢将远的子集分裂出去形成新的聚类

度量和聚类距离

这个时候,我们就要好好定义下距离的"远""近"了.我们从点和点之间的度量开始,到聚类和聚类之间的距离(linkage criterion).

基于点和点之间的度量$d$,我们就可以规定聚类A,B之间的距离.常用的有以下几种:

  1. 最大距离(Complete-linkage clustering),$\max\{ d(a,b) : a \in A, b\in B \}$

  2. 最小距离(Single-linkage clustering),$\min\{ d(a,b) : a \in A, b\in B \}$

  3. 平均距离(UPGMA),$ \frac {1} {n\times m} \sum \limits_ {a \in A, b\in B} d(a,b) \}$,其中$n,m$分别是A,B聚类的基数,也就是包含的点的数量.

  4. 中点距离(UPGMC),$\vert d(C_A,C_B) \vert$,其中$C_A, C_B$分别是A,B的中点,也就是离中心最近的点

  5. 能量距离(Energy distance),$ \frac {2} {n\times m} \sum \limits_ {a \in A, b\in B} \vert a-b\vert_2 - \frac{1}{n^2} \sum \limits_{a_i,a_j \in A} \vert a_i -a_j \vert_2 -\frac{1}{m^2} \sum \limits_{b_i,b_j \in B} \vert b_i -_j \vert_2 \}$

SLINK算法介绍、复杂度

至于聚合型聚类的复杂度,不可能小于$ O(n^2) $,因为我们需要计算每对数据点之间的距离.我们介绍一种R.Sibson在1972年提出的SLINK算法,时间复杂度是$ O(n^2) $,空间复杂度是$ O(n) $.

我们把n个数据点从0到$ n-1 $编号,SLINK算法求出的是两个数组A和B,A(i)代表把i数据点和起码一个比i序号大的数据点并入一个聚类的层级(表现在树状图上是高度,本质上是聚类之间的距离),而B(i)表示这个层级包含i的聚类中序号最大的点.

得到这两个数组后

  1. 我们可以从A中最小的数 $ a_0 $(可能有多个)开始,对应的点集$ \{i\}$ (因为可能多个,所以用集合表示)中的每个点,将它和它的B(i)划分为同一聚类并一直处于一个聚类,没有涉及的点单独成一个聚类;

  2. 将$ a_0 $从A中去除,找A中剩下的最小的数,对应的点集$ \{i\} $中的每个点,将它和它的B(i)划入同一聚类并一直处于一个聚类,没有涉及的聚类(可能是单点聚类)保持不变.

  3. 重复2步骤,直到把所有点划入一个聚类.

方法优缺点

相比于K-means方法,层次聚类每次运行都会得到相同结果,也可以在任何层次停止,处理的数据类型也更加丰富,比如可以通过设定合适的度量来处理字符串类型,但算法复杂度也会提高.

SLINK实现


In [1]:
from math import *
inf = float('inf')


def metrics(a, b, met='euclidean'):
    """definition of distance between numeric data points
    Args:
        a,b(array-like)  : - two points to compare
        metric(string) : - used metric, "euclidean", "squared", "manhattan" or "max"
    Returns:
        distance(float) : - distance between two numeric points
    """
    try:
        if(len(a) != len(b)):
            raise ValueError("two vectors are not of the same dimension")
            exit()
        k = 0
        for i in range(len(a)):
            if(met == 'euclidean' or 'squared'):
                k += (a[i] - b[i])**2
            if(met == 'manhattan'):
                k += abs(a[i] - b[i])

            if(met == 'max'):
                k = max(k, abs(a[i] - b[i]))

        if(met == 'euclidean'):
            k = sqrt(k)
        return(k)
    except TypeError:
        print("Not all data points are numeric")


def _SLINK_recur(Dataset, A=None, B=None, dist=metrics, **kws):
    """recursion of SLINK

    Args:
        Dataset(iterables) :- dataset
        A, B(iterables) :- pointer representations of the dendrogram of the first k elements
        dist(func) :- used metric
        **kws : - other parameters of metric function
    Returns:
        List: - pointer representations of dendrogram of the first k+1 elements 
                with A noting the lowest level at which i is no longer the last point in his cluster and 
                B storing the last point in the cluster which i then joins
    """
    if(A is None and B is None):
        k = 1
        A = [inf]
        B = [0]
    else:
        if(len(A) == len(B)):
            k = len(A)
        else:
            print('A and B are not of same dimension')
            exit()
    n = len(Dataset)
    if(k >= n):
        return(A, B)
    else:
        B.append(k)
        A.append(inf)
        M = [0 for i in range(k)]
        for i in range(k):
            M[i] = dist(Dataset[i], Dataset[k])

        for i in range(k):
            if(A[i] >= M[i]):
                M[B[i]] = min(M[B[i]], A[i])
                A[i] = M[i]
                B[i] = k
            if(A[i] < M[i]):
                M[B[i]] = min(M[B[i]], M[i])
        for i in range(k):
            if(A[i] >= A[B[i]]):
                B[i] = k
        return _SLINK_recur(Dataset, A, B)


def SLINK(Dataset, d):
    """function to execute SLINK algo

    Args:
        Dataset(iterables) : - list of data points, who are also lists
        d(int) : - dimension of data points

    Returns :
        res(iterables) : - list of triples sorted by the second element,
                                first element is index of point,
                                the other two are pointer representations of dendrograms noting the
                                lowest level at which i is no longer the last point in his cluster and
                                the last point in the cluster which i then joins
        Heights(iterables) : - list of the second element of res' triples
    """
    n = len(Dataset)
    A = [inf for i in range(n)]
    B = [0 for i in range(n)]

    # initialisation
    A[0] = inf
    B[0] = 0

    for k in range(1, n):
        B[k] = k
        A[k] = inf
        M = [0 for i in range(k + 1)]
        for i in range(k):
            M[i] = metrics(Dataset[i], Dataset[k])
        for i in range(k):
            if(A[i] >= M[i]):
                M[B[i]] = min(M[B[i]], A[i])
                A[i] = M[i]
                B[i] = k
            if(A[i] < M[i]):
                M[B[i]] = min(M[B[i]], M[i])
        for i in range(k):
            if(A[i] >= A[B[i]]):
                B[i] = k
    res = [(index, i, j) for index, (i, j) in enumerate(zip(A, B))]
    res = sorted(res, key=lambda x: x[1])
    Heights = [triple[1] for triple in res]
    return(res, Heights)


def display_from_SLINK(res, Heights):
    '''function to display clustering layer by layer

    according to the pointer representations of dendrograms

    Args:
        res(iterables) :- list of triples sorted by the second element,
                                    first element is index of point,
                                    the other two are pointer representations of dendrograms noting the
                                    lowest level at which i is no longer the last point in his cluster and
                                    the last point in the cluster which i then joins

        Heights(iterables) :- list of the second element of res' triples
    '''
    num_points = len(Heights)
    Clustering = [[i] for i in range(num_points)]
    i = 0
    j = 0
    k = 0
    while i < num_points - 1:
        # one layer clustering
        i = j
        k += 1
        print(k, 'th 合并')
        while j < num_points - 1 and Heights[j] == Heights[i]:
            Clustering[res[j][2]] += Clustering[res[j][0]]
            print(Clustering[res[j][2]])
            j += 1

In [7]:
Dataset = [[i**2, i] for i in range(10)]
d = 2

In [13]:
import random
random.shuffle(Dataset)

In [17]:
for index, data in enumerate(Dataset):
    print(index, data)


0 [81, 9]
1 [36, 6]
2 [4, 2]
3 [49, 7]
4 [0, 0]
5 [1, 1]
6 [64, 8]
7 [16, 4]
8 [9, 3]
9 [25, 5]

In [15]:
res, Heights = SLINK(Dataset, d)

In [16]:
res # [(节点序号, 该节点被首次归入有更大序号的节点的簇时的高度, 当时该节点所在簇中节点的最大序号)]


Out[16]:
[(4, 1.4142135623730951, 5),
 (2, 3.1622776601683795, 5),
 (5, 5.0990195135927845, 8),
 (7, 7.0710678118654755, 8),
 (8, 9.055385138137417, 9),
 (1, 11.045361017187261, 9),
 (3, 13.038404810405298, 9),
 (6, 15.033296378372908, 9),
 (0, 17.029386365926403, 9),
 (9, inf, 9)]

In [18]:
Heights


Out[18]:
[1.4142135623730951,
 3.1622776601683795,
 5.0990195135927845,
 7.0710678118654755,
 9.055385138137417,
 11.045361017187261,
 13.038404810405298,
 15.033296378372908,
 17.029386365926403,
 inf]

In [19]:
display_from_SLINK(res, Heights)


1 th 合并
[5, 4]
2 th 合并
[5, 4, 2]
3 th 合并
[8, 5, 4, 2]
4 th 合并
[8, 5, 4, 2, 7]
5 th 合并
[9, 8, 5, 4, 2, 7]
6 th 合并
[9, 8, 5, 4, 2, 7, 1]
7 th 合并
[9, 8, 5, 4, 2, 7, 1, 3]
8 th 合并
[9, 8, 5, 4, 2, 7, 1, 3, 6]
9 th 合并
[9, 8, 5, 4, 2, 7, 1, 3, 6, 0]
10 th 合并

使用sklearn中的接口实现层次聚类

sklearn中有接口sklearn.cluster.AgglomerativeClustering可以实现一种叫做凝聚层次聚类的层次聚类算法.它通过递归地聚合样本实现层次聚类

详细请参考sklearn聚类的文档.