这节我们要看的是层次聚类,它的目的就是形成一个层级的聚类情况,从下到上反映了从单点相粘合形成小聚类,小聚类形成大聚类的过程.不像K-means或者高斯混合模型这种划分聚类方法需要事先给一个聚类的数量K,一般它会形成一个树状图,允许我们在各个层次(对应不同数量的聚类)看聚类情况,从而克服了划分聚类方法由固定的K带来的缺点.
层次聚类方法按照构建层级的方法分为两种:
聚合型 : 从下到上,从每个单点构成的聚类出发,一步步将相近的聚类粘合
分裂型 : 从上到下,所有的点都属于一个聚类,慢慢将远的子集分裂出去形成新的聚类
这个时候,我们就要好好定义下距离的"远""近"了.我们从点和点之间的度量开始,到聚类和聚类之间的距离(linkage criterion).
基于点和点之间的度量$d$,我们就可以规定聚类A,B之间的距离.常用的有以下几种:
最大距离(Complete-linkage clustering),$\max\{ d(a,b) : a \in A, b\in B \}$
最小距离(Single-linkage clustering),$\min\{ d(a,b) : a \in A, b\in B \}$
平均距离(UPGMA),$ \frac {1} {n\times m} \sum \limits_ {a \in A, b\in B} d(a,b) \}$,其中$n,m$分别是A,B聚类的基数,也就是包含的点的数量.
中点距离(UPGMC),$\vert d(C_A,C_B) \vert$,其中$C_A, C_B$分别是A,B的中点,也就是离中心最近的点
能量距离(Energy distance),$ \frac {2} {n\times m} \sum \limits_ {a \in A, b\in B} \vert a-b\vert_2 - \frac{1}{n^2} \sum \limits_{a_i,a_j \in A} \vert a_i -a_j \vert_2 -\frac{1}{m^2} \sum \limits_{b_i,b_j \in B} \vert b_i -_j \vert_2 \}$
至于聚合型聚类的复杂度,不可能小于$ O(n^2) $,因为我们需要计算每对数据点之间的距离.我们介绍一种R.Sibson在1972年提出的SLINK算法,时间复杂度是$ O(n^2) $,空间复杂度是$ O(n) $.
我们把n个数据点从0到$ n-1 $编号,SLINK算法求出的是两个数组A和B,A(i)代表把i数据点和起码一个比i序号大的数据点并入一个聚类的层级(表现在树状图上是高度,本质上是聚类之间的距离),而B(i)表示这个层级包含i的聚类中序号最大的点.
得到这两个数组后
我们可以从A中最小的数 $ a_0 $(可能有多个)开始,对应的点集$ \{i\}$ (因为可能多个,所以用集合表示)中的每个点,将它和它的B(i)划分为同一聚类并一直处于一个聚类,没有涉及的点单独成一个聚类;
将$ a_0 $从A中去除,找A中剩下的最小的数,对应的点集$ \{i\} $中的每个点,将它和它的B(i)划入同一聚类并一直处于一个聚类,没有涉及的聚类(可能是单点聚类)保持不变.
重复2步骤,直到把所有点划入一个聚类.
相比于K-means方法,层次聚类每次运行都会得到相同结果,也可以在任何层次停止,处理的数据类型也更加丰富,比如可以通过设定合适的度量来处理字符串类型,但算法复杂度也会提高.
In [1]:
from math import *
inf = float('inf')
def metrics(a, b, met='euclidean'):
"""definition of distance between numeric data points
Args:
a,b(array-like) : - two points to compare
metric(string) : - used metric, "euclidean", "squared", "manhattan" or "max"
Returns:
distance(float) : - distance between two numeric points
"""
try:
if(len(a) != len(b)):
raise ValueError("two vectors are not of the same dimension")
exit()
k = 0
for i in range(len(a)):
if(met == 'euclidean' or 'squared'):
k += (a[i] - b[i])**2
if(met == 'manhattan'):
k += abs(a[i] - b[i])
if(met == 'max'):
k = max(k, abs(a[i] - b[i]))
if(met == 'euclidean'):
k = sqrt(k)
return(k)
except TypeError:
print("Not all data points are numeric")
def _SLINK_recur(Dataset, A=None, B=None, dist=metrics, **kws):
"""recursion of SLINK
Args:
Dataset(iterables) :- dataset
A, B(iterables) :- pointer representations of the dendrogram of the first k elements
dist(func) :- used metric
**kws : - other parameters of metric function
Returns:
List: - pointer representations of dendrogram of the first k+1 elements
with A noting the lowest level at which i is no longer the last point in his cluster and
B storing the last point in the cluster which i then joins
"""
if(A is None and B is None):
k = 1
A = [inf]
B = [0]
else:
if(len(A) == len(B)):
k = len(A)
else:
print('A and B are not of same dimension')
exit()
n = len(Dataset)
if(k >= n):
return(A, B)
else:
B.append(k)
A.append(inf)
M = [0 for i in range(k)]
for i in range(k):
M[i] = dist(Dataset[i], Dataset[k])
for i in range(k):
if(A[i] >= M[i]):
M[B[i]] = min(M[B[i]], A[i])
A[i] = M[i]
B[i] = k
if(A[i] < M[i]):
M[B[i]] = min(M[B[i]], M[i])
for i in range(k):
if(A[i] >= A[B[i]]):
B[i] = k
return _SLINK_recur(Dataset, A, B)
def SLINK(Dataset, d):
"""function to execute SLINK algo
Args:
Dataset(iterables) : - list of data points, who are also lists
d(int) : - dimension of data points
Returns :
res(iterables) : - list of triples sorted by the second element,
first element is index of point,
the other two are pointer representations of dendrograms noting the
lowest level at which i is no longer the last point in his cluster and
the last point in the cluster which i then joins
Heights(iterables) : - list of the second element of res' triples
"""
n = len(Dataset)
A = [inf for i in range(n)]
B = [0 for i in range(n)]
# initialisation
A[0] = inf
B[0] = 0
for k in range(1, n):
B[k] = k
A[k] = inf
M = [0 for i in range(k + 1)]
for i in range(k):
M[i] = metrics(Dataset[i], Dataset[k])
for i in range(k):
if(A[i] >= M[i]):
M[B[i]] = min(M[B[i]], A[i])
A[i] = M[i]
B[i] = k
if(A[i] < M[i]):
M[B[i]] = min(M[B[i]], M[i])
for i in range(k):
if(A[i] >= A[B[i]]):
B[i] = k
res = [(index, i, j) for index, (i, j) in enumerate(zip(A, B))]
res = sorted(res, key=lambda x: x[1])
Heights = [triple[1] for triple in res]
return(res, Heights)
def display_from_SLINK(res, Heights):
'''function to display clustering layer by layer
according to the pointer representations of dendrograms
Args:
res(iterables) :- list of triples sorted by the second element,
first element is index of point,
the other two are pointer representations of dendrograms noting the
lowest level at which i is no longer the last point in his cluster and
the last point in the cluster which i then joins
Heights(iterables) :- list of the second element of res' triples
'''
num_points = len(Heights)
Clustering = [[i] for i in range(num_points)]
i = 0
j = 0
k = 0
while i < num_points - 1:
# one layer clustering
i = j
k += 1
print(k, 'th 合并')
while j < num_points - 1 and Heights[j] == Heights[i]:
Clustering[res[j][2]] += Clustering[res[j][0]]
print(Clustering[res[j][2]])
j += 1
In [7]:
Dataset = [[i**2, i] for i in range(10)]
d = 2
In [13]:
import random
random.shuffle(Dataset)
In [17]:
for index, data in enumerate(Dataset):
print(index, data)
In [15]:
res, Heights = SLINK(Dataset, d)
In [16]:
res # [(节点序号, 该节点被首次归入有更大序号的节点的簇时的高度, 当时该节点所在簇中节点的最大序号)]
Out[16]:
In [18]:
Heights
Out[18]:
In [19]:
display_from_SLINK(res, Heights)
sklearn中有接口sklearn.cluster.AgglomerativeClustering
可以实现一种叫做凝聚层次聚类
的层次聚类算法.它通过递归地聚合样本实现层次聚类
详细请参考sklearn聚类的文档.