Notebook version: 1.0 (Jan, 23, 2017)
Author: Jerónimo Arenas García (jarenas@tsc.uc3m.es)
Jesús Cid Sueiro (jcid@tsc.uc3m.es)
Changes: v.1.0 - First complete version. Based on a Matlab version by Jerónimo Arenas
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
In [27]:
import scipy.io
data = scipy.io.loadmat('dataP3.mat')
X = data['X'].T
true_model = data['true_model']
true_labels = data['true_labels']
n_clusters = true_model[0][0][0][0][0]
mx = true_model[0][0][1]
P = true_model[0][0][2]
V = true_model[0][0][3][0]
# centers = [[-5, 0], [0, 1.5], [5, -1], [2, 2], [-3, 6], [0.5, -3]]
# X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
# T = [[0.4, 0.3], [-0.4, 0.8]]
# centers = [[-5, 0], [0, 1.5], [5, -1], [2, 2], [-3, 6], [0.5, -3]]
# X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
# T = [[0.4, 0.3], [-0.4, 0.8]]
# for i in range(6):
# X = np.dot(X, T)
# plt.scatter(X[:, 0], X[:, 1], s=30);
# plt.axis('equal')
# plt.show()
The file contains the three following variables:
X
: A matrix of size ($2 \times 4000$) with the training data.true_model
: A Matlab struct with the model used for generation of the synthetic data. The fields of the struct provide the following information about the model: number of groups, the mean and covariance matrices of the multidimensional normal distributions that characterize the class-conditional pdfs, and the {\em a priori} probability of each group.true_labels
: A vector of length $4000$ with the information about which group has generated each of the training patterns.Note that, since this is a clustering problem, the information provided by variables true_model
and true_labels
is normally not known (not at least all of it). Precisely, the objective of the clustering problem is to obtain such information from the available data.
In [28]:
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=true_labels, s=30, cmap='rainbow')
plt.axis('equal')
plt.show()
In [ ]:
In [36]:
from sklearn.cluster import KMeans
est = KMeans(4) # 4 clusters
est.fit(X)
y_kmeans = est.predict(X)
plt.scatter(X[:,0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');
plt.axis('equal')
plt.show()
Compute the overall distortion: $$J = \sum_{i=1}^c \sum_{{\bf x}_k \in {\cal C}_i} \|{\bf x}_k - {\bf \mu}_i\|^2,$$ where $c$ is the number of groups, the second sum runs over the training patterns in class ${\cal C}_i$, and ${\bf \mu}_i$ is the calculated mean vector for the $i$th class, $i=1,\dots,c$.
In [ ]:
Here, we show one of the possible solutions that can be found. The overall distortion for this case is $70.87$.
\centerline{\includegraphics[width=8cm]{k_means_groups}}
We can see that one of the original groups has been roughly split into three subgroups. It is also interesting to see that, even though the centers of the blue and yellow groups have been correctly placed, since data assignment is done based on minimum distance (i.e., covariance information is ignored) some of the points that were generated by the blue' group are incorrectly assigned to the
yellow' group.
In [ ]:
In [ ]:
In [ ]:
The left figure provides the results for this exercise. The fact that the average distortion over 10 runs and the result of the best run differ proves the non-deterministic behavior of the $k$-means algorithm. For completeness, the right figure displays the log-likelihood of the best model obtained for each value of $c$ (i.e., the best out of the 10 different initializations).
\centerline{\includegraphics[width=7cm]{distortion_kmeans} \includegraphics[width=7cm]{likelihood_kmeans}}
The overall distortion shows a decreasing behavior as the number of groups is increased. We can also see that the gain is very important when $c$ changes from $1$ to $4$, and after that the gain is not that noticeable. This behavior can be used as a heuristic to select the number of groups.
Analyze the evolution of the log-likelihood as a function of the number of groups using the same initialization described in Subsection \ref{subsec:EM_kmeans}. Plot the achieved log-likelihood for the number of groups in the range from 1 to 12.
Analyze the behavior of the BIC (Bayesian Information Criterion), which is given by $$\text{BIC} = -2 L + n_p \log(n)$$ where $L$ is the log-likelihood of the model, $n_p$ is the number of parameters of the model, and $n$ is the number of available training patterns.
\begin{solution} In order to calculate the BIC criterion we simply need to take into account that the number of parameters is given by $n_p = 6 c$, i.e., six parameters are needed per class, 2 for the mean vector, 3 for the covariance matrix, and 1 for the {\em a priori} probability of the class. The following figures show the log-likelihood and the BIC as functions of the number of groups. According to the figure, the BIC criterion would suggest to use 4 groups, which coincides with the true number of group of the generative model. \centerline{\includegraphics[width=8cm]{logL_ngroups} \includegraphics[width=8cm]{BIC_ngroups}} \end{solution}The first one consists of 4 compact clusters generated from a Gaussian distribution. This is the kind of dataset that are best suited to centroid-based clustering algorithms like $K$-means. If the goal of the clustering algorithm is to minimize the intra-cluster distances and find a representative prototype or centroid for each cluster, $K$-means may be a good option.
In [2]:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.utils import shuffle
N = 300
nc = 4
Xs, ys = make_blobs(n_samples=N, centers=nc,
random_state=6, cluster_std=0.60, shuffle = False)
X, y = shuffle(Xs, ys, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=30);
plt.axis('equal')
plt.show()
Note that we have computed two data matrices:
Note that both matrices contain the same data (rows) but in different order. The sorted matrix will be usefull later for illustration purposes, but keep in mind that, in a real clustering application, vector ${\bf y}$ is unknown (learning is not supervised), and only a data matrix with an arbitrary ordering (like ${\bf X}$) will be available.
The second dataset contains two concentric rings. One could expect from a clustering algorithm to identify two different clusters, one per each ring of points. If this is the case, $K$-means or any other algorithm focused on minimizing distances to some cluster centroids is not a good choice.
In [3]:
from sklearn.datasets.samples_generator import make_circles
X2s, y2s = make_circles(n_samples=N, factor=.5, noise=.05, shuffle=False)
X2, y2 = shuffle(X2s, y2s, random_state=0)
plt.scatter(X2[:, 0], X2[:, 1], s=30)
plt.axis('equal')
plt.show()
Note, again, that we have computed both the sorted (${\bf X}_{2s}$) and the shuffled (${\bf X}_2$) versions of the dataset in the code above.
In [4]:
from sklearn.cluster import KMeans
# <SOL>
est = KMeans(n_clusters=4)
clusters = est.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=clusters, s=30, cmap='rainbow')
plt.axis('equal')
clusters = est.fit_predict(X2)
plt.figure()
plt.scatter(X2[:, 0], X2[:, 1], c=clusters, s=30, cmap='rainbow')
plt.axis('equal')
plt.show()
# </SOL>
Spectral clustering algorithms are focused on connectivity: clusters are determined by maximizing some measure of intra-cluster connectivity and maximizing some form of inter-cluster connectivity.
To implement a spectral clustering algorithm we must specify a similarity measure between data points. In this session, we will use the rbf kernel, that computes the similarity between ${\bf x}$ and ${\bf y}$ as:
$$\kappa({\bf x},{\bf y}) = \exp(-\gamma \|{\bf x}-{\bf y}\|^2)$$Other similarity functions can be used, like the kernel functions implemented in Scikit-learn (see the metrics module).
For a dataset ${\cal S} = \{{\bf x}^{(0)},\ldots,{\bf x}^{(N-1)}\}$, the $N\times N$ affinity matrix ${\bf K}$ contains the similarity measure between each pair of samples. Thus, its components are
$$K_{ij} = \kappa\left({\bf x}^{(i)}, {\bf x}^{(j)}\right)$$The following fragment of code illustrates all pairs of distances between any two points in the dataset.
In [5]:
from sklearn.metrics.pairwise import rbf_kernel
gamma = 0.5
K = rbf_kernel(X, X, gamma=gamma)
In [6]:
plt.imshow(K, cmap='hot')
plt.colorbar()
plt.title('RBF Affinity Matrix for gamma = ' + str(gamma))
plt.grid('off')
plt.show()
Despite the apparent randomness of the affinity matrix, it contains some hidden structure, that we can uncover by visualizing the affinity matrix computed with the sorted data matrix, ${\bf X}_s$.
In [7]:
Ks = rbf_kernel(Xs, Xs, gamma=gamma)
plt.imshow(Ks, cmap='hot')
plt.colorbar()
plt.title('RBF Affinity Matrix for gamma = ' + str(gamma))
plt.grid('off')
plt.show()
Note that, despite their completely different appearance, both affinity matrices contain the same values, but with a different order of rows and columns.
For this dataset, the sorted affinity matrix is almost block diagonal. Note, also, that the block-wise form of this matrix depends on parameter $\gamma$.
In [ ]:
Out from the diagonal block, similarities are close to zero. We can enforze a block diagonal structure be setting to zero the small similarity values.
For instance, by thresholding ${\bf K}_s$ with threshold $t$, we get the truncated (and sorted) affinity matrix $$ \overline{K}_{s,ij} = K_{s,ij} \cdot \text{u}(K_{s,ij} - t) $$
(where $\text{u}()$ is the step function) which is block diagonal.
In [8]:
# <SOL>
t = 0.001
Kt = K*(K>t) # Truncated affinity matrix
Kst = Ks*(Ks>t) # Truncated and sorted affinity matrix
# </SOL>
Any similarity matrix defines a weighted graph in such a way that the weight of the edge linking ${\bf x}_i$ and ${\bf x}_j$ is $K_{ij}$.
If $K$ is a full matrix, the graph is fully connected (there is and edge connecting every pair of nodes). But we can get a more interesting sparse graph by setting to zero the edges with a small weights.
For instance, let us visualize the graph for the truncated affinity matrix $\overline{\bf K}$ with threshold $t$. You can also check the effect of increasing or decreasing $t$.
In [9]:
import networkx as nx
G = nx.from_numpy_matrix(Kt)
graphplot = nx.draw(G, X, node_size=40, width=0.5,)
plt.axis('equal')
plt.show()
Note that, for this dataset, the graph connects edges from the same cluster only. Therefore, the number of diagonal blocks in $\overline{\bf K}_s$ is equal to the number of connected components in the graph.
Note, also, the graph does not depend on the sample ordering in the data matrix: the graphs for any matrix ${\bf K}$ and its sorted version ${\bf K}_s$ are the same.
The Laplacian matrix of a given affinity matrix ${\bf K}$ is given by $${\bf L} = {\bf D} - {\bf K}$$ where ${\bf D}$ is the diagonal degree matrix given by $$D_{ii}=\sum^{n}_{j} K_{ij}$$
If ${\bf L}$ is a block diagonal with blocks ${\bf L}_0, {\bf L}_1, \ldots, {\bf L}_{c-1}$, then it has at least $c$ orthogonal eigenvectors with zero eigenvalue: indeed, each block ${\bf L}_i$ is the Laplacian matrix of the graph containing the samples in the $i$ connected component, therefore, according to property P2,
$${\bf L}_i \cdot {\bf 1}_{N_i} = {\bf 0}_{N_i}$$where $N_i$ is the number of samples in the $i$-th connected component.
Therefore, if $${\bf v}_i = \left(\begin{array}{l} {\bf 0}_{N_0} \\ \vdots \\ {\bf 0}_{N_{i-1}} \\ {\bf 1}_{N_i} \\ {\bf 0}_{N_{i+1}} \\ \vdots \\ {\bf 0}_{N_{c-1}} \end{array} \right) $$ then $${\bf L} \cdot {\bf v}_{i} = {\bf 0}_{N}$$
We can compute the Laplacian matrix for the given dataset and visualize the eigenvalues:
In [10]:
Dst = np.diag(np.sum(Kst, axis=1))
Lst = Dst - Kst
# Next, we compute the eigenvalues of the matrix
w = np.linalg.eigvalsh(Lst)
plt.figure()
plt.plot(w, marker='.');
plt.title('Eigenvalues of the matrix')
plt.show()
Verify that ${\bf 1}_N$ is an eigenvector with zero eigenvalues. To do so, compute ${\bf L}_{st} \cdot {\bf 1}_N$ and verify that its euclidean norm is close to zero (it may be not exactly zero due to finite precission errors).
Verify that vectors ${\bf v}_i$ defined above (that you can compute using vi = (ys==i)
) also have zero eigenvalue.
In [11]:
# <SOL>
print np.linalg.norm(Lst.dot(np.ones((N,1))))
for i in range(nc):
vi = (ys==i)
print np.linalg.norm(Lst.dot(vi))
# </SOL>
Verify that the spectral properties of the Laplacian matrix computed from ${\bf K}_{st}$ still apply using the unsorted matrix, ${\bf K}_t$: compute ${\bf L}_{t} \cdot {\bf v}'_{i}$, where ${\bf v}'_i$ is a binary vector with components equal to 1 at the positions corresponding to samples in cluster $i$ (that you can compute using vi = (y==i)
)), and verify that its euclidean norm is close to zero.
In [12]:
# <SOL>
Dt = np.diag(np.sum(Kt, axis=1))
Lt = Dt - Kt
print np.linalg.norm(Lt.dot(np.ones((N,1))))
for i in range(nc):
vi = (y==i)
print np.linalg.norm(Lt.dot(vi))
# </SOL>
Note that the position of 1's in eigenvectors ${\bf v}_i$ points out the samples in the $i$-th connected component. This suggest the following tentative clustering algorithm:
This is the grounding idea of some spectral clustering algorithms. In this precise form, this algorithm does not usually work, for several reasons that we will discuss next, but with some modifications it becomes a powerfull method.
One of the reasons why the algorithm above may not work is that vectors ${\bf v}'_0, \ldots,{\bf v}'_{c-1}$ are not the only zero eigenvectors or ${\bf L}_t$: any linear combination of them is also a zero eigenvector. Eigenvector computation algorithms may return a different set of orthogonal eigenvectors.
However, one can expect that eigenvector should have similar component in the positions corresponding to samples in the same connected component.
In [13]:
wst, vst = np.linalg.eigh(Lst)
for n in range(nc):
plt.plot(vst[:,n], '.-')
In [14]:
# <SOL>
D = np.diag(np.sum(K, axis=1))
L = D - K
w, v = np.linalg.eigh(L)
for n in range(nc):
plt.plot(v[:,n], '.-')
# </SOL>
Note that, despite the eigenvector components can not be used as a straighforward cluster indicator, they are strongly informative of the clustering structure.
Therfore we can define vectors ${\bf z}^{(n)} = (v_{n0}, \ldots, v_{n,c-1})$ and apply a centroid based algorithm (like $K$-means) to identify all points with similar eigenvector components. The corresponding samples in ${\bf X}$ become the final clusters of the spectral clustering algorithm.
One possible way to identify the cluster structure is to apply a $K-means algorithm over the eigenvector coordinates. The steps of the spectral clustering algorithm become the following
Summarizing, the steps of the spectral clustering algorithm for a data matrix ${\bf X}$ are the following:
In [15]:
# <SOL>
g = 20
t = 0.1
K2 = rbf_kernel(X2, X2, gamma=g)
K2t = K2*(K2>t)
G2 = nx.from_numpy_matrix(K2t)
graphplot = nx.draw(G2, X2, node_size=40, width=0.5)
plt.axis('equal')
plt.show()
# </SOL>
In [16]:
# <SOL>
D2t = np.diag(np.sum(K2t, axis=1))
L2t = D2t - K2t
w2t, v2t = np.linalg.eigh(L2t)
Z2t = v2t[:,0:2]
plt.scatter(Z2t[:,0], Z2t[:,1], s=20)
plt.show()
# <SOL>
In [17]:
from sklearn.cluster import KMeans
est = KMeans(n_clusters=2)
clusters = est.fit_predict(Z2t)
In [18]:
plt.scatter(X2[:, 0], X2[:, 1], c=clusters, s=50, cmap='rainbow')
plt.axis('equal')
plt.show()
The spectral clustering algorithm in Scikit-learn requires the number of clusters to be specified. It works well for a small number of clusters but is not advised when using many clusters and/or data.
Finally, we are going to run spectral clustering on both datasets. Spend a few minutes figuring out the meaning of parameters of the Spectral Clustering implementation of Scikit-learn:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html
Note that there is not equivalent parameter to our threshold $t$, which has been useful for the graph representations. However, playing with $\gamma$ should be enough to get a good clustering.
The following piece of code executes the algorithm with an 'rbf' kernel. You can manually adjust the number of clusters and the parameter of the kernel to study the behavior of the algorithm. When you are done, you can also:
In [19]:
from sklearn.cluster import SpectralClustering
n_clusters = 4
gamma = .1 # Warning do not exceed gamma=100
SpClus = SpectralClustering(n_clusters=n_clusters,affinity='rbf',
gamma=gamma)
SpClus.fit(X)
plt.scatter(X[:, 0], X[:, 1], c=SpClus.labels_.astype(np.int), s=50,
cmap='rainbow')
plt.axis('equal')
plt.show()
In [20]:
nc = 2
gamma = 50 #Warning do not exceed gamma=300
SpClus = SpectralClustering(n_clusters=nc, affinity='rbf', gamma=gamma)
SpClus.fit(X2)
plt.scatter(X2[:, 0], X2[:, 1], c=SpClus.labels_.astype(np.int), s=50,
cmap='rainbow')
plt.axis('equal')
plt.show()
In [21]:
nc = 5
SpClus = SpectralClustering(n_clusters=nc, affinity='nearest_neighbors')
SpClus.fit(X2)
plt.scatter(X2[:, 0], X2[:, 1], c=SpClus.labels_.astype(np.int), s=50,
cmap='rainbow')
plt.axis('equal')
plt.show()
Bottom-up approach:
In practice, this creates a hierarchical tree, that can be visualized with a dendogram. We can cut the tree at different levels, in each case obtaining a different number of clusters.
We merge the two closest clusters, where the distance between clusters is defined as:
Hierarchical clustering may lead to clusters of very different sizes. Complete linkage is the worst strategy, while Ward gives the most regular sizes. However, the affinity (or distance used in clustering) cannot be varied with Ward, thus for non Euclidean metrics, average linkage is a good alternative.
There are at least three different implementations of the algorithm:
complete',
ward', and `average' linkage methods. Allows for the definition of connectivity constraints