In [1]:

    
import numpy as np
from matplotlib import pyplot as plt

#configure plotting
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib
matplotlib.rcParams['figure.figsize'] =  (8,5)
matplotlib.rcParams['text.usetex'] = True
matplotlib.rcParams['font.size'] = 16
matplotlib.rcParams['font.family'] = 'serif'

np.random.seed(0)
import sys
sys.path.append('/home/james/work/gpclust/')

Mixture of Gaussians with GPclust

James Hensman, November 2014

This is a very simple demo to show how to use GPclust to build a mixture of Gaussians. We'll grab some toy data, fit the model and do some very simple analyis of the posterior. We'll also have a look at the truncated Dirichlet process approximation, and how to run a merge-split method to find the optimal number of clusters. Finally we'll examine the effect of changing the priors over the cluster components.



In [2]:

    
X = np.load('twoD_clustering_example.numpy')
print X.shape

We have a two-dimensional dataset, with 506 data.

Building the model is as simple as importing the GPclust library and calling the class constructor with our data. The argument K specifies how many clusters to use.



In [3]:

    
import GPclust
m = GPclust.MOG(X, K=10)

For two-dimensional data, there's a handy built-in plot function:



In [4]:

    
m.plot()

Here, the data are colored according to the (most probably) assigned cluster, and the heavy line(s) show the contours of the models probability density. Thin lines show the countours of the probablity density of each component.

We can see that the model has not been fitted yet! The variational approximation to the posterior is initialized randomly, i.e. all data are randomly assigned across the clusters. To fit the approximation, call the optimize routine, and plot again.



In [5]:

    
m.optimize()









    



iteration 65 bound=1403.95251615 grad=4.21732408197e-07, beta=0.238937658726 vb converged (ftol)



In [6]:

    
m.plot()

Dirichlet process priors and the merge-split method

Instead of using a fixed number of clusters, we can place a Dirichlet process prior over the component weights: effectively a prior of an infintie number of clusters. The variational approximation is truncated se we still only model a finite number of clusters, but now we can optimize for the number of components.

To help explore the search space, GPclust has a try_split function. This re-initialized the posterior with a cluster 'split' into two. Attempted splits are only accepted if the bound onthe marginal likelihood increases. A wrapper around the try_split(), method, systematic_splits() repeatedly attempts splits.

Let's build a MOG with a DP prior, truncated at four components:



In [7]:

    
m = GPclust.MOG(X, K=4, prior_Z='DP', alpha=10.)
m.optimize()
m.plot()









    



iteration 54 bound=1144.56640134 grad=5.47203243145e-07, beta=0.583726432908 vb converged (gtol)

With only four components, the fit is quite poor. To attempt to split the $i^\textrm{th}$ component, call try_split(i). The method returns True if the split was sucessful, and displays the current status after each split attempt.



In [8]:

    
m.try_split(1)









    



attempting to split cluster  1
iteration 55 bound=1203.35998974 grad=2.65813830543e-07, beta=0.241221760971 vb converged (ftol)
split suceeded, bound changed by:  58.7935883981 , 1  new clusters (K=5)
optimizing new split to convergence:
iteration 1 bound=1203.3599898 grad=8.82237524687e-08, beta=0 vb converged (ftol)






    Out[8]:





True



In [9]:

    
m.plot()

We can see that in attempting to split the first cluster, the bound on the marginal likelihood has increased and the move was accepted. To save having to repeatedly call the try_split method, there's a helpful systematic_splits() function, which iterates through each cluster and attempts to split it until the marginal-likelihood bound fails to increase. Here we'll turn the output off, as it can get a bit long.



In [10]:

    
m.systematic_splits(verbose=False)



In [11]:

    
m.plot()

Getting at the parameters of the posterior.

There are perhaps three main tasks that we'd like to do with the fitted model:

See which data are assigned to the same cluster
Examine the means (and variances) of the cluster components
Predict the probability density at a new point

The posterior over data assignments is stored in a matix called phi. This matrix is NxK, and each element $\phi_{nk}$ represents the probability that the $n^\textrm{th}$ datum is assigned to the $k^\textrm{th}$ component. Here we'll visualize the posterior assignment probabilities in a heatmap-style plot.



In [12]:

    
phi = m.phi
_=plt.matshow(phi.T, cmap=plt.cm.hot, vmin=0, vmax=1, aspect='auto')
plt.xlabel('data index')
plt.ylabel('cluster index')
_=plt.colorbar()

The variational posterior for each of the components in the MOG model is a Gaussian-Wishart distribution. These are stored in the model as so:



In [13]:

    
print 'first component mean:', m.mun[:,0]
print 'first component covariance:\n', m.Sns[:,:,0]









    



first component mean: [ 0.07422406  0.02492751]
first component covariance:
[[ 0.0158891  -0.03426786]
 [-0.03426786  0.10578672]]

Finally, to obtain the predictive density of the model at any point, we can use m.predict. To get the density under each of the components, we can use `m.predict_components'.



In [14]:

    
test_point = np.array([[0.1, 0.0]])
density = m.predict(test_point)
print 'model density:',density

cw_density = m.predict_components(test_point)
print 'density under each component:', cw_density.round(3)









    



model density: [ 6.13456833]
density under each component: [[  5.58920000e+01   2.00000000e-02   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.05150000e+01   0.00000000e+00]]

Adjusting the prior parameters

The MOG model has priors over the component means and covariances (Gaussian Wishart) and a Dirichlet (or Dirichlet-process) prior over the mixing proportions. We'll use the 2D dataset to illustrate how to change these prior parameters and the effect of them.

First, let's place a strong prior over the cluster covariances that forces them to be small, and a large concentration parameter, which makes the Dirichlet process prefer a large number of clusters.



In [17]:

    
m = GPclust.MOG(X, K=10, prior_S=np.eye(2)*5e-3, prior_v=30., prior_Z='DP', alpha=100)
m.optimize(verbose=False)
m.systematic_splits(verbose=False)
m.systematic_splits(verbose=False)
m.plot()
print 'bound on the marginal likelihood:', m.log_likelihood()









    



 bound on the marginal likelihood: 2913.89812473

We see that the model finds a very large number of small clusters! Now let's see what happens if we relax the concentration parameter, and reduce the strength of the Wishart prior:



In [18]:

    
m = GPclust.MOG(X, K=10, prior_S=np.eye(2)*5e-3, prior_v=10., prior_Z='DP', alpha=0.1)
m.optimize(verbose=False)
m.systematic_splits(verbose=False)
m.systematic_splits(verbose=False)
m.plot()
print 'bound on the marginal likelihood:', m.log_likelihood()









    



bound on the marginal likelihood: 1519.14956219

The MOG also takes a prior mean parameter (which defalts to the mean of the data) and a prior concentration parameter as part of the Wishart distribution. We leave experimenting with these to the reader.



In [ ]: