Gaussian Mixture Models and Expectation Maximisation in Shogun

By Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de. Based on the GMM framework of the Google summer of code 2011 project of Alesis Novik - https://github.com/alesis

This notebook is about learning and using Gaussian Mixture Models (GMM) in Shogun. Below, we demonstrate how to use them for sampling, for density estimation via Expectation Maximisation (EM), and for clustering.

Note that Shogun's interfaces for mixture models are deprecated and are soon to be replace by more intuitive and efficient ones. This notebook contains some python magic at some places to compensate for this. However, all computations are done within Shogun itself.

Finite Mixture Models (skip if you just want code examples)

We begin by giving some intuition about mixture models. Consider an unobserved (or latent) discrete random variable taking $k$ states $s$ with probabilities $\text{Pr}(s=i)=\pi_i$ for $1\leq i \leq k$, and $k$ random variables $x_i|s_i$ with arbritary densities or distributions, which are conditionally independent of each other given the state of $s$. In the finite mixture model, we model the probability or density for a single point $x$ begin generated by the weighted mixture of the $x_i|s_i$

$$ p(x)=\sum_{i=1}^k\text{Pr}(s=i)p(x)=\sum_{i=1}^k \pi_i p(x|s) $$

which is simply the marginalisation over the latent variable $s$. Note that $\sum_{i=1}^k\pi_i=1$.

For example, for the Gaussian mixture model (GMM), we get (adding a collection of parameters $\theta:=\{\boldsymbol{\mu}_i, \Sigma_i\}_{i=1}^k$ that contains $k$ mean and covariance parameters of single Gaussian distributions)

$$ p(x|\theta)=\sum_{i=1}^k \pi_i \mathcal{N}(\boldsymbol{\mu}_i,\Sigma_i) $$

Note that any set of probability distributions on the same domain can be combined to such a mixture model. Note again that $s$ is an unobserved discrete random variable, i.e. we model data being generated from some weighted combination of baseline distributions. Interesting problems now are

  • Learning the weights $\text{Pr}(s=i)=\pi_i$ from data
  • Learning the parameters $\theta$ from data for a fixed family of $x_i|s_i$, for example for the GMM
  • Using the learned model (which is a density estimate) for clustering or classification

All of these problems are in the context of unsupervised learning since the algorithm only sees the plain data and no information on its structure.

Expectation Maximisation

Expectation Maximisation (EM) is a powerful method to learn any form of latent models and can be applied to the Gaussian mixture model case. Standard methods such as Maximum Likelihood are not straightforward for latent models in general, while EM can almost always be applied. However, it might converge to local optima and does not guarantee globally optimal solutions (this can be dealt with with some tricks as we will see later). While the general idea in EM stays the same for all models it can be used on, the individual steps depend on the particular model that is being used.

The basic idea in EM is to maximise a lower bound, typically called the free energy, on the log-likelihood of the model. It does so by repeatedly performing two steps

  • The E-step optimises the free energy with respect to the latent variables $s_i$, holding the parameters $\theta$ fixed. This is done via setting the distribution over $s$ to the posterior given the used observations.
  • The M-step optimises the free energy with respect to the paramters $\theta$, holding the distribution over the $s_i$ fixed. This is done via maximum likelihood.

It can be shown that this procedure never decreases the likelihood and that stationary points (i.e. neither E-step nor M-step produce changes) of it corresponds to local maxima in the model's likelihood. See references for more details on the procedure, and how to obtain a lower bound on the log-likelihood. There exist many different flavours of EM, including variants where only subsets of the model are iterated over at a time. There is no learning rate such as step size or similar, which is good and bad since convergence can be slow.

Mixtures of Gaussians in Shogun

The main class for GMM in Shogun is CGMM, which contains an interface for setting up a model and sampling from it, but also to learn the model (the $\pi_i$ and parameters $\theta$) via EM. It inherits from the base class for distributions in Shogun, CDistribution, and combines multiple single distribution instances to a mixture.

We start by creating a GMM instance, sampling from it, and computing the log-likelihood of the model for some points, and the log-likelihood of each individual component for some points. All these things are done in two dimensions to be able to plot them, but they generalise to higher (or lower) dimensions easily.

Let's sample, and illustrate the difference of knowing the latent variable indicating the component or not.


In [ ]:
%pylab inline
%matplotlib inline
# import all Shogun classes
from modshogun import *

In [ ]:
from matplotlib.patches import Ellipse

# a tool for visualisation
def get_gaussian_ellipse_artist(mean, cov, nstd=1.96, color="red", linewidth=3):
    """
    Returns an ellipse artist for nstd times the standard deviation of this
    Gaussian, specified by mean and covariance
    """
    # compute eigenvalues (ordered)
    vals, vecs = eigh(cov)
    order = vals.argsort()[::-1]
    vals, vecs = vals[order], vecs[:, order]
    
    theta = numpy.degrees(arctan2(*vecs[:, 0][::-1]))

    # width and height are "full" widths, not radius
    width, height = 2 * nstd * sqrt(vals)
    e = Ellipse(xy=mean, width=width, height=height, angle=theta, \
               edgecolor=color, fill=False, linewidth=linewidth)
    
    return e

Set up the model in Shogun


In [ ]:
# create mixture of three Gaussians
num_components=3
num_max_samples=100
gmm=GMM(num_components)

dimension=2

# set means (TODO interface should be to construct mixture from individuals with set parameters)
means=zeros((num_components, dimension))
means[0]=[-5.0, -4.0]
means[1]=[7.0, 3.0]
means[2]=[0, 0.]
[gmm.set_nth_mean(means[i], i) for i in range(num_components)]

# set covariances
covs=zeros((num_components, dimension, dimension))
covs[0]=array([[2, 1.3],[.6, 3]])
covs[1]=array([[1.3, -0.8],[-0.8, 1.3]])
covs[2]=array([[2.5, .8],[0.8, 2.5]])
[gmm.set_nth_cov(covs[i],i) for i in range(num_components)]

# set mixture coefficients, these have to sum to one (TODO these should be initialised automatically)
weights=array([0.5, 0.3, 0.2])
gmm.set_coef(weights)

Sampling from mixture models

Sampling is extremely easy since every instance of the CDistribution class in Shogun allows to sample from it (if implemented)


In [ ]:
# now sample from each component seperately first, the from the joint model
hold(True)
colors=["red", "green", "blue"]
for i in range(num_components):
    # draw a number of samples from current component and plot
    num_samples=int(rand()*num_max_samples)+1
    
    # emulate sampling from one component (TODO fix interface of GMM to handle this)
    w=zeros(num_components)
    w[i]=1.
    gmm.set_coef(w)
    
    # sample and plot (TODO fix interface to have loop within)
    X=array([gmm.sample() for _ in range(num_samples)])
    plot(X[:,0], X[:,1], "o", color=colors[i])
    
    # draw 95% elipsoid for current component
    gca().add_artist(get_gaussian_ellipse_artist(means[i], covs[i], color=colors[i]))
    
hold(False)
_=title("%dD Gaussian Mixture Model with %d components" % (dimension, num_components))
 
# since we used a hack to sample from each component
gmm.set_coef(weights)

Evaluating densities in mixture Models

Next, let us visualise the density of the joint model (which is a convex sum of the densities of the individual distributions). Note the similarity between the calls since all distributions implement the CDistribution interface, including the mixture.


In [ ]:
# generate a grid over the full space and evaluate components PDF
resolution=100
Xs=linspace(-10,10, resolution)
Ys=linspace(-8,6, resolution)
pairs=asarray([(x,y) for x in Xs for y in Ys])
D=asarray([gmm.cluster(pairs[i])[3] for i in range(len(pairs))]).reshape(resolution,resolution)

figure(figsize=(18,5))
subplot(1,2,1)
pcolor(Xs,Ys,D)
xlim([-10,10])
ylim([-8,6])
title("Log-Likelihood of GMM")

subplot(1,2,2)
pcolor(Xs,Ys,exp(D))
xlim([-10,10])
ylim([-8,6])
_=title("Likelihood of GMM")

Density estimating with mixture models

Now let us draw samples from the mixture model itself rather than from individual components. This is the situation that usually occurs in practice: Someone gives you a bunch of data with no labels attached to it all all. Our job is now to find structure in the data, which we will do with a GMM.


In [ ]:
# sample and plot (TODO fix interface to have loop within)
X=array([gmm.sample() for _ in range(num_max_samples)])
plot(X[:,0], X[:,1], "o")
_=title("Samples from GMM")

Imagine you did not know the true generating process of this data. What would you think just looking at it? There are clearly at least two components (or clusters) that might have generated this data, but three also looks reasonable. So let us try to learn a Gaussian mixture model on those.


In [ ]:
def estimate_gmm(X, num_components):
    # bring data into shogun representation (note that Shogun data is in column vector form, so transpose)
    features=RealFeatures(X.T)
    
    gmm_est=GMM(num_components)
    gmm_est.set_features(features)
    
    # learn GMM
    gmm_est.train_em()
    
    return gmm_est

So far so good, now lets plot the density of this GMM using the code from above


In [ ]:
component_numbers=[2,3]

# plot true likelihood
D_true=asarray([gmm.cluster(pairs[i])[num_components] for i in range(len(pairs))]).reshape(resolution,resolution)
figure(figsize=(18,5))
subplot(1,len(component_numbers)+1,1)
pcolor(Xs,Ys,exp(D_true))
xlim([-10,10])
ylim([-8,6])
title("True likelihood")

for n in range(len(component_numbers)):
    # TODO get rid of these hacks and offer nice interface from Shogun
    
    # learn GMM with EM
    gmm_est=estimate_gmm(X, component_numbers[n])
    
    # evaluate at a grid of points
    D_est=asarray([gmm_est.cluster(pairs[i])[component_numbers[n]] for i in range(len(pairs))]).reshape(resolution,resolution)
    
    # visualise densities
    subplot(1,len(component_numbers)+1,n+2)
    pcolor(Xs,Ys,exp(D_est))
    xlim([-10,10])
    ylim([-8,6])
    _=title("Estimated likelihood for EM with %d components"%component_numbers[n])

It is also possible to access the individual components of the mixture distribution. In our case, we can for example draw 95% ellipses for each of the Gaussians using the method from above. We will do this (and more) below.

On local minima of EM

It seems that three comonents give a density that is closest to the original one. While two components also do a reasonable job here, it might sometimes happen (KMeans is used to initialise the cluster centres if not done by hand, using a random cluster initialisation) that the upper two Gaussians are grouped, re-run for a couple of times to see this. This illustrates how EM might get stuck in a local minimum. We will do this below, where it might well happen that all runs produce the same or different results - no guarantees. Note that it is easily possible to initialise EM via specifying the parameters of the mixture components as did to create the original model above.

One way to decide which of multiple convergenced EM instances to use is to simply compute many of them (with different initialisations) and then choose the one with the largest likelihood. WARNING Do not select the number of components like this as the model will overfit.


In [ ]:
# function to draw ellipses for all components of a GMM
def visualise_gmm(gmm, color="blue"):

    for i in range(gmm.get_num_components()):
        component=Gaussian.obtain_from_generic(gmm.get_component(i))
        gca().add_artist(get_gaussian_ellipse_artist(component.get_mean(), component.get_cov(), color=color))

In [ ]:
# multiple runs to illustrate random initialisation matters
for _ in range(3):
    figure(figsize=(18,5))
    
    subplot(1, len(component_numbers)+1, 1)
    plot(X[:,0],X[:,1], 'o')
    visualise_gmm(gmm_est, color="blue")
    title("True components")
    
    for i in range(len(component_numbers)):
        gmm_est=estimate_gmm(X, component_numbers[i])
        
        subplot(1, len(component_numbers)+1, i+2)
        plot(X[:,0],X[:,1], 'o')
        visualise_gmm(gmm_est, color=colors[i])
        
        # TODO add a method to get likelihood of full model, retraining is inefficient
        likelihood=gmm_est.train_em()
        _=title("Estimated likelihood: %.2f (%d components)"%(likelihood,component_numbers[i]))

Clustering with mixture models

Recall that our initial goal was not to visualise mixture models (although that is already pretty cool) but to find clusters in a given set of points. All we need to do for this is to evaluate the log-likelihood of every point under every learned component and then pick the largest one. Shogun can do both. Below, we will illustrate both cases, obtaining a cluster index, and evaluating the log-likelihood for every point under each component.


In [ ]:
def cluster_and_visualise(gmm_est):
    # obtain cluster index for each point of the training data
    # TODO another hack here: Shogun should allow to pass multiple points and only return the index
    # as the likelihood can be done via the individual components
    # In addition, argmax should be computed for us, although log-pdf for all components should also be possible
    clusters=asarray([argmax(gmm_est.cluster(x)[:gmm.get_num_components()]) for x in X])
    
    # visualise points by cluster
    hold(True)
    for i in range(gmm.get_num_components()):
        indices=clusters==i
        plot(X[indices,0],X[indices,1], 'o', color=colors[i])
        
    hold(False)

# learn gmm again
gmm_est=estimate_gmm(X, num_components)
    
figure(figsize=(18,5))
subplot(121)
cluster_and_visualise(gmm)
title("Clustering under true GMM")
subplot(122)
cluster_and_visualise(gmm_est)
_=title("Clustering under estimated GMM")

These are clusterings obtained via the true mixture model and the one learned via EM. There is a slight subtlety here: even the model under which the data was generated will not cluster the data correctly if the data is overlapping. This is due to the fact that the cluster with the largest probability is chosen. This doesn't allow for any ambiguity. If you are interested in cases where data overlaps, you should always look at the log-likelihood of the point for each cluster and consider taking into acount "draws" in the decision, i.e. probabilities for two different clusters are equally large.

Below we plot all points, coloured by their likelihood under each component.


In [ ]:
figure(figsize=(18,5))

for comp_idx in range(num_components):
    subplot(1,num_components,comp_idx+1)
    
    # evaluated likelihood under current component
    # TODO Shogun should do the loop and allow to specify component indices to evaluate pdf for
    # TODO distribution interface should be the same everywhere
    component=Gaussian.obtain_from_generic(gmm.get_component(comp_idx))
    cluster_likelihoods=asarray([component.compute_PDF(X[i]) for i in range(len(X))])
    
    # normalise
    cluster_likelihoods-=cluster_likelihoods.min()
    cluster_likelihoods/=cluster_likelihoods.max()
    
    # plot, coloured by likelihood value
    cm=get_cmap("jet")
    hold(True)
    for j in range(len(X)):
        color = cm(cluster_likelihoods[j])
        plot(X[j,0], X[j,1]  ,"o", color=color) 
    hold(False)
    title("Data coloured by likelihood for component %d" % comp_idx)

Note how the lower left and middle cluster are overlapping in the sense that points at their intersection have similar likelihoods. If you do not care at all about this and are just interested in a partitioning of the space, simply choose the maximum.

Below we plot the space partitioning for a hard clustering.


In [ ]:
# compute cluster index for every point in space
D_est=asarray([gmm_est.cluster(pairs[i])[:num_components].argmax() for i in range(len(pairs))]).reshape(resolution,resolution)

# visualise clustering
cluster_and_visualise(gmm_est)

# visualise space partitioning
hold(True)
pcolor(Xs,Ys,D_est)
hold(False)

Coming soon

  • Split and merge Expectation Maximisation

TODO (bugfixes, interface issues, see also all the TODOs in notebook)

  • Check cholesky computation
  • likelhoods should sum to one
  • sample multiple points
  • rename cluster method to compute_log_likelihoods
  • pull model likelihood out of sample
  • proper mixture model interface with classical distribution and EM interface (E,M methods)

  • set_coef dimensiosn check. DImension check in general

  • check for initialization before sampling
  • eigen3ize Gaussian and remove GaussianDistribution

References

[1]: Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press.

[2]: Bishop, C. M., & others. (2006). Pattern recognition and machine learning. Springer New York.

[3]: Gatsby Computational Neurocience Unit lecture notes for Machine Learning (link)