Expectation Minimization and MNIST

Nazarov Ivan, 101мНОД (ИССА)


In [ ]:
## Add JS-based table of contents
from IPython.display import HTML as add_TOC
add_TOC( u"""<h1 id="tocheading">Table of Contents</h1></br><div id="toc"></div>
<script src="https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js"></script></br></hr></br>""" )

Preamble


In [ ]:
import os, time as tm, warnings
warnings.filterwarnings( "ignore" )
# from IPython.core.display import HTML
from IPython.display import display, HTML

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed( 569034853  )
## This is the correct way to use the random number generator,
##  since it allows finer control.
rand = np.random.RandomState( np.random.randint( 0x7FFFFFFF ) )

EM and MNIST

The $\TeX$ markup used here uses the "align*" environment and thus should not be viewed though nbViewer.

Before proceeding, it seems pedagogically necessary (at least for myself) to revise the EM-slgorithm and show its "correctness", so to say.

A brief description of the EM algorithm

The EM algorithm seeks to maximize the likelihood by means of successive application of two steps: the E-step and the M-step.

For any probability measure $Q$ on the space of latent variables $Z$ with density $q$ the following holds:
\begin{align*} \log p(X|\Theta) &= \int q(Z) \log p(X|\Theta) dZ = \mathbb{E}q \log p(X|\Theta) \ %% &= \Bigl[p(X,Z|\Theta) = p(Z|X,\Theta) p(X|\Theta) \Bigr] \ &= \mathbb{E}{Z\sim q} \log \frac{p(X,Z|\Theta)}{p(Z|X\Theta)} = \mathbb{E}_{Z\sim q} \log \frac{q(Z)}{p(Z|X,\Theta)}

 + \mathbb{E}_{Z\sim q} \log \frac{p(X,Z|\Theta)}{q(Z)} \\ 
&= KL\bigl(q\|p(\cdot|X,\Theta)\bigr) + \mathcal{L}\bigl(q, \Theta\bigr)\,,

\end{align*}

since the Bayes theorem posits that $p(X,Z|\Theta) = p(Z|X,\Theta) p(X|\Theta)$. Call this equiation the "master equation".

Now note that since the Kullback-Leibler divergence is always non-negative, one has the following inequality: $$\log p(X|\Theta) \geq \mathcal{L}\bigl(q, \Theta\bigr) \,.$$

Let's try to make the lower bound as large as possible by changing $\Theta$ and varying $q$. But first note that the left-hand side of the master equation is independent of $q$, whence maximization of $\mathcal{L}$ with respect to $q$ (with $\Theta$ fixed) is equivalent to minimization of $KL\bigl(q\|p(\cdot|X,\Theta)\bigr)$ with respect to $q$ taking $\Theta$ fixed. Since $q$ is arbitrary, the optimal minimizer $q^*_\Theta$ is $q^*(Z|\Theta) = p(Z|X,\Theta)$ for all $Z$.

Now at the optimal distributuion $q^*_\Theta$ the master equation becomes $$ \log p(X|\Theta) = \mathcal{L}\bigl(q^*_\Theta, \Theta\bigr) = \mathbb{E}_{Z\sim q^*_\Theta} \log \frac{p(X,Z|\Theta)}{q^*(Z|\Theta)} = \mathbb{E}_{Z\sim q^*_\Theta} \log p(X,Z|\Theta) - \mathbb{E}_{Z\sim q^*_\Theta} \log q^*(Z|\Theta) \,, $$ for any $\Theta$. Thus the problem of log-likelihood maximization reduces to that of maximizing the sum of expectations on the right-hand side.

This new problem does not seem to be tractable in general since the optimization paramters $\Theta$ affect both the expected log-likelihood $\log p(X,Z|\Theta)$ under $Z\sim q^*_\Theta$ and the entropy of the optimal distribution of the latent variables $Z$.

Hopefully using an iterative procedure which switches between the computation of $q^*_\Theta$ and the maximization of $\Theta$ might be effective. Consider the folowing :

  • E-step: considering $\Theta_i$ as given and fixed find $q^*_{\Theta_i} = \mathop{\text{argmin}}_q\,\, KL\bigl(q\|p(\cdot|X,\Theta_i)\bigr)$ and set $q_{i+1} = q^*_{\Theta_i}$;
  • M-step: considering $q_{i+1}$ as given, solve $\mathcal{L}(q_{i+1},\Theta) \to \mathop{\text{max}}_\Theta$, where $$ \mathcal{L}(q,\Theta) = \mathbb{E}_{Z\sim q} \log p(X,Z|\Theta) - \mathbb{E}_{Z\sim q} \log q(Z) \,.$$

The fact that $q_i$ is considered fixed makes the optimization of $\mathcal{L}(q_i,\Theta)$ equivalent to maximization of the expected log-likelihood, since the entropy term is fixed. Therefore the M-step becomes:

  • given $q_{i+1}$ find $\Theta^*_{i+1} = \mathop{\text{argmax}}_\Theta\,\, \mathbb{E}_{Z\sim q_{i+1}} \log p(X,Z|\Theta)$ and put $\Theta_{i+1} = \Theta^*_{i+1}$.

Now, if the latent variables are mutually independent, then the optimal $q$ must be factorizable into marginal densities and:

\begin{align*} KL\bigl(q\|p(\cdot|X,\Theta)\bigr) &= \mathbb{E}_{Z\sim q} \log q(Z) - \sum_j \mathbb{E}_{z_j\sim q_j} \log p(z_j|X,\Theta)\\ &= \sum_j \mathbb{E}_{z_j\sim q_j} \log q_j(z_j) - \sum_j \mathbb{E}_{z_j\sim q_j} \log p(z_j|X,\Theta) = \sum_j KL\bigl(q_j\|p_j(|X,\Theta)\bigr) \,, \end{align*}

where $q_j$ is the marginal desity of $z_j$ in $q(Z)$ (the last term in the first line comes from the Fubini theorem).

Therefore the E-step could be reduced to a set of minimization problems with respect to one-dimensional density functions: $$ q_j^* = \mathop{\text{argmin}}_{q_j}\,\, KL\bigl(q_j\|p_j(\cdot|X,\Theta)\bigr) \,, $$ since the Kulback-Leibler divergence in this case in additively separable.

Correctness

Recall that the master equation is an identity: for all densities $q$ on $Z$ and for all admissible parameters $\Theta$ $$ \log p(X|\Theta) = KL\bigl(q\|p(\cdot|X,\Theta)\bigr) + \mathcal{L}\bigl(q, \Theta\bigr) \,.$$

Hence if after the E-step the Kulback-Leibler divergence is reduced: $$ KL\bigl(q'\|p(\cdot|X,\Theta)\bigr) \leq KL\bigl(q\|p(\cdot|X,\Theta)\bigr) \,,$$ then for the same set of parameters $\Theta$ one has $$ \mathcal{L}(q,\Theta) \leq \mathcal{L}(q',\Theta) \,.$$

Just after the E-step one has $q_{i+1} = p(Z|X,\Theta_i)$, whence $KL\bigl(q_{i+1}\|p(\cdot|X,\Theta_i)\bigr) = 0$. In turn, this implies via the master equation that the following equality holds: $$ \log p(X|\Theta_i) = \mathcal{L}(q_{i+1},\Theta_i) \,.$$

After the M-step, since $\Theta_{i+1}$ is a maximizer, or at least an "improver" of $\mathcal{L}(q_{i+1},\Theta)$ compared to its value at $(q_i,\Theta_i)$, one has $$ \mathcal{L}(q_{i+1},\Theta_i) \leq \mathcal{L}(q_{i+1},\Theta_{i+1}) \,.$$

Threfore the effect of a single complete round of EM on the log-likelihood itself is: $$ \log p(X|\Theta_i) = \mathcal{L}(q_{i+1},\Theta_i) \leq \mathcal{L}(q_{i+1},\Theta_{i+1}) \leq \mathcal{L}(q_{i+2},\Theta_{i+1}) = \log p(X|\Theta_{i+1}) \,,$$ where the equality is achieved between the E and the M step within one round. This implies that EM indeed iteratively improves the log-likihood.

Note, that in the general case, without attaining zero Kulback-Leibler divergence at the $E$-step, one cannot be sure that the real log-likelihood is improved by each iteration and one can just say that $$ \mathcal{L}(q_{i+1},\Theta_i) \leq \log p(X|\Theta_i) \,,$$ which does not uncover a relationship with $\log p(X|\Theta_{i+1})$. And without the guarantee that EM improves the log-likelihood to the maximum one cannot be sure about the consistency of the estimators. The key question is whether the lower bound $\mathcal{L}(q,\Theta)$ is any good.

Application of the EM to MNIST data

Each image is a random element in a discrete probability space $\Omega = \{0,1\}^{N\times M}$ with product-measure $$ \mathbb{P}(\omega) = \prod_{i=1}^N\prod_{j=1}^M \theta_{ij}^{\omega_{ij}} (1-\theta_{ij})^{1-\omega_{ij}} \,,$$ for any $\omega\in \Omega$. In particular $M=N=28$. Basically each bit of the image is independent of any other bit and each one is a Bernoulli random variable with parameter $\theta_{ij}$: $\omega_{ij}\sim \text{Bern}(\theta_{ij})$.

Let's apply the EM algorithm to this dataset. The proposed model is the following.

Consider a mixture model of discrete probability spaces. Suppose there are $K$ componets in the mixture. Then each image is distributed according to the following law: $$p(\omega|\Theta) = \sum_{k=1}^K \pi_k p_k(\omega|\theta_k) = \sum_{k=1}^K \pi_k \prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{\omega_{ij}} (1-\theta_{kij})^{1-\omega_{ij}}$$ where $\theta_{kij}$ is the paramter of the probability distribution of the $(i,j)$-th random variable (pixel) in the $k$-th class, and $\pi_k$ is the (prior) porbability of the $k$-th mixutre to generate a random element, $\sum_{k=1}^K \pi_k= 1$.

Suppose $X=(x_i)_{i=1}^n \in \Omega^n$ is the dataset. The log-likelihood is given by $$ \log p(X|\Theta) = \sum_{s=1}^n \log \sum_{k=1}^K \pi_k \prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{x_{sij}} (1-\theta_{kij})^{1-x_{sij}} \,,$$ where $x_{sij}\in\{0,1\}$ -- is the value of the the $(i,j)$-th pixel at the $s$-th observation.

If the source $Z=(z_i)_{i=1}^n$ components of the mixture at each datapoint were known, then the log-likelihood would have been $$ \log p(X,Z|\Theta) = \sum_{s=1}^n \log \prod_{k=1}^K \Bigl[ \pi_k \prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{x_{sij}} (1-\theta_{kij})^{1-x_{sij}} \Bigr]^{1_{z_s = k}} \,,$$ where $1_{z_s = k}$ is the indicator and take the value $1$ if $\{z_s = k\}$ and $0$ otherwise ($1_{\{k\}}(z_s)$ is another notation).

The log-likelihood simplifies to $$ \log p(X,Z|\Theta) = \sum_{s=1}^n \sum_{k=1}^K 1_{z_s = k} \Bigl( \log \pi_k + \sum_{i=1}^N \sum_{j=1}^M \bigl( x_{sij} \log \theta_{kij} + (1-x_{sij}) \log (1-\theta_{kij}) \bigr) \Bigr) \,,$$ and further into a more separable form $$ \log p(X,Z|\Theta) = \sum{s=1}^n \sum{k=1}^K 1_{z_s = k} \log \pi_k

  • \sum{s=1}^n \sum{k=1}^K 1_{zs = k} \Bigl( \sum{i=1}^N \sum{j=1}^M x{sij} \log \theta_{kij}
  • \sum{i=1}^N \sum{j=1}^M (1-x{sij}) \log (1-\theta{kij}) \Bigr) \,.$$

The expected log-likelihood under $z_s\sim q_s$ with $\mathbb{P}(z_s=k|X) = q_{sk}$, is given by $$ \mathbb{E}\log p(X,Z|\Theta) = \sum{s=1}^n \sum{k=1}^K q_{sk} \log \pi_k

  • \sum{s=1}^n \sum{k=1}^K q{sk} \sum{i=1}^N \sum{j=1}^M \bigl( x{sij} \log \theta{kij} + (1-x{sij}) \log (1-\theta_{kij}) \bigr) \,.$$

Analytic solution: E-step

At the E-step one must compute $q^*(Z) = \mathbb{P}(z_s=k|X) = \hat{q}_{sk}$ based on the value of $\Theta = ((\pi_k), (\theta_{kij}))$. $$\hat{q}_{sk} = \frac{p(x_s|z_s=k,\Theta) p(z_s=k)}{\sum_{l=1}^K p(x_s|z_s=l,\Theta) p(z_s=l)} \propto \pi_k \prod_{i=1}^N \prod_{j=1}^M \theta_{kij}^{x_{sij}} (1-\theta_{kij})^{1-x_{sij}} $$ and $$ q^*(Z) = \prod_{s=1}^n q_{s z_s} $$ Note that the denominator is actually the log-likelihood of the data.

In order to improve numerical stability and avoid numerical underflow it is better to use the following procedure for computation of the conditional probability: $$ l_{sk} = \sum_{i=1}^N \sum_{j=1}^M \log \bigl( \theta_{kij}^{x_{sij}} (1-\theta_{kij})^{1-x_{sij}} \bigr) \,,$$ set $l^*_s = \max_k l_{sk}$ and compute the log-sum $$ \hat{l}_s = \log \sum_{k=1}^K \text{exp}\Bigl\{ ( l_{sk} - l^*_s ) + \log \pi_k \Bigr\} \,,$$ and then compute the consitional distribution: $$ \hat{q}_{sk} = \text{exp}\Bigl\{ l_{sk} + \log \pi_k - ( \hat{l}_s + l^*_s ) \Bigr\} \,.$$

This seemingly redunant subctration and addition of $l^*_s$ helps avoid underflow during the numerical exponentiation. After this sanitization the E-step's optimal distribution would be numerically accurate.

If $l^*_s >> l_{sk}$ for all $k$ but such that $l^*_s = l_{sk}$ (let it be $k^*$), then an underflow occurs at the sum-exp step, whence for some very small $\epsilon > 0$ one has $$ \hat{l}_s = l^*_s + \log (1+\epsilon) \,,$$ whence $$ \hat{q}_{sk} = \text{exp}\bigl\{l_{sk} - \hat{l}_s\bigr\} = (1+\epsilon)^{-1} \cdot \text{exp}\bigl\{l_{sk} - l^*_s \bigr\} \,.$$ For $k=k^*$ on has $\hat{q}_{sk} = \frac{1}{1+\epsilon}\approx 1$, and for $k\neq k^*$ -- $\hat{q}_{sk} = \frac{\eta}{1+\epsilon} \approx 0$ for some extremely small $\eta>0$.

The variables in the code have the following dimensions:

  • $\theta \in [0,1]^{K\times (N\times M)}$;
  • $\pi \in [0,1]^{1\times K}$;
  • $x \in \{0,1\}^{n\times (N\times M)}$;
  • $z \in [0,1]^{n\times K}$.

Wrappers required for the assignment.


In [ ]:
## A bunch of wrappers to match the task specifications
def posterior( x, clusters ) :
    pi = np.ones( clusters.shape[ 0 ], dtype = np.float ) / clusters.shape[ 0 ]
    q, ll = __posterior( x, theta = clusters, pi = pi )
    return q

## The likelihood is a byproduct of the E-step's minimization of Kullback-Leibler
def likelihood( x, clusters ) :
    pi = np.ones( clusters.shape[ 0 ], dtype = np.float ) / clusters.shape[ 0 ]
    q, ll = __posterior( x, theta = clusters, pi = pi )
    return np.sum( ll )

Classify using the maximum aposteriori rule.


In [ ]:
## Classifier
def classify( x, theta, pi = None ) :
    pi = pi if pi is not None else np.ones( theta.shape[ 0 ], dtype = np.float ) / theta.shape[ 0 ]
## Compute the posterior probabilities of the data
    q_sk, ll_s = __posterior( x, theta = theta, pi = pi )
## Classify according to max pasterior:
    c_s = np.argmax( q_sk, axis = 1 )
    return c_s, q_sk, ll_s

A procedure to compute the log-likelihood of each observaton with respect to each mixture component. Used in the posterior computation.


In [ ]:
def __component_likelihood( x, theta ) :
## Unfortunately sometimes there can be negative machine zeros, which
##  spoil the log-likelihood computation by poisoning with NANs.
## That is why the theta array is restricted to [0,1].
    theta_clipped = np.clip( theta, 0.0, 1.0 )
## Iterate over classes
    ll_sk = np.zeros( ( x.shape[ 0 ], theta.shape[ 0 ] ), dtype = np.float )
## Make a binary mask of the data
    mask = x > 0
    for k in xrange( theta.shape[ 0 ] ) :
## Note that the power formulation is just a mathematically convenient way of
##  writing \theta if x=1 or (1-\theta) otherwise.
        ll_sk[ :, k ] = np.sum( np.where( mask,
            np.log( theta_clipped[ k ] ), np.log( 1 - theta_clipped[ k ] ) ), axis = ( 1, ) )
    return ll_sk

The actual procedure for computing the E*-step: the conditional distribution and the log-likelihood scores.


In [ ]:
## The core procedure for computing the conditional density of classes
def __posterior( x, theta, pi ) :
## Get the log-likelihoods of each observation in each mixture component.
    ll_sk = __component_likelihood( x, theta )
## Find the largest unnormalized probability.
    llstar_s = np.reshape( np.max( ll_sk, axis = ( 1, ) ), ( ll_sk.shape[ 0 ], 1 ) )
## Subtract the largest exponent
    ll_sk -= llstar_s
## In the rare case when the largest exponent is -Inf, force the differences
##  to zero. This effective treaks such observations as having unfiorm likelihood
##  across classes. This way the priors don't get masked by really small numbers. 
##  I could've used ``np.nan_to_num( ll_sk - llstar_s )'' but it actually copies
##  the ll_sk array.
    ll_sk[ np.isnan( ll_sk ) ] = 0.0
## Don't forget to add the log-prior probability (Numpy broadcasting applies!).
##  Adding priors before dealing with infinities would mask then and yield
##  incorrect estimates of the log-likelihoods!
    ll_sk += np.log( np.reshape( pi, ( 1, ll_sk.shape[ 1 ] ) ) )
## Compute the log-sum-exp of the individual log-likelihoods. Negative infinities
##  resolve to 0.0 while the largest exponent resolves to a one. This step cannot
##  produce NaNs
    ll_s = np.reshape( np.log( np.sum( np.exp( ll_sk ), axis = ( 1, ) ) ), ( ll_sk.shape[ 0 ], 1 ) )
## The sum-exp could never be anything lower than 1, since at least one
##  element of each row of ll_sk has to be lstar_s, whence the respective
##  difference should be zero and the exponent -- 1. Thus even if the
##  rest of the sum is close to machine zero, the logarithm would still
##  return 0.
## Normalise the likelihoods to get conditional probability, and compute
##  the sum of the log-denominator, which is the log-likelihood.
    return np.exp( ll_sk - ll_s ), ll_s + llstar_s

Analytic solution: M-step

At the M-step for some fixed $q(Z)$ one solves $\mathbb{E}\log p(X,Z|\Theta)\to \max_\Theta$ subject to $\sum_{k=1}^K \pi_k = 1$ which is a convex optimization problem with respect to $\Theta$, since the log-likelihood as a linear combination of convex functions is convex. The first order condition is $\sum_{s=1}^n \frac{q_{sk}}{\pi_k} - \lambda = 0$ for all $k=1,\ldots,K$, whence $ \lambda = \sum_{s=1}^n \sum_{l=1}^K q_{sl} = n $ and finally $$ \hat{\pi}_k = \frac{\sum_{s=1}^n q_{sk}}{n} \,.$$

For $\theta_{kij}$, $i=1,\ldots,N$, $j=1,\ldots,M$ and $k=1,\ldots,K$ the FOC is $$ \sum_{s=1}^n q_{sk} \frac{x_{sij}}{\theta_{kij}} - \sum_{s=1}^n q_{sk} \frac{1-x_{sij}}{1-\theta_{kij}} = 0 \,,$$ whence $$ \hat{\theta}_{kij} = \frac{\sum_{s=1}^n q_{sk} x_{sij}}{ \sum_{s=1}^n q_{sk} } = \frac{\sum_{s=1}^n q_{sk} x_{sij}}{ n \hat{\pi}_k } \,.$$

This M-step procedure is implemented below.


In [ ]:
## The E-step is simple: just compute the optimal parameters under
##  the current conditional distribution of the latent variables.
def __learn_clusters( x, z ) :
## The prior class probabilities
    pi = z.sum( axis = ( 0, ) )
## Pixel probabilities conditional on the calss
    theta = np.tensordot( z, x, ( 0, 0 ) ) / pi.reshape( ( pi.shape[ 0 ], 1 ) )
## Return: regularization should be done at **E**-step!
#     return np.clip( theta, 1.0/784, 1.0 - 1.0/784 ), pi / x.shape[ 0 ]
    return theta, pi / x.shape[ 0 ]

A wrapper to match the assignment specifications.


In [ ]:
## A wrapper for the above function
def learn_clusters( x, z ) :
    theta, pi = __learn_clusters( x, z )
## Just return theta: in the condtional model the pi are fixed.
    return theta

A it has been mentioned eariler, the EM algorithm switches between E and M steps until convergence.


In [ ]:
## A wrapper for the core em algorithm below
def em_algorithm( x, K, maxiter, verbose = True, rel_eps = 1e-4, full = False ) :
## Initialize the model parameters with uniform [0.25,0.75] random numbers
    theta_1 = rand.uniform( size = ( K, x.shape[ 1 ] ) ) * 0.5 + 0.25
    pi_1 = None if not full else np.ones( K, dtype = np.float ) / K
## Run the em algorithm
    tick = tm.time( )
    ll, theta, pi, status = __em_algorithm( x, theta_1 = theta_1,
        pi_1 = pi_1, niter = maxiter, rel_eps = rel_eps, verbose = verbose )
    tock = tm.time( )
    print( "total %.3f, %.3f/iter" % ( ( tock - tick ), ( tock - tick ) / len( ll ), ) )
## Return the history of theta and the final log liklihood
    if verbose :
        if status[ 'status' ] != 0 :
            print "Convergence not achieved. %d" % ( status[ 'status' ], )
    if full :
        return ( theta, pi ), ll
    return theta, ll

The procedure above actually invokes the true EM core, defined below.


In [ ]:
## The core of the EM algorithm
def __em_algorithm( x, theta_1, pi_1 = None, niter = 1000, rel_eps = 1e-4, verbose = True ) :
## If we were supplied with an initial estimate of the prior distribution,
##  then assume the full model is needed.
    full_model = pi_1 is not None
## If the prior cluster probabilities are not supplied, assume uniform distribution.
    pi_1 = pi_1 if full_model else np.ones( theta_1.shape[ 0 ], dtype = np.float ) / theta_1.shape[ 0 ]
## Allocate the necessary space for the history of model estimates
    theta_hist, pi_hist = theta_1[ np.newaxis ].copy( ), pi_1[ np.newaxis ].copy( )
    ll_hist = np.asarray( [ -np.inf ], dtype = np.float )
## Set "old" estimated to zero. At this line the current estimates are in fact
##  the initially provided ones.
    theta_0, pi_0 = np.zeros_like( theta_1 ), np.zeros_like( pi_1 )
## Initialize the loop
    status, kiter, rel_theta, rel_pi, ll = -1, 0, np.nan, np.nan, -np.inf
    while kiter < niter :
## Dump the current estimators and other information.
        if verbose :
            print( "Iteration %d: avg. log-lik: %.3f, $\\Theta$ div. %.3f, $\\Pi$ div. %.3f" % (
                kiter, ll / x.shape[ 0 ], rel_theta, rel_pi ) )
            show_data( theta_1 - theta_0 if True else theta_1, n = theta_0.shape[ 0 ],
                n_col = min( 10, theta_0.shape[ 0 ] ), cmap = plt.cm.hot, interpolation = 'nearest' )
## The convergence criterion is the L^∞ norm of relative L^1 errors
        if max( rel_pi, rel_theta ) < rel_eps :
            status = 0
            break ;
## Overwrite the initial estimates
        theta_0, pi_0 = theta_1, pi_1
## E-step: call the core posterior function to get both the log-likelihood
##  and the estimate of the conditional distribution.
        z_1, ll_s = __posterior( x, theta_0, pi_0 )
## Sum the individual log-likelihoods of observations
        ll = ll_s.sum()
## M-step: compute the optimal parameters under the current estimate of the posterior
        theta_1, pi_1 = __learn_clusters( x, z_1 )
## Discard the computed estimate of pi if the model is discriminative (conditional likelihood).
        if not full_model :
            pi_1 = pi_0
## Record the current estimates to the history
        theta_hist = np.vstack( ( theta_hist, theta_1[np.newaxis] ) )
        pi_hist = np.vstack( ( pi_hist, pi_1[np.newaxis] ) )
        ll_hist = np.append( ll_hist, ll )
## Check for bad float numbers
        if not ( np.all( np.isfinite( theta_1 ) ) and np.all( np.isfinite( pi_1 ) ) ) :
            status= -2
            break ;
## Check convergence: L^1 relative error. If the relative margin is exactly
##  zero, then return NaNs. This makes teh loop exhaust all iterations, since
##  any comprison agains a NaN returns False.
        rel_theta = np.sum( np.abs( theta_1 - theta_0 ) / ( np.abs( theta_0 ) + rel_eps ) ) if rel_eps > 0 else np.nan
        rel_pi = np.sum( np.abs( pi_1 - pi_0 ) / ( np.abs( pi_0 ) + rel_eps ) ) if rel_eps > 0 else np.nan
## Next iteration
        kiter += 1
    return ll_hist, theta_hist, pi_hist, { 'status': status, 'iter': kiter }

Define a convenient procedure for running experiments. By setting relative error to zero the algorithm is forced to exhaust all the allocated iterations.


In [ ]:
def experiment( data, K, maxiter, verbose = True, until_convergence = False, full = False ) :
## Run the EM
    return em_algorithm( data, K, maxiter, rel_eps = 1.0e-4 if until_convergence else 0.0, verbose = verbose, full = full )

Miscellanea: visualization

In order to be able to plot more flexibly, define another arranger.


In [ ]:
## A more flexible image arrangement
def arrange_flex( images, n_row = 10, n_col = 10, N = 28, M = 28, fill_value = 0 ) :
## Create the final grid of images row-by-row
    im_grid = np.full( ( n_row * N, n_col * M ), fill_value, dtype = images.dtype )
    for k in range( min( images.shape[ 0 ], n_col * n_row ) ) :
## Get the grid cell at which to place the image
        i, j = ( k // n_col ) * N, ( k % n_col ) * M
## Just put the image in the cell
        im_grid[ i:i+N, j:j+M ] = np.reshape( images[ k ], ( N, M, ) )
    return im_grid

The folowing pair of procedures are used to plot the digits in a clear manner. The first one just creates a canvas for the image: it sets up both axes properly and adds labels to them.


In [ ]:
def setup_canvas( axis, n_row, n_col, N = 28, M = 28 ) :
## Setup major tick marks to the seam between images and disable their labels
    axis.set_yticks( np.arange( 1, n_row + 1 ) * N, minor = False )
    axis.set_xticks( np.arange( 1, n_col + 1 ) * M, minor = False )
    axis.set_yticklabels( [ ], minor = False ) ; axis.set_xticklabels( [ ], minor = False )
## Set minor ticks so that they are exactly between the major ones
    axis.set_yticks( ( np.arange( n_row + 1 ) + 0.5 ) * N, minor = True )
    axis.set_xticks( ( np.arange( n_col + 1 ) + 0.5 ) * M, minor = True )
## Make their labels into cell x-y coordinates
    axis.set_yticklabels( [ "%d" % (i,) for i in 1+np.arange( n_row + 1 ) ], minor = True )
    axis.set_xticklabels( [ "%d" % (i,) for i in 1+np.arange( n_col + 1 ) ], minor = True )
## Tick marks should be oriented outward
    axis.tick_params( axis = 'both', which = 'both', direction = 'out' )
## Return nothing!
    axis.grid( color = 'white', linestyle = '--' )

This procedure displays the images on a nice plot. Used for one-line visualization.


In [ ]:
def show_data( data, n, n_col = 10, transpose = False, **kwargs ) :
## Get the number of rows necessary to plot the needed number of images
    n_row = ( n + n_col - 1 ) // n_col
## Transpose if necessary
    if transpose :
        n_col, n_row = n_row, n_col
## Set the dimensions of the figure
    fig = plt.figure( figsize = ( n_col, n_row ) )
    axis = fig.add_subplot( 111 )
## Plot!
    setup_canvas( axis, n_row, n_col )
    axis.imshow( arrange_flex( data[:n], n_col = n_col, n_row = n_row ), **kwargs )
## Plot
    plt.show( )

In [ ]:
def visualize( data, clusters, ll, n_col = 2, plot_ll = True ) :
## Display the result
    print "Final conditional log-likelihood value per observation achieved %f in %d iteration(s)" % (
        ll[-1] / data.shape[ 0 ], len( ll ) )
## Plot the first difference of average log-likelihood
    if plot_ll :
        plt.figure( figsize = ( 12, 7 ) )
        ax = plt.subplot(111)
        ax.set_title( r"avg. log-likelihood change between successive iterations (log scale)" )
        ax.plot( np.diff( ll / data.shape[ 0 ]  ) )
#         ax.set_ylabel( r"$\Delta_i \frac{1}{n} \sum_{s=1}^n \mathbb{E}_{z_s\sim q_i} \log p(x_s,z_s|\Theta_i)$" )
        ax.set_ylabel( r"$\Delta_i \frac{1}{n} \sum_{s=1}^n \log p(x_s|\Theta_i)$" )
        ax.set_yscale( 'log' )
## Plot the final estimates
    if n_col > 0 :
        show_data( clusters[-1], n = clusters.shape[1], n_col = n_col, cmap = plt.cm.spectral, interpolation = 'nearest' )

MIscellanea: animating the EM

This function creates an animation of successive iterations of a run of the EM.


In [ ]:
def animate( theta, ll, pi = None, n_col = 10, n_row = 10, interval = 1, **kwargs ) :
## Create a background
    bg = arrange_flex( np.zeros_like( theta[ 0 ] ), n_col = n_col, n_row = n_row )
## Compute log-likelihood differences and sanitize them.
    ll_diff = np.maximum( np.diff( ll ), np.finfo(np.float).eps )
    ll_diff[ ~np.isfinite( ll_diff ) ] = np.nan
## Set up the figure, the axis, and the plot elements we want to animate
    fig = plt.figure( figsize = ( 12, 12 ) )
## Create three subplots  and sposition them specifically
    if pi is None :
        ax1, ax3, ax2 = fig.add_subplot( 311 ), fig.add_subplot( 312 ), fig.add_subplot( 313 )
    else :
        ax1, ax4 = fig.add_subplot( 411 ), fig.add_subplot( 412 )
        ax3, ax2 = fig.add_subplot( 413 ), fig.add_subplot( 414 )
## Initialize different ranges for the image aritsts
    setup_canvas( ax1, n_row = n_row, n_col = n_col )
    ax1.set_title( r"Current estimate of the mixture components" )
    setup_canvas( ax2, n_row = n_row, n_col = n_col )
    ax2.set_title( r"Change between successive iterations" )
## Initialize geomtery for the delta log-likelihood plot.
    ax3.set_xlim( -0.1, ll.shape[ 0 ] + 0.1 )
    ax3.set_yscale( 'log' ) #; ax3.set_yticklabels( [ ] )
    ax3.set_title( r"Change between successive iterations of EM (log scale)" )
    ax3.set_ylabel( r"$\Delta_i \sum_{s=1}^n \log p(x_s|\Theta_i)$" )
    ax3.set_ylim( np.nanmin( ll_diff ) * 0.9, np.nanmax( ll_diff ) * 1.1 )
    ax3.grid( )
## Setup a plot for prior probabilites
    if pi is not None :
        classes = 1 + np.arange( len( pi[ 0 ] ) )
        ax4.set_xticks( classes )
        ax4.set_ylim( 0.0, 1.0 )
        ax4.set_title( r"Current estimate of the mixture weights" )
        ba1 = ax4.bar( classes, pi[ 0 ], align = "center" )
## Setup the artists
    im1 = ax1.imshow( bg, vmin = +0.0, vmax = +1.0, **kwargs )
    im2 = ax2.imshow( bg, vmin = -1.0, vmax = +1.0, **kwargs )
    line1, = ax3.plot( [ ], linestyle = "-", color = 'blue' )
## Animation function.  This is called sequentially
    def update( i ) :
## Compute the frame
        frame = theta[ i ] - theta[ i-1 ] if i > 0 else theta[ 0 ]
        frame /= np.max( np.abs( frame ) )
## Draw frames on the image artists
        im1.set_data( arrange_flex( theta[ i ], n_col = n_col, n_row = n_row ) )
        im2.set_data( arrange_flex( frame,      n_col = n_col, n_row = n_row ) )
        if i > 0 :
## Show history on the line artist
            line1.set_data( np.arange( i ), ll_diff[ :i ] )
        if pi is not None :
            [ b.set_height( h ) for b, h in zip( ba1, pi[ i ] ) ]
            if i > 0 :
                [ b.set_color( 'green' if h > p else 'red' ) for b, h, p in zip( ba1, pi[ i ], pi[ i-1 ] ) ]
## Return an iterator of artists in this frame
            return ( im1, im2, line1, ) + ba1
        return im1, im2, line1,
## Call the animator.
    return animation.FuncAnimation( fig, update, frames = theta.shape[ 0 ], interval = interval, blit = True )

Define a function that produces (using ffmpeg) and embeds a video in HTML into IPython


In [ ]:
## Make simple animations of the EM estimatora
## http://jakevdp.github.io/blog/2013/05/12/embedding-matplotlib-animations/
## http://jakevdp.github.io/blog/2012/08/18/matplotlib-animation-tutorial/
from matplotlib import animation
from IPython.display import HTML
from tempfile import NamedTemporaryFile

def embed_video( anim ) :
    VIDEO_TAG = """<video controls autoplay muted loop><source src="data:video/x-m4v;base64,{0}"
        type="video/mp4">Your browser does not support the video tag.</video>"""
    plt.close( anim._fig )
    if not hasattr( anim, '_encoded_video' ) :
        ffmpeg_writer = animation.FFMpegWriter( )
        with NamedTemporaryFile( suffix = '.mp4' ) as f:
            anim.save( 'myanim.mp4', fps = 12, extra_args = [ '-vcodec', 'libx264' ] )# , writer = ffmpeg_writer )
            video = open( 'myanim.mp4', "rb" ).read( )
        anim._encoded_video = video.encode( "base64" )
    return HTML( VIDEO_TAG.format( anim._encoded_video ) )

Miscellanea: obtaining the data

Try to download the MNIST data from the SciKit repository.


In [ ]:
if False :
## Fetch MNIST dataset from SciKit and create a local copy.
    from sklearn.datasets import fetch_mldata
    mnist = fetch_mldata( "MNIST original", data_home = './data/' )
    np.savez_compressed('./data/mnist/mnist_scikit.npz', data = mnist.data, labels = mnist.target )

Or obtain the data from the provided CSV files.


In [ ]:
if False :
## The procedure below loads the MNIST data from a comma-separated text file.
    def load_mnist_from_csv( filename ) :
## Read the CSV file
        data = np.loadtxt( open( filename, "rb" ), dtype = np.short, delimiter = ",", skiprows = 0 )
## Peel off the lables
        return data[:,1:], data[:,0]
## Fetch the data from the provided CSV (!) files and save as a compressed data blob
    data, labels = load_mnist_from_csv( "./data/mnist/mnist_train.csv" )
    np.savez_compressed( './data/mnist/mnist_train.npz', labels = labels, data = data )
    data, labels = load_mnist_from_csv( "./data/mnist/mnist_test.csv" )
    np.savez_compressed( './data/mnist/mnist_test.npz', labels = labels, data = data )

Study

First of all load and binarize the training data using the value 127 as the threshold.


In [ ]:
assert( os.path.exists( './data/mnist/mnist_train.npz' ) )
with np.load( './data/mnist/mnist_train.npz', 'r' ) as npz :
    mnist_labels, mnist_data = npz[ 'labels' ], np.array( npz[ 'data' ] > 127, np.int )
    
assert( os.path.exists( './data/mnist/mnist_test.npz' ) )
with np.load( './data/mnist/mnist_test.npz', 'r' ) as npz :
    test_labels, test_data = npz[ 'labels' ], np.array( npz[ 'data' ] > 127, np.int )

Case : $K=2$

Let's have a look at some 6s and 9s.


In [ ]:
## Mask
inx_sixes, inx_nines = np.where( mnist_labels == 6 )[ 0 ], np.where( mnist_labels == 9 )[ 0 ]
## Extract
sixes = mnist_data[ rand.choice( inx_sixes, 90, replace = False ) ]
nines = mnist_data[ rand.choice( inx_nines, 90, replace = False ) ]
## Show
show_data( sixes, n = 45, n_col = 15, cmap = plt.cm.gray, interpolation = 'nearest' )
show_data( nines, n = 45, n_col = 15, cmap = plt.cm.gray, interpolation = 'nearest' )

They do indeed look quite distinct. Now collect them into a single dataset and estimate the model.


In [ ]:
data = mnist_data[ np.append( inx_sixes, inx_nines ) ]
clusters, ll = experiment( data, 2, 30 )

The estimate deltas show that the EM algorithm's E-step actually transfers the unlikely observations between classes, as is expected by constructon of the algorithm.

Judging by the plot below, it turns out that 30 iterations is more than enough for the EM to get meaninful estimates the class ideals, represented by the probability porduct-measure on $\Omega^{28\times 28}$.


In [ ]:
visualize( data, clusters, ll )

Now let's see how well the EM algorithm performs on a model with more classes. But before that let's have a look at a random sample of the handwritten digits.


In [ ]:
indices = np.arange( mnist_data.shape[ 0 ] )
rand.shuffle( indices )
show_data( mnist_data[ indices[:100] ] , n = 100, n_col = 10, cmap = plt.cm.gray, interpolation = 'nearest' )

Case : $K=10, 15, 20, 30, 60$ and $90$

The original size of the trainig sample is too large to fit in these RAM banks :) That is why I had to limit the sample to a random subset of 2000 observations.


In [ ]:
sub_sample = np.concatenate( tuple( [ rand.choice( np.where( mnist_labels == i )[ 0 ], size = 200 ) for i in range( 10 ) ] ) )

In [ ]:
train_data, train_labels = mnist_data[ sub_sample ], mnist_labels[ sub_sample ]
# train_data, train_labels = mnist_data, mnist_labels

Run the procedure that perfoems EM algorithm and return the history of the parameter estimates as well as the dynamics of the log-likelihood lower bonud.


In [ ]:
clusters_10, ll_10 = experiment( train_data, 10, 50 )

One can clearly see, that $50$ iterations were not enough for the alogirithm to converge: though the changes are tiny, even on the log-scale, they are still unstable.


In [ ]:
visualize( train_data, clusters_10, ll_10, n_col = 10, plot_ll = True )

Let's see if changing $K$ does the trick.


In [ ]:
clusters_15, ll_15 = experiment( train_data, 15, 50, verbose = False, until_convergence = False )
clusters_20, ll_20 = experiment( train_data, 20, 50, verbose = False, until_convergence = False )
clusters_30, ll_30 = experiment( train_data, 30, 50, verbose = False, until_convergence = False )

For what values of $K$ was it possible to infer the templates of all digits?


In [ ]:
visualize( train_data, clusters_15, ll_15, n_col = 10, plot_ll = False )
visualize( train_data, clusters_20, ll_20, n_col = 10, plot_ll = False )
visualize( train_data, clusters_30, ll_30, n_col = 10, plot_ll = False )

Obviously, the model with more mixture components is more likely to produce "templates" for all digits. For larger $K$ this is indeed the case.

Having run this algorithm for many times we are able to say that the digits $3$ and $8$, $4$ and $9$ and sometimes $5$ tend to be poorly separated. Furthermore due to there being many different handwritten variations of the same digit one should estimate a model with more classes.

The returned templates of the mixture components are clearly suboptimal: the procedure seems to get stuck at individual examples. This may happen for any $K$ and allowing for more iterations does not remedy this. Some possibilities do exist: add regularizers to the E step, that tie neighbouring pixel distributions together.


In [ ]:
clusters_60, ll_60 = experiment( train_data, 60, 500, verbose = False, until_convergence = True )

As one can see, increasing the number of iterations does not necessarily improve the results.


In [ ]:
visualize( train_data, clusters_60, ll_60, n_col = 15 )

Judging by the plot of the log-likelihood, the fact that the EM is guaranteed to converge to local maxima and does so extremely fast, there was no need for more than 120-130 iterations. The chages in the log-likelihood around that number of iterations are of the order $10^{-4}$. Since we are working in finite precision arithmetic (double), the smallest precision is $\approx 10^{-14}$.

Let's see the dynamics of the estimated of the EM iterations. You will have to ensure that ffMPEG is installed (Windows: and is on the PATH environment variable).


In [ ]:
## If you want to see this animation ensure that ffmpeg is installed and uncomment the following lines.
anim_60 = animate( clusters_60, ll_60, n_col = 15, n_row = 4,
    interval = 1, cmap = plt.cm.hot, interpolation = 'nearest' )
embed_video( anim_60 )

The parameter estimates of the EM stabilize pretty quickly. In fact most templates stabilize by iterations 100-120.

Choosing $K$

Among many methods of model selection, let's use simple training sample fittness score, givne by the value of the log-likelihood. Becasue the models are nested with respect to the number of mixture components, one should expect the likelihood to be a non decreasing function of $K$ (on average dut to randomization of the initial parameter values). For large enough $K$ this method may lead to overfitting.


In [ ]:
## Test model with K from 12 up to 42 with a step of 4
classes = 12 + np.arange( 11, dtype = np.int ) * 3
ll_hist = np.full( len( classes ), -np.inf, dtype = np.float )
## Store parameters
parameter_hist = list( )
for i, K in enumerate( classes ) :
## Run the experiment
    c, l = experiment( train_data, K, 50, verbose = False, until_convergence = False )
    ll_hist[ i ] = l[ -1 ]
    parameter_hist.append( c[ -1 ] )
## Visualize the final parameters
    show_data( c[-1], n = K, n_col = 13, cmap = plt.cm.hot, interpolation = 'nearest' )

Indeed the log-likelihood does not decrease with $K$ on average. Nevertheless the model with the highes likelihood turs out to have this many mixture components:


In [ ]:
print classes[ np.argmax( ll_hist ) ]

A nice, yet expected coincidence :)

Classification: label assignment

Select a model ...


In [ ]:
# clusters = parameter_hist[ np.argmax( ll_hist ) ] * 0.999 + 0.0005
clusters = clusters_60[-1]

... and get the posterior mixture component probabilities.


In [ ]:
## Compute the posterior component probabilities, and use max-aposteriori
##  for the best class selection. 
c_s, q_sk, ll_s = classify( train_data, clusters )

Use a simple majority rule to automatically assign lables to templates.


In [ ]:
template_x_label_maj_60 = np.full( clusters.shape[ 0 ], -1, np.int )
for t in range( clusters.shape[ 0 ] ) :
    l, f = np.unique( train_labels[ c_s == t ], return_counts = True)
    if len( l ) > 0 :
## This is too blunt an approach: it does not guarantee surjectivity of the mapping.
        template_x_label_maj_60[ t ] = l[ np.argmax( f ) ]

Assign the labels $l$ to templates $t$ according to its score, based on the average of the top-$5$ log-likelihoods of observations with label $l$ and classfified with template $t$.


In [ ]:
## there are 10 labels and K templates
label_cluster_score = np.full( ( clusters.shape[ 0 ], 10 ), -np.inf, np.float )
## Loop over each template
for t in range( clusters.shape[ 0 ] ) :
## The selected templates are chosen according to max-aposteriori rule.
    inx = np.where( c_s == t )[ 0 ]
## Get the assigned lables and their frequencies
    actual_labels = train_labels[ inx ]
    l, f = np.unique( actual_labels, return_counts = True )
## For each template and each associated label in the training set the
##  score is average of the top-5 highest log-likelihoods.
    label_cluster_score[ t, l ] = [ np.average( sorted(
        ll_s[ inx[ actual_labels == a ] ].flatten( ), reverse = True )[ : 5 ] ) for a in l ]

## For each template choose the label with the highes likelihood.
template_x_label_lik_60 = np.argmax( label_cluster_score, axis = 1 )

Compare the label assignments. Here are the templates.


In [ ]:
show_data( clusters, clusters.shape[ 0 ], 10,
          cmap = plt.cm.spectral, interpolation = 'nearest' )

These are the templates, which were assigned different labels by the majority and "trust" methods.


In [ ]:
mask = np.asarray( template_x_label_maj_60 != template_x_label_lik_60, dtype = np.float ).reshape( (-1,1) )
show_data( clusters * mask, clusters.shape[ 0 ], 10,
          cmap = plt.cm.spectral, interpolation = 'nearest' )

print "\nLikelihood based: ", template_x_label_lik_60[ mask[:,0] > 0 ]
print "Majority bassed: ", template_x_label_maj_60[ mask[:,0] > 0 ]

Here are the pictures of templates ordered according to their label.


In [ ]:
show_data( clusters[ np.argsort( template_x_label_lik_60 ) ], clusters.shape[ 0 ],
          10, cmap = plt.cm.spectral, interpolation = 'nearest' ) 
show_data( clusters[ np.argsort( template_x_label_maj_60 ) ], clusters.shape[ 0 ],
          10, cmap = plt.cm.spectral, interpolation = 'nearest' )

Classification: test sample

Shall we try running the classifier on the test data?


In [ ]:
## Run the classifier on the test data
c_s_60, q_sk, ll_s = classify( test_data, clusters )

Let's see the best template for each test observation in some sub-sample.


In [ ]:
## Show a sample of images and their templates
sample = np.random.permutation( test_data.shape[ 0 ] )[:64]
## Stack image and tis best template atop one another
display_stack = np.empty( ( 2 * len( sample ), test_data.shape[ 1 ] ), dtype = np.float )
display_stack[0::2] = test_data[ sample ] * q_sk[ sample, c_s_60[ sample ], np.newaxis ]
display_stack[1::2] = clusters[ c_s_60[ sample ] ]
## Display
show_data( display_stack, n = display_stack.shape[ 0 ], n_col = 16,
    transpose = False, cmap = plt.cm.spectral, interpolation = 'nearest' )

The digits are shown in pairs: each first digit is the test observation (colour is determined by the confidence of the classifier -- the whiter the higher), and each second -- is the best template.

Let's see how accurate the classification was. Recall that the component assignment was done using $$ \hat{t}_s = \mathop{\text{argmax}}_k p\bigl( C_s = k \, \big\vert\, X = x_s\bigr) \,, $$ i.e. maximum aposteriori rule.

Then, with the classes assigned, labels were deduced based on either :

  • simple majority;
  • observations with top largest likelihoods in each class.

By accuracy I understand the following socre: $$ \alpha = 1 - \frac{1}{|\text{TEST}|} \sum_{s\,\in\,\text{TEST}} \mathbf{1}\{ l_s \neq L(\hat{t}_s)\}\,, $$ where $l_s$ -- is the actual lablesof an observation $s$, $\hat{t}_s$ -- is the inferred mixture component (class) of that observation, and $k\mapsto L(k)$ is the component to label mapping.


In [ ]:
print "Accuracy of likelihood based labelling: %.2f" % (
    100 * np.average( template_x_label_lik_60[ c_s_60 ] == test_labels ), )
print "Accuracy of simple majority labelling: %.2f" % (
    100 * np.average( template_x_label_maj_60[ c_s_60 ] == test_labels ), )

Not surprisingly, majority- and likelihood-based classification accuracies are close.

Let's see which test observations the model considers an artefact and for which it cannot reliably assign a template: i.e. the posterior class probablity for these cases coincides with the prior. This happens when the likelihood of an observation is identical within each class.


In [ ]:
## Now display the test observations, which the model cold nor classify at all.
bad_tests = np.where( np.isinf( ll_s ) )[ 0 ]
show_data( test_data[ bad_tests ], n = max( len( bad_tests ), 10 ), n_col = 15, cmap = plt.cm.gray, interpolation = 'nearest' )
# print q_sk[ bad_tests ]

</hr>

Let's see how more pasrimonious models fare with respect to accuracy on thte test sample.

Accuracy of $K=30$


In [ ]:
clusters = clusters_30[-1]
c_s, q_sk, ll_s = classify( train_data, clusters )

template_x_label_maj_30 = np.full( clusters.shape[ 0 ], -1, np.int )
for t in range( clusters.shape[ 0 ] ) :
    l, f = np.unique( train_labels[ c_s == t ], return_counts = True)
    if len( l ) > 0 :
        template_x_label_maj_30[ t ] = l[ np.argmax( f ) ]

label_cluster_score_30 = np.full( ( clusters.shape[ 0 ], 10 ), -np.inf, np.float )
for t in range( clusters.shape[ 0 ] ) :
    inx = np.where( c_s == t )[ 0 ]
    actual_labels = train_labels[ inx ]
    l, f = np.unique( actual_labels, return_counts = True )
    label_cluster_score_30[ t, l ] = [ np.average( sorted(
        ll_s[ inx[ actual_labels == a ] ].flatten( ), reverse = True )[ : 5 ] ) for a in l ]

template_x_label_lik_30 = np.argmax( label_cluster_score_30, axis = 1 )
show_data( clusters[ np.argsort( template_x_label_lik_30 ) ], clusters.shape[ 0 ],
          10, cmap = plt.cm.spectral, interpolation = 'nearest' )
print template_x_label_lik_30[ np.argsort( template_x_label_lik_30 ) ].reshape((3,-1))

c_s_30, q_sk, ll_s = classify( test_data, clusters )
print "Accuracy of likelihood based labelling: %.2f" % (
    100 * np.average( template_x_label_lik_30[ c_s_30 ] == test_labels ), )
print "Accuracy of simple majority labelling: %.2f" % (
    100 * np.average( template_x_label_maj_30[ c_s_30 ] == test_labels ), )

Accuracy of $K=20$


In [ ]:
clusters = clusters_20[-1]
c_s, q_sk, ll_s = classify( train_data, clusters )

template_x_label_maj_20 = np.full( clusters.shape[ 0 ], -1, np.int )
for t in range( clusters.shape[ 0 ] ) :
    l, f = np.unique( train_labels[ c_s == t ], return_counts = True)
    if len( l ) > 0 :
        template_x_label_maj_20[ t ] = l[ np.argmax( f ) ]

label_cluster_score_20 = np.full( ( clusters.shape[ 0 ], 10 ), -np.inf, np.float )
for t in range( clusters.shape[ 0 ] ) :
    inx = np.where( c_s == t )[ 0 ]
    actual_labels = train_labels[ inx ]
    l, f = np.unique( actual_labels, return_counts = True )
    label_cluster_score_20[ t, l ] = [ np.average( sorted(
        ll_s[ inx[ actual_labels == a ] ].flatten( ), reverse = True )[ : 5 ] ) for a in l ]

template_x_label_lik_20 = np.argmax( label_cluster_score_20, axis = 1 )
show_data( clusters[ np.argsort( template_x_label_lik_20 ) ], clusters.shape[ 0 ],
          10, cmap = plt.cm.spectral, interpolation = 'nearest' )
print template_x_label_lik_20[ np.argsort( template_x_label_lik_20 ) ].reshape((2,-1))

c_s_20, q_sk, ll_s = classify( test_data, clusters )
print "Accuracy of likelihood based labelling: %.2f" % (
    100 * np.average( template_x_label_lik_20[ c_s_20 ] == test_labels ), )
print "Accuracy of simple majority labelling: %.2f" % (
    100 * np.average( template_x_label_maj_20[ c_s_20 ] == test_labels ), )

Accuracy of $K=15$


In [ ]:
clusters = clusters_15[-1]
c_s, q_sk, ll_s = classify( train_data, clusters )

template_x_label_maj_15 = np.full( clusters.shape[ 0 ], -1, np.int )
for t in range( clusters.shape[ 0 ] ) :
    l, f = np.unique( train_labels[ c_s == t ], return_counts = True)
    if len( l ) > 0 :
        template_x_label_maj_15[ t ] = l[ np.argmax( f ) ]

label_cluster_score_15 = np.full( ( clusters.shape[ 0 ], 10 ), -np.inf, np.float )
for t in range( clusters.shape[ 0 ] ) :
    inx = np.where( c_s == t )[ 0 ]
    actual_labels = train_labels[ inx ]
    l, f = np.unique( actual_labels, return_counts = True )
    label_cluster_score_15[ t, l ] = [ np.average( sorted(
        ll_s[ inx[ actual_labels == a ] ].flatten( ), reverse = True )[ : 5 ] ) for a in l ]

template_x_label_lik_15 = np.argmax( label_cluster_score_15, axis = 1 )
show_data( clusters[ np.argsort( template_x_label_lik_15 ) ], clusters.shape[ 0 ],
          15, cmap = plt.cm.spectral, interpolation = 'nearest' )
print template_x_label_lik_15[ np.argsort( template_x_label_lik_15 ) ].reshape((1,-1))

c_s_15, q_sk, ll_s = classify( test_data, clusters )
print "Accuracy of likelihood based labelling: %.2f" % (
    100 * np.average( template_x_label_lik_15[ c_s_15 ] == test_labels ), )
print "Accuracy of simple majority labelling: %.2f" % (
    100 * np.average( template_x_label_maj_15[ c_s_15 ] == test_labels ), )

Accuracy of $K=10$


In [ ]:
clusters = clusters_10[-1]
c_s, q_sk, ll_s = classify( train_data, clusters )

template_x_label_maj_10 = np.full( clusters.shape[ 0 ], -1, np.int )
for t in range( clusters.shape[ 0 ] ) :
    l, f = np.unique( train_labels[ c_s == t ], return_counts = True)
    if len( l ) > 0 :
        template_x_label_maj_10[ t ] = l[ np.argmax( f ) ]

label_cluster_score_10 = np.full( ( clusters.shape[ 0 ], 10 ), -np.inf, np.float )
for t in range( clusters.shape[ 0 ] ) :
    inx = np.where( c_s == t )[ 0 ]
    actual_labels = train_labels[ inx ]
    l, f = np.unique( actual_labels, return_counts = True )
    label_cluster_score_10[ t, l ] = [ np.average( sorted(
        ll_s[ inx[ actual_labels == a ] ].flatten( ), reverse = True )[ : 5 ] ) for a in l ]

template_x_label_lik_10 = np.argmax( label_cluster_score_10, axis = 1 )
show_data( clusters[ np.argsort( template_x_label_lik_10 ) ], clusters.shape[ 0 ],
          10, cmap = plt.cm.spectral, interpolation = 'nearest' )
print template_x_label_lik_10[ np.argsort( template_x_label_lik_10 ) ].reshape((1,-1))

c_s_10, q_sk, ll_s = classify( test_data, clusters )
print "Accuracy of likelihood based labelling: %.2f" % (
    100 * np.average( template_x_label_lik_10[ c_s_10 ] == test_labels ), )
print "Accuracy of simple majority labelling: %.2f" % (
    100 * np.average( template_x_label_maj_10[ c_s_10 ] == test_labels ), )

As one can see test sample accuracy of the model falls drammatically for less number of mixture components. This was expected, since due to various reasons, one being thet the data is handwritten, it is higly unlikely, that a single digit would have only one template.


In [ ]:
print "Model with K = 10: %.2f" % ( 100 * np.average( template_x_label_lik_10[ c_s_10 ] == test_labels ), )
print "Model with K = 15: %.2f" % ( 100 * np.average( template_x_label_lik_15[ c_s_15 ] == test_labels ), )
print "Model with K = 20: %.2f" % ( 100 * np.average( template_x_label_lik_20[ c_s_20 ] == test_labels ), )
print "Model with K = 30: %.2f" % ( 100 * np.average( template_x_label_lik_30[ c_s_30 ] == test_labels ), )
print "Model with K = 60: %.2f" % ( 100 * np.average( template_x_label_lik_60[ c_s_60 ] == test_labels ), )


Ignore everything below



Let digress for a moment and consider the full model and create a video to see how the estimates are refined.


In [ ]:
( clusters_full, pi_full ), ll_full = experiment( data, 30, 1000, False, True, True )

In [ ]:
anim_full = animate( clusters_full, ll_full, pi = pi_full, n_col = 15, n_row = 2, interval = 1, cmap = plt.cm.hot, interpolation = 'nearest' )
embed_video( anim_full )

A random variable $X\sim \text{Beta}(\alpha,\beta)$ if the law of $X$ has density $$p(u) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} u^{\alpha-1}(1-u)^{\beta-1} $$

$$ \log p(X,Z|\Theta) = \sum_{s=1}^n \log \prod_{k=1}^K \Bigl[ \pi_k \prod_{i=1}^N \prod_{j=1}^M \frac{\Gamma(\alpha_{kij}+\beta_{kij})}{\Gamma(\alpha_{kij})\Gamma(\beta_{kij})} x_{sij}^{\alpha_{kij}-1}(1-x_{sij})^{\beta_{kij}-1} \Bigr]^{1_{z_s = k}}$$
\begin{align*} \mathbb{E}_q \log p(X,Z|\Theta) &= \sum_{k=1}^K \sum_{s=1}^n q_{sk} \log \pi_k \\ &+ \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M \sum_{s=1}^n q_{sk} \bigl( \log \Gamma(\alpha_{kij}+\beta_{kij}) - \log \Gamma(\alpha_{kij}) - \log \Gamma(\beta_{kij}) \bigr) \\ &+ \sum_{k=1}^K \sum_{i=1}^N \sum_{j=1}^M \sum_{s=1}^n q_{sk} \bigl( (\alpha_{kij}-1) \log x_{sij} + (\beta_{kij}-1) \log(1-x_{sij}) \bigr) \\ \end{align*}

Derivative of a Gamma function does not seem to yeild analytically tracktable solutions.


Non-parametric approach


In [ ]:
from sklearn.neighbors import KernelDensity
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV

In [ ]:
pca = PCA( n_components = 50 )
X_train_pca = pca.fit_transform( X_train )

In [ ]:
params = { 'bandwidth' : np.logspace( -1, 1, 20 ) }
grid = GridSearchCV( KernelDensity( ), params )
grid.fit( X_train_pca )
print("best bandwidth: {0}".format( grid.best_estimator_.bandwidth ) )

In [ ]:
params
kde = grid.best_estimator_

In [ ]:
new_data = kde.sample( 100 )
new_data = pca.inverse_transform( new_data )
print new_data.shape

plt.figure( figsize = ( 9, 9 ) )
plt.imshow( arrange_flex( new_data ), cmap = plt.cm.gray, interpolation = 'nearest' )
plt.show( )