In [1]:
from __future__ import division
# scientific
%matplotlib inline
from matplotlib import pyplot as plt;
import matplotlib as mpl;
import numpy as np;
import sklearn as skl;
import sklearn.datasets;
import sklearn.cluster;
import sklearn.mixture;
# ipython
import IPython;
# python
import os;
import random;
#####################################################
# image processing
import PIL;
# trim and scale images
def trim(im, percent=100):
print("trim:", percent);
bg = PIL.Image.new(im.mode, im.size, im.getpixel((0,0)))
diff = PIL.ImageChops.difference(im, bg)
diff = PIL.ImageChops.add(diff, diff, 2.0, -100)
bbox = diff.getbbox()
if bbox:
x = im.crop(bbox)
return x.resize(((x.size[0]*percent)//100, (x.size[1]*percent)//100), PIL.Image.ANTIALIAS);
#####################################################
# daft (rendering PGMs)
import daft;
# set to FALSE to load PGMs from static images
RENDER_PGMS = True;
# decorator for pgm rendering
def pgm_render(pgm_func):
def render_func(path, percent=100, render=None, *args, **kwargs):
print("render_func:", percent);
# render
render = render if (render is not None) else RENDER_PGMS;
if render:
print("rendering");
# render
pgm = pgm_func(*args, **kwargs);
pgm.render();
pgm.figure.savefig(path, dpi=300);
# trim
img = trim(PIL.Image.open(path), percent);
img.save(path, 'PNG');
else:
print("not rendering");
# error
if not os.path.isfile(path):
raise Exception("Error: Graphical model image %s not found. You may need to set RENDER_PGMS=True." % path);
# display
return IPython.display.Image(filename=path);#trim(PIL.Image.open(path), percent);
return render_func;
######################################################
They can be found here. Written by our very own Benjamin Bray!
In [2]:
@pgm_render
def pgm_latent():
pgm = daft.PGM([4,4], origin=[-2,-1], node_unit=0.8, grid_unit=2.0);
# nodes
pgm.add_node(daft.Node("z", r"$Z_n$", 0.7, 1));
pgm.add_node(daft.Node("x", r"$X_n$", 1.3, 1, observed=True));
pgm.add_node(daft.Node("theta", r"$\theta$", 1.3, 0.3));
# edges
pgm.add_edge("z", "x", head_length=0.08);
pgm.add_edge("theta", "x", head_length=0.08);
pgm.add_edge("theta", "z", head_length=0.08);
pgm.add_plate(daft.Plate([0.4,0.8,1.3,0.5], label=r"$\qquad\qquad\qquad\;\; N$",
shift=-0.1))
return pgm;
In [3]:
%%capture
pgm_latent("images/pgm/latent.png")
Out[3]:
Goal: Partition data $\X = \{ x_1, \dots, x_n \} \subset \R^d$ into $K$ disjoint clusters.
Usually, we fix $K$ beforehand! Use model selection to overcome this limitation.
First, pick random cluster centers $\mu_k$. Then, repeat until convergence:
Clustering problems motivate the definition of a mixture model.
In clustering, we represent mixtures via latent cluster indicators, $$ \begin{align} z_n &\sim \Cat[\pi] & \forall\, n = 1,\dots,N \\ x_n \mid z_n &\sim p_{z_n}(\cdot \mid \theta_{z_n}) & \forall\, n = 1,\dots,N \end{align} $$
In this form, mixture models are clearly applicable to clustering problems.
For a deeper understanding of mixture models, see De Finetti's Theorem.
In a Gaussian Mixture Model, the base distributions are Gaussian.
In [4]:
# example stolen from scikit-learn docs
def plot_gmm():
n_samples = 300
# generate random sample, two components
np.random.seed(0)
# generate spherical data centered on (20, 20)
shifted_gaussian = np.random.randn(n_samples, 2) + np.array([20, 20])
# generate zero centered stretched Gaussian data
C = np.array([[0., -0.7], [3.5, .7]])
stretched_gaussian = np.dot(np.random.randn(n_samples, 2), C)
# concatenate the two datasets into the final training set
X_train = np.vstack([shifted_gaussian, stretched_gaussian])
# fit a Gaussian Mixture Model with two components
clf = skl.mixture.GMM(n_components=2, covariance_type='full')
clf.fit(X_train)
# display predicted scores by the model as a contour plot
x = np.linspace(-20.0, 30.0)
y = np.linspace(-20.0, 40.0)
X, Y = np.meshgrid(x, y)
XX = np.array([X.ravel(), Y.ravel()]).T
Z = -clf.score_samples(XX)[0]
Z = Z.reshape(X.shape)
plt.figure(figsize=(10,8))
CS = plt.contour(X, Y, Z, norm=mpl.colors.LogNorm(vmin=1.0, vmax=1000.0),
levels=np.logspace(0, 3, 10))
CB = plt.colorbar(CS, shrink=0.8, extend='both')
plt.scatter(X_train[:, 0], X_train[:, 1], .8)
plt.title('Negative log-likelihood predicted by a GMM')
plt.axis('tight')
In [5]:
plot_gmm()
Goal: Since only the data points $x_n$ are observed, we wish to
We will use the Expectation Maximization algorithm.
The cross entropy measures the average number of bits needed to encode messages drawn from $p$ when we use a code optimal for $q$: $$ H(p,q) = -\sum_{x \in \X} p(x) \log q(x) = - E_p[\log q(x)] $$
Intuitively, $H(p,q) \geq H(p)$. The relative entropy is the difference $H(p,q) - H(p)$.
The relative entropy or Kullback-Leibler divergence of $q$ from $p$ is
$$ \begin{align} D_{KL}(p || q) &= \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)} \\ &= H(p,q) - H(p) \end{align} $$Measures the number of extra bits needed to encode messages from $p$ if we use a code optimal for $q$.
Most textbooks (including [MLAPP] and [PRML]) introduce EM without much justification.
Instead, we will tackle the theory head-on, using ideas from [Neal & Hinton 1998].
Question: How can we perform maximum likelihood estimation in latent variable models?
In [6]:
%%capture
pgm_latent("images/pgm/latent.png")
Out[6]:
In reality, we only observe $\X=(x_1,\dots,x_n)$. The observed data log-likelihood is given by $$ \begin{align} \ell(\theta | \X) &= \log p(\X | \theta) = \sum_{k=1}^N \log p(x_k | \theta) \\ &= \sum_{k=1}^N \left[ \log \sum_{z_k} p(x_k, z_k | \theta) \right] \end{align} $$
[MLAPP]: This likelihood function is non-convex even assuming exponential family distributions. Multiple modes make inference (NP) hard!
Our general approach will be to reason about the hidden variables through a proxy distribution $q(Z)$.
The $q$ distribution will disappear later, but this perspective is useful for extending EM to new settings.
Through Jensen's inequality, we obtain the evidence lower bound (ELBO) on the log-likelihood: $$ \begin{align} \ell(\theta | \X) &= \log \sum_z p(\X, z | \theta) \\ &= \log \sum_z q(z) \frac{p(\X, z | \theta)}{q(z)} \\ &\geq \sum_z q(z) \log \frac{p(\X, z | \theta)}{q(z)} \equiv \L(q, \theta) \end{align} $$ Important: the choice of distribution $q(z)$ is arbitrary, above holds for all $q$! More on this soon
We can rewrite the lower bound $\L(q, \theta)$ as follows: $$ \begin{align} \ell(\theta | \X) \geq \L(q,\theta) &= \sum_z q(z) \log \frac{p(\X,z|\theta)}{q(z)} \\ &= \sum_q q(z) \log p(\X,z|\theta) - \sum_z q(z) \log q(z) \\ &= E_q[\log p(\X,Z|\theta)] - E_q[\log q(z)] \\ &= E_q[\log p(\X,Z|\theta)] + H[q] \end{align} $$
We have just shown that $$ \ell(\theta|\X) \geq \L(q,\theta) = E_q[\log p(\X,Z|\theta)] - E_q[\log q(z)] $$
This very closely resembles the formula for relative entropy $$ \begin{align} D_{KL}(q || p) &= E_q[-\log p(Z)] + E_q[\log q(Z)] \\ &= H[q,p] - H[q] \end{align} $$
Except that the variables $X$ are clamped to our observations $X=\X$.
It is easy to show $\L(q,\theta)$ differs from a relative entropy by only a constant wrt $\theta$: $$ \begin{align*} D_{KL}(q || p(Z|\X,\theta)) &= H(q, p(Z|\X,\theta)) - H(q) \\ &= E_q[ -\log p(Z|\X,\theta) ] - H(q) \\ &= E_q[ -\log p(Z,\X | \theta) ] - E_q[ -\log p(\X|\theta) ] - H(q) \\ &= E_q[ -\log p(Z,\X | \theta) ] + \log p(\X|\theta) - H(q) \\ &= -\mathcal{L}(q,\theta) + \mathrm{const.} \end{align*} $$
This yields a second proof of the ELBO, since $D_{KL} \geq 0$, $$ \boxed{ \log p(\X | \theta) = D_{KL}(q || p(Z | \X, \theta)) + \mathcal{L}(q,\theta) \geq \mathcal{L}(q,\theta) } $$
The quality of our bound $\L(q,\theta)$ depends heavily on the choice of proxy distribution $q(Z)$.
Remark: Maximizing $\L(q, \theta)$ with respect to $q$ is equivalent to minimizing the relative entropy between $q$ and the hidden posterior $P(Z|\X,\theta)$. Hence, the optimal $q$ is exactly the hidden posterior, for which $\log p(\X|\theta) = \L(q,\theta)$.
The bound is tight--we can make it touch the log-likelihood.
Finding a global maximum of the likelihood is difficult in the presence of latent variables. $$ \hat\theta_{ML} = \arg\max_\theta \ell(\theta | \X) = \arg\max_\theta \log p(\X|\theta) $$
Instead, the Expectation Maximization algorithm gives an iterative procedure for finding a local maximum of the likelihood.
Consider only proxy distributions of the form $$ q_\vartheta(Z) = p(Z | \X, \vartheta) $$
The optimal value for $\vartheta$, in the sense that $\L(\vartheta, \theta) \equiv \L(q_\vartheta, \theta)$ is maximum, depends on $\theta$.
This suggests an iterative scheme.
Suppose we have an estimate $\theta_t$ of the parameters. To improve our estimate, we first compute a new lower bound on the observed log-likelihood, $$ \vartheta_{t+1} = \arg\max_\vartheta \L(\vartheta, \theta_t) = \theta_t $$
Since $q_\vartheta(Z) = p(Z|\X,\vartheta)$, we know from before that $\vartheta=\theta_t$ maximizes $\L(\vartheta, \theta)$.
Next, we estimate new parameters by optimizing over the lower bound. $$ \begin{align} \theta_{t+1} &= \arg\max_\theta \L(\vartheta_{t+1}, \theta) \\ &= \arg\max_\theta E_q[\log p(\X,Z|\theta)] - E_q[\log q_\vartheta(z)] \\ &= \arg\max_\theta E_q[ \log p(\X,Z | \theta) ] \end{align} $$
The last line optimizes over the expected complete data log-likelihood.
Expectation maximization performs coordinate ascent on $\L(\vartheta,\theta)$.
Expectation maximization performs updates on $Q(\theta_t,\theta)$.
Theorem: After a single iteration of Expectation Maximization, the observed data likelihood of the parameters has not decreased, that is, $$ \ell(\theta_t | \X) \leq \ell(\theta_{t+1} | \X) $$
Proof: This result is a simple consequence of all our hard work so far! $$ \begin{align} \ell(\theta_t | \X) &= \mathcal{L}(q_{\vartheta_{t+1}}, \theta_t) & \text{(Tightness)} \\ &\leq \mathcal{L}(q_{\vartheta_{t+1}}, \theta_{t+1}) & \text{(M-Step)} \\ &\leq \ell(\theta_{t+1} | \X) & \text{(ELBO)} \end{align} $$
The corresponding probabilistic model is a mixture of binomials: $$ \begin{align} \theta &= (\theta_A, \theta_B) & &&\text{fixed coin biases} \\ Z_n &\sim \mathrm{Uniform}\{A, B \} & \forall\, n=1,\dots,N &&\text{coin indicators} \\ X_n | Z_n, \theta &\sim \Bin[\theta_{Z_n}, M] & \forall\, n=1,\dots,N &&\text{head count} \end{align} $$
In [7]:
@pgm_render
def pgm_coinflip():
pgm = daft.PGM([4,4], origin=[-2,-1], node_unit=0.8, grid_unit=2.0);
# nodes
pgm.add_node(daft.Node("z", r"$Z_n$", 0.7, 1));
pgm.add_node(daft.Node("x", r"$X_n$", 1.3, 1, observed=True));
pgm.add_node(daft.Node("theta", r"$\theta$", 1.3, 0.3));
# edges
pgm.add_edge("z", "x", head_length=0.08);
pgm.add_edge("theta", "x", head_length=0.08);
pgm.add_plate(daft.Plate([0.4,0.8,1.3,0.5],
label=r"$\qquad\qquad\qquad\;\; N$",
shift=-0.1))
return pgm;
In [8]:
%%capture
pgm_coinflip("images/pgm/coinflip.png")
Out[8]:
The complete data log-likelihood for a single trial $(x_n,z_n)$ is $$ \log p(x_n, z_n | \theta) = \log p(z_n) + \log p(x_n | z_n, \theta) $$ Here, $P(z_n)=\frac{1}{2}$. The remaining term is $$ \begin{align} \log p(x_n | z_n, \theta) &= \log \binom{M}{x_n} \theta_{z_n}^{x_n} (1-\theta_{z_n})^{M-x_n} \\ &= \log \binom{M}{x_n} + x_n \log\theta_{z_n} + (M-x_n)\log(1-\theta_{z_n}) \end{align} $$
In [9]:
def coin_likelihood(roll, bias):
# P(X | Z, theta)
numHeads = roll.count("H");
flips = len(roll);
return pow(bias, numHeads) * pow(1-bias, flips-numHeads);
def coin_marginal_likelihood(rolls, biasA, biasB):
# P(X | theta)
trials = [];
for roll in rolls:
h = roll.count("H");
t = roll.count("T");
likelihoodA = coin_likelihood(roll, biasA);
likelihoodB = coin_likelihood(roll, biasB);
trials.append(np.log(0.5 * (likelihoodA + likelihoodB)));
return sum(trials);
def plot_coin_likelihood(rolls, thetas=None):
# grid
xvals = np.linspace(0.01,0.99,100);
yvals = np.linspace(0.01,0.99,100);
X,Y = np.meshgrid(xvals, yvals);
# compute likelihood
Z = [];
for i,r in enumerate(X):
z = []
for j,c in enumerate(r):
z.append(coin_marginal_likelihood(rolls,c,Y[i][j]));
Z.append(z);
# plot
plt.figure(figsize=(10,8));
C = plt.contour(X,Y,Z,150);
cbar = plt.colorbar(C);
plt.title(r"Likelihood $\log p(\mathcal{X}|\theta_A,\theta_B)$", fontsize=20);
plt.xlabel(r"$\theta_A$", fontsize=20);
plt.ylabel(r"$\theta_B$", fontsize=20);
# plot thetas
if thetas is not None:
thetas = np.array(thetas);
plt.plot(thetas[:,0], thetas[:,1], '-k', lw=2.0);
plt.plot(thetas[:,0], thetas[:,1], 'ok', ms=5.0);
In [10]:
plot_coin_likelihood([ "HTTTHHTHTH", "HHHHTHHHHH",
"HTHHHHHTHH", "HTHTTTHHTT", "THHHTHHHTH"]);
The E-Step involves writing down an expression for $$ \begin{align} E_q[\log p(\X, Z | \theta )] &= E_q[\log p(\X | Z, \theta) p(Z)] \\ &= E_q[\log p(\X | Z, \theta)] + \log p(Z) \\ \end{align} $$
The $\log p(Z)$ term is constant wrt $\theta$, so we ignore it.
Recall $q \equiv q_{\theta_t} = p(Z | \X,\theta_t)$
Let $a_n = q(z_n = A)$ and $b_n = q(z_n = B)$. Taking derivatives with respect to $\theta_A$ and $\theta_B$, we obtain the following update rules: $$ \theta_A = \frac{\sum_{n=1}^N a_n x_n}{\sum_{n=1}^N a_n M} \qquad \theta_B = \frac{\sum_{n=1}^N b_n x_n}{\sum_{n=1}^N b_n M} $$
Interpretation: For each coin, examples are weighted according to the probability that they belong to that coin. Observing $M$ flips is equivalent to observing $a_n M$ effective flips.
In [11]:
def coin_em(rolls, theta_A=None, theta_B=None, maxiter=10):
# Initial Guess
theta_A = theta_A or random.random();
theta_B = theta_B or random.random();
thetas = [(theta_A, theta_B)];
# Iterate
for c in range(maxiter):
print("#%d:\t%0.2f %0.2f" % (c, theta_A, theta_B));
# assign a coin to each trial
heads_A, tails_A = 0,0;
heads_B, tails_B = 0,0;
for trial in rolls:
likelihood_A = coin_likelihood(trial,theta_A);
likelihood_B = coin_likelihood(trial,theta_B);
p_A = likelihood_A / (likelihood_A + likelihood_B);
p_B = likelihood_B / (likelihood_A + likelihood_B);
heads_A += p_A * trial.count("H");
tails_A += p_A * trial.count("T");
heads_B += p_B * trial.count("H");
tails_B += p_B * trial.count("T");
# recompute thetas
theta_A = heads_A / (heads_A + tails_A);
theta_B = heads_B / (heads_B + tails_B);
thetas.append((theta_A,theta_B));
return thetas, (theta_A,theta_B);
In [12]:
rolls = [ "HTTTHHTHTH", "HHHHTHHHHH", "HTHHHHHTHH",
"HTHTTTHHTT", "THHHTHHHTH" ];
thetas, _ = coin_em(rolls, 0.1, 0.3, maxiter=6);
In [13]:
plot_coin_likelihood(rolls, thetas)
Recall from earlier the Gaussian Mixture Model: $$ \begin{align} \theta &= (\vec\pi, \vec\mu, \vec\Sigma) && \text{model parameters} \\ z_n &\sim \Cat[\pi] && \text{cluster indicators} \\ x_n | z_n, \theta &\sim \mathcal{N}(\mu_{z_n}, \Sigma_{z_n}) && \text{base distribution} \end{align} $$
The complete data log-likelihood for a single datapoint $(x_n, z_n)$ is $$ \begin{align} \log p(x_n, z_n | \theta) &= \log \prod_{k=1}^K \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)^{\mathbb{I}(z_n = k)} \\ &= \sum_{k=1}^K \mathbb{I}(z_n = k) \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \end{align} $$
The hidden posterior for a single point $(x_n, z_n)$ can be found using Bayes' rule: $$ \begin{align} p(z_n = k | x_n, \theta) &= \frac{P(z_n=k | \theta) p(x_n | z_n=k, \theta)}{p(x_N | \theta)} \\ &= \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{k'=1}^K \pi_{k'} \mathcal{N}(x_n | \mu_{k'}, \Sigma_{k'})} \end{align} $$
The expected complete likelihood is $$ \begin{align} Eq[ \log p(\X,Z|\theta)] &= \sum{n=1}^N \sum_{k=1}^K E_q\big[ \mathbb{I}(z_n = k) \big] \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigmak) \ &= \sum{n=1}^N \sum{k=1}^K r{nk} \log \pi_k
r_{nk} \log \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
\end{align}
$$where $r_{nk} \equiv p(z_n = k \mid x_n,\theta_t)$ is the responsibility that cluster $k$ takes for datapoint $x_n$ after step $t$.
During the M-Step, we optimize the lower bound with respect to $\theta=(\pi,\mu,\Sigma)$. Verify that the correct updates are $$ \begin{gather} \pi_k = \frac{1}{N} \sum_{n=1}^N r_{nk} = \frac{r_k}{N} \quad \mu_k = \frac{\sum_{n=1}^N r_{nk} x_n}{r_k} \\ \Sigma_k = \frac{\sum_{n=1}^N r_{nk} x_n x_n^T}{r_k} - \mu_k \mu_k^T \end{gather} $$
where $r_k = \sum_{n=1}^N r_{nk}$ is the effective sample size for cluster $k$.