In [1]:
from __future__ import division
# scientific
%matplotlib inline
from matplotlib import pyplot as plt;
import matplotlib as mpl;
import numpy as np;
import sklearn as skl;
import sklearn.datasets;
import sklearn.cluster;
import sklearn.mixture;
# ipython
import IPython;
# python
import os;
import random;
#####################################################
# image processing
import PIL;
# trim and scale images
def trim(im, percent=100):
print("trim:", percent);
bg = PIL.Image.new(im.mode, im.size, im.getpixel((0,0)))
diff = PIL.ImageChops.difference(im, bg)
diff = PIL.ImageChops.add(diff, diff, 2.0, -100)
bbox = diff.getbbox()
if bbox:
x = im.crop(bbox)
return x.resize(((x.size[0]*percent)//100, (x.size[1]*percent)//100), PIL.Image.ANTIALIAS);
#####################################################
# daft (rendering PGMs)
import daft;
# set to FALSE to load PGMs from static images
RENDER_PGMS = True;
# decorator for pgm rendering
def pgm_render(pgm_func):
def render_func(path, percent=100, render=None, *args, **kwargs):
print("render_func:", percent);
# render
render = render if (render is not None) else RENDER_PGMS;
if render:
print("rendering");
# render
pgm = pgm_func(*args, **kwargs);
pgm.render();
pgm.figure.savefig(path, dpi=300);
# trim
img = trim(PIL.Image.open(path), percent);
img.save(path, 'PNG');
else:
print("not rendering");
# error
if not os.path.isfile(path):
raise Exception("Error: Graphical model image %s not found. You may need to set RENDER_PGMS=True." % path);
# display
return IPython.display.Image(filename=path);#trim(PIL.Image.open(path), percent);
return render_func;
######################################################
In [31]:
@pgm_render
def pgm_hmm():
pgm = daft.PGM([7, 7], origin=[0, 0])
# Nodes
pgm.add_node(daft.Node("Z1", r"$Z_1$", 1, 3.5))
pgm.add_node(daft.Node("Z2", r"$Z_2$", 2, 3.5))
pgm.add_node(daft.Node("Z3", r"$\dots$", 3, 3.5, plot_params={'ec':'none'}))
pgm.add_node(daft.Node("Z4", r"$Z_T$", 4, 3.5))
pgm.add_node(daft.Node("x1", r"$X_1$", 1, 2.5, observed=True))
pgm.add_node(daft.Node("x2", r"$X_2$", 2, 2.5, observed=True))
pgm.add_node(daft.Node("x3", r"$\dots$", 3, 2.5, plot_params={'ec':'none'}))
pgm.add_node(daft.Node("x4", r"$X_T$", 4, 2.5, observed=True))
# Add in the edges.
pgm.add_edge("Z1", "Z2", head_length=0.08)
pgm.add_edge("Z2", "Z3", head_length=0.08)
pgm.add_edge("Z3", "Z4", head_length=0.08)
pgm.add_edge("Z1", "x1", head_length=0.08)
pgm.add_edge("Z2", "x2", head_length=0.08)
pgm.add_edge("Z4", "x4", head_length=0.08)
return pgm;
In [32]:
%%capture
pgm_hmm("images/pgm/hmm.png")
Out[32]:
For a Hidden Markov Model with $N$ hidden states and $M$ observed states, there are three row-stochastic parameters $\theta=(A,B,\pi)$,
Filtering means to compute the current belief state $p(z_t | x_1, \dots, x_t,\theta)$. $$ p(z_t | x_1,\dots,x_t) = \frac{p(x_1,\dots,x_t,z_t)}{p(x_1,\dots,x_t)} $$
Solved by the forward algorithm.
Recursion starts from the front of the chain. (Suggestion: you need to work out the above recursion on your own, it's a really important exercise)
Recursion starts from the back of the chain. (Suggestion: you need to work out the above recursion on your own, it's a really important exercise)
Compute $p(z_t | \X)$ offline, given all observations.
We can break the chain into two parts, the past and future: $$ \begin{align} p(z_t | \X) &= p(x_{1:t}, z_t, x_{t+1:T}) \frac{1}{p(\X)} \\ &= p(x_{1:t}, z_t) p(x_{t+1:T} | x_{1:t}, z_t)\frac{1}{p(\X)} \\ &= p(x_{1:t}, z_t) p(x_{t+1:T} | z_t)\frac{1}{p(\X)} \end{align} $$ Exercise: How did we get that last line?
Overall, the smoothing problem is to compute $$ \begin{align*} \gamma_t(j) \equiv p(z_t = j \mid \X) & = \frac{p(\X \mid z_t = j)p(z_t = j)}{p(\X)} \\ & = \frac{p(\X_{1:t} \mid z_t = j) p(\X_{t+1:T} | z_t=j) p(z_t = j)}{p(\X)} \\ & = \frac{\alpha_t(j) \beta_t(j)}{p(\X)} = \frac{\alpha_t(j) \beta_t(j)}{\sum_k \alpha_t(k)\beta_t(k)} \end{align*} $$ It is an easy exercise to check that, for any $t$, $p(\X) = \sum_k \alpha_t(k)\beta_t(k)$.
We solve this via the forward-backward algorithm, where
Decoding computes the most probable state sequence, given observations. $$ \vec{z}^* = \arg\max_{z_1,\dots,z_T} p(z_1,\dots,z_T | x_1, \dots, x_T, \theta) $$
The decoding problem is solved by the Viterbi algorithm, which uses dynamic programming. See [PRML] or [MLAPP] for more details.
In English, some words can have multiple parts of speech. For instance,
(Example taken from [here](http://english.stackexchange.com/questions/46277/what-word-can-fulfill-the-most-parts-of-speech).)
It is usually necessary to learn the model parameters $\theta = (A,B,\pi)$ from data.
The learning problem is solved by the Baum-Welch algorithm, a special case of expectation maximization.
Recall $q_t(Z) = p(Z|\X,\theta_t)$
The joint likelihood of the hidden and observed states is
$$ \begin{align} \log p(x_{1:T}, z_{1:T} | \theta) &= \log \left[ p(z_1|\pi) \prod_{t=2}^T p(z_t | z_{t-1}, A) \prod_{t=1}^T p(x_t | z_t, B) \right] \\ &= \log p(z_1|\pi) + \sum_{t=2}^T \log p(z_t | z_{t-1}, A) \\ & \qquad \qquad + \sum_{t=1}^T \log p(x_t | z_t, B) \end{align} $$Each term of the complete-data log-likelihood is:
$$ \begin{align} \log p(z_1 | \pi) &= \sum_{j=1}^N \mathbb{I}(z_1=j) \log \pi_j \\ \log p(z_t | z_{t-1}, A) &= \sum_{i=1}^N \sum_{j=1}^N \mathbb{I}(z_{t-1}=i)\mathbb{I}(z_t=j) \log A_{ij} \\ \log p(x_t | z_t, B) &= \sum_{j=1}^N \mathbb{I}(z_t=j) \log B_{j,x_t} \end{align} $$The expected complete likelihood $Q(\theta_t, \theta)$ is
$$ \begin{align} Q(\theta_t,\theta) &= E_q[ \log p(\X,Z|\theta) ] \\ &= E_q[ \log p(z_1|\pi) ] + E_q\left[ \sum_{t=2}^T \log p(z_t | z_{t-1}, A) \right]\\ &\qquad + E_q\left[ \sum_{t=1}^T \log p(x_t | z_t, B) \right] \end{align} $$Fixing $t > 1$, and taking expectations with respect to $q(Z) = p(Z|\X,\theta)$,
$$ \begin{align*} E_q[\log p(z_1 | \pi)] &= \sum_{j=1}^N q(z_1=j) \log \pi_j \\ E_q[\log p(z_t | z_{t-1}, A)] &= \sum_{i=1}^N \sum_{j=1}^N q(z_{t-1}=i, z_t=j) \log A_{ij} \\ E_q[\log p(x_t | z_t, B)] &= \sum_{j=1}^N q(z_t=j) \log B_{j, x_t} \end{align*} $$The E-Step consists of computing the $q$ terms from the previous slide, which can all be computed using the forward-backward algorithm!
The M-Step consists of normalizing the expected transition and emission counts
You must compute the key quantities for $q()$ using the forward-backward algorithm. In the E step you can assume the parameters $\theta = (A, B, \pi)$, so we'll drop dependence on $\theta$.
We've just barely scratched the surface of what probabilistic graphical models have to offer.
This lecture will be a grab-bag of some of the things we missed.
Accordingly, we won't test you on the details.
These courses are being offered next semester:
The following courses have been offered in the past:
So far, we have talked only about directed models.
In some cases, undirected models are more natural.
Consider the problem of modeling pixels in an image.
How can we represent this situation as a Bayesian network?
Directed models come with a natural topological ordering.
Undirected models have no natural ordering!
Let $\mathbf{D}$ be a set of random variables. A factor is a function $\phi : \mathrm{Val}(\mathbf{D}) \mapsto \R$.
A factor assigns a real number to every possible configuration of the variables $\mathbf{D}$
Probability distributions are normalized factors.
Other factors need not be normalized. For example, $\phi(A,B)$ assigns high affinity to configurations where $A$ and $B$ agree:
$A$ | $B$ | $\phi(A,B)$ |
---|---|---|
0 | 0 | 100 |
0 | 1 | -50 |
1 | 0 | -50 |
1 | 1 | 100 |
Given a set of factors $\Phi=(\phi_1(\mathbf{D}_1), \phi_2(\mathbf{D}_2), \dots, \phi_K(\mathbf{D}_K))$, we can define a Gibbs Distribution, $$ \begin{align} P_\Phi(X_1,\dots,X_n) &= \frac{1}{Z} \phi_1(\mathbf{D}_1) \phi_2(\mathbf{D}_2) \cdots \phi_K(\mathbf{D}_K) \\ &= \frac{1}{Z} \exp\left[ \sum_{k=1}^K \log \phi_k(\mathbf{D}_k) \right] \end{align} $$ where $Z$ is a normalization constant. Sometimes log-potentials are referred to as energies associated with each configuration of variables.
This model was stolen from physicists :)
Suppose we have a pairwise Markov Random Field over binary pixels.
Define the energy function, summing over all edges $(i,j)$: $$ U(\X) = -\beta \sum_{(i,j)} x_i x_j \qquad (\beta > 0) $$
Then, the probability of an image is $$ P(\X) = \frac{1}{Z} \exp[-U(\X)] = \frac{1}{Z} \exp\left[-\beta \sum_{(i,j)} x_i x_j \right] $$
The affinity parameter $\beta > 0$ determines how much we want neighboring pixels to agree with each other.
We can also model more complex interactions between pixels.
Undirected models are incredibly useful for probabilistic modeling.
Inference: Estimate hidden variables $Z$ from observed variables $X$. $$ P(Z | X,\theta) = \frac{P(X,Z | \theta)}{P(X|\theta)} $$
We can recover $P(Z | X,\theta)$ exactly when the hidden variables $Z$ are discrete.
Many exact inference algorithms can be seen as a generalization of the fowards-backwards algorithm for HMMs.
Unfortunately, they have undesirable time and space complexity properties.
In [2]:
@pgm_render
def pgm_hmm():
pgm = daft.PGM([7, 7], origin=[0, 0])
# Nodes
pgm.add_node(daft.Node("Z1", r"$Z_1$", 1, 3.5))
pgm.add_node(daft.Node("Z2", r"$Z_2$", 2, 3.5))
pgm.add_node(daft.Node("Z3", r"$\dots$", 3, 3.5, plot_params={'ec':'none'}))
pgm.add_node(daft.Node("Z4", r"$Z_T$", 4, 3.5))
pgm.add_node(daft.Node("x1", r"$X_1$", 1, 2.5, observed=True))
pgm.add_node(daft.Node("x2", r"$X_2$", 2, 2.5, observed=True))
pgm.add_node(daft.Node("x3", r"$\dots$", 3, 2.5, plot_params={'ec':'none'}))
pgm.add_node(daft.Node("x4", r"$X_T$", 4, 2.5, observed=True))
# Add in the edges.
pgm.add_edge("Z1", "Z2", head_length=0.08)
pgm.add_edge("Z2", "Z3", head_length=0.08)
pgm.add_edge("Z3", "Z4", head_length=0.08)
pgm.add_edge("Z1", "x1", head_length=0.08)
pgm.add_edge("Z2", "x2", head_length=0.08)
pgm.add_edge("Z4", "x4", head_length=0.08)
return pgm;
In [3]:
%%capture
pgm_hmm("images/pgm/hmm.png")
Out[3]:
Omitting some slides from lecture -- but still available on notes
Exact inference is possible in discrete graphical models.
Fortunately, approximate inference works well in practice.
Uses material from [MLAPP] Chapter 21 and MLSS 2009
Exact inference is too difficult, so we turn to approximate inference.
The two most common methods are
In general, we want to find $p(X|Y)$ for some variables $X$ and $Y$. Using Bayes' Rule, $$p(y|x)= \frac{p(x|y)p(y)}{p(x)}= \frac{p(x|y)p(y)}{\int_Y p(x|y)p(y)\, dy}$$ Usually, the hard part is evaluating the normalization integral, $$ \int_Y p(x|y)p(y)\, dy $$
Key Point: Many inference problems can be reduced to the problem of evaluating an integral.
We can express $\pi$ as an integral over the unit circle $C$, which has area $\pi r^2 = \pi$.
$$ \pi = \int_{\R^2} \mathbb{I}_C(x) \, dx = \int_{[-1,1]\times[-1,1]} \mathbb{I}_C(x) \,dx $$Let $p$ be the uniform distribution on the square of size $2$. $$ p(x)=\begin{cases} \frac{1}{4} & x \in [-1,1] \times [-1,1] \\ 0 & \text{ otherwise.} \end{cases} $$
Then, $$ \pi = 4\; \int_{[-1,1]\times[-1,1]} \mathbb{I}_C(x) p(x) \,dx \approx E_p[4\cdot \mathbb{I}_C(X) ] $$
Therefore, we can estimate $\pi$ by drawing samples uniformly from the square $[-1,1] \times [-1,1]$ and computing the fraction of samples that fall within the unit circle $C$!
$$ \frac{\text{Area of Circle}}{\text{Area of Square}} = \frac{\pi r^2}{4r^2} = \frac{\pi}{4} $$This method is not ideal and requires tons of samples.
In [4]:
def estimate_pi(n):
# draw samples
samples = np.random.random((2,n)) * 2 - 1;
# separate samples inside/outside circle
r = np.sum(samples**2, axis=0);
inside = samples[:,r < 1];
outside = samples[:,r >= 1];
# estimate pi
return inside.shape[1] / n * 4;
def estimate_pi_plot(n):
# draw samples
samples = np.random.random((2,n)) * 2 - 1;
# separate samples inside/outside circle
r = np.sum(samples**2, axis=0);
inside = samples[:,r < 1];
outside = samples[:,r >= 1];
# estimate pi
pi_estimate = inside.shape[1] / n * 4;
print("pi estimated to be %.5f" % pi_estimate)
# plot
plt.figure(figsize=(6,6))
plt.plot(inside[0,:], inside[1,:], "rx");
plt.plot(outside[0,:], outside[1,:], "bx");
plt.axis("equal");
plt.title(r"$\pi \approx %0.2f$ from %d samples" % (pi_estimate,n));
In [5]:
estimate_pi_plot(5000)
We can compute any integral $\int_X f(x)p(x) \, dx$ provided that we know how to sample from $p$. Different methods have been devloped to sample from arbitrary distributions:
The most popular is Markov Chain Monte Carlo, which beats out the other methods in high-dimensional spaces.
Goal: Draw samples from $P(X_1,\dots,X_N)$.
Idea: Construct a Markov Chain on the state space $\X$ whose stationary distribution is the target density.
We will perform a (biased) random walk on the state space, such that the fraction of time we spend at each state $(x_1,\dots,x_N)$ is proportional to $p(x_1,\dots,x_N)$
Sample each variable in turn, conditioned on all the others.
This was originally a homework problem, but you lucked out :) See [Blei, Ng, and Jordan 2003] for details.
Given a set $\beta=(\beta_1,\dots,\beta_K)$ of $K$ topics and a fixed vocabulary, generate documents in the following way:
Latent Dirichlet Allocation is arguably the simplest model that requires approximate inference. It is often used as the "hello world" example when new inference techniques are introduced.
Latent Dirichlet Allocation has many, many applications to areas other than text modeling.