In [ ]:
import numpy as np
import jhu2016_labs.plotting as plotting
import jhu2016_labs.models as models
%matplotlib inline
In real scenario, we observed a set of values, often called features, and we try to learn some model that can "explain" the observed features. Generally, we do not know the true distribution of the features. For this work we will simulate a set of single dimensional features sampled from a GMM. We will call this GMM the "true model" and in this notebook we will see different strategies to approximate the true model from the simulated training data.
In [ ]:
true_model = models.GMM([-4, 0, 2], [1, 2, .7], [0.2, 0.2, 0.6])
X = true_model.sampleData(50)
Y = np.zeros_like(X) # The random variable has a single dimension !
x_min = -8
x_max = 7
y_min = -0.01
y_max = .4
fig, ax = plotting.plotGMM(true_model, x_min=x_min, x_max=x_max, y_min=y_min, y_max=y_max,
label='True model')
ax.plot(X, Y, '+', color='b', label='data')
ax.legend(loc='upper left')
We can first try to model the observed data with a Gaussian density function. The Gaussian density function is defined as: $$ \DeclareMathOperator{\Norm}{\mathcal{N}} \DeclareMathOperator{\Gam}{Gam} \DeclareMathOperator{\e}{exp} p(x \mid \mu, \sigma^2) = \Norm(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \e\{ \frac{-(x - \mu)^2}{2\sigma^2} \} $$
Try to play with the mean and the variance of the Gaussian. How each parameter change the density ?
In [ ]:
gaussian = models.Gaussian(0., 1.) # <- 1st parameter is mean, second variance0
fig, ax = plotting.plotGaussian(gaussian)
Alternatively, the Gaussian density function can be parameterized with a precision $\lambda$ which is the inverse of the variance: $$ \DeclareMathOperator{\e}{exp} p(x | \mu, \lambda) = \Norm(x \mid \mu, \lambda^{-1}) = \frac{\sqrt{\lambda}}{\sqrt{2\pi}} \e\{ \frac{-\lambda(x - \mu)^2}{2} \} $$
This alternative parameterization will prove to be useful when dealing with Bayesian inference for the Gaussian/GMM density.
We will first try to model our data with a simple Gaussian. We need to find the parameters (mean and variance) that fit the best the data.
$$ \begin{align} \mu &= \frac{1}{N} \sum_i x_i \\ \sigma^2 &= \frac{1}{N} \sum_i x_i^2 \end{align} $$From the above equations, implement the maximum-likelihood solution for the Gaussian. Does the Gaussian seems to be a good model for the true density ?
In [ ]:
##########################
# Write your code here.
mean_ml =
var_ml =
##########################
# Check your solutions uncomment the following code
test_gaussian_ml = models.Gaussian.maximumLikelihood(X)
assert np.isclose(test_gaussian_ml.mean, mean_ml), 'incorrect mean'
assert np.isclose(test_gaussian_ml.var, var_ml), 'incorrect variance'
print('correct solution !')
gaussian_ml = models.Gaussian(mean_ml, var_ml)
llh = gaussian_ml.logLikelihood(X)
print('log-likelihood:', llh)
y_min = -.01
y_max = .4
fig, ax = plotting.plotGMM(true_model, y_min=y_min, y_max=y_max, label='True model')
plotting.plotGaussian(gaussian_ml, fig=fig, ax=ax, color='red', label='Gaussian ML')
ax.plot(X, Y, '+', color='b', label='data')
ax.legend(loc='upper left')
We can work within the Bayesian framework with the Gaussian density by putting a prior over the mean and precision. A common an convenient choice of prior for the Gaussian is the Normal-Gamma prior:
$$ p(\mu, \lambda \mid m_0, \kappa_0, a_0, b_0) = \Norm(\mu \mid m_0, (\kappa_0 \lambda)^{-1}) \Gam(\lambda \mid a_0, b_0) $$where:
$$ \Gam(\lambda \mid a_0, b_0) = \frac{1}{\Gamma(a_0)} b_0^{a_0} \lambda^{a_0 - 1} \e \{ -b_0 \lambda\} $$$m_0$, $\kappa_0$, $a_0$ and $b_0$ are called hyper-parameters. They are the parameters of the prior distribution.
Play with the parameters of the Normal-Gamma. How each parameter affects the density ?
In [ ]:
ng_prior = models.NormalGamma(0, 2, 5, 6)
x_min =-2.
x_max = 2.
y_min = 0.
y_max = 2
plotting.plotNormalGamma(ng_prior, x_min=x_min, x_max=x_max, y_min=y_min, y_max=y_max)
Because the Normal-Gamma is the conjugate prior of the Normal density, the posterior distribution $p(\mu, \lambda \mid \mathbf{x})$ has a closed form solution:
$$ p(\mu, \lambda \mid \mathbf{x}) = \Norm(\mu \mid m_n, (\kappa_n \lambda)^{-1}) \Gam(\lambda \mid a_n, b_n) $$where:
$$ \begin{align} m_n &= \frac{\kappa_0 m_0 + N \bar{x}} {\kappa_0 + N} \\ \kappa_n &= \kappa_0 + N \\ a_n &= a_0 + \frac{N}{2} \\ b_n &= b_0 + \frac{N}{2} ( s + \frac{\kappa_0 (\bar{x} - m_0)^2}{\kappa_0 + N} ) \\ \bar{x} &= \frac{1}{N} \sum_i x_i \\ s &= \frac{1}{N} \sum_i (x_i - \bar{x})^2 \end{align} $$$N$ is the total number of point in the the training data and $m_n$, $\kappa_n$, $a_n$ and $b_n$ are the parameters of the posterior. Note that they are different from the hyper-parameters !!
Compute the posterior distribution with 1, 5, 10, 20 and 50 data points from the training data. What do you observe ?
In [ ]:
ng_prior = models.NormalGamma(0, 2, 5, 6)
ng_posterior = ng_prior.posterior(X[:1]) # <- Change '1' to the number of sample you want to use
x_min =-2
x_max = 2
y_min = 0.
y_max = 2
plotting.plotNormalGamma(ng_posterior, x_min=x_min, x_max=x_max, y_min=y_min, y_max=y_max)
Now that we have our posterior distibution we can predict the probability of a new data point given the training data:
$$ p(x' \mid \mathbf{x}) = \int p(x' \mid \theta) p(\theta \mid \mathbf{x}) d \theta $$For the Gaussian with Normal-Gamma prior the marginal predictive distribution is the Student's t distribution:
$$ \newcommand{\diff}{\mathop{}\!d} \DeclareMathOperator{\St}{\mathcal{St}} p(x' \mid \mathbf{x}) = \int_{-\infty}^{\infty} p(x|\mu, \lambda) p(\mu, \lambda \mid \mathbf{x}) \diff \mu \diff \lambda = \St(x' \mid \mu_n, \nu, \gamma) $$where:
$$ \begin{align} \nu &= 2a_n \\ \gamma &= \frac{a_n \kappa_n}{b_n(\kappa_n + 1)} \\ \St(x' \mid \mu_n, \nu, \gamma) &= \frac{\Gamma(\frac{\nu}{2} + \frac{1}{2})}{\Gamma(\frac{\nu}{2})} \Big( \frac{\gamma}{\pi \nu} \Big)^{\frac{1}{2}} \Big[ 1 + \frac{\gamma (x - \mu_n)^2}{\nu} \Big]^{-\frac{\nu}{2} - \frac{1}{2}} \end{align} $$Compute the log-likelihood of the predictive distribution derived from a posterior trained with 1, 5, 10, 20 and 50 data points. How does it compare to the log-likelihood of the Gaussian trained with maximum likelihood ?
In [ ]:
ng_posterior = ng_prior.posterior(X[:1]) # <- Change '1' by the number of sample you want to use.
predict_pdf = ng_posterior.predictiveDensity()
llh = predict_pdf.logLikelihood(X)
print('log-likelihood:', llh)
The Gaussian density is a very simple function, however, in most cases of interest the density we try to model has a complex shape that cannot be expressed with a simple formula. A solution is to assume that our complex density is made of $K$ Gaussian densities. This is called a Gaussian Mixture Model (GMM) and it is defined as: $$ p(x|\boldsymbol{\mu}, \boldsymbol{\lambda}, \boldsymbol{\pi}) = \sum_{k=1}^{K} \pi_k \Norm(x|\mu_k, (\lambda_k)^{-1}) $$
Observe the influence of each parameters on the density and add/remove some components. Does this model suffer of the same drawbacks of the Gaussian density ?
In [ ]:
# Change the initial values of the parameters.
# you can add a component by removing a mean, a variance
# and a weight. All arrays shall always have the same number
# of elements ! Be careful, make sure that the sum of weight
# always sum up to one.
means = [-4.0, 0.0, 4.0, 5]
variances = [1.0, 2.0, 1.4, 1]
weights = [0.1, 0.4, 0.2, 0.3]
gmm = models.GMM(means, variances, weights)
y_min = 0.
y_max = .3
plotting.plotGMM(gmm, show_components=True, y_min=y_min, y_max=y_max)
The GMM parameters can be estimated with the Expectation-Maximization (EM) algorithm. The EM algorithm is an iterative algorithm that converges toward a (local) maximum of the log-likelihood of the data given the model. The EM training is as follows:
In [ ]:
# Initialization of the model
means = [-4.0, 0.0, 4.0]
variances = [1.0, 1.0, 1.]
weights = [0.3, 0.4, 0.3]
gmm = models.GMM(means, variances, weights)
## EM algorithm
print('initial log-likelihood:', gmm.logLikelihood(X))
############################################
# Change this loop to check the convergence of the EM .
for i in range(10):
# E-step
Z = gmm.EStep(X)
# M-step
gmm.MStep(X, Z)
llh = gmm.logLikelihood(X)
print('log-likelihood:', llh)
##############################################
y_min = -0.01
y_max = .4
fig, ax = plotting.plotGMM(true_model, y_min=y_min, y_max=y_max, label='True model')
plotting.plotGMM(gmm, fig=fig, ax=ax, color='r', label='GMM EM')
ax.plot(X, Y, '+', color='b', label='data')
ax.legend(loc='upper left')
We can also work the Bayesian inference for GMM by putting a prior over the GMM parameters. Let $\Theta$ be the set of parameters of the GMM: $$ \Theta = \{ \boldsymbol{\mu}, \boldsymbol{\lambda}, \boldsymbol{\pi} \} $$
The prior over the weights $\boldsymbol{\pi}$ will be a Dirichlet distribution:
$$ \DeclareMathOperator{\Dir}{Dir} p(\boldsymbol{\pi}) = \Dir(\boldsymbol{\pi} \mid \boldsymbol{\alpha}_0) $$and the prior of the mean and precision of the component $k$ of the mixture will be a Normal-Gamma distribution:
$$ p(\mu_k, \lambda_k) = \Norm(\mu_k \mid m_0, (\kappa_0 \lambda_k)^{-1}) \Gam(\lambda_k \mid a_0, b_0) $$The joint distribution of the data, the latent variables and the parameters can be written as:
$$ \begin{align} p(\mathbf{x}, \mathbf{z}, \Theta) &= p(\mathbf{x}, \mathbf{z} \mid \Theta)p(\Theta) \\ &= \Bigg[ \prod_{i=0}^{N} p(x_i \mid \boldsymbol{\mu}, \boldsymbol{\lambda}, \boldsymbol{\pi}) \Bigg] \Bigg[ \Dir(\boldsymbol{\pi} \mid \boldsymbol{\alpha}_0) \prod_{k=0}^{K} \Norm(\mu_k \mid m_0, (\kappa_0 \lambda_k)^{-1}) \Gam(\lambda_k \mid a_0, b_0) \Bigg] \end{align} $$The EM algorithm cannot be applied directly to train the Bayesian GMM. Instead, we will use Gibbs Sampling (GS) to learn the posterior distribution of the parameters. Gibbs Sampling is a simple way to sample values from a complex distribution. Let's say that we want to sample values for the random variable $A$ and $B$ conditioned on $C$. Unfortunately, $p(A, B \mid C)$ is too complex to be sampled from directly. Alternatively we can sample in turn $a$ from $p(A | B = b, C)$ and $b$ from $p(B | A=a, C)$. It can be proven that if we keep sampling long enough, the set of $a$ and $b$ will be distributed according to $p(A, B | C)$. This is the Gibbs Sampling algorithm.
NOTE: Gibbs Sampling is not always applicable as we cannot always sample from the conditional distributions (it may be intractable).
In the following we just show the 3 conditional distributions we need for the Gibbs Sampling of the Bayesian GMM.
We define the set of all data point $x_i$ that are assigned to the component $k$ of the mixture as follows: $$ \mathbf{x}_{(k)} = \{ x_i : z_i = k, \forall i \in \{1,... , N \} \} $$ and similarly for the latent variables $\mathbf{z}$: $$ \mathbf{z}_{(k)} = \{ z_i : z_i = k, \forall i \in \{1,... , N \} \} $$
$$ \begin{align} p(\mu_k, \lambda_k \mid \mathbf{x}, \mathbf{z}, \Theta_{\smallsetminus \{ \mu_k, \lambda_k \} } ) &= p(\mu_k, \lambda_k \mid \mathbf{x}_{(k)}, \mathbf{z}_{(k)}, \Theta_{\smallsetminus \{ \mu_k, \lambda_k \} } ) \\ &= \Norm(\mu_k \mid m_{n,k}, (\kappa_{n,k} \lambda_k)^{-1}) \Gam(\lambda_k \mid a_{n,k}, b_{n,k}) \end{align} $$where:
$$ \begin{align} m_{n,k} &= \frac{\kappa_0 m_0 + N_k \bar{x}_k} {\kappa_0 + N_k} \\ \kappa_{n,k} &= \kappa_0 + N_k \\ a_{n,k} &= a_0 + \frac{N_k}{2} \\ b_{n,k} &= b_0 + \frac{N_k}{2} ( s + \frac{\kappa_0 (\bar{x}_k - m_0)^2}{\kappa_0 + N_k} ) \\ N_k &= \left\vert \mathbf{x}_{(k)} \right\vert \\ \bar{x}_k &= \frac{1}{N_k} \sum_{\forall x \in \mathbf{x}_{(k)}} x \\ s_n &= \frac{1}{N} \sum_{\forall x \in \mathbf{x}_{(k)}} (x_i - \bar{x})^2 \end{align} $$NOTE: these equations are very similar to the Bayesian Gaussian estimate. However, it remains some difference
where: $$ \alpha_{n,k} = \alpha_{0,k} + N_k \; ; \; \forall \, k = 1\dots K $$
In [ ]:
# hyper-parameters [pi_1, pi_2, ...], m, kappa, a, b
# you may try to change this.
bgmm = models.BayesianGMM([1, 1, 1], 0, 1, 2, 1)
y_min = -0.01
y_max = .4
fig, ax = plotting.plotGMM(true_model, y_min=y_min, y_max=y_max, label='True model')
for i in range(5):
# Sample the latent variables
Z = bgmm.sampleLatentVariables(X)
# Update the parameters
bgmm.sampleMeansVariances(X, Z)
bgmm.sampleWeights(Z)
# Just for plotting, this is not part of the Gibbs Sampling algorithm.
plotting.plotGMM(bgmm.gmm, fig=fig, ax=ax, color='b', lw=.5, label='sampled GMM')
gmm_avg = bgmm.averageGMM()
plotting.plotGMM(gmm_avg, fig=fig, ax=ax, color='r', lw=3, label='avg GMM')
ax.plot(X, Y, '+', color='b', label='data')
handles, labels = ax.get_legend_handles_labels()
labels, ids = np.unique(labels, return_index=True)
handles = [handles[i] for i in ids]
ax.legend(handles, labels, loc='upper left')
In [ ]: