Section 4.4: ML applied to Gaussian Mixtures: The Expectation Maximization Algorithm

Likelihood can quickly become so complex that it doesn't have an analytic solution. However, one complex model that does still have a simple solution is a mixture of Gaussians.

Assume that the likelihood for a datum $x_i$ is given by:
$p(x_i|\theta) = \Sigma_{j=1}^M \alpha_j \mathcal{N}(\mu_j, \sigma_j)$
In other words, the probability is given by a sum of Gaussians with different normalization factors. To be a proper pdf, it must be true that
$\Sigma_{j=1}^M \alpha_j = 1.$

How do we compute the log likelihood? Go back to the definition of likelihood:
$L=p(\{x_i\}| \theta)$
where $\theta$ is a vector of parameters. So
$L=\prod_{i=1}^N (\Sigma_{i=1}^M \alpha_j e^{\frac{-(x_i-\mu_j)^2}{2\sigma_j^2}} )$

Since $\log{a\times b} = \log(a) +\log(b)$
$ln\ L = \Sigma_{i=1}^N ln\ [\Sigma_{j=1}^M\alpha_j e^{\frac{-(x_i-\mu_j)^2}{2\sigma_j^2}} ]$

How do we minimize this function? An analytical approach would result in $3M-1$ nonlinear equations. A numerical approach could be followed with the Levenberg-Marquardt algorithm or MCMC, both of which will be discussed later on in the book. However, here they suggest that an approach based on "hidden variables" called the Expectation Maximization (EM) Algorithm will provide a quicker solution.

They begin by introducing the concept of hidden variables and class labels. I think they're just saying - suppose you can associate each variable $x_i$ with the Gaussian $(\mathcal{N}(\mu_j,\sigma_j))$ that actually generated it. How would you find the correct Gaussian? Well, you can use Bayes' rule to find the probability that a given variable $x_i$ is associated with Gaussian $j$.

Reminder of what Bayes' theorem says:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
Applied here:
$p(j|x_i) = \frac{p(x_i|j) p(j)}{p(x_i)} = \frac{\alpha_j \mathcal{N}(\mu_j, \alpha_j)}{\Sigma_{j=1}^M \alpha_j \mathcal{N}(\mu_j, \sigma_j)}$
(Where did p(j) go?)

After that calculation seems to go nowhere, they transition back to the log likelihood, and generalize the equation, replacing the normal distribution with a generic $p_j(x_i|\theta)$, where they assume but don't notate the fact that $\theta$ only depends on a single set of variables with subscript j.

$ln\ L = \Sigma_{i=1}^N ln\ [\Sigma_{j=1}^M\alpha_j p_j(x_i|\theta)]$

Then they do a bunch of math that I didn't really feel like doing myself.
And then they decide to switch back to the individual case of Gaussians to make the math simplify.

Anyway, the main point is that you end up with expressions for $\mu_j$ and $\sigma_j$ as a function of $p(j|x_i)$, and also use the Bayes theorem expression from above to compute $p(j|x_i)$ as a function of $\mathcal{N}(\mu_j,\sigma_j)$. If you begin with guesses for the parameters and then iterate through these two expressions, you'll usually converge to values $\mu_j$ and $\sigma_j$ that maximize the likelihood function. Good choices of initial guesses are to set all $\sigma_j$ to the sample standard deviation, all $\alpha_j$ to $1/M$, and $\mu_j$ by randomly sampling from ${x_i}$.

Here is an example of the EM algorithm, figure 4.2 in the book.



In [1]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
from sklearn.mixture import GMM #Gaussian Mixture Model



In [3]:

    
from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)



In [4]:

    
np.random.seed(1)
gmm=GMM(3,n_iter=1)  #gmm is now some weird function thing



In [5]:

    
gmm #see?









    Out[5]:





GMM(covariance_type='diag', init_params='wmc', min_covar=0.001,
  n_components=3, n_init=1, n_iter=1, params='wmc', random_state=None,
  thresh=0.01)



In [7]:

    
gmm.means_ = np.array([[-1],[0],[3]]) #What does the underscore mean?
gmm.covars_ = np.array([[1.5],[1],[0.5]])**2
gmm.weights_ = np.array([[0.3,0.5,0.2]])

X=gmm.sample(1000) #Now sample data from the distribution

print gmm.means_
print gmm.weights_
#Why does these have reverse dimensions?









    



[[-1]
 [ 0]
 [ 3]]
[[ 0.3  0.5  0.2]]



In [8]:

    
#Now fit the data with a GMM model 
N=np.arange(1,11) #allow for up to 10 gaussian components
models=[None for i in range(len(N))] #initialize set of models

#Now compute best-fit models with 1,2,3,4...etc. components
for i in range(len(N)):
    models[i]=GMM(N[i]).fit(X)
#Note that GMM(x).fit is performing the EM algorithm

#Then computethe AIC and BIC of the models
AIC = [m.aic(X) for m in models] #Akaike information criterion
BIC = [m.bic(X) for m in models] #Bayesian information criterion



In [9]:

    
#Now for some plots
#Setup
fig = plt.figure(figsize=(5,1.7))
fig.subplots_adjust(left=0.12,right=0.97, 
                    bottom=0.21,top=0.9, wspace=0.5)

#First plot: data + best-fit mixture
ax = fig.add_subplot(131)
M_best = models[np.argmin(AIC)] #Model with lowest AIC is best

x=np.linspace(-6,6,1000)
#Compute the log probability of X under the model and
#return the posterior distribution (responsibilities) of each
#mixture component for each element of X.
logprob, responsibilities = M_best.score_samples(x)
pdf = np.exp(logprob)
pdf_individual = responsibilities*pdf[:,np.newaxis]

ax.hist(X,30,normed=True, histtype='stepfilled', alpha=0.4)#Data
ax.plot(x,pdf,'-k') #Combined pdf of model
ax.plot(x,pdf_individual, '--k') #pdf of each gaussian comp.
ax.text(0.04,0.96,"Best-fit Mixture",
        ha='left', va='top', transform=ax.transAxes)
ax.set_xlabel('$x$')
ax.set_ylabel('$p(x)$')

#2nd plot: AIC and BIC
ax=fig.add_subplot(132)
ax.plot(N,AIC, '-k', label='AIC')
ax.plot(N,BIC, '--k', label='BIC')
ax.set_xlabel('\# of components')
ax.set_ylabel('Information criterion')
ax.legend(loc=2)

#posterior probabilities for each component
ax = fig.add_subplot(133)

p = M_best.predict_proba(x)
p = p[:, (1,0,2)] #Rearrange order so plot looks better
p = p.cumsum(1).T

ax.fill_between(x,0,p[0], color='gray', alpha=0.3)
ax.fill_between(x,p[0],p[1], color='gray', alpha=0.5)
ax.fill_between(x,p[1],1,color='gray',alpha=0.7)
ax.set_xlim(-6,6)
ax.set_ylim(0,1)
ax.set_xlabel('$x$')
ax.set_ylabel(r'$p({\rm class}|x)$')

ax.text(-5,0.3,'class 1', rotation='vertical')
ax.text(0,0.5,'class 2', rotation='vertical')
ax.text(3,0.3,'class 3', rotation='vertical')









    Out[9]:





<matplotlib.text.Text at 0x1075fdd50>

How well did the model do?



In [11]:

    
print gmm.means_
print M_best.means_
#It got the number of components right, 
#and is in the ballpark of the means









    



[[-1]
 [ 0]
 [ 3]]
[[ 0.17827918]
 [ 2.85873886]
 [-1.08315069]]



In [13]:

    
#And here's a comparison of the weights
print gmm.weights_
print M_best.weights_









    



[[ 0.3  0.5  0.2]]
[ 0.43139915  0.19999963  0.36860122]

This section ends with a discussion of how to change the algorithm if you have a measurement error $e$ on each $x_i$. It goes through some math following the inclusion of this error, and then seems to say that EM algorithms fail, but then says they're still useful. I didn't really get it.

Oh, and if you want to deal with mixtures that are not Gaussian just consult the "abundant and easily accessible literature on the various froms of the EM algorithm." Obviously.

Section 4.5 The bootstrap and jackknife

The bootstrap and jackknife are non-parametric methods for determining confidence limits for estimated parameters. In other words, they can be used when the shape of the distribution from which you're drawing $\{x_i\}$ is unknown. The basic idea of the bootstrap is that you take your dataset of size $N$ and create $B$ new datasets of size $N$ by sampling from $\{x_i\}$ with replacement. Then you can use the $B$ new datasets to estimate uncertainties in parameters. There is also a parametric version of the bootstrap, where you sample data from the best-fit model. The book mentions that astronomers screw this up a lot, and references a paper by Tom Loredo, that I looked into to try to understand this better. I didn't come out with a great understanding, but here's what I gathered: The numerical recipes prescription for the bootstrap is wrong. One should think carefully about whether to use a parametric or non-parametric boostrap The varaiability of derived parameters is not the same as the uncertainty in the derived parameters. The bootstrapped distribution over parameter space is not actually a pdf. For linear models with Gaussian-distributed errors, the (flawed) approach will yield the correct result, but by accident. People who study exoplanet transits are the worst.

Figure 4.3 shows a simple example of a bootstrap, demonstrating that for sampling from a Gaussian distribution, you get bootstrap estimates of uncertainties that are in line with computed estimates for Gaussian. In particular, the distributions are computed for $\sigma_s$ and $\sigma_G$.

Equation 3.35: $\sigma_s$ or uncertainty of scale parameter (uncertainty of $\sigma$ of Gaussian from which the data were drawn.) Equation 3.36: Equation for computing the interquartile range $\sigma_G = 0.7413 (q75-q25)$ Equation 3.37: Uncertainty for computing the uncertainty in any quantile (including $\sigma_G$).

I personally found this very frustrating because the discussion seemed to imply that it's difficult to estimate uncertainties for non-simple cases, and then they just show us a simple case where we already have analytical solutions for what we're looking for.



In [18]:

    
#Setup
from scipy.stats import norm
from matplotlib import pyplot as plt

from astroML.resample import bootstrap
from astroML.stats import sigmaG

from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)

m = 1000  # number of points
n = 10000  # number of bootstraps

#------------------------------------------------------------
# sample values from a normal distribution
np.random.seed(123)
data = norm(0, 1).rvs(m) #Returns m random variables from normal distribution with mean 0, sigma 1

#------------------------------------------------------------
# Compute bootstrap resamplings of data

# Compute a set of n bootstrap samples from the dataset "data", and return the array of computed standard deviations 
mu1_bootstrap = bootstrap(data, n,  np.std, kwargs=dict(axis=1, ddof=1))
# Compute a set of n bootstrap samples from the dataset "data", and return the array of computed sigmaG's
mu2_bootstrap = bootstrap(data, n, sigmaG, kwargs=dict(axis=1))

#------------------------------------------------------------
# Compute the theoretical expectations for the two distributions
x = np.linspace(0.8, 1.2, 1000)

sigma1 = 1. / np.sqrt(2 * (m - 1))
pdf1 = norm(1, sigma1).pdf(x)

sigma2 = 1.06 / np.sqrt(m)
pdf2 = norm(1, sigma2).pdf(x)

#------------------------------------------------------------
# Plot the results
fig, ax = plt.subplots(figsize=(5, 3.75))

ax.hist(mu1_bootstrap, bins=50, normed=True, histtype='step',
        color='blue', ls='dashed', label=r'$\sigma\ {\rm (std. dev.)}$')
ax.plot(x, pdf1, color='gray')

ax.hist(mu2_bootstrap, bins=50, normed=True, histtype='step',
        color='red', label=r'$\sigma_G\ {\rm (quartile)}$')
ax.plot(x, pdf2, color='gray')

ax.set_xlim(0.82, 1.18)

ax.set_xlabel(r'$\sigma$')
ax.set_ylabel(r'$p(\sigma|x,I)$')

ax.legend()

plt.show()

Okay, now the jackknife

In the jackknife, you derive N samples of size N-1 by removing one point from each sample and then computing the desired statistic, $\alpha$. Then, there are formulae for computing the best jackknife estimate of $\alpha$, called $\alpha^J$, and its standard error $\sigma_\alpha$. These formulae are functions of N, $\alpha_N$ (the $\alpha$ computed using the whole sample) and the full set of $\alpha_i$ that are computed in the jackknife process.
Figure 4.4 shows a jackknife example, which demonstrates an important point - the jackknife is good at estimating uncertainties and at estimating "smooth differential" statistics, but not at estimating "rank-based" statistics, like median or quantiles.



In [14]:

    
import numpy as np
from scipy.stats import norm
from matplotlib import pyplot as plt

from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)

#------------------------------------------------------------
# sample values from a normal distribution
np.random.seed(123)
m = 1000  # number of points
data = norm(0, 1).rvs(m) #The dataset comes from a normal distribution with mean 0, std 1

#------------------------------------------------------------
# Compute jackknife resamplings of data
from astroML.resample import jackknife
from astroML.stats import sigmaG

# mu1 is the mean of the standard-deviation-based width
#jacknife(your dataset, the statistic you want to calculate, keyword arguments to pass to np.std)
#return_raw_distribution means it will return distribution the actual jackknife distribution.
#jackknife returns the mean of the jk distribution, the sigma of the distribution, and (if set) 
#the distribution itself.

#This is a computation of the best-fit standard deviation (np.std)
mu1, sigma_mu1, mu1_raw = jackknife(data, np.std,
                                    kwargs=dict(axis=1, ddof=1),
                                    return_raw_distribution=True)

pdf1_theory = norm(1, 1. / np.sqrt(2 * (m - 1)))
pdf1_jackknife = norm(mu1, sigma_mu1)

# mu2 is the mean of the interquartile-based width (sigmaG)
#  WARNING: do not use the following in practice.  This example
#           shows that jackknife fails for rank-based statistics.
mu2, sigma_mu2, mu2_raw = jackknife(data, sigmaG,
                                    kwargs=dict(axis=1),
                                    return_raw_distribution=True)
pdf2_theory = norm(data.std(), 1.06 / np.sqrt(m))
pdf2_jackknife = norm(mu2, sigma_mu2)
print mu2, sigma_mu2

#------------------------------------------------------------
# plot the results
print "mu_1 mean: %.2f +- %.2f" % (mu1, sigma_mu1)
print "mu_2 mean: %.2f +- %.2f" % (mu2, sigma_mu2)

fig = plt.figure(figsize=(5, 2))
fig.subplots_adjust(left=0.11, right=0.95, bottom=0.2, top=0.9,
                    wspace=0.25)

ax = fig.add_subplot(121)
ax.hist(mu1_raw, np.linspace(0.996, 1.008, 100),
        label=r'$\sigma^*\ {\rm (std.\ dev.)}$',
        histtype='stepfilled', fc='white', normed=False)
ax.hist(mu2_raw, np.linspace(0.996, 1.008, 100),
        label=r'$\sigma_G^*\ {\rm (quartile)}$',
        histtype='stepfilled', fc='gray', normed=False)
ax.legend(loc='upper left', handlelength=2)

ax.xaxis.set_major_locator(plt.MultipleLocator(0.004))
ax.set_xlabel(r'$\sigma^*$')
ax.set_ylabel(r'$N(\sigma^*)$')
ax.set_xlim(0.998, 1.008)
ax.set_ylim(0, 550)

ax = fig.add_subplot(122)
x = np.linspace(0.45, 1.15, 1000)
ax.plot(x, pdf1_jackknife.pdf(x),
        color='blue', ls='dashed', label=r'$\sigma\ {\rm (std.\ dev.)}$',
        zorder=2)
ax.plot(x, pdf1_theory.pdf(x), color='gray', zorder=1)
ax.plot(x, pdf2_jackknife.pdf(x),
        color='red', label=r'$\sigma_G\ {\rm (quartile)}$', zorder=2)
ax.plot(x, pdf2_theory.pdf(x), color='gray', zorder=1)
plt.legend(loc='upper left', handlelength=2)

ax.set_xlabel(r'$\sigma$')
ax.set_ylabel(r'$p(\sigma|x,I)$')
ax.set_xlim(0.45, 1.15)
ax.set_ylim(0, 24)

plt.show()









    



0.597747861969 0.0313531079465
mu_1 mean: 1.00 +- 0.02
mu_2 mean: 0.60 +- 0.03

Note above how the distribution used to compute the standard deviation is smooth, while that used to compute the interquartile range is not smooth. This is because the removal of any point below q25, for example, will always return the same value of the statistic. The result of this strange distribution is shown on the right. The standard deviation value and uncertainty are similar to the analytic solutions. However, while the uncertainty in $\sigma_G$ is similar to the uncertainty in the analytic solution, its actual value is way off - it should be 1, but it's 0.6.

Some of the takeaways in this section seem to be: Jackknife and bootstrap both have subtleties that you might want to better understand before using them. Smooth statistics will behave better than rank-based statistics. You might want to try both jackknife and bootstrap to get a sense for whether things are behaving.

Section 4.6 Hypothesis testing

Hypothesis testing is deciding whether some measured data are consistent with a hypothesis. Often, we propose a null hypothesis, and then see whether we can reject the null hypothesis for our dataset. As an example, they mention - suppose you want to decide whether a bright spot in an image is a source or background. The null hypothesis would be - this data point comes from the background distribution. If the actual data point is significantly different from the expectations from this background distribution, we might reject the null hypothesis. Some terminology. The p value is the probability that we'd derive a value at least as large as $x_i$ given the null distribution. Type I errors are false positives. The null hypothesis is rejected, but that is not correct. In our example, we think we've detected a source, but it is just noise. Type II errors are false negatives. The null hypothesis is false, but it is not rejected. In our example, there is a source there, but we can't distinguish it from noise. The sigificance level $\alpha$ is the probability of getting $x_i\gtrsim x_{cutoff}$ assuming the null hypothesis. In other words, $\alpha=\int_{x_c}^\infty h(x) dx$ where h(x) is the distribution assuming the null hypothesis.

An important point they make is that with multiple hypothesis testing, the fraction of false positives can exceed $\alpha$.



In [15]:

    
from scipy.stats import norm
from matplotlib import pyplot as plt
from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)

x=np.linspace(0,200)

a=0.1
xc=120
mu_s=150
mu_b=100
sigma_s=12
sigma_b=10
hs = a*norm(mu_s,sigma_s).pdf(x) #Returns m random variables from normal distribution with mean 0, sigma 1
hb = (1-a)*norm(mu_b,sigma_b).pdf(x)
h=hs+hb

fig = plt.figure(figsize=(10, 4))
fig.subplots_adjust(left=0.11, right=0.95, bottom=0.2, top=0.9,
                    wspace=0.25)

ax = fig.add_subplot(131)
ax.set_xlim(60,180)
ax.set_ylim(0,0.05)
ax.plot(x,hs, linewidth=3)
ax.plot(x,hb, linewidth=3)
ax.plot(x,h, '--', linewidth=3)
ax.set_xlabel('x')
ax.plot([xc,xc], [0,1])

ax = fig.add_subplot(132)
ax.set_xlim(60,180)
ax.set_ylim(0,0.01)
ax.plot(x,hs, linewidth=3)
ax.plot(x,hb, linewidth=3)
ax.plot(x,h, '--', linewidth=3)
ax.set_xlabel('x')
ax.set_title('False positives')
ax.plot([xc,xc], [0,1], linewidth=3)
ax.fill_between(x[x>=120],0,hb[x>=120], color='gray', alpha=1)

ax = fig.add_subplot(133)
ax.set_xlim(60,180)
ax.set_ylim(0,0.002)
ax.plot(x,hs, linewidth=3)
ax.plot(x,hb, linewidth=3)
ax.plot(x,h, '--', linewidth=3)
ax.set_xlabel('x')
ax.set_title('False negatives')
ax.plot([xc,xc], [0,1], linewidth=3)
ax.fill_between(x[x<=120],0,hs[x<=120], color='gray', alpha=1)









    Out[15]:





<matplotlib.collections.PolyCollection at 0x1075c1490>



In [96]:

    
a=0.1
N=1e6
alpha=(1-norm.cdf(xc,loc=mu_b, scale=sigma_b))



In [97]:

    
alpha #My number seems to be slightly lower than theirs (0.024)









    Out[97]:





0.022750131948179209



In [98]:

    
nspurious=N*(1-a)*alpha  #number of false positive
nspurious









    Out[98]:





20475.118753361287



In [99]:

    
nmissed=N*a*norm.cdf(xc, loc=mu_s, scale=sigma_s) #number of false negatives
nmissed









    Out[99]:





620.96653257761318



In [100]:

    
nsource = N*a - nmissed + nspurious #Number classified as a source
nsource









    Out[100]:





119854.15222078367



In [107]:

    
#completeness = (N*a-nmissed)/(N*a) #Na cancels so this equals (1-norm.cdf(xc, loc=mu_s, scale=sigma_s))
#completeness
completeness=1-norm.cdf(xc, loc=mu_s, scale=sigma_s)
completeness









    Out[107]:





0.99379033467422384



In [102]:

    
contamination = nspurious / nsource #Depends on a!! If a is very small, then all sources will be spurious
contamination









    Out[102]:





0.17083362047936407

Notice that the contamination rate (17%) is much higher than $\alpha$ (0.02)!

We might ask - what is the optimal way to define $x_c$? Below, I show what happens to completeness and contamination as a function of $x_c$.



In [16]:

    
xc_list=np.arange(50,200)
contamination_list=[]
completeness_list=[]
a=0.1
N=1e6
for i,xc in zip(np.arange(len(xc_list)), xc_list):
    alpha=(1-norm.cdf(xc,loc=mu_b, scale=sigma_b))
    nspurious=N*(1-a)*alpha  #number of false positive
    nmissed=N*a*norm.cdf(xc, loc=mu_s, scale=sigma_s) #number of false negatives
    nsource = N*a - nmissed + nspurious #Number classified as a source
    completeness_list.append(1-norm.cdf(xc, loc=mu_s, scale=sigma_s))
    contamination_list.append(nspurious / nsource) #Depends on a!! If a is very small, then all sources will be spurious

completeness=np.array(completeness_list)
contamination=np.array(contamination_list)

Here is a plot of completeness and contamination vs. xc. Note that you want to maximize completeness, but minimize contamination!



In [180]:

    
from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)

fig = plt.figure(figsize=(10, 4))
fig.subplots_adjust(left=0.11, right=0.95, bottom=0.2, top=0.9,
                    wspace=0.25)
    
ax = fig.add_subplot(111)
ax.plot(xc_list, completeness, label='completeness', linewidth=3)
ax.plot(xc_list, contamination, label='contamination', linewidth=3)
ax.set_xlabel('$x_c$', fontsize=20)
for axis in ['top','bottom','left','right']:
  ax.spines[axis].set_linewidth(4)
ax.set_xlim(80,180)
ax.legend(fontsize=20)









    Out[180]:





<matplotlib.legend.Legend at 0x11e83f690>

But this whole exercise assumes that we know a. In practice, we usually don't know a. A technique for dealing with this and finding the ideal value of $x_c$ is called the Benjamini and Hochberg method. The requirements are that the data are described as $h(x) = (1-a)h_B (x)+ a h_S(x)$, and that $h_B$ is known.
Define $p_i = 1-H_B(x_i)$
and sort sample so that $p_i$ are increasing.



In [ ]: