Lecture 0 - Introduction

The goal of this lecture is to give a basic


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Let's import numpy and pyplot package first

Gentle introduction to numpy library


In [2]:
import numpy as np
import matplotlib.pyplot as plt

Drawing Samples from Distribution

Dirichlet distribution is a distribution over a multinomial distribution, which means a random sample (\theta) drawn from K-dimensional Dirichlet can be used as a parameter of multinomial distribution. (i.e. \sum_i^K \theta_i = 1, \theta_i > 0 for all i)

PDF of unsymmetric K-dim Dirichlet distribution (use different parameters for each dimension)

$$Dir(\vec\theta | \vec\alpha) = \frac{\Gamma(\sum^K \alpha_i)}{\prod^K \Gamma(\alpha_i)} \prod^K \theta_i^{\alpha_i-1}$$

PDF of symmetric Dirichlet distribution (use same parameter for every dimension)

$$Dir(\vec\theta | \alpha) = \frac{\Gamma(K \alpha)}{\Gamma(\alpha)^K} \prod^K \theta_i^{\alpha-1}$$

Now, let's draw samples from Dirichlet distribution with numpy library


In [3]:
# number of dimensions of Dirichlet distribution
ndim = 5
# number of samples drawn
nsample = 10
# parameter of dirichlet distribution
alpha = [0.1] * ndim

# drawing samples using numpy
theta = np.random.dirichlet(alpha, size=nsample)

In [4]:
#plot samples using bar chart
ndraws = 4
plt.subplots(figsize=(12,3*ndraws/2))
for i in xrange(ndraws):
    plt.subplot(ndraws/2,2,i)
    plt.bar(xrange(ndim), theta[i])


/Users/arongdari/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/matplotlib/axes/_subplots.py:69: MatplotlibDeprecationWarning: The use of 0 (which ends up being the _last_ sub-plot) is deprecated in 1.4 and will raise an error in 1.5
  mplDeprecation)

Let's compute the mean of one million samples from Dir(0.1)


In [5]:
# increase the sample size
nsample = 1000000
theta = np.random.dirichlet(alpha, size=nsample)
np.mean(theta, 0)


Out[5]:
array([ 0.20006261,  0.19972409,  0.20022272,  0.20010768,  0.1998829 ])

Now, let's compute the mean of one million samples from Dir(1)


In [6]:
alpha = [1]*ndim
nsample = 1000000
theta = np.random.dirichlet(alpha, size=nsample)
np.mean(theta, 0)


Out[6]:
array([ 0.20029273,  0.20002407,  0.19991972,  0.19992145,  0.19984203])

In [7]:
ndraws = 4
plt.subplots(figsize=(12,3*ndraws/2))
for i in xrange(ndraws):
    plt.subplot(ndraws/2,2,i)
    plt.bar(xrange(ndim), theta[i])


Let's compute the mean of one million samples from Dir(100)


In [8]:
alpha = [100]*ndim
nsample = 1000000
theta = np.random.dirichlet(alpha, size=nsample)
np.mean(theta, 0)


Out[8]:
array([ 0.20001437,  0.19997349,  0.20001017,  0.200016  ,  0.19998597])

In [9]:
ndraws = 4
plt.subplots(figsize=(12,3*ndraws/2))
for i in xrange(ndraws):
    plt.subplot(ndraws/2,2,i)
    plt.bar(xrange(ndim), theta[i])


The averages of samples from different Dirichlet are almost identical, however, as the value of parameter decreases, the sparsity of samples increases.

Gaussian Mixture Model

In this section, I will describe the collapsed Gibbs sampling algorithm for Gaussian mixture model (GMM) with known variance (assume we know the variance of each cluster).

Generative process of GMM

prior part $$ \mu_k \sim N(\mu_0, \sigma_0) \\ \pi \sim \text{Dir} (\alpha) $$

for each data $i$ $$ z_i \sim \text{Mult}(\pi)\\ x_i \sim N(\mu_{z_i}, \sigma) $$

Let's start with generating 200 sample data set from 5 clusters


In [10]:
pi = np.array([0.2]*5)
mu = [-5,-2.5,0,2.5,5]
sigma = 1.
K = 5


num_data = 100
x = np.zeros(num_data)
z = np.zeros(num_data, dtype=int)
for i in xrange(num_data):
    z[i] = np.random.multinomial(1, pi).argmax()
    x[i] = np.random.normal(mu[z[i]], sigma)
    
for k in xrange(K):
    print 'clister', k, ' = ', np.mean(x[z == k]), ' num cases = ', np.sum(z==k)


clister 0  =  -5.09849587101  num cases =  21
clister 1  =  -2.36740367617  num cases =  17
clister 2  =  0.214391568419  num cases =  26
clister 3  =  3.05373272972  num cases =  22
clister 4  =  4.39649752891  num cases =  14

Collapsed Gibbs sampling for GMM

$$ p(z_i|z_{-i}, x, \mu_0, \alpha) \propto p(x_i | z_i, x_{-i}, \mu_0) p(z_i|z_{-i}, \alpha) $$

The first term of RHS is a conjugate posterior predictive distribution of normal-normal conjugacy, and the second term of RHS is a conjugate posterior predictive distribution of Dirichlet-multinomial conjugacy.


In [11]:
from scipy.stats import norm
from sklearn import metrics

num_iter = 100

s_z = np.zeros(num_data) - 1
K = 5

hyper_mu = 0.
hyper_var = 1.
known_var = 1.

alpha = 1.

rand_index_score = np.zeros(num_iter)
trace = np.zeros([num_iter, K])

for it in xrange(num_iter):
    for i in xrange(num_data):
        s_z[i] = -1
        
        prob = np.zeros(K)

        for k in xrange(K):
            sample_num = np.sum(s_z == k)
            sample_sum = np.sum(x[s_z == k])
            posterior_mean = (hyper_mu/hyper_var + sample_sum/known_var)/(1./hyper_var + sample_num/known_var)
            posterior_variance = 1./(1./hyper_var + sample_num/known_var)
            
#            prob[k] = (sample_num+alpha) * norm.pdf(x[i], posterior_mean, posterior_variance)
            prob[k] = norm.logpdf(x[i], posterior_mean, posterior_variance)
            prob[k] += np.log((sample_num+alpha)/(num_data-1+alpha*K))
            
        prob -= prob.max()
        prob = np.exp(prob)   
        prob /= np.sum(prob)
                
        new_k = np.random.multinomial(1, prob).argmax()
        
        s_z[i] = new_k
    
    rand_index_score[it] = metrics.adjusted_rand_score(z, s_z)  
    trace[it,:] = [np.sum(s_z == k) for k in xrange(K)]

In [12]:
for k in xrange(K):
    print np.mean(x[s_z == k]), np.sum(s_z == k)


4.39330927614 21
-5.13704960516 21
-0.0275381820348 20
2.26235708061 20
-2.33804670059 18

In [13]:
# plot the rand index scores over iterations
plot(rand_index_score)


Out[13]:
[<matplotlib.lines.Line2D at 0x10e2e1bd0>]

In [14]:
# plot the number of assigned data points for each cluster over iterations
plot(trace)


Out[14]:
[<matplotlib.lines.Line2D at 0x11513b250>,
 <matplotlib.lines.Line2D at 0x11513b4d0>,
 <matplotlib.lines.Line2D at 0x11513b710>,
 <matplotlib.lines.Line2D at 0x11513b8d0>,
 <matplotlib.lines.Line2D at 0x11513ba90>]

Let's change the prior distribution of normal from normal (unknown mean and known variance) to normal-inverse-gamma (unknown mean and variance)


In [15]:
#import student-t distribution
from scipy.stats import t

num_iter = 100

s_z = np.zeros(num_data) - 1
K = 5

hyper_mu = 0.
hyper_var = 1.

# inverse-gamma parameters
ig_alpha = 1.
ig_beta = 1.

alpha = 1.

rand_index_score = np.zeros(num_iter)
trace = np.zeros([num_iter, K])

for it in xrange(num_iter):
    for i in xrange(num_data):
        s_z[i] = -1
        
        prob = np.zeros(K)

        for k in xrange(K):
            sample_num = np.sum(s_z == k)
            sample_sum = np.sum(x[s_z == k])
            if sample_num != 0:
                sample_mean = np.mean(x[s_z == k])
            else:
                sample_mean = 0
            
            posterior_mean = (hyper_mu*hyper_var+sample_num*sample_mean)/(hyper_var+sample_num)
            posterior_var = hyper_var + sample_num
            
            posterior_alpha = ig_alpha + sample_num/2.0 
            posterior_beta = ig_beta + 0.5 * np.sum((x[s_z==k]-sample_mean)**2) + 0.5*(sample_num*hyper_var)/(hyper_var + sample_num)\
                        *((sample_mean-hyper_mu)**2)
                
            # posterior predictive distribution follows student-t distribution
            prob[k] = t.logpdf(x[i], df= 2.0*posterior_alpha, loc=posterior_mean, scale=posterior_beta*(posterior_var+1)/(posterior_var*posterior_alpha))
            prob[k] += np.log((sample_num+alpha)/(num_data-1+alpha*K))
            
        prob -= prob.max()
        prob = np.exp(prob)
        prob /= np.sum(prob)
                
        new_k = np.random.multinomial(1, prob).argmax()
        
        s_z[i] = new_k
    
    rand_index_score[it] = metrics.adjusted_rand_score(z, s_z)  
    trace[it,:] = [np.sum(s_z == k) for k in xrange(K)]

In [16]:
# plot the rand index scores over iterations
plot(rand_index_score)


Out[16]:
[<matplotlib.lines.Line2D at 0x1151c6b90>]

In [17]:
# plot the number of assigned data points for each cluster over iterations
plot(trace)


Out[17]:
[<matplotlib.lines.Line2D at 0x1152b9e10>,
 <matplotlib.lines.Line2D at 0x1152bc0d0>,
 <matplotlib.lines.Line2D at 0x1152bc310>,
 <matplotlib.lines.Line2D at 0x1152bc4d0>,
 <matplotlib.lines.Line2D at 0x1152bc690>]

In [18]:
for k in xrange(K):
    print np.mean(x[s_z == k]), np.sum(s_z == k)


-0.508555963255 13
0.134044712948 44
0.391721675502 8
-0.253489394291 18
-0.639100806055 17

In [18]: