The goal of this lecture is to give a basic
In [1]:
%pylab inline
Let's import numpy and pyplot package first
In [2]:
import numpy as np
import matplotlib.pyplot as plt
Dirichlet distribution is a distribution over a multinomial distribution, which means a random sample (\theta) drawn from K-dimensional Dirichlet can be used as a parameter of multinomial distribution. (i.e. \sum_i^K \theta_i = 1, \theta_i > 0 for all i)
PDF of unsymmetric K-dim Dirichlet distribution (use different parameters for each dimension)
$$Dir(\vec\theta | \vec\alpha) = \frac{\Gamma(\sum^K \alpha_i)}{\prod^K \Gamma(\alpha_i)} \prod^K \theta_i^{\alpha_i-1}$$PDF of symmetric Dirichlet distribution (use same parameter for every dimension)
$$Dir(\vec\theta | \alpha) = \frac{\Gamma(K \alpha)}{\Gamma(\alpha)^K} \prod^K \theta_i^{\alpha-1}$$Now, let's draw samples from Dirichlet distribution with numpy library
In [3]:
# number of dimensions of Dirichlet distribution
ndim = 5
# number of samples drawn
nsample = 10
# parameter of dirichlet distribution
alpha = [0.1] * ndim
# drawing samples using numpy
theta = np.random.dirichlet(alpha, size=nsample)
In [4]:
#plot samples using bar chart
ndraws = 4
plt.subplots(figsize=(12,3*ndraws/2))
for i in xrange(ndraws):
plt.subplot(ndraws/2,2,i)
plt.bar(xrange(ndim), theta[i])
Let's compute the mean of one million samples from Dir(0.1)
In [5]:
# increase the sample size
nsample = 1000000
theta = np.random.dirichlet(alpha, size=nsample)
np.mean(theta, 0)
Out[5]:
Now, let's compute the mean of one million samples from Dir(1)
In [6]:
alpha = [1]*ndim
nsample = 1000000
theta = np.random.dirichlet(alpha, size=nsample)
np.mean(theta, 0)
Out[6]:
In [7]:
ndraws = 4
plt.subplots(figsize=(12,3*ndraws/2))
for i in xrange(ndraws):
plt.subplot(ndraws/2,2,i)
plt.bar(xrange(ndim), theta[i])
Let's compute the mean of one million samples from Dir(100)
In [8]:
alpha = [100]*ndim
nsample = 1000000
theta = np.random.dirichlet(alpha, size=nsample)
np.mean(theta, 0)
Out[8]:
In [9]:
ndraws = 4
plt.subplots(figsize=(12,3*ndraws/2))
for i in xrange(ndraws):
plt.subplot(ndraws/2,2,i)
plt.bar(xrange(ndim), theta[i])
The averages of samples from different Dirichlet are almost identical, however, as the value of parameter decreases, the sparsity of samples increases.
In this section, I will describe the collapsed Gibbs sampling algorithm for Gaussian mixture model (GMM) with known variance (assume we know the variance of each cluster).
Generative process of GMM
prior part $$ \mu_k \sim N(\mu_0, \sigma_0) \\ \pi \sim \text{Dir} (\alpha) $$
for each data $i$ $$ z_i \sim \text{Mult}(\pi)\\ x_i \sim N(\mu_{z_i}, \sigma) $$
Let's start with generating 200 sample data set from 5 clusters
In [10]:
pi = np.array([0.2]*5)
mu = [-5,-2.5,0,2.5,5]
sigma = 1.
K = 5
num_data = 100
x = np.zeros(num_data)
z = np.zeros(num_data, dtype=int)
for i in xrange(num_data):
z[i] = np.random.multinomial(1, pi).argmax()
x[i] = np.random.normal(mu[z[i]], sigma)
for k in xrange(K):
print 'clister', k, ' = ', np.mean(x[z == k]), ' num cases = ', np.sum(z==k)
The first term of RHS is a conjugate posterior predictive distribution of normal-normal conjugacy, and the second term of RHS is a conjugate posterior predictive distribution of Dirichlet-multinomial conjugacy.
In [11]:
from scipy.stats import norm
from sklearn import metrics
num_iter = 100
s_z = np.zeros(num_data) - 1
K = 5
hyper_mu = 0.
hyper_var = 1.
known_var = 1.
alpha = 1.
rand_index_score = np.zeros(num_iter)
trace = np.zeros([num_iter, K])
for it in xrange(num_iter):
for i in xrange(num_data):
s_z[i] = -1
prob = np.zeros(K)
for k in xrange(K):
sample_num = np.sum(s_z == k)
sample_sum = np.sum(x[s_z == k])
posterior_mean = (hyper_mu/hyper_var + sample_sum/known_var)/(1./hyper_var + sample_num/known_var)
posterior_variance = 1./(1./hyper_var + sample_num/known_var)
# prob[k] = (sample_num+alpha) * norm.pdf(x[i], posterior_mean, posterior_variance)
prob[k] = norm.logpdf(x[i], posterior_mean, posterior_variance)
prob[k] += np.log((sample_num+alpha)/(num_data-1+alpha*K))
prob -= prob.max()
prob = np.exp(prob)
prob /= np.sum(prob)
new_k = np.random.multinomial(1, prob).argmax()
s_z[i] = new_k
rand_index_score[it] = metrics.adjusted_rand_score(z, s_z)
trace[it,:] = [np.sum(s_z == k) for k in xrange(K)]
In [12]:
for k in xrange(K):
print np.mean(x[s_z == k]), np.sum(s_z == k)
In [13]:
# plot the rand index scores over iterations
plot(rand_index_score)
Out[13]:
In [14]:
# plot the number of assigned data points for each cluster over iterations
plot(trace)
Out[14]:
Let's change the prior distribution of normal from normal (unknown mean and known variance) to normal-inverse-gamma (unknown mean and variance)
In [15]:
#import student-t distribution
from scipy.stats import t
num_iter = 100
s_z = np.zeros(num_data) - 1
K = 5
hyper_mu = 0.
hyper_var = 1.
# inverse-gamma parameters
ig_alpha = 1.
ig_beta = 1.
alpha = 1.
rand_index_score = np.zeros(num_iter)
trace = np.zeros([num_iter, K])
for it in xrange(num_iter):
for i in xrange(num_data):
s_z[i] = -1
prob = np.zeros(K)
for k in xrange(K):
sample_num = np.sum(s_z == k)
sample_sum = np.sum(x[s_z == k])
if sample_num != 0:
sample_mean = np.mean(x[s_z == k])
else:
sample_mean = 0
posterior_mean = (hyper_mu*hyper_var+sample_num*sample_mean)/(hyper_var+sample_num)
posterior_var = hyper_var + sample_num
posterior_alpha = ig_alpha + sample_num/2.0
posterior_beta = ig_beta + 0.5 * np.sum((x[s_z==k]-sample_mean)**2) + 0.5*(sample_num*hyper_var)/(hyper_var + sample_num)\
*((sample_mean-hyper_mu)**2)
# posterior predictive distribution follows student-t distribution
prob[k] = t.logpdf(x[i], df= 2.0*posterior_alpha, loc=posterior_mean, scale=posterior_beta*(posterior_var+1)/(posterior_var*posterior_alpha))
prob[k] += np.log((sample_num+alpha)/(num_data-1+alpha*K))
prob -= prob.max()
prob = np.exp(prob)
prob /= np.sum(prob)
new_k = np.random.multinomial(1, prob).argmax()
s_z[i] = new_k
rand_index_score[it] = metrics.adjusted_rand_score(z, s_z)
trace[it,:] = [np.sum(s_z == k) for k in xrange(K)]
In [16]:
# plot the rand index scores over iterations
plot(rand_index_score)
Out[16]:
In [17]:
# plot the number of assigned data points for each cluster over iterations
plot(trace)
Out[17]:
In [18]:
for k in xrange(K):
print np.mean(x[s_z == k]), np.sum(s_z == k)
In [18]: