First things first

Click File -> Save a copy in Drive and click Open in new tab in the pop-up window to save your progress in Google Drive.

Expectation-maximization algorithm

In this assignment, we will derive and implement formulas for Gaussian Mixture Model — one of the most commonly used methods for performing soft clustering of the data.

Setup

Loading auxiliary files and importing the necessary libraries.


In [1]:
try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False
if IN_COLAB:
    print("Downloading Colab files")
    ! shred -u setup_google_colab.py
    ! wget https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/setup_google_colab.py -O setup_google_colab.py
    import setup_google_colab
    setup_google_colab.load_data_week2()


Downloading Colab files
--2020-05-31 09:07:06--  https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/setup_google_colab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1232 (1.2K) [text/plain]
Saving to: ‘setup_google_colab.py’

setup_google_colab. 100%[===================>]   1.20K  --.-KB/s    in 0s      

2020-05-31 09:07:06 (73.8 MB/s) - ‘setup_google_colab.py’ saved [1232/1232]

https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/week2/w2_grader.py w2_grader.py
https://raw.githubusercontent.com/hse-aml/bayesian-methods-for-ml/master/week2/samples.npz samples.npz

In [0]:
import numpy as np
from numpy.linalg import slogdet, det, solve
from scipy.stats import multivariate_normal
import matplotlib.pyplot as plt
import time
from sklearn.datasets import load_digits
from w2_grader import EMGrader
%matplotlib inline

Grading

We will create a grader instance below and use it to collect your answers. Note that these outputs will be stored locally inside grader and will be uploaded to the platform only after running submitting function in the last part of this assignment. If you want to make a partial submission, you can run that cell anytime you want.


In [0]:
grader = EMGrader()

Implementing EM for GMM

Indented block

For debugging, we will use samples from a Gaussian mixture model with unknown mean, variance, and priors. We also added initial values of parameters for grading purposes.


In [4]:
samples = np.load('samples.npz')
X = samples['data']
pi0 = samples['pi0']
mu0 = samples['mu0']
sigma0 = samples['sigma0']
plt.scatter(X[:, 0], X[:, 1], c='grey', s=30)
plt.axis('equal')
plt.show()


Reminder

Remember, that EM algorithm is a coordinate descent optimization of variational lower bound $\mathcal{L}(\theta, q) = \int q(T) \log\frac{p(X, T|\theta)}{q(T)}dT\to \max$.

E-step:
$\mathcal{L}(\theta, q) \to \max\limits_{q} \Leftrightarrow \mathcal{KL} [q(T) \,\|\, p(T|X, \theta)] \to \min \limits_{q\in Q} \Rightarrow q(T) = p(T|X, \theta)$
M-step:
$\mathcal{L}(\theta, q) \to \max\limits_{\theta} \Leftrightarrow \mathbb{E}_{q(T)}\log p(X,T | \theta) \to \max\limits_{\theta}$

For GMM, $\theta$ is a set of parameters that consists of mean vectors $\mu_c$, covariance matrices $\Sigma_c$ and priors $\pi_c$ for each component.

Latent variables $T$ are indices of components to which each data point is assigned, i.e. $t_i$ is the cluster index for object $x_i$.

The joint distribution can be written as follows: $\log p(T, X \mid \theta) = \sum\limits_{i=1}^N \log p(t_i, x_i \mid \theta) = \sum\limits_{i=1}^N \sum\limits_{c=1}^C q(t_i = c) \log \left (\pi_c \, f_{\!\mathcal{N}}(x_i \mid \mu_c, \Sigma_c)\right)$, where $f_{\!\mathcal{N}}(x \mid \mu_c, \Sigma_c) = \frac{1}{\sqrt{(2\pi)^n|\boldsymbol\Sigma_c|}} \exp\left(-\frac{1}{2}({x}-{\mu_c})^T{\boldsymbol\Sigma_c}^{-1}({x}-{\mu_c}) \right)$ is the probability density function (pdf) of the normal distribution $\mathcal{N}(x_i \mid \mu_c, \Sigma_c)$.

E-step

In this step we need to estimate the posterior distribution over the latent variables with fixed values of parameters: $q_i(t_i) = p(t_i \mid x_i, \theta)$. We assume that $t_i$ equals to the cluster index of the true component of the $x_i$ object. To do so we need to compute $\gamma_{ic} = p(t_i = c \mid x_i, \theta)$. Note that $\sum\limits_{c=1}^C\gamma_{ic}=1$.

Important trick 1: It is important to avoid numerical errors. At some point you will have to compute the formula of the following form: $\frac{e^{y_i}}{\sum_j e^{y_j}}$, which is called softmax. When you compute exponents of large numbers, some numbers may become infinity. You can avoid this by dividing numerator and denominator by $e^{\max(y)}$: $\frac{e^{y_i-\max(y)}}{\sum_j e^{y_j - \max(y)}}$. After this transformation maximum value in the denominator will be equal to one. All other terms will contribute smaller values. So, to compute desired formula you first subtract maximum value from each component in vector $\mathbf{y}$ and then compute everything else as before.

Important trick 2: You will probably need to compute formula of the form $A^{-1}x$ at some point. You would normally inverse $A$ and then multiply it by $x$. A bit faster and more numerically accurate way to do this is to directly solve equation $Ay = x$ by using a special function. Its solution is $y=A^{-1}x$, but the equation $Ay = x$ can be solved by methods which do not explicitely invert the matrix. You can use np.linalg.solve for this.

Other usefull functions: ```slogdet``` and ```det```

Task 1: Implement E-step for GMM using template below.


In [0]:
def E_step(X, pi, mu, sigma):
    """
    Performs E-step on GMM model
    Each input is numpy array:
    X: (N x d), data points
    pi: (C), mixture component weights 
    mu: (C x d), mixture component means
    sigma: (C x d x d), mixture component covariance matrices
    
    Returns:
    gamma: (N x C), probabilities of clusters for objects
    """
    N = X.shape[0] # number of objects
    C = pi.shape[0] # number of clusters
    d = mu.shape[1] # dimension of each object
    gamma = np.zeros((N, C)) # distribution q(T)


    ### YOUR CODE HERE
    for c in range(C):
      gamma[:, c] = 0.5*np.diag(np.dot((X-mu[c,:]), np.linalg.solve(sigma[c,:,:], (X-mu[c,:]).T)))

    gamma_max = np.amax(gamma, axis=1, keepdims=True)
    gamma = gamma-gamma_max
    sigma_det = np.linalg.det(sigma)
    gamma = pi * np.exp(-gamma) * np.power(np.power(2*np.pi, N)*sigma_det, -0.5)
    gamma = gamma/np.sum(gamma, axis=1, keepdims=True)
    return gamma

In [6]:
gamma = E_step(X, pi0, mu0, sigma0)
grader.submit_e_step(gamma)


Current answer for task Task 1 (E-step) is: 0.5337178741081262

M-step

In M-step we need to maximize $\mathbb{E}_{q(T)}\log p(X,T | \theta)$ with respect to $\theta$. In our model this means that we need to find optimal values of $\pi$, $\mu$, $\Sigma$. To do so, you need to compute the derivatives and set them to zero. You should start by deriving formulas for $\mu$ as it is the easiest part. Then move on to $\Sigma$. Here it is crucial to optimize function w.r.t. to $\Lambda = \Sigma^{-1}$ and then inverse obtained result. Finaly, to compute $\pi$, you will need Lagrange Multipliers technique to satisfy constraint $\sum\limits_{i=1}^{n}\pi_i = 1$.


Important note: You will need to compute derivatives of scalars with respect to matrices. To refresh this technique from previous courses, see wiki article about it . Main formulas of matrix derivatives can be found in Chapter 2 of The Matrix Cookbook. For example, there you may find that $\frac{\partial}{\partial A}\log |A| = A^{-T}$.

Task 2: Implement M-step for GMM using template below.


In [0]:
def M_step(X, gamma):
    """
    Performs M-step on GMM model
    Each input is numpy array:
    X: (N x d), data points
    gamma: (N x C), distribution q(T)  
    
    Returns:
    pi: (C)
    mu: (C x d)
    sigma: (C x d x d)
    """
    N = X.shape[0] # number of objects
    C = gamma.shape[1] # number of clusters
    d = X.shape[1] # dimension of each object

    pi = np.zeros(C)
    mu = np.zeros((C,d))
    sigma = np.zeros((C,d,d))


    pi = gamma.mean(axis=0)
    mu = np.dot(gamma.T,X)/gamma.sum(axis=0)[:, np.newaxis]

    for c in range(C):
      sigma[c,:,:] = np.dot((X-mu[c]).T, gamma[:,c].reshape(N,1) * (X-mu[c]))/gamma.sum(axis=0)[c]

    return pi, mu, sigma

In [8]:
gamma = E_step(X, pi0, mu0, sigma0)
pi, mu, sigma = M_step(X, gamma)
grader.submit_m_step(pi, mu, sigma)


Current answer for task Task 2 (M-step: mu) is: 2.899391882050383
Current answer for task Task 2 (M-step: sigma) is: 5.9771052168975265
Current answer for task Task 2 (M-step: pi) is: 0.5507624459218775

Loss function

Finally, we need some function to track convergence. We will use variational lower bound $\mathcal{L}$ for this purpose. We will stop our EM iterations when $\mathcal{L}$ will saturate. Usually, you will need only about 10-20 iterations to converge. It is also useful to check that this function never decreases during training. If it does, you have a bug in your code.

Task 3: Implement a function that will compute $\mathcal{L}$ using template below.

$$\mathcal{L} = \sum_{i=1}^{N} \sum_{c=1}^{C} q(t_i =c) (\log \pi_c + \log f_{\!\mathcal{N}}(x_i \mid \mu_c, \Sigma_c)) - \sum_{i=1}^{N} \sum_{c=1}^{K} q(t_i =c) \log q(t_i =c)$$

In [0]:
def compute_vlb(X, pi, mu, sigma, gamma):
    """
    Each input is numpy array:
    X: (N x d), data points
    gamma: (N x C), distribution q(T)  
    pi: (C)
    mu: (C x d)
    sigma: (C x d x d)
    
    Returns value of variational lower bound
    """
    N = X.shape[0] # number of objects
    C = gamma.shape[1] # number of clusters
    d = X.shape[1] # dimension of each object

    ### YOUR CODE HERE
    loss = np.zeros(N)
    for c in range(C):
      mvn = multivariate_normal(mu[c, :], sigma[c, :, :], allow_singular=True)
      loss += ((np.log(pi[c]) + mvn.logpdf(X)) -  np.log(gamma[:,c]))*gamma[:,c]
      
    loss = np.sum(loss)
    return loss

In [10]:
pi, mu, sigma = pi0, mu0, sigma0
gamma = E_step(X, pi, mu, sigma)
pi, mu, sigma = M_step(X, gamma)
loss = compute_vlb(X, pi, mu, sigma, gamma)
grader.submit_VLB(loss)


Current answer for task Task 3 (VLB) is: -1213.9734643060183

Bringing it all together

Now that we have E step, M step and VLB, we can implement the training loop. We will initialize values of $\pi$, $\mu$ and $\Sigma$ to some random numbers, train until $\mathcal{L}$ stops changing, and return the resulting points. We also know that the EM algorithm converges to local optima. To find a better local optima, we will restart the algorithm multiple times from different (random) starting positions. Each training trial should stop either when maximum number of iterations is reached or when relative improvement is smaller than given tolerance ($|\frac{\mathcal{L}_i-\mathcal{L}_{i-1}}{\mathcal{L}_{i-1}}| \le \text{rtol}$).

Remember, that initial (random) values of $\pi$ that you generate must be non-negative and sum up to 1. Also, $\Sigma$ matrices must be symmetric and positive semi-definite. If you don't know how to generate those matrices, you can use $\Sigma=I$ as initialization.

You will also sometimes get numerical errors because of component collapsing. The easiest way to deal with this problems is to restart the procedure.

Task 4: Implement training procedure


In [0]:
def train_EM(X, C, rtol=1e-3, max_iter=100, restarts=10):
    '''
    Starts with random initialization *restarts* times
    Runs optimization until saturation with *rtol* reached
    or *max_iter* iterations were made.
    
    X: (N, d), data points
    C: int, number of clusters
    '''
    N = X.shape[0] # number of objects
    d = X.shape[1] # dimension of each object
    best_loss = None
    best_pi = None
    best_mu = None
    best_sigma = None

    for _ in range(restarts):
        try:
            ### YOUR CODE HERE
            # X: (N x d), data points
            # gamma: (N x C), distribution q(T)  
            # pi: (C)
            #  mu: (C x d)
            #    sigma: (C x d x d)
            pi = np.array([1/C]*C)
            mu = np.random.randn(C,d)
            #sigmas = np.eye(d)
            #sigma = np.array([sigmas]*C)
            sigma = np.zeros((C, d, d))
            for c in range(C):
                sigma[c] = np.eye(d) * np.random.uniform(1, C)
            for i in range(max_iter):
              gamma = E_step(X, pi, mu, sigma)
              pi, mu, sigma = M_step(X, gamma)
              loss = compute_vlb(X, pi, mu, sigma, gamma)
              if best_loss is None or loss<best_loss:
                best_loss = loss
                best_pi = pi
                best_mu = mu
                best_sigma = sigma
            
        except np.linalg.LinAlgError:
            print("Singular matrix: components collapsed")
            pass

    return best_loss, best_pi, best_mu, best_sigma

In [12]:
best_loss, best_pi, best_mu, best_sigma = train_EM(X, 3)
grader.submit_EM(best_loss)


Current answer for task Task 4 (EM) is: -1232.0527804127278

If you implemented all the steps correctly, your algorithm should converge in about 20 iterations. Let's plot the clusters to see it. We will assign a cluster label as the most probable cluster index. This can be found using a matrix $\gamma$ computed on last E-step.


In [13]:
gamma = E_step(X, best_pi, best_mu, best_sigma)
labels = gamma.argmax(axis=1)
colors = np.array([(31, 119, 180), (255, 127, 14), (44, 160, 44)]) / 255.
plt.scatter(X[:, 0], X[:, 1], c=colors[labels], s=30)
plt.axis('equal')
plt.show()


Authorization & Submission

To submit assignment parts to Cousera platform, please, enter your e-mail and token into variables below. You can generate a token on this programming assignment's page. Note: The token expires 30 minutes after generation.


In [14]:
STUDENT_EMAIL = ""
STUDENT_TOKEN = ""
grader.status()


You want to submit these numbers:
Task Task 1 (E-step): 0.5337178741081262
Task Task 2 (M-step: mu): 2.899391882050383
Task Task 2 (M-step: sigma): 5.9771052168975265
Task Task 2 (M-step: pi): 0.5507624459218775
Task Task 3 (VLB): -1213.9734643060183
Task Task 4 (EM): -1232.0527804127278

If you want to submit these answers, run cell below


In [15]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)


You used an invalid email or your token may have expired. Please make sure you have entered all fields correctly. Try generating a new token if the issue still persists.