**Note to Amazon EC2 users**: To conserve memory, make sure to stop all the other notebooks before running this notebook.

```
In [1]:
```import graphlab as gl
import numpy as np
import matplotlib.pyplot as plt
import copy
from scipy.stats import multivariate_normal
%matplotlib inline
'''Check GraphLab Create version'''
from distutils.version import StrictVersion
assert (StrictVersion(gl.version) >= StrictVersion('1.8.5')), 'GraphLab Create must be version 1.8.5 or later.'

```
```

In this section, you will implement the EM algorithm. We will take the following steps:

- Provide a log likelihood function for this model.
- Implement the EM algorithm.
- Create some synthetic data.
- Visualize the progress of the parameters during the course of running EM.
- Visualize the convergence of the model.

We provide a function to calculate log likelihood for mixture of Gaussians. The log likelihood quantifies the probability of observing a given set of data under a particular setting of the parameters in our model. We will use this to assess convergence of our EM algorithm; specifically, we will keep looping through EM update steps until the log likehood ceases to increase at a certain rate.

```
In [2]:
```def log_sum_exp(Z):
""" Compute log(\sum_i exp(Z_i)) for some array Z."""
return np.max(Z) + np.log(np.sum(np.exp(Z - np.max(Z))))
def loglikelihood(data, weights, means, covs):
""" Compute the loglikelihood of the data for a Gaussian mixture model with the given parameters. """
num_clusters = len(means)
num_dim = len(data[0])
ll = 0
for d in data:
Z = np.zeros(num_clusters)
for k in range(num_clusters):
# Compute (x-mu)^T * Sigma^{-1} * (x-mu)
delta = np.array(d) - means[k]
exponent_term = np.dot(delta.T, np.dot(np.linalg.inv(covs[k]), delta))
# Compute loglikelihood contribution for this data point and this cluster
Z[k] += np.log(weights[k])
Z[k] -= 1/2. * (num_dim * np.log(2*np.pi) + np.log(np.linalg.det(covs[k])) + exponent_term)
# Increment loglikelihood contribution of this data point across all clusters
ll += log_sum_exp(Z)
return ll

The first step in the EM algorithm is to compute cluster responsibilities. Let $r_{ik}$ denote the responsibility of cluster $k$ for data point $i$. Note that cluster responsibilities are fractional parts: Cluster responsibilities for a single data point $i$ should sum to 1. $$ r_{i1} + r_{i2} + \ldots + r_{iK} = 1 $$

To figure how much a cluster is responsible for a given data point, we compute the likelihood of the data point under the particular cluster assignment, multiplied by the weight of the cluster. For data point $i$ and cluster $k$, this quantity is $$ r_{ik} \propto \pi_k N(x_i | \mu_k, \Sigma_k) $$ where $N(x_i | \mu_k, \Sigma_k)$ is the Gaussian distribution for cluster $k$ (with mean $\mu_k$ and covariance $\Sigma_k$).

We used $\propto$ because the quantity $N(x_i | \mu_k, \Sigma_k)$ is not yet the responsibility we want. To ensure that all responsibilities over each data point add up to 1, we add the normalization constant in the denominator: $$ r_{ik} = \frac{\pi_k N(x_i | \mu_k, \Sigma_k)}{\sum_{k=1}^{K} \pi_k N(x_i | \mu_k, \Sigma_k)}. $$

Complete the following function that computes $r_{ik}$ for all data points $i$ and clusters $k$.

**Drawing from a Gaussian distribution.** SciPy provides a convenient function multivariate_normal.pdf that computes the likelihood of seeing a data point in a multivariate Gaussian distribution. The usage is

`multivariate_normal.pdf([data point], mean=[mean vector], cov=[covariance matrix])`

```
In [3]:
```def compute_responsibilities(data, weights, means, covariances):
'''E-step: compute responsibilities, given the current parameters'''
num_data = len(data)
num_clusters = len(means)
resp = np.zeros((num_data, num_clusters))
# Update resp matrix so that resp[i,k] is the responsibility of cluster k for data point i.
# Hint: To compute likelihood of seeing data point i given cluster k, use multivariate_normal.pdf.
for i in range(num_data):
for k in range(num_clusters):
dist = multivariate_normal.pdf(data[i], means[k], covariances[k])
resp[i, k] = (weights[k] * dist)
# Add up responsibilities over each data point and normalize
row_sums = resp.sum(axis=1)[:, np.newaxis]
resp = resp / row_sums
return resp

**Checkpoint**.

```
In [4]:
```resp = compute_responsibilities(data=np.array([[1.,2.],[-1.,-2.]]), weights=np.array([0.3, 0.7]),
means=[np.array([0.,0.]), np.array([1.,1.])],
covariances=[np.array([[1.5, 0.],[0.,2.5]]), np.array([[1.,1.],[1.,2.]])])
if resp.shape==(2,2) and np.allclose(resp, np.array([[0.10512733, 0.89487267], [0.46468164, 0.53531836]])):
print 'Checkpoint passed!'
else:
print 'Check your code again.'

```
```

Once the cluster responsibilities are computed, we update the parameters (weights, means, and covariances) associated with the clusters.

**Computing soft counts**. Before updating the parameters, we first compute what is known as "soft counts". The soft count of a cluster is the sum of all cluster responsibilities for that cluster:
$$
N^{\text{soft}}_k = r_{1k} + r_{2k} + \ldots + r_{Nk} = \sum_{i=1}^{N} r_{ik}
$$

where we loop over data points. Note that, unlike k-means, we must loop over every single data point in the dataset. This is because all clusters are represented in all data points, to a varying degree.

We provide the function for computing the soft counts:

```
In [5]:
```def compute_soft_counts(resp):
# Compute the total responsibility assigned to each cluster, which will be useful when
# implementing M-steps below. In the lectures this is called N^{soft}
counts = np.sum(resp, axis=0)
return counts

**Updating weights.** The cluster weights show us how much each cluster is represented over all data points. The weight of cluster $k$ is given by the ratio of the soft count $N^{\text{soft}}_{k}$ to the total number of data points $N$:
$$
\hat{\pi}_k = \frac{N^{\text{soft}}_{k}}{N}
$$
Notice that $N$ is equal to the sum over the soft counts $N^{\text{soft}}_{k}$ of all clusters.

Complete the following function:

```
In [6]:
```def compute_weights(counts):
num_clusters = len(counts)
weights = [0.] * num_clusters
num_data = counts.sum()
for k in range(num_clusters):
# Update the weight for cluster k using the M-step update rule for the cluster weight, \hat{\pi}_k.
# HINT: compute # of data points by summing soft counts.
weights[k] = counts[k] / num_data
return weights

**Checkpoint**.

```
In [7]:
```resp = compute_responsibilities(data=np.array([[1.,2.],[-1.,-2.],[0,0]]), weights=np.array([0.3, 0.7]),
means=[np.array([0.,0.]), np.array([1.,1.])],
covariances=[np.array([[1.5, 0.],[0.,2.5]]), np.array([[1.,1.],[1.,2.]])])
counts = compute_soft_counts(resp)
weights = compute_weights(counts)
print counts
print weights
if np.allclose(weights, [0.27904865942515705, 0.720951340574843]):
print 'Checkpoint passed!'
else:
print 'Check your code again.'

```
```

**Updating means**. The mean of each cluster is set to the weighted average of all data points, weighted by the cluster responsibilities:
$$
\hat{\mu}_k = \frac{1}{N_k^{\text{soft}}} \sum_{i=1}^N r_{ik}x_i
$$

Complete the following function:

```
In [8]:
```def compute_means(data, resp, counts):
num_clusters = len(counts)
num_data = len(data)
means = [np.zeros(len(data[0]))] * num_clusters
for k in range(num_clusters):
# Update means for cluster k using the M-step update rule for the mean variables.
# This will assign the variable means[k] to be our estimate for \hat{\mu}_k.
weighted_sum = 0.
for i in range(num_data):
weighted_sum += resp[i, k] * data[i]
means[k] = weighted_sum / counts[k]
return means

**Checkpoint**.

```
In [9]:
```data_tmp = np.array([[1.,2.],[-1.,-2.]])
resp = compute_responsibilities(data=data_tmp, weights=np.array([0.3, 0.7]),
means=[np.array([0.,0.]), np.array([1.,1.])],
covariances=[np.array([[1.5, 0.],[0.,2.5]]), np.array([[1.,1.],[1.,2.]])])
counts = compute_soft_counts(resp)
means = compute_means(data_tmp, resp, counts)
if np.allclose(means, np.array([[-0.6310085, -1.262017], [0.25140299, 0.50280599]])):
print 'Checkpoint passed!'
else:
print 'Check your code again.'

```
```

**Updating covariances**. The covariance of each cluster is set to the weighted average of all outer products, weighted by the cluster responsibilities:
$$
\hat{\Sigma}_k = \frac{1}{N^{\text{soft}}_k}\sum_{i=1}^N r_{ik} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T
$$

The "outer product" in this context refers to the matrix product $$ (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T. $$ Letting $(x_i - \hat{\mu}_k)$ to be $d \times 1$ column vector, this product is a $d \times d$ matrix. Taking the weighted average of all outer products gives us the covariance matrix, which is also $d \times d$.

Complete the following function:

```
In [10]:
```def compute_covariances(data, resp, counts, means):
num_clusters = len(counts)
num_dim = len(data[0])
num_data = len(data)
covariances = [np.zeros((num_dim,num_dim))] * num_clusters
for k in range(num_clusters):
# Update covariances for cluster k using the M-step update rule for covariance variables.
# This will assign the variable covariances[k] to be the estimate for \hat{\Sigma}_k.
weighted_sum = np.zeros((num_dim, num_dim))
for i in range(num_data):
# YOUR CODE HERE (Hint: Use np.outer on the data[i] and this cluster's mean)
diff = data[i] - means[k]
weighted_sum += resp[i, k] * np.outer(diff, np.transpose(diff))
covariances[k] = weighted_sum / counts[k]
return covariances

**Checkpoint**.

```
In [11]:
```data_tmp = np.array([[1.,2.],[-1.,-2.]])
resp = compute_responsibilities(data=data_tmp, weights=np.array([0.3, 0.7]),
means=[np.array([0.,0.]), np.array([1.,1.])],
covariances=[np.array([[1.5, 0.],[0.,2.5]]), np.array([[1.,1.],[1.,2.]])])
counts = compute_soft_counts(resp)
means = compute_means(data_tmp, resp, counts)
covariances = compute_covariances(data_tmp, resp, counts, means)
if np.allclose(covariances[0], np.array([[0.60182827, 1.20365655], [1.20365655, 2.4073131]])) and \
np.allclose(covariances[1], np.array([[ 0.93679654, 1.87359307], [1.87359307, 3.74718614]])):
print 'Checkpoint passed!'
else:
print 'Check your code again.'

```
```

```
In [12]:
```# SOLUTION
def EM(data, init_means, init_covariances, init_weights, maxiter=1000, thresh=1e-4):
# Make copies of initial parameters, which we will update during each iteration
means = init_means[:]
covariances = init_covariances[:]
weights = init_weights[:]
# Infer dimensions of dataset and the number of clusters
num_data = len(data)
num_dim = len(data[0])
num_clusters = len(means)
# Initialize some useful variables
resp = np.zeros((num_data, num_clusters))
ll = loglikelihood(data, weights, means, covariances)
ll_trace = [ll]
for it in range(maxiter):
if it % 5 == 0:
print("Iteration %s" % it)
# E-step: compute responsibilities
resp = compute_responsibilities(data, weights, means, covariances)
# M-step
# Compute the total responsibility assigned to each cluster, which will be useful when
# implementing M-steps below. In the lectures this is called N^{soft}
counts = compute_soft_counts(resp)
# Update the weight for cluster k using the M-step update rule for the cluster weight, \hat{\pi}_k.
weights = compute_weights(counts)
# Update means for cluster k using the M-step update rule for the mean variables.
# This will assign the variable means[k] to be our estimate for \hat{\mu}_k.
means = compute_means(data, resp, counts)
# Update covariances for cluster k using the M-step update rule for covariance variables.
# This will assign the variable covariances[k] to be the estimate for \hat{\Sigma}_k.
covariances = compute_covariances(data, resp, counts, means)
# Compute the loglikelihood at this iteration
ll_latest = loglikelihood(data, weights, means, covariances)
ll_trace.append(ll_latest)
# Check for convergence in log-likelihood and store
if (ll_latest - ll) < thresh and ll_latest > -np.inf:
break
ll = ll_latest
if it % 5 != 0:
print("Iteration %s" % it)
out = {'weights': weights, 'means': means, 'covs': covariances, 'loglik': ll_trace, 'resp': resp}
return out

To help us develop and test our implementation, we will generate some observations from a mixture of Gaussians and then run our EM algorithm to discover the mixture components. We'll begin with a function to generate the data, and a quick plot to visualize its output for a 2-dimensional mixture of three Gaussians.

Now we will create a function to generate data from a mixture of Gaussians model.

```
In [13]:
```def generate_MoG_data(num_data, means, covariances, weights):
""" Creates a list of data points """
num_clusters = len(weights)
data = []
for i in range(num_data):
# Use np.random.choice and weights to pick a cluster id greater than or equal to 0 and less than num_clusters.
k = np.random.choice(len(weights), 1, p=weights)[0]
# Use np.random.multivariate_normal to create data from this cluster
x = np.random.multivariate_normal(means[k], covariances[k])
data.append(x)
return data

```
In [14]:
```# Model parameters
init_means = [
[5, 0], # mean of cluster 1
[1, 1], # mean of cluster 2
[0, 5] # mean of cluster 3
]
init_covariances = [
[[.5, 0.], [0, .5]], # covariance of cluster 1
[[.92, .38], [.38, .91]], # covariance of cluster 2
[[.5, 0.], [0, .5]] # covariance of cluster 3
]
init_weights = [1/4., 1/2., 1/4.] # weights of each cluster
# Generate data
np.random.seed(4)
data = generate_MoG_data(100, init_means, init_covariances, init_weights)

```
In [15]:
```plt.figure()
d = np.vstack(data)
plt.plot(d[:,0], d[:,1],'ko')
plt.rcParams.update({'font.size':16})
plt.tight_layout()

```
```

```
In [16]:
```np.random.seed(4)
# Initialization of parameters
chosen = np.random.choice(len(data), 3, replace=False)
initial_means = [data[x] for x in chosen]
initial_covs = [np.cov(data, rowvar=0)] * 3
initial_weights = [1/3.] * 3
# Run EM
results = EM(data, initial_means, initial_covs, initial_weights)

```
```

**Note**. Like k-means, EM is prone to converging to a local optimum. In practice, you may want to run EM multiple times with different random initialization. We have omitted multiple restarts to keep the assignment reasonably short. For the purpose of this assignment, we assign a particular random seed (`seed=4`

) to ensure consistent results among the students.

**Checkpoint**. For this particular example, the EM algorithm is expected to terminate in 23 iterations. That is, the last line of the log should say "Iteration 22". If your function stopped too early or too late, you should re-visit your code.

Our algorithm returns a dictionary with five elements:

- 'loglik': a record of the log likelihood at each iteration
- 'resp': the final responsibility matrix
- 'means': a list of K means
- 'covs': a list of K covariance matrices
- 'weights': the weights corresponding to each model component

**Quiz Question**: What is the weight that EM assigns to the first component after running the above codeblock?

```
In [18]:
```round(results['weights'][0], 3)

```
Out[18]:
```

**Quiz Question**: Using the same set of results, obtain the mean that EM assigns the second component. What is the mean in the first dimension?

```
In [20]:
```# Your code here
round(results['means'][1][0], 3)

```
Out[20]:
```

**Quiz Question**: Using the same set of results, obtain the covariance that EM assigns the third component. What is the variance in the first dimension?

```
In [27]:
```results['covs'][2]

```
Out[27]:
```

One useful feature of testing our implementation on low-dimensional simulated data is that we can easily visualize the results.

We will use the following `plot_contours`

function to visualize the Gaussian components over the data at three different points in the algorithm's execution:

- At initialization (using initial_mu, initial_cov, and initial_weights)
- After running the algorithm to completion
- After just 12 iterations (using parameters estimates returned when setting
`maxiter=12`

)

```
In [28]:
```import matplotlib.mlab as mlab
def plot_contours(data, means, covs, title):
plt.figure()
plt.plot([x[0] for x in data], [y[1] for y in data],'ko') # data
delta = 0.025
k = len(means)
x = np.arange(-2.0, 7.0, delta)
y = np.arange(-2.0, 7.0, delta)
X, Y = np.meshgrid(x, y)
col = ['green', 'red', 'indigo']
for i in range(k):
mean = means[i]
cov = covs[i]
sigmax = np.sqrt(cov[0][0])
sigmay = np.sqrt(cov[1][1])
sigmaxy = cov[0][1]/(sigmax*sigmay)
Z = mlab.bivariate_normal(X, Y, sigmax, sigmay, mean[0], mean[1], sigmaxy)
plt.contour(X, Y, Z, colors = col[i])
plt.title(title)
plt.rcParams.update({'font.size':16})
plt.tight_layout()

```
In [29]:
```# Parameters after initialization
plot_contours(data, initial_means, initial_covs, 'Initial clusters')

```
```

```
In [30]:
```# Parameters after running EM to convergence
results = EM(data, initial_means, initial_covs, initial_weights)
plot_contours(data, results['means'], results['covs'], 'Final clusters')

```
```

```
In [31]:
```# YOUR CODE HERE
results = EM(data, initial_means, initial_covs, initial_weights, 12)
plot_contours(data, results['means'], results['covs'], 'Clusters after 12 iterations')

```
```

**Quiz Question**: Plot the loglikelihood that is observed at each iteration. Is the loglikelihood plot monotonically increasing, monotonically decreasing, or neither [multiple choice]?

```
In [32]:
```results = EM(data, initial_means, initial_covs, initial_weights)
# YOUR CODE HERE
loglikelihoods = results['loglik']

```
```

```
In [33]:
```plt.plot(range(len(loglikelihoods)), loglikelihoods, linewidth=4)
plt.xlabel('Iteration')
plt.ylabel('Log-likelihood')
plt.rcParams.update({'font.size':16})
plt.tight_layout()

```
```

Now that we're confident in our implementation of the EM algorithm, we'll apply it to cluster some more interesting data. In particular, we have a set of images that come from four categories: sunsets, rivers, trees and forests, and cloudy skies. For each image we are given the average intensity of its red, green, and blue pixels, so we have a 3-dimensional representation of our data. Our goal is to find a good clustering of these images using our EM implementation; ideally our algorithm would find clusters that roughly correspond to the four image categories.

To begin with, we'll take a look at the data and get it in a form suitable for input to our algorithm. The data are provided in SFrame format:

```
In [34]:
```images = gl.SFrame('images.sf')
gl.canvas.set_target('ipynb')
import array
images['rgb'] = images.pack_columns(['red', 'green', 'blue'])['X4']
images.show()

```
```

We need to come up with initial estimates for the mixture weights and component parameters. Let's take three images to be our initial cluster centers, and let's initialize the covariance matrix of each cluster to be diagonal with each element equal to the sample variance from the full data. As in our test on simulated data, we'll start by assuming each mixture component has equal weight.

This may take a few minutes to run.

```
In [35]:
```np.random.seed(1)
# Initalize parameters
init_means = [images['rgb'][x] for x in np.random.choice(len(images), 4, replace=False)]
cov = np.diag([images['red'].var(), images['green'].var(), images['blue'].var()])
init_covariances = [cov, cov, cov, cov]
init_weights = [1/4., 1/4., 1/4., 1/4.]
# Convert rgb data to numpy arrays
img_data = [np.array(i) for i in