Week 4 Assessment: Principal Component Analysis (PCA)

Learning Objective

In this notebook, we will implement PCA. We will implement the two versions of PCA as described in the lectures, which handles the when the dataset size exceeds the dataset dimensionality, as well as the case when we have the dimensionality greater than the size of the dataset.

We will break down the task of implementing PCA into small components and combine them in the end to produce the final algorithm. We will apply PCA to the MNIST dataset and observe how the reconstruction changes as we change the number of principal components used.

  • If you are having issues with the grader, be sure to checkout the Q&A.

  • If you are stuck with the programming assignments, you can visit the discussion forum and discuss with your peers.


In [1]:
# PACKAGE: DO NOT EDIT
import numpy as np
import timeit

In [2]:
# PACKAGE: DO NOT EDIT
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

Now, let's plot a digit from the dataset:


In [3]:
from ipywidgets import interact

from sklearn.datasets import fetch_mldata
MNIST = fetch_mldata('MNIST original', data_home='./MNIST')
%matplotlib inline

plt.figure(figsize=(4,4))
plt.imshow(MNIST.data[0].reshape(28,28), cmap='gray');


Before we implement PCA, we will need to do some data preprocessing. In this assessment, some of them will be implemented by you, others we will take care of. However, when you are working on real world problems, you will need to do all these steps by yourself!

The preprocessing steps we will do are

  1. Convert unsigned interger 8 (uint8) encoding of pixels to a floating point number between 0-1.
  2. Subtract from each image the mean $\mu$.
  3. Scale each dimension of each image by $\frac{1}{\sigma}$ where $\sigma$ is the standard deviation of this dimension across the whole dataset.

The steps above ensure that our images will have zero mean and one variance. These preprocessing steps are also known as Data Normalization or Feature Scaling.

1. PCA

Now we will implement PCA. Before we do that, let's pause for a moment and think about the steps for performing PCA. Assume that we are performing PCA on some dataset $\boldsymbol X$ for $M$ principal components. We then need to perform the following steps, which we break into parts:

  1. Data normalization (normalize).
  2. Find eigenvalues and corresponding eigenvectors for the covariance matrix $\boldsymbol S$. Sort by the largest eigenvalues and the corresponding eigenvectors (eig).

After these steps, we can then compute the projection and reconstruction of the data onto the spaced spanned by the top $M$ eigenvectors.


In [4]:
# GRADED FUNCTION: DO NOT EDIT THIS LINE

# ===YOU SHOULD EDIT THIS FUNCTION===
def normalize(X):
    """Normalize the given dataset X
    Args:
        X: ndarray, dataset
    
    Returns:
        (Xbar, mean, std): ndarray, Xbar is the normalized dataset
        with mean 0 and standard deviation 1; mean and std are the 
        mean and standard deviation respectively.
    
    Note:
        You will encounter dimensions where the standard deviation is
        zero, for those when you do normalization the normalized data
        will be NaN. Handle this by setting using `std = 1` for those 
        dimensions when doing normalization.
    """
    mu = np.mean(X, axis=0) # EDIT THIS
    std = np.std(X, axis=0)
    std_filled = std.copy()
    std_filled[std==0] = 1.
    Xbar = (X - mu) / std_filled                # EDIT THIS
    return Xbar, mu, std_filled

In [5]:
# GRADED FUNCTION: DO NOT EDIT THIS LINE

# ===YOU SHOULD EDIT THIS FUNCTION===
def eig(S):
    """Compute the eigenvalues and corresponding eigenvectors 
        for the covariance matrix S.
    Args:
        S: ndarray, covariance matrix
    
    Returns:
        (eigvals, eigvecs): ndarray, the eigenvalues and eigenvectors

    Note:
        the eigenvals and eigenvecs SHOULD BE sorted in descending
        order of the eigen values
        
        Hint: take a look at np.argsort for how to sort in numpy.
    """
    eVals, eVecs = np.linalg.eig(S)
    order = np.absolute(eVals).argsort()[::-1]
    eVals = eVals[order]
    eVecs = eVecs[:,order]  
    return (eVals, eVecs) # EDIT THIS

In [30]:
# GRADED FUNCTION: DO NOT EDIT THIS LINE

# ===YOU SHOULD EDIT THIS FUNCTION===
def projection_matrix(B):
    """Compute the projection matrix onto the space spanned by `B`
    Args:
        B: ndarray of dimension (D, M), the basis for the subspace
    
    Returns:
        P: the projection matrix
    """
    P = B @ np.linalg.inv(B.T @ B) @ B.T # EDIT THIS
    return P

Now, with the help of the functions you have implemented above, let's implement PCA! When you implement PCA, do take advantage of the functions that you have implemented above.


In [7]:
# GRADED FUNCTION: DO NOT EDIT THIS LINE

# ===YOU SHOULD EDIT THIS FUNCTION===
def PCA(X, num_components):
    """
    Args:
        X: ndarray of size (N, D), where D is the dimension of the data,
           and N is the number of datapoints
        num_components: the number of principal components to use.
    Returns:
        X_reconstruct: ndarray of size (N, D) the reconstruction
        of X from the first `num_components` principal components.
    """
    # Compute the data covariance matrix S
    S = np.cov(X, rowvar=False, bias=True)

    # Next find eigenvalues and corresponding eigenvectors for S by implementing eig().
    eig_vals, eig_vecs = eig(S)
    
    # Reconstruct the images from the lowerdimensional representation
    # To do this, we first need to find the projection_matrix (which you implemented earlier)
    # which projects our input data onto the vector space spanned by the eigenvectors
    P = projection_matrix(eig_vecs[:,:num_components]) # projection matrix
    
    # Then for each data point x_i in the dataset X 
    #   we can project the original x_i onto the eigenbasis.
    X_reconstruct = (P @ X.T).T
    return X_reconstruct

In [8]:
## Some preprocessing of the data
NUM_DATAPOINTS = 1000
X = (MNIST.data.reshape(-1, 28 * 28)[:NUM_DATAPOINTS]) / 255.
Xbar, mu, std = normalize(X)

The greater number of of principal components we use, the smaller will our reconstruction error be. Now, let's answer the following question:

How many principal components do we need in order to reach a Mean Squared Error (MSE) of less than $100$ for our dataset?


In [9]:
def mse(predict, actual):
    return np.square(predict - actual).sum(axis=1).mean()

In [17]:
loss = []
reconstructions = []
for num_component in range(1, 100):
    reconst = PCA(Xbar, num_component)
    reconst = np.real(reconst)
    error = mse(reconst, Xbar)
    reconstructions.append(reconst)
    # print('n = {:d}, reconstruction_error = {:f}'.format(num_component, error))
    loss.append((num_component, error))

reconstructions = np.asarray(reconstructions)
reconstructions = reconstructions * std + mu # "unnormalize" the reconstructed image
loss = np.asarray(loss)

In [18]:
loss


Out[18]:
array([[   1.        ,  445.17648843],
       [   2.        ,  403.94197923],
       [   3.        ,  377.59243065],
       [   4.        ,  353.30012434],
       [   5.        ,  335.44289452],
       [   6.        ,  320.51589546],
       [   7.        ,  307.28420723],
       [   8.        ,  294.95357442],
       [   9.        ,  283.97661502],
       [  10.        ,  274.01335119],
       [  11.        ,  264.53517783],
       [  12.        ,  255.4115194 ],
       [  13.        ,  246.97777313],
       [  14.        ,  238.61248858],
       [  15.        ,  230.82394538],
       [  16.        ,  223.74508667],
       [  17.        ,  217.25288549],
       [  18.        ,  211.10980526],
       [  19.        ,  205.19126619],
       [  20.        ,  199.45937602],
       [  21.        ,  193.83064587],
       [  22.        ,  188.72992977],
       [  23.        ,  183.68020736],
       [  24.        ,  178.7164468 ],
       [  25.        ,  173.83889867],
       [  26.        ,  169.24008665],
       [  27.        ,  164.70992918],
       [  28.        ,  160.39904031],
       [  29.        ,  156.24217637],
       [  30.        ,  152.31470003],
       [  31.        ,  148.58551212],
       [  32.        ,  144.96751302],
       [  33.        ,  141.4636097 ],
       [  34.        ,  138.04905222],
       [  35.        ,  134.80419402],
       [  36.        ,  131.63769824],
       [  37.        ,  128.51486788],
       [  38.        ,  125.44911449],
       [  39.        ,  122.44568498],
       [  40.        ,  119.61780193],
       [  41.        ,  116.861161  ],
       [  42.        ,  114.12098626],
       [  43.        ,  111.49498167],
       [  44.        ,  109.04360723],
       [  45.        ,  106.67769811],
       [  46.        ,  104.41128462],
       [  47.        ,  102.28918562],
       [  48.        ,  100.20962071],
       [  49.        ,   98.18639674],
       [  50.        ,   96.20259239],
       [  51.        ,   94.23047999],
       [  52.        ,   92.32157216],
       [  53.        ,   90.45233407],
       [  54.        ,   88.71988007],
       [  55.        ,   87.01017231],
       [  56.        ,   85.35714647],
       [  57.        ,   83.73306986],
       [  58.        ,   82.1346028 ],
       [  59.        ,   80.55715323],
       [  60.        ,   79.04351123],
       [  61.        ,   77.59111949],
       [  62.        ,   76.17928152],
       [  63.        ,   74.83974007],
       [  64.        ,   73.53709719],
       [  65.        ,   72.26303721],
       [  66.        ,   71.01439426],
       [  67.        ,   69.78934379],
       [  68.        ,   68.60084356],
       [  69.        ,   67.43033735],
       [  70.        ,   66.29343157],
       [  71.        ,   65.18158647],
       [  72.        ,   64.09569727],
       [  73.        ,   63.02477536],
       [  74.        ,   61.96452806],
       [  75.        ,   60.92847182],
       [  76.        ,   59.92795243],
       [  77.        ,   58.93630438],
       [  78.        ,   57.95961501],
       [  79.        ,   56.9970415 ],
       [  80.        ,   56.06279858],
       [  81.        ,   55.15272708],
       [  82.        ,   54.2574517 ],
       [  83.        ,   53.37740707],
       [  84.        ,   52.5191994 ],
       [  85.        ,   51.66320389],
       [  86.        ,   50.85142887],
       [  87.        ,   50.04982152],
       [  88.        ,   49.25596342],
       [  89.        ,   48.48553832],
       [  90.        ,   47.72661966],
       [  91.        ,   46.98670333],
       [  92.        ,   46.26800157],
       [  93.        ,   45.55661019],
       [  94.        ,   44.85918259],
       [  95.        ,   44.18483827],
       [  96.        ,   43.52591415],
       [  97.        ,   42.89422674],
       [  98.        ,   42.27062771],
       [  99.        ,   41.65512189]])

We can also put these numbers into perspective by plotting them.


In [19]:
fig, ax = plt.subplots()
ax.plot(loss[:,0], loss[:,1]);
ax.axhline(100, linestyle='--', color='r', linewidth=2)
ax.xaxis.set_ticks(np.arange(1, 100, 5));
ax.set(xlabel='num_components', ylabel='MSE', title='MSE vs number of principal components');


But numbers don't tell us everything! Just what does it mean qualitatively for the loss to decrease from around $450.0$ to less than $100.0$?

Let's find out! In the next cell, we draw the original eight as the leftmost image. Then we show the reconstruction of the image on the right, in descending number of principal components used.


In [20]:
@interact(image_idx=(0, 1000))
def show_num_components_reconst(image_idx):
    fig, ax = plt.subplots(figsize=(20., 20.))
    actual = X[image_idx]
    x = np.concatenate([actual[np.newaxis, :], reconstructions[:, image_idx]])
    ax.imshow(np.hstack(x.reshape(-1, 28, 28)[np.arange(10)]),
              cmap='gray');
    ax.axvline(28, color='orange', linewidth=2)


We can also browse throught the reconstructions for other digits. Once again, interact becomes handy.


In [21]:
@interact(i=(0, 10))
def show_pca_digits(i=1):
    plt.figure(figsize=(4,4))
    actual_sample = X[i].reshape(28,28)
    reconst_sample = (reconst[i, :] * std + mu).reshape(28, 28)
    plt.imshow(np.hstack([actual_sample, reconst_sample]), cmap='gray')
    plt.show()


2. PCA for high-dimensional datasets

Sometimes, the dimensionality of our dataset may be larger than the number of data points we have. Then it might be inefficient to perform PCA with the implementation above. Instead, as mentioned in the lectures, we can implement PCA in a more efficient manner, which we call PCA for high-dimensional data (PCA_high_dim).

Consider the normalized data matrix $\boldsymbol{\bar{X}}$ of size $N \times D$ where $D > N$. To do PCA we perform the following steps:

  • We solve the following eigenvalue/eigenvector equation for the matrix $\frac{1}{N} \boldsymbol{\bar{X}} \boldsymbol{\bar{X}}^T$, i.e. we solve for $\lambda_i$, $\boldsymbol c_i$ in $$\frac{1}{N} \boldsymbol{\bar{X}} \boldsymbol{\bar{X}}^T \boldsymbol c_i = \lambda_i \boldsymbol c_i.$$

  • We want to recover original eigenvectors $\boldsymbol b_i$ of the data covariance matrix $\boldsymbol S = \frac{1}{N} \boldsymbol{\bar{X}^T} \boldsymbol{\bar{X}}$.

  • Left-multiply the eigenvectors $\boldsymbol c_i$ by $\boldsymbol{\bar{X}}^T$ yields $$\frac{1}{N} \boldsymbol{\bar{X}}^T \boldsymbol{\bar{X}} \boldsymbol{\bar{X}}^T \boldsymbol c_i = \lambda_i \boldsymbol{\bar{X}}^T \boldsymbol c_i$$ and we recover $\boldsymbol b_i=\boldsymbol{\bar{X}}^T \boldsymbol c_i$ as eigenvector of $\boldsymbol S$ with the eigenvalue $\lambda_i$.


In [29]:
# GRADED FUNCTION: DO NOT EDIT THIS LINE

def PCA_high_dim(X, num_components):
    """Compute PCA for small sample size. 
    Args:
        X: ndarray of size (N, D), where D is the dimension of the data,
           and N is the number of data points in the training set. You may assume the input 
           has been normalized.
        num_components: the number of principal components to use.
    Returns:
        X_reconstruct: (N, D) ndarray. the reconstruction
        of X from the first `num_components` principal components.
    """
    N, D = X.shape
    M = (1/N)*(X @ X.T) # EDIT THIS, compute the matrix \frac{1}{N}XX^T.
    eig_vals, eig_vecs = eig(M) # EDIT THIS, compute the eigenvalues. 
    U = X.T @ eig_vecs # EDIT THIS. Compute the eigenvectors for the original PCA problem.
    # Similar to what you would do in PCA, compute the projection matrix,
    # then perform the projection.
    P = projection_matrix(U[:, 0:num_components]) # projection matrix
    X_reconstruct = (P @ X.T).T # EDIT THIS.
    return X_reconstruct

Given the same dataset, PCA_high_dim and PCA should give the same output. Assuming we have implemented PCA correctly, we can then use PCA to test the correctness of PCA_high_dim.

We can use this invariant to test our implementation of PCA_high_dim, assuming that we have correctly implemented PCA.


In [28]:
np.testing.assert_almost_equal(PCA(Xbar, 2), PCA_high_dim(Xbar, 2))
# In fact, you can generate random input dataset to verify your implementation.
print('correct')


correct

Now let's compare the running time between PCA and PCA_high_dim.

Tips for running benchmarks or computationally expensive code:

When you have some computation that takes up a non-negligible amount of time. Try separating the code that produces output from the code that analyzes the result (e.g. plot the results, comput statistics of the results). In this way, you don't have to recompute when you want to produce more analysis.


In [32]:
def time(f, repeat=10):
    times = []
    for _ in range(repeat):
        start = timeit.default_timer()
        f()
        stop = timeit.default_timer()
        times.append(stop-start)
    return np.mean(times), np.std(times)

In [33]:
times_mm0 = []
times_mm1 = []

for datasetsize in np.arange(4, 784, step=20):
    XX = Xbar[:datasetsize]
    mu, sigma = time(lambda : XX.T @ XX)
    times_mm0.append((datasetsize, mu, sigma))
    
    mu, sigma = time(lambda : XX @ XX.T)
    times_mm1.append((datasetsize, mu, sigma))
    
times_mm0 = np.asarray(times_mm0)
times_mm1 = np.asarray(times_mm1)

In [34]:
fig, ax = plt.subplots()
ax.set(xlabel='size of dataset', ylabel='running time')
bar = ax.errorbar(times_mm0[:, 0], times_mm0[:, 1], times_mm0[:, 2], label="$X^T X$ (PCA)", linewidth=2)
ax.errorbar(times_mm1[:, 0], times_mm1[:, 1], times_mm1[:, 2], label="$X X^T$ (PCA_high_dim)", linewidth=2)
ax.legend();


We first benchmark the time taken to compute $\boldsymbol X^T\boldsymbol X$ and $\boldsymbol X\boldsymbol X^T$. Jupyter's magic command %time is quite handy.

Next we benchmark PCA, PCA_high_dim.


In [42]:
times0 = []
times1 = []

for datasetsize in np.arange(4, 784, step=100):
    XX = Xbar[:datasetsize]
    npc = 2
    mu, sigma = time(lambda : PCA( XX, 2))
    times0.append((datasetsize, mu, sigma))
    
    mu, sigma = time(lambda : PCA_high_dim(XX, 2))
    times1.append((datasetsize, mu, sigma))
    
times0 = np.asarray(times0)
times1 = np.asarray(times1)

Alternatively, use the time magic command.


In [43]:
%time Xbar.T @ Xbar
%time Xbar @ Xbar.T
pass # Put this here, so that our output does not show the result of computing `Xbar @ Xbar.T`


CPU times: user 40 ms, sys: 64 ms, total: 104 ms
Wall time: 287 ms
CPU times: user 108 ms, sys: 72 ms, total: 180 ms
Wall time: 398 ms

We can also compare the running time for PCA and PCA_high_dim directly. Spend some time and think about what this plot means. We mentioned in lectures that PCA_high_dim are advantageous when we have dataset size $N$ < data dimension $D$. Although our plot for the two running times does not intersect exactly at $N = D$, it does show the trend.


In [44]:
fig, ax = plt.subplots()
ax.set(xlabel='number of datapoints', ylabel='run time')
ax.errorbar(times0[:, 0], times0[:, 1], times0[:, 2], label="PCA", linewidth=2)
ax.errorbar(times1[:, 0], times1[:, 1], times1[:, 2], label="PCA_high_dim", linewidth=2)
ax.legend();


Again, with the magic command time.


In [45]:
%time PCA(Xbar, 2)
%time PCA_high_dim(Xbar, 2)
pass


CPU times: user 1.9 s, sys: 1.74 s, total: 3.63 s
Wall time: 7.28 s
CPU times: user 5.44 s, sys: 4.41 s, total: 9.85 s
Wall time: 19.7 s

In [ ]: