CSE 6040, Fall 2015 [30]: PCA in practice

For our course's final class, let's take the conceptual idea of a principal components analysis (PCA) and put it into practice.

For this notebook, you will need the following:

You'll also need a bunch of modules; might as well preload those now:


In [ ]:
import os
import sys
import re

In [ ]:
import numpy as np
import pandas as pd

In [ ]:
from IPython.display import display
import matplotlib.pyplot as plt

%matplotlib inline

from PIL import Image
import seaborn as sns

In [ ]:
import plotly.plotly as py
from plotly.graph_objs import *

# @YOUSE: Fill in your credentials (user ID, API key) for Plotly here
# You can get them or reset them at: https://plot.ly/settings/api
py.sign_in ('USERNAME', 'API-KEY')

In [ ]:
%reload_ext autoreload
%autoreload 2

import cse6040utils

Recap: Solving the PCA problem

Recall the basic algorithm to compute a PCA, the theory of which is explained in the notes of Lab 29 and an interactive visual demo of which appears at http://setosa.io/ev/principal-component-analysis/.

You are given a set of $m-1$ data points, $X \equiv (x_0, x_1, \cdots, x_{m-1})^T$. Each data point $x_i \in \mathbb{R}^d$ is $d$-dimensional. You wish to find a $k$-dimensional representation of these points, where $k \leq d$. To do so, you run the PCA procedure, which identifies a $k$-dimensional subspace in terms of $k$ orthogonal vectors ("axes"); these vectors are the principal components.

  1. If the data are not centered, transform them accordingly. In particular, ensure that their mean is 0, i.e., $\displaystyle \frac{1}{m} \sum_{i=0}^{m-1} x_i = 0$.
  2. Compute the $k$-truncated SVD, $X \approx U_k \Sigma_k V_k^T$. The truncated SVD is just the subset of singular vectors corresponding to the largest $k$ singular values.
  3. Choose $v_0, v_1, \ldots, v_{k-1}$ as the principal components.

A "Surprise" Dataset

Today's first application of PCA involves a surprise data set. If you haven't done so already, download it and unpack it into your notebook's working directory.

The data set is a bunch of goofy images. Let's look at one, selected at random. I swear I picked it randomly.


In [ ]:
goofy = Image.open ('peeps/rodd.tiff', 'r') # Load an image
goofy

Let's convert this image into a Numpy array, and then also to grayscale.


In [ ]:
def rgb2gray (rgb):
    return np.dot (rgb[...,:3], [0.299, 0.587, 0.144])

def imshow_gray (im):
    plt.imshow (im, interpolation='nearest', cmap=plt.get_cmap ('gray'))

In [ ]:
goofy_np_gray = rgb2gray (np.asarray (goofy))
imshow_gray (goofy_np_gray)
print "What a ham!"

Next, let's load all the images as grayscale, into a list, original_images, along with an array image_names to hold a name for each image. (The names are extracted from the image filename.)

You may need to adjust the filepath below if this code does not work for you.


In [ ]:
original_images = []
image_names = []
for base, dirs, files in os.walk ('peeps/.'):
    for filename in files:
        name_tiff = re.match (r'^(.*)\.tiff$', filename)
        if name_tiff:
            filepath = os.path.join (base, filename)
            im = rgb2gray (np.asarray (Image.open (filepath, 'r')))
            key = name_tiff.groups (0)[0]
            original_images.append (im)
            image_names.append (key)
        
print "Found", len (original_images), "goofy images.\n"

Preprocessing

To apply PCA, we'll want to preprocess the images in various ways.

To begin with, observe that the images come in all shapes and sizes.


In [ ]:
min_rows, min_cols = sys.maxsize, sys.maxsize
max_rows, max_cols = 0, 0
for (i, image) in enumerate (original_images):
    r, c = image.shape[0], image.shape[1]
    print '%d:' % i, image_names[i], "--", r, "x", c, "pixels"
    
    min_rows = min (min_rows, r)
    max_rows = max (max_rows, r)
    min_cols = min (min_cols, c)
    max_cols = max (max_cols, c)
    
print "\n==> Least common image size:", min_rows, "x", min_cols, "pixels"

Exercise. Recenter the images so that they are all the same size. Store them in a 3-dimensional Numpy array called images[:, :, :], where images[k, :, :] is the k-th image.


In [ ]:
# Re-center images to a common size
images = np.zeros ((len (original_images), min_rows, min_cols))
for (i, image) in enumerate (original_images):
    # @YOUSE: Implement a recentering
    pass
    
imshow_gray (images[29, :, :]) # Test: Who am I?

Exercise. Compute an "average" image, taken as the elementwise (pixelwise) mean over all images. Store the result in a min_rows $\times$ min_cols Numpy array called, mean_image.


In [ ]:
# @YOUSE: Compute the mean image
mean_image = ...

# Inspect your solution by viewing the "average" image. How would you describe it?
imshow_gray (mean_image)

Exercise. Recall that PCA requires centered points. Let's do that by subtracting the mean image from every image, overwriting the result (images array).


In [ ]:
# @YOUSE: Re-center the images around `mean_image`
pass

In [ ]:
imshow_gray (images[29, :, :]) # Compare this to the original. It's "spooky!"

For PCA, we need a data matrix. Here is some code to convert our 3-D array of images into a 2-D data matrix, where we "flatten" each image into a 1-D vector by a simple reshape operation.


In [ ]:
# Create m x d data matrix
m = len (images)
d = min_rows * min_cols
X = np.reshape (images, (m, d))

In [ ]:
# To get back to an image, just reshape it again
imshow_gray (np.reshape (X[29, :], (min_rows, min_cols)))

Applying PCA

Exercise. Compute the SVD of X. Store the result in three arrays, U, Sigma, and VT, where U holds $U$, Sigma holds just the diagonal entries of $\Sigma$, and VT holds $V^T$.


In [ ]:
# @YOUSE: Compute the SVD of X here
pass

# Sanity check on dimensions
print "X:", X.shape
print "U:", U.shape
print "Sigma:", Sigma.shape
print "V^T:", VT.shape

The following code looks at Sigma. The collection of $\sigma_i$ values is also referred to as the spectrum of the matrix.


In [ ]:
def peek_Sigma (Sigma, ret_df=False):
    k = len (Sigma)
    df_Sigma = pd.DataFrame (np.arange (len (Sigma)), columns=['i'])
    df_Sigma['sigma_i'] = Sigma
    Sigma_sq = np.power (Sigma, 2)
    Err_sq = np.sum (Sigma_sq) - np.cumsum (Sigma_sq)
    Err_sq[Err_sq < 0] = 0
    Err = np.sqrt (Err_sq)
    Relerr = Err / (Sigma[0] + Err[0])
    df_Sigma['sigma_i^2'] = Sigma_sq
    df_Sigma['err_i^2'] = Err_sq
    df_Sigma['err_i'] = Err
    df_Sigma['relerr_i'] = Relerr
    print "Singular values:"
    display (df_Sigma.head ())
    print "  ..."
    display (df_Sigma.tail ())
    
    f, ax = plt.subplots (figsize=(7, 7))
    #ax.set (yscale="log")
    sns.regplot ("i", "sigma_i", df_Sigma, ax=ax, fit_reg=False)
    if ret_df:
        return df_Sigma

In [ ]:
peek_Sigma (Sigma)

Exercise (question only). Does the spectrum of these data decay quickly or slowly? How should that affect your choice of $k$, if you are considering a $k$-truncated SVD?

(@YOUSE: Enter a response here)

Exercise. Run the code cell below to look at the first ("0"-th) principal components. Modify the cell and re-run it to look at a few more. What do they appear to capture?


In [ ]:
imshow_gray (np.reshape (VT[0, :], (min_rows, min_cols)))

Exercise. Write some code to compute a new matrix Y, which is the original data matrix projected onto the first two principal components.

You can use the second and third code cells below to draw your projection.


In [ ]:
# @YOUSE: Project onto the first k principal axes.
Y = ...

In [ ]:
def plotly_scatter_2d_labeled (X, x_1=0, x_2=1, labels=""):
    m, d = X.shape
    traces = []
    if d > 1:
        traces.append (Scatter (x=Y[:, x_1:x_1+1], y=Y[:, x_2:x_2+1],
                                mode='markers',
                                text=labels))
    else:
        traces.append (Scatter (x=Y, y=[0.0] * m,
                                mode='markers',
                                text=labels))
    fig = Figure (data=traces) #, layout=layout)
    display (py.iplot (fig))

In [ ]:
plotly_scatter_2d_labeled (Y, labels=image_names)

Another fun thing to do is to run k-means on the projected points.


In [ ]:
from scipy.cluster.vq import kmeans, vq

In [ ]:
def points2df (X):
    return pd.DataFrame (X, columns=['x_%d' % (i+1) for i in range (X.shape[1])])

# From [Lab 28]
def plot_clustering_k2 (X, centers, clustering):
    df = points2df (X)
    df['clustering'] = clustering
    sns.lmplot(data=df, x="x_1", y="x_2", hue="clustering", fit_reg=False,)
    if df['clustering'][0] == 0:
        colors = ['b', 'g']
    else:
        colors = ['g', 'b']
    plt.scatter(centers[:,0], centers[:,1], s=500, c=colors, marker=u'*' )

In [ ]:
def split_clusters (clustering):
    """
    Given a list of cluster label assignments, 'clustering',
    returns a list of lists, 'J[0:k]', 'clustering[J[i]]'
    is an array of all points having the same label.
    """
    id_label_pairs = list (enumerate (set (clustering.flatten ())))
    labels_map = dict ([(v, i) for (i, v) in id_label_pairs])
    
    # Count how many points belong to each cluster
    counts = [0] * len (labels_map)
    for l in clustering.flatten ():
        counts[labels_map[l]] += 1
        
    # Allocate space for each cluster
    clusters = [np.zeros (k, dtype=int) for k in counts]
    
    # Separate the points by cluster
    counts = [0] * len (labels_map)
    for (id_x, l) in enumerate (clustering.flatten ()):
        l_id = labels_map[l]
        k = counts[l_id]
        clusters[l_id][k] = id_x
        counts[l_id] += 1

    return clusters

def make_clustering_traces_k2 (X, clustering=None, x_1=0, x_2=1, labels=""):
    """
    Returns a list Plotly-compatible marker traces.
    """
    traces = []
    if clustering is None:
        traces.append (Scatter (x=X[:, x_1:(x_1+1)],
                                y=X[:, x_2:(x_2+1)],
                                mode='markers',
                                text=labels))
    else:
        clusters = split_clusters (clustering)
        for J in clusters:
            s_J = Scatter (x=X[J, x_1:(x_1+1)],
                           y=X[J, x_2:(x_2+1)],
                           mode='markers',
                           name="%s" % str (clustering[J[0]]),
                           text=[labels[j] for j in J])
            traces.append (s_J)
    return traces

In [ ]:
num_clusters = 3
centers, distortion = kmeans (X, num_clusters)
clustering, _ = vq (X, centers)
traces_X = make_clustering_traces_k2 (Y, clustering, labels=image_names)

In [ ]:
fig = Figure (data=traces_X)
py.iplot (fig)

In [ ]:
centers, distortion = kmeans (Y, num_clusters)
clustering, _ = vq (Y, centers)
traces_Y = make_clustering_traces_k2 (Y, clustering, labels=image_names)

In [ ]:
fig = Figure (data=traces_Y) #, layout=layout)
py.iplot (fig)

In [ ]:
df_kcurve = pd.DataFrame (columns=['k', 'distortion']) 
for i in range(1,10):
    _, distortion = kmeans (Y, i)
    df_kcurve.loc[i] = [i, distortion]
df_kcurve.plot(x="k", y="distortion")

(optional) Exploring handwritten digits

If time permits, we'll use the code fragments below to explore a different data set, based on images of handwritten digits available from: http://yann.lecun.com/exdb/mnist/


In [ ]:
# Download and unpack MNIST digits database
(mnist_images_gz, mnist_labels_gz) = cse6040utils.download_mnist ('training')

print "Images:", mnist_images_gz
print "Labels:", mnist_labels_gz

In [ ]:
images, labels, inds = cse6040utils.load_mnist (mnist_images_gz, mnist_labels_gz,
                                                digits=[1, 8],
                                                return_indices=True)

In [ ]:
print images.shape
print labels.shape

In [ ]:
imshow_gray (images[2, :, :])

In [ ]:
# Compute mean image
mean_image = np.mean (images, 0)
imshow_gray (mean_image)

In [ ]:
X = np.reshape (images - mean_image, (images.shape[0], images.shape[1]*images.shape[2]))
print X.shape

In [ ]:
(U, Sigma, VT) = np.linalg.svd (X, full_matrices=False)

In [ ]:
peek_Sigma (Sigma)

In [ ]:
#imshow_gray (mean_image)
imshow_gray (np.reshape (VT[2, :], (28, 28)))

In [ ]:
# Project onto the first k principal axes
k = 2
Y = X.dot (VT[0:k, :].T)

In [ ]:
annotations = ["[%d] %d" % (i, d) for (i, d) in zip (np.arange (len (Y)), labels)]

print annotations[:5]

In [ ]:
plotly_scatter_2d_labeled (Y, labels=annotations)

In [ ]:
traces_Y = make_clustering_traces_k2 (Y, clustering=labels, labels=annotations)
fig = Figure (data=traces_Y)
py.iplot (fig)

In [ ]:
imshow_gray (images[2705, :, :])

In [ ]:
imshow_gray (images[1407, :, :])

In [ ]:
num_clusters = 2
centers, distortion = kmeans (X, num_clusters)
clustering, _ = vq (X, centers)

In [ ]:
traces_Y = make_clustering_traces_k2 (Y, clustering, labels=annotations)
fig = Figure (data=traces_Y)
py.iplot (fig)

In [ ]:
imshow_gray (images[93, :, :])

References

Today's notebook uses a bunch of library modules and coding tricks; if you want to learn more, see these references.

Image manipulation

PCA in Python