For our course's final class, let's take the conceptual idea of a principal components analysis (PCA) and put it into practice.
For this notebook, you will need the following:
cse6040utils.py
You'll also need a bunch of modules; might as well preload those now:
In [ ]:
import os
import sys
import re
In [ ]:
import numpy as np
import pandas as pd
In [ ]:
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline
from PIL import Image
import seaborn as sns
In [ ]:
import plotly.plotly as py
from plotly.graph_objs import *
# @YOUSE: Fill in your credentials (user ID, API key) for Plotly here
# You can get them or reset them at: https://plot.ly/settings/api
py.sign_in ('USERNAME', 'API-KEY')
In [ ]:
%reload_ext autoreload
%autoreload 2
import cse6040utils
Recall the basic algorithm to compute a PCA, the theory of which is explained in the notes of Lab 29 and an interactive visual demo of which appears at http://setosa.io/ev/principal-component-analysis/.
You are given a set of $m-1$ data points, $X \equiv (x_0, x_1, \cdots, x_{m-1})^T$. Each data point $x_i \in \mathbb{R}^d$ is $d$-dimensional. You wish to find a $k$-dimensional representation of these points, where $k \leq d$. To do so, you run the PCA procedure, which identifies a $k$-dimensional subspace in terms of $k$ orthogonal vectors ("axes"); these vectors are the principal components.
In [ ]:
goofy = Image.open ('peeps/rodd.tiff', 'r') # Load an image
goofy
Let's convert this image into a Numpy array, and then also to grayscale.
In [ ]:
def rgb2gray (rgb):
return np.dot (rgb[...,:3], [0.299, 0.587, 0.144])
def imshow_gray (im):
plt.imshow (im, interpolation='nearest', cmap=plt.get_cmap ('gray'))
In [ ]:
goofy_np_gray = rgb2gray (np.asarray (goofy))
imshow_gray (goofy_np_gray)
print "What a ham!"
Next, let's load all the images as grayscale, into a list, original_images
, along with an array image_names
to hold a name for each image. (The names are extracted from the image filename.)
You may need to adjust the filepath below if this code does not work for you.
In [ ]:
original_images = []
image_names = []
for base, dirs, files in os.walk ('peeps/.'):
for filename in files:
name_tiff = re.match (r'^(.*)\.tiff$', filename)
if name_tiff:
filepath = os.path.join (base, filename)
im = rgb2gray (np.asarray (Image.open (filepath, 'r')))
key = name_tiff.groups (0)[0]
original_images.append (im)
image_names.append (key)
print "Found", len (original_images), "goofy images.\n"
To begin with, observe that the images come in all shapes and sizes.
In [ ]:
min_rows, min_cols = sys.maxsize, sys.maxsize
max_rows, max_cols = 0, 0
for (i, image) in enumerate (original_images):
r, c = image.shape[0], image.shape[1]
print '%d:' % i, image_names[i], "--", r, "x", c, "pixels"
min_rows = min (min_rows, r)
max_rows = max (max_rows, r)
min_cols = min (min_cols, c)
max_cols = max (max_cols, c)
print "\n==> Least common image size:", min_rows, "x", min_cols, "pixels"
Exercise. Recenter the images so that they are all the same size. Store them in a 3-dimensional Numpy array called images[:, :, :]
, where images[k, :, :]
is the k
-th image.
In [ ]:
# Re-center images to a common size
images = np.zeros ((len (original_images), min_rows, min_cols))
for (i, image) in enumerate (original_images):
# @YOUSE: Implement a recentering
pass
imshow_gray (images[29, :, :]) # Test: Who am I?
Exercise. Compute an "average" image, taken as the elementwise (pixelwise) mean over all images. Store the result in a min_rows
$\times$ min_cols
Numpy array called, mean_image
.
In [ ]:
# @YOUSE: Compute the mean image
mean_image = ...
# Inspect your solution by viewing the "average" image. How would you describe it?
imshow_gray (mean_image)
Exercise. Recall that PCA requires centered points. Let's do that by subtracting the mean image from every image, overwriting the result (images
array).
In [ ]:
# @YOUSE: Re-center the images around `mean_image`
pass
In [ ]:
imshow_gray (images[29, :, :]) # Compare this to the original. It's "spooky!"
For PCA, we need a data matrix. Here is some code to convert our 3-D array of images into a 2-D data matrix, where we "flatten" each image into a 1-D vector by a simple reshape
operation.
In [ ]:
# Create m x d data matrix
m = len (images)
d = min_rows * min_cols
X = np.reshape (images, (m, d))
In [ ]:
# To get back to an image, just reshape it again
imshow_gray (np.reshape (X[29, :], (min_rows, min_cols)))
Exercise. Compute the SVD of X
. Store the result in three arrays, U
, Sigma
, and VT
, where U
holds $U$, Sigma
holds just the diagonal entries of $\Sigma$, and VT
holds $V^T$.
In [ ]:
# @YOUSE: Compute the SVD of X here
pass
# Sanity check on dimensions
print "X:", X.shape
print "U:", U.shape
print "Sigma:", Sigma.shape
print "V^T:", VT.shape
The following code looks at Sigma. The collection of $\sigma_i$ values is also referred to as the spectrum of the matrix.
In [ ]:
def peek_Sigma (Sigma, ret_df=False):
k = len (Sigma)
df_Sigma = pd.DataFrame (np.arange (len (Sigma)), columns=['i'])
df_Sigma['sigma_i'] = Sigma
Sigma_sq = np.power (Sigma, 2)
Err_sq = np.sum (Sigma_sq) - np.cumsum (Sigma_sq)
Err_sq[Err_sq < 0] = 0
Err = np.sqrt (Err_sq)
Relerr = Err / (Sigma[0] + Err[0])
df_Sigma['sigma_i^2'] = Sigma_sq
df_Sigma['err_i^2'] = Err_sq
df_Sigma['err_i'] = Err
df_Sigma['relerr_i'] = Relerr
print "Singular values:"
display (df_Sigma.head ())
print " ..."
display (df_Sigma.tail ())
f, ax = plt.subplots (figsize=(7, 7))
#ax.set (yscale="log")
sns.regplot ("i", "sigma_i", df_Sigma, ax=ax, fit_reg=False)
if ret_df:
return df_Sigma
In [ ]:
peek_Sigma (Sigma)
Exercise (question only). Does the spectrum of these data decay quickly or slowly? How should that affect your choice of $k$, if you are considering a $k$-truncated SVD?
(@YOUSE: Enter a response here)
Exercise. Run the code cell below to look at the first ("0"-th) principal components. Modify the cell and re-run it to look at a few more. What do they appear to capture?
In [ ]:
imshow_gray (np.reshape (VT[0, :], (min_rows, min_cols)))
Exercise. Write some code to compute a new matrix Y
, which is the original data matrix projected onto the first two principal components.
You can use the second and third code cells below to draw your projection.
In [ ]:
# @YOUSE: Project onto the first k principal axes.
Y = ...
In [ ]:
def plotly_scatter_2d_labeled (X, x_1=0, x_2=1, labels=""):
m, d = X.shape
traces = []
if d > 1:
traces.append (Scatter (x=Y[:, x_1:x_1+1], y=Y[:, x_2:x_2+1],
mode='markers',
text=labels))
else:
traces.append (Scatter (x=Y, y=[0.0] * m,
mode='markers',
text=labels))
fig = Figure (data=traces) #, layout=layout)
display (py.iplot (fig))
In [ ]:
plotly_scatter_2d_labeled (Y, labels=image_names)
Another fun thing to do is to run k-means on the projected points.
In [ ]:
from scipy.cluster.vq import kmeans, vq
In [ ]:
def points2df (X):
return pd.DataFrame (X, columns=['x_%d' % (i+1) for i in range (X.shape[1])])
# From [Lab 28]
def plot_clustering_k2 (X, centers, clustering):
df = points2df (X)
df['clustering'] = clustering
sns.lmplot(data=df, x="x_1", y="x_2", hue="clustering", fit_reg=False,)
if df['clustering'][0] == 0:
colors = ['b', 'g']
else:
colors = ['g', 'b']
plt.scatter(centers[:,0], centers[:,1], s=500, c=colors, marker=u'*' )
In [ ]:
def split_clusters (clustering):
"""
Given a list of cluster label assignments, 'clustering',
returns a list of lists, 'J[0:k]', 'clustering[J[i]]'
is an array of all points having the same label.
"""
id_label_pairs = list (enumerate (set (clustering.flatten ())))
labels_map = dict ([(v, i) for (i, v) in id_label_pairs])
# Count how many points belong to each cluster
counts = [0] * len (labels_map)
for l in clustering.flatten ():
counts[labels_map[l]] += 1
# Allocate space for each cluster
clusters = [np.zeros (k, dtype=int) for k in counts]
# Separate the points by cluster
counts = [0] * len (labels_map)
for (id_x, l) in enumerate (clustering.flatten ()):
l_id = labels_map[l]
k = counts[l_id]
clusters[l_id][k] = id_x
counts[l_id] += 1
return clusters
def make_clustering_traces_k2 (X, clustering=None, x_1=0, x_2=1, labels=""):
"""
Returns a list Plotly-compatible marker traces.
"""
traces = []
if clustering is None:
traces.append (Scatter (x=X[:, x_1:(x_1+1)],
y=X[:, x_2:(x_2+1)],
mode='markers',
text=labels))
else:
clusters = split_clusters (clustering)
for J in clusters:
s_J = Scatter (x=X[J, x_1:(x_1+1)],
y=X[J, x_2:(x_2+1)],
mode='markers',
name="%s" % str (clustering[J[0]]),
text=[labels[j] for j in J])
traces.append (s_J)
return traces
In [ ]:
num_clusters = 3
centers, distortion = kmeans (X, num_clusters)
clustering, _ = vq (X, centers)
traces_X = make_clustering_traces_k2 (Y, clustering, labels=image_names)
In [ ]:
fig = Figure (data=traces_X)
py.iplot (fig)
In [ ]:
centers, distortion = kmeans (Y, num_clusters)
clustering, _ = vq (Y, centers)
traces_Y = make_clustering_traces_k2 (Y, clustering, labels=image_names)
In [ ]:
fig = Figure (data=traces_Y) #, layout=layout)
py.iplot (fig)
In [ ]:
df_kcurve = pd.DataFrame (columns=['k', 'distortion'])
for i in range(1,10):
_, distortion = kmeans (Y, i)
df_kcurve.loc[i] = [i, distortion]
df_kcurve.plot(x="k", y="distortion")
If time permits, we'll use the code fragments below to explore a different data set, based on images of handwritten digits available from: http://yann.lecun.com/exdb/mnist/
In [ ]:
# Download and unpack MNIST digits database
(mnist_images_gz, mnist_labels_gz) = cse6040utils.download_mnist ('training')
print "Images:", mnist_images_gz
print "Labels:", mnist_labels_gz
In [ ]:
images, labels, inds = cse6040utils.load_mnist (mnist_images_gz, mnist_labels_gz,
digits=[1, 8],
return_indices=True)
In [ ]:
print images.shape
print labels.shape
In [ ]:
imshow_gray (images[2, :, :])
In [ ]:
# Compute mean image
mean_image = np.mean (images, 0)
imshow_gray (mean_image)
In [ ]:
X = np.reshape (images - mean_image, (images.shape[0], images.shape[1]*images.shape[2]))
print X.shape
In [ ]:
(U, Sigma, VT) = np.linalg.svd (X, full_matrices=False)
In [ ]:
peek_Sigma (Sigma)
In [ ]:
#imshow_gray (mean_image)
imshow_gray (np.reshape (VT[2, :], (28, 28)))
In [ ]:
# Project onto the first k principal axes
k = 2
Y = X.dot (VT[0:k, :].T)
In [ ]:
annotations = ["[%d] %d" % (i, d) for (i, d) in zip (np.arange (len (Y)), labels)]
print annotations[:5]
In [ ]:
plotly_scatter_2d_labeled (Y, labels=annotations)
In [ ]:
traces_Y = make_clustering_traces_k2 (Y, clustering=labels, labels=annotations)
fig = Figure (data=traces_Y)
py.iplot (fig)
In [ ]:
imshow_gray (images[2705, :, :])
In [ ]:
imshow_gray (images[1407, :, :])
In [ ]:
num_clusters = 2
centers, distortion = kmeans (X, num_clusters)
clustering, _ = vq (X, centers)
In [ ]:
traces_Y = make_clustering_traces_k2 (Y, clustering, labels=annotations)
fig = Figure (data=traces_Y)
py.iplot (fig)
In [ ]:
imshow_gray (images[93, :, :])
Today's notebook uses a bunch of library modules and coding tricks; if you want to learn more, see these references.
Image manipulation
PCA in Python