In [ ]:
from os import path
import glob
from astropy.table import Table
import astropy.units as u
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('notebook.mplstyle')
%matplotlib inline
import sklearn
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_lfw_people
from PIL import Image
from IPython.display import display
import ipywidgets as widgets
Astronomical data is often "big data" in two ways: many objects and many features. For example, a spectroscopic survey -- you can think of the fluxes measured at each wavelength as being individual features! In that context, each object or data point could have ~1000's of features. With data that is this high-dimensional, how do you visualize it? How do you know where the information is?
We know from other work that stellar spectra can be characterized typically by just a few parameters, things like $T_{\rm eff}$, $\log g$, $[{\rm Fe}/{\rm H}]$, etc. This suggests that there is a mapping from "fluxes" to some much lower-dimensional space that preserves most of the information. You might call this "learning the parameters of a model," "data compression," or "feature extraction" (or all). The goal of dimensionality reduction is to approximate this transformation while preserving most of the information. PCA is a particular dimensionality reduction method that uses a linear transformation for this step. PCA therefore corresponds to projecting the data onto a set of vectors aligned with the directions of maximum variance in the data.
Why does more variance correspond to more information? Imagine 2D distribution of points aligned with axes with large variance in one direction, small variance in the other:
In [ ]:
x = np.random.normal(0, 1., size=1024)
y = np.random.normal(0, 0.1, size=x.size)
plt.figure(figsize=(5,5))
plt.scatter(x, y, alpha=0.25, marker='.')
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
The feature $x_2$ is not very informative, i.e. the data have a small variance in this dimension. We would therefore preserve most of the information about the data by ignoring this dimension and just using $x_1$ in whatever classification problem or analysis we do with the data. For small numbers of dimensions, reducing the dimensionality doesn't help much in analysis, visualization, storage, etc. But for large numbers of dimensions, reducing the dimensionality can be crucial for interpretability or making problems tractable.
Disclaimer: There are a number of ways to interpret or present the concepts behind PCA -- this is just the way I think of it!
The "principal components" of a dataset are the eigenvectors of the empirical covariance matrix of the data ordered by the magnitude of the corresponding eigenvalues. As a procedure, PCA is just a projection of the original data onto a subspace defined by a subset of the eigenvectors. Let's unpack that a bit...
Let's imagine we observe a few quantities (features) that end up being highly correlated, say, the size and the absolute magnitude of a bunch of galaxies
In [ ]:
X = np.loadtxt('demo-data/correlated_data.csv', delimiter=',')
X.shape
In [ ]:
fig,axes = plt.subplots(2, 2, figsize=(6,6), sharex=True, sharey=True)
axes[0,0].plot(X[:,0], X[:,1], marker='.', alpha=0.5, linestyle='none')
axes[1,0].plot(X[:,0], X[:,2], marker='.', alpha=0.5, linestyle='none')
axes[1,1].plot(X[:,1], X[:,2], marker='.', alpha=0.5, linestyle='none')
axes[0,0].set_ylabel('x2')
axes[1,0].set_ylabel('x3')
axes[1,0].set_xlabel('x1')
axes[1,1].set_xlabel('x2')
axes[0,0].set_xlim(-10, 10)
axes[0,0].set_ylim(-10, 10)
fig.tight_layout()
axes[0,1].set_visible(False)
Most of the variance in these data is along one particular direction; we might therefore be able to do some analysis of this data by considering just that direction. But in general, that vector might be some combination of the data that you measure. With PCA, we assume that the eigenvectors computed from the empirical covariance matrix describe these directions. If we rank the eigenvectors by the eigenvalues, the directions will be in decreasing importance:
In [ ]:
Cov = np.cov(X.T)
eig_vals, eig_vecs = np.linalg.eig(Cov)
We sort by the eigenvalues and project the data onto the eigenbasis. The 0th feature is now the most informative, and so on.
In [ ]:
idx = eig_vals.argsort()
Y = X[:,idx] @ eig_vecs[idx] # projection
In [ ]:
fig,axes = plt.subplots(1, 2, figsize=(10,5), sharex=True, sharey=True)
axes[0].plot(Y[:,0], Y[:,1], marker='.', alpha=0.5, linestyle='none')
axes[1].plot(Y[:,0], Y[:,2], marker='.', alpha=0.5, linestyle='none')
axes[0].set_xlim(-10, 10)
axes[0].set_ylim(-10, 10)
axes[0].set_xlabel('$x_1$')
axes[0].set_ylabel('$x_2$')
axes[1].set_xlabel('$x_1$')
axes[1].set_ylabel('$x_3$')
fig.tight_layout()
One way to quantify how much variance or information is contained along each eigenvector is to compute the cumulative explained variance. This tells you, at a given eigenvector, what fraction the total variance is explained by the given eigenvector and all eigenvectors preceding it (with larger eigenvalues). Heres the CEV for the example above:
In [ ]:
np.cumsum(eig_vals) / np.sum(eig_vals)
This tells us that the 0th eigenvector alone contains ~98% of the variance of the distribution, and 99.9% is contained with the first two. You have to decide what fraction of the explained variance is OK with you, and truncate the eigenvectors there. Here, if we arbitrarily say we want >95% of the explained variance, we would just keep the first eigenvector.
Here we're going to do an example of PCA with some more interesting, high-dimensional data: pictures of faces. Each image is 62 x 47 (why would they do that to us!), so if we treat each pixel as a feature, that's 2914 features per image or per object. We'll do a PCA on that large feature-space and see how many "eigenfaces" we need to explain 95% of the variance (i.e. do a pretty good job at representing faces).
Note: this is heavily inspired by Jake Vanderplas' notebook.
In [ ]:
# fetch the face image data
faces = fetch_lfw_people(min_faces_per_person=32)
In [ ]:
faces.images.shape, faces.data.shape
We're using a sample dataset from scikit-learn - here's a sample of what some of the images look like:
In [ ]:
fig, axes = plt.subplots(4, 8, figsize=(9, 6),
subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes.flat):
ax.imshow(faces.images[i], cmap='Greys_r')
fig.tight_layout()
We can also plot the pixel "flux" values as if the are features / dimensions of the data. Here are a few projections of the data:
In [ ]:
fig,axes = plt.subplots(2, 2, figsize=(8,8), sharex=True, sharey=True)
axes[0,0].scatter(faces.images[:,0,0], faces.images[:,32,16], marker='.')
axes[0,1].scatter(faces.images[:,11,21], faces.images[:,25,13], marker='.')
axes[1,0].scatter(faces.images[:,13,41], faces.images[:,59,12], marker='.')
axes[1,1].scatter(faces.images[:,2,20], faces.images[:,4,11], marker='.')
axes[0,0].set_xlim(-5, 260)
axes[0,0].set_ylim(-5, 260)
fig.tight_layout()
Let's now perform a PCA on the image data, and keep the largest 256 eigenvectors (which are actually "images"). Here, we'll use the PCA implementation in scikit-learn
which actually doesn't compute eigenvectors (it does an SVD on the input data) and is a bit more numerically stable:
In [ ]:
pca = PCA(256)
pca.fit(faces.data)
Let's visualize the top eigenfaces:
In [ ]:
fig, axes = plt.subplots(4, 8, figsize=(9, 6),
subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes.flat):
ax.imshow(pca.components_[i].reshape(62, 47), cmap='Greys_r')
fig.tight_layout()
Apart from being slightly terrifying, it's cool! You can see components needed for shading faces, making eyes darker, making noses wider, etc. Let's look at the cumulative explained variance:
In [ ]:
plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='')
plt.xlabel('number of components')
plt.ylabel('cumulative variance');
plt.ylim(0, 1)
plt.xscale('log', basex=2)
plt.axvline(faces.images[0].size, color='r')
The red line is at 2914, the number of input features. How many eigenfaces do we need to use to preserve 95% of the variance?
In [ ]:
np.argmin(np.abs(np.cumsum(pca.explained_variance_ratio_) - 0.95))
Only 170! That's a big data compression.
Let's now use the eigenvectors defined from the scikit-learn face dataset to represent a few more images of faces that aren't in the training set. We'll load those and save them as the array "other faces":
In [ ]:
other_faces = []
for filename in glob.glob('demo-data/face*.jpg'):
image_file = Image.open(filename)
image = image_file.convert('L') # convert image to monochrome
image = np.array(image).astype(float)
other_faces.append(image)
n_other = len(other_faces)
other_faces = np.array(other_faces)
other_faces = other_faces.reshape(n_other, -1)
print("{0} other faces".format(len(other_faces)))
In [ ]:
image_shape = faces.images[0].shape
We'll now write a function that will show what these new images look like compressed to a given number of eigenvectors of the scikit-learn training set.
In [ ]:
def do_pca(n_components):
fig, ax = plt.subplots(1, 4, figsize=(7.5,2.75),
subplot_kw={'xticks':[], 'yticks':[]})
pca = PCA(n_components)
pca.fit(faces.data)
for i in range(len(other_faces)):
components = pca.transform(other_faces[i].reshape(1, -1))
projected = pca.inverse_transform(components)
ax.flat[i].imshow(projected[0].reshape(image_shape), cmap='binary_r')
fig.suptitle('{}-dim reconstruction'.format(pca.n_components), fontsize=22, y=0.98)
fig.tight_layout()
fig.subplots_adjust(top=0.9)
In [ ]:
do_pca(16)
In [ ]: