```
In [ ]:
```from os import path
import glob
from astropy.table import Table
import astropy.units as u
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('notebook.mplstyle')
%matplotlib inline
import sklearn
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_lfw_people
from PIL import Image
from IPython.display import display
import ipywidgets as widgets

Astronomical data is often "big data" in two ways: many objects *and* many features. For example, a spectroscopic survey -- you can think of the fluxes measured at each wavelength as being individual features! In that context, each object or data point could have ~1000's of features. With data that is this high-dimensional, how do you visualize it? How do you know where the information is?

We know from other work that stellar spectra can be characterized typically by just a few parameters, things like $T_{\rm eff}$, $\log g$, $[{\rm Fe}/{\rm H}]$, etc. This suggests that there is a mapping from "fluxes" to some much lower-dimensional space that preserves most of the information. You might call this "learning the parameters of a model," "data compression," or "feature extraction" (or all). The goal of dimensionality reduction is to approximate this transformation while preserving most of the *information*. PCA is a particular dimensionality reduction method that uses a linear transformation for this step. PCA therefore corresponds to projecting the data onto a set of vectors aligned with the directions of maximum *variance* in the data.

Why does more variance correspond to more information? Imagine 2D distribution of points aligned with axes with large variance in one direction, small variance in the other:

```
In [ ]:
```x = np.random.normal(0, 1., size=1024)
y = np.random.normal(0, 0.1, size=x.size)
plt.figure(figsize=(5,5))
plt.scatter(x, y, alpha=0.25, marker='.')
plt.xlim(-5, 5)
plt.ylim(-5, 5)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')

*Disclaimer: There are a number of ways to interpret or present the concepts behind PCA -- this is just the way I think of it!*

The "principal components" of a dataset are the eigenvectors of the empirical covariance matrix of the data ordered by the magnitude of the corresponding eigenvalues. As a procedure, PCA is just a projection of the original data onto a subspace defined by a subset of the eigenvectors. Let's unpack that a bit...

Let's imagine we observe a few quantities (features) that end up being highly correlated, say, the size and the absolute magnitude of a bunch of galaxies

```
In [ ]:
```X = np.loadtxt('demo-data/correlated_data.csv', delimiter=',')
X.shape

```
In [ ]:
```fig,axes = plt.subplots(2, 2, figsize=(6,6), sharex=True, sharey=True)
axes[0,0].plot(X[:,0], X[:,1], marker='.', alpha=0.5, linestyle='none')
axes[1,0].plot(X[:,0], X[:,2], marker='.', alpha=0.5, linestyle='none')
axes[1,1].plot(X[:,1], X[:,2], marker='.', alpha=0.5, linestyle='none')
axes[0,0].set_ylabel('x2')
axes[1,0].set_ylabel('x3')
axes[1,0].set_xlabel('x1')
axes[1,1].set_xlabel('x2')
axes[0,0].set_xlim(-10, 10)
axes[0,0].set_ylim(-10, 10)
fig.tight_layout()
axes[0,1].set_visible(False)

```
In [ ]:
```Cov = np.cov(X.T)
eig_vals, eig_vecs = np.linalg.eig(Cov)

```
In [ ]:
```idx = eig_vals.argsort()
Y = X[:,idx] @ eig_vecs[idx] # projection

```
In [ ]:
```fig,axes = plt.subplots(1, 2, figsize=(10,5), sharex=True, sharey=True)
axes[0].plot(Y[:,0], Y[:,1], marker='.', alpha=0.5, linestyle='none')
axes[1].plot(Y[:,0], Y[:,2], marker='.', alpha=0.5, linestyle='none')
axes[0].set_xlim(-10, 10)
axes[0].set_ylim(-10, 10)
axes[0].set_xlabel('$x_1$')
axes[0].set_ylabel('$x_2$')
axes[1].set_xlabel('$x_1$')
axes[1].set_ylabel('$x_3$')
fig.tight_layout()

*cumulative explained variance*. This tells you, at a given eigenvector, what fraction the total variance is explained by the given eigenvector and all eigenvectors preceding it (with larger eigenvalues). Heres the CEV for the example above:

```
In [ ]:
```np.cumsum(eig_vals) / np.sum(eig_vals)

*alone* contains ~98% of the variance of the distribution, and 99.9% is contained with the first two. You have to decide what fraction of the explained variance is OK with you, and truncate the eigenvectors there. Here, if we arbitrarily say we want >95% of the explained variance, we would just keep the first eigenvector.

Here we're going to do an example of PCA with some more interesting, high-dimensional data: pictures of faces. Each image is 62 x 47 (why would they do that to us!), so if we treat each pixel as a feature, that's 2914 features *per image* or *per object*. We'll do a PCA on that large feature-space and see how many "eigenfaces" we need to explain 95% of the variance (i.e. do a pretty good job at representing faces).

*Note: this is heavily inspired by Jake Vanderplas' notebook*.

```
In [ ]:
```# fetch the face image data
faces = fetch_lfw_people(min_faces_per_person=32)

```
In [ ]:
```faces.images.shape, faces.data.shape

```
In [ ]:
```fig, axes = plt.subplots(4, 8, figsize=(9, 6),
subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes.flat):
ax.imshow(faces.images[i], cmap='Greys_r')
fig.tight_layout()

```
In [ ]:
```fig,axes = plt.subplots(2, 2, figsize=(8,8), sharex=True, sharey=True)
axes[0,0].scatter(faces.images[:,0,0], faces.images[:,32,16], marker='.')
axes[0,1].scatter(faces.images[:,11,21], faces.images[:,25,13], marker='.')
axes[1,0].scatter(faces.images[:,13,41], faces.images[:,59,12], marker='.')
axes[1,1].scatter(faces.images[:,2,20], faces.images[:,4,11], marker='.')
axes[0,0].set_xlim(-5, 260)
axes[0,0].set_ylim(-5, 260)
fig.tight_layout()

`scikit-learn`

which actually doesn't compute eigenvectors (it does an SVD on the input data) and is a bit more numerically stable:

```
In [ ]:
```pca = PCA(256)
pca.fit(faces.data)

Let's visualize the top eigenfaces:

```
In [ ]:
```fig, axes = plt.subplots(4, 8, figsize=(9, 6),
subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes.flat):
ax.imshow(pca.components_[i].reshape(62, 47), cmap='Greys_r')
fig.tight_layout()

```
In [ ]:
```plt.plot(np.cumsum(pca.explained_variance_ratio_), marker='')
plt.xlabel('number of components')
plt.ylabel('cumulative variance');
plt.ylim(0, 1)
plt.xscale('log', basex=2)
plt.axvline(faces.images[0].size, color='r')

```
In [ ]:
```np.argmin(np.abs(np.cumsum(pca.explained_variance_ratio_) - 0.95))

Only 170! That's a big data compression.

*aren't* in the training set. We'll load those and save them as the array "other faces":

```
In [ ]:
```other_faces = []
for filename in glob.glob('demo-data/face*.jpg'):
image_file = Image.open(filename)
image = image_file.convert('L') # convert image to monochrome
image = np.array(image).astype(float)
other_faces.append(image)
n_other = len(other_faces)
other_faces = np.array(other_faces)
other_faces = other_faces.reshape(n_other, -1)
print("{0} other faces".format(len(other_faces)))

```
In [ ]:
```image_shape = faces.images[0].shape

```
In [ ]:
```def do_pca(n_components):
fig, ax = plt.subplots(1, 4, figsize=(7.5,2.75),
subplot_kw={'xticks':[], 'yticks':[]})
pca = PCA(n_components)
pca.fit(faces.data)
for i in range(len(other_faces)):
components = pca.transform(other_faces[i].reshape(1, -1))
projected = pca.inverse_transform(components)
ax.flat[i].imshow(projected[0].reshape(image_shape), cmap='binary_r')
fig.suptitle('{}-dim reconstruction'.format(pca.n_components), fontsize=22, y=0.98)
fig.tight_layout()
fig.subplots_adjust(top=0.9)

```
In [ ]:
```do_pca(16)

```
In [ ]:
```