The PCA section of this notebook was put together by [Jake Vanderplas](http://www.vanderplas.com). Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_tutorial/).

Dimensionality Reduction: Principal Component Analysis in-depth

Here we'll explore Principal Component Analysis, which is an extremely useful linear dimensionality reduction technique.

We'll start with our standard set of initial imports:



In [ ]:

    
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting style defaults
import seaborn as sns; sns.set()

Introducing Principal Component Analysis

Principal Component Analysis is a very powerful unsupervised method for dimensionality reduction in data. It's easiest to visualize by looking at a two-dimensional dataset:



In [ ]:

    
np.random.seed(1)
X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T
plt.plot(X[:, 0], X[:, 1], 'o')
plt.axis('equal');

We can see that there is a definite trend in the data. What PCA seeks to do is to find the Principal Axes in the data, and explain how important those axes are in describing the data distribution:



In [ ]:

    
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_)
print(pca.components_)

To see what these numbers mean, let's view them as vectors plotted on top of the data:



In [ ]:

    
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)
plt.axis('equal');

Notice that one vector is longer than the other. In a sense, this tells us that that direction in the data is somehow more "important" than the other direction. The explained variance quantifies this measure of "importance" in direction.

Another way to think of it is that the second principal component could be completely ignored without much loss of information! Let's see what our data look like if we only keep 95% of the variance:



In [ ]:

    
clf = PCA(0.95) # keep 95% of variance
X_trans = clf.fit_transform(X)
print(X.shape)
print(X_trans.shape)

By specifying that we want to throw away 5% of the variance, the data is now compressed by a factor of 50%! Let's see what the data look like after this compression:



In [ ]:

    
X_new = clf.inverse_transform(X_trans)
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.2)
plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)
plt.axis('equal');

The light points are the original data, while the dark points are the projected version. We see that after truncating 5% of the variance of this dataset and then reprojecting it, the "most important" features of the data are maintained, and we've compressed the data by 50%!

This is the sense in which "dimensionality reduction" works: if you can approximate a data set in a lower dimension, you can often have an easier time visualizing it or fitting complicated models to the data.

Application of PCA to Digits

The dimensionality reduction might seem a bit abstract in two dimensions, but the projection and dimensionality reduction can be extremely useful when visualizing high-dimensional data. Let's take a quick look at the application of PCA to the digits data we looked at before:



In [ ]:

    
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target



In [ ]:

    
pca = PCA(2)  # project from 64 to 2 dimensions
Xproj = pca.fit_transform(X)
print(X.shape)
print(Xproj.shape)



In [ ]:

    
plt.scatter(Xproj[:, 0], Xproj[:, 1], c=y, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('tab10', 10))
plt.colorbar();

We could also do the same plot, using Altair and Pandas:

digits_smushed = pd.DataFrame(Xproj) digits_smushed['target'] = digits.target digits_smushed.head()

This gives us an idea of the relationship between the digits. Essentially, we have found the optimal stretch and rotation in 64-dimensional space that allows us to see the layout of the digits, without reference to the labels.

What do the Components Mean?

PCA is a very useful dimensionality reduction algorithm, because it has a very intuitive interpretation via eigenvectors. The input data is represented as a vector: in the case of the digits, our data is

$$ x = [x_1, x_2, x_3 \cdots] $$

but what this really means is

$$ image(x) = x_1 \cdot{\rm (pixel~1)} + x_2 \cdot{\rm (pixel~2)} + x_3 \cdot{\rm (pixel~3)} \cdots $$

If we reduce the dimensionality in the pixel space to (say) 6, we recover only a partial image:



In [ ]:

    
from decompositionplots import plot_image_components

sns.set_style('white')
plot_image_components(digits.data[0])

But the pixel-wise representation is not the only choice. We can also use other basis functions, and write something like

$$ image(x) = {\rm mean} + x_1 \cdot{\rm (basis~1)} + x_2 \cdot{\rm (basis~2)} + x_3 \cdot{\rm (basis~3)} \cdots $$

What PCA does is to choose optimal basis functions so that only a few are needed to get a reasonable approximation. The low-dimensional representation of our data is the coefficients of this series, and the approximate reconstruction is the result of the sum:



In [ ]:

    
from decompositionplots import plot_pca_interactive
plot_pca_interactive(digits.data)

Here we see that with only six PCA components, we recover a reasonable approximation of the input!

Thus we see that PCA can be viewed from two angles. It can be viewed as dimensionality reduction, or it can be viewed as a form of lossy data compression where the loss favors noise. In this way, PCA can be used as a filtering process as well.

Choosing the Number of Components

But how much information have we thrown away? We can figure this out by looking at the explained variance as a function of the components:



In [ ]:

    
sns.set()
pca = PCA().fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

Here we see that our two-dimensional projection loses a lot of information (as measured by the explained variance) and that we'd need about 20 components to retain 90% of the variance. Looking at this plot for a high-dimensional dataset can help you understand the level of redundancy present in multiple observations.

Other Dimensionality Reducting Routines

Note that scikit-learn contains many other unsupervised dimensionality reduction routines: some you might wish to try are Other dimensionality reduction techniques which are useful to know about:

sklearn.decomposition.PCA: Principal Component Analysis
sklearn.decomposition.RandomizedPCA: extremely fast approximate PCA implementation based on a randomized algorithm
sklearn.decomposition.SparsePCA: PCA variant including L1 penalty for sparsity
sklearn.decomposition.FastICA: Independent Component Analysis
sklearn.decomposition.NMF: non-negative matrix factorization
sklearn.manifold.LocallyLinearEmbedding: nonlinear manifold learning technique based on local neighborhood geometry
sklearn.manifold.IsoMap: nonlinear manifold learning technique based on a sparse graph algorithm

Each of these has its own strengths & weaknesses, and areas of application. You can read about them on the scikit-learn website.

Independent component analysis

Here we'll learn about indepednent component analysis (ICA), a matrix decomposition method that's an alternative to PCA.

Independent Component Analysis (ICA)

ICA was originally created for the "cocktail party problem" for audio processing. It's an incredible feat that our brains are able to filter out all these different sources of audio, automatically!

(I really like how smug that guy looks - it's really over the top) Source

Cocktail party problem

Given multiple sources of sound (people talking, the band playing, glasses clinking), how do you distinguish independent sources of sound? Imagine at a cocktail party you have multiple microphones stationed throughout, and you get to hear all of these different sounds.

Source

What if you applied PCA to the cocktail party problem?

Example adapted from the excellent scikit-learn documentation.



In [ ]:

    
import fig_code

fig_code.cocktail_party()

Discussion

What do you get when you apply PCA to the cocktail party problem?
How would you describe the difference between maximizing variance via orthogonal features (PCA) and finding independent signals (ICA)?

Non-negative matrix factorization

NMF is like ICA in that it is trying to learn the parts of the data that make up the whole, by looking at the reconstructability of them matrix. This was originally published by Lee and Seung, "Learning the parts of objects by non-negative matrix factorization", and applied to image data below.

VQ here is vector quantization (VQ), yet another dimensionality reduction method ... it's kinda like K-means but not

Back to biology!

Enough images and signal processing ... where is the RNA!??!? Let's apply these algorithms to some biological datasets.

We'll use the 300-cell dataset (6 clusters, 50 cells each) data from the Macosko2015 paper.

Rather than plotting each cell in each component, we'll look at the mean (or median) contribution of each component to the cell types.



In [ ]:

    
from decompositionplots import explore_smushers
explore_smushers()

Discussion

Discuss the questions below while you play with the sliders.

Is the first component of each algorithm always the largest magnitude comopnent?
Which algorithm(s) tend to place an individual celltype in each component?
Which algorithm(s) seem to be driven by the "loudest" or largest changes in gene expression across all cells, rather than the unique contribution of each cell type?
How does the lowrank data affect the decomposition?
How does using the mean or median affect your interpretation?
How does the number of components influence the decomposition by PCA? (indicate all that apply)
- You get to see more distinct signals in the data
- It changes the components
- It doesn't affect the first few components
- You get to see more of the "special cases" in the variation of the data
How does the number of components influence the decomposition by ICA? (indicate all that apply)
- You get to see more distinct signals in the data
- It changes the components
- It doesn't affect the first few components
- You get to see more of the "special cases" in the variation of the data
How does the number of components influence the decomposition by NMF? (indicate all that apply)
- You get to see more distinct signals in the data
- It changes the components
- It doesn't affect the first few components
- You get to see more of the "special cases" in the variation of the data
What does the first component of PCA represent? (Check all that apply)
- The features that change the most across the data
- One distinct subset of features that appears independently of all other features
- The axis of the "loudest" features in the dataset
- A particular set of genes features that appear together and not with other features
What does the first component of ICA represent? (Check all that apply)
- The features that change the most across the data
- One distinct subset of features that appears independently of all other features
- The axis of the "loudest" features in the dataset
- A particular set of genes that appear together and not with other features
What does the first component of NMF represent? (Check all that apply)
- The features that change the most across the data
- One distinct subset of features that appears independently of all other features
- The axis of the "loudest" features in the dataset
- A particular set of genes that appear together and not with other features