Manifold learning

What is a manifold?

Manifolds are multi-dimensional surfaces that could look flat if you're up really really close but from far away are curved. The classic example of a manifold is a torus, or "donut", where you could reshape a coffee mug into a donut by melting it while preserving the essential aspect - the handle or "hole."

The idea behind manifold embedding algorithms is to maintain the high dimensional structure of the manifold, but plot the data in two dimensions.

The math behind these algorithms is actually quite simple. We want to convert each point in high dimensions to a point in two dimensions:

  • High dimensional data of samples $\vec{x}_i$ in $N$-dimensional gene space: $\vec{x}_1, \vec{x}_2, \ldots \vec{x}_N$
  • Low dimensional data $\vec{y}_i = \vec{y}_{i, 1}, \vec{y}_{i, 2}$ (2-dimensional cartesian plane) - $\vec{y}_1, \vec{y}_2, \ldots \vec{y}_N$

Visually, you can think of converting each high $N$-dimensional sample $i$'s gene expression vector $x_i$ to a length 2 vector:

$ \begin{bmatrix} x_{i, 1} \\ x_{i, 2} \\ \vdots \\ x_{i, N} \end{bmatrix} \rightarrow \begin{bmatrix} y_{i, 1} \\ y_{i, 2} \end{bmatrix} $

We'll compare MDS and t-SNE side by side once we get a brief introduction to both.

Multidimensional scaling (MDS)

Multidimensional scaling is an algorithm which faithfully maintains all pairwise distances between the points in the dataset.

Whiteboard explanation and discussion

t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is an extension of MDS. In addition to maintaining pairwise distances, t-SNE adds the constraint that things that were far apart in the high-dimensional data should also be far apart in 2d, and that things that are close together in high dimensions should stay close together.

Warnings

  • Do NOT use t-SNE for clustering!! We will explore how t-SNE can arbitrarily push some cells farther away and create what look like clusters but really aren't.
  • t-SNE is merely a visualization technique - it's a way that you can view ONE perspective of your data, but is NOT the final view

Whiteboard explanation and discussion


In [ ]:
import random

In [ ]:
random.random()

In [ ]:
%load_ext autoreload
%autoreload 2

In [ ]:
%matplotlib inline

from decompositionplots import explore_manifold
explore_manifold()

Discussion

While you're playing with the sliders above, discuss the questions below.

  1. How can you interpret the x- and y- axes of MDS and t-SNE?
    • The x- and y-axes are purely for visualization purposes
    • The x- and y-axes represent the highly varying features across samples
    • The x- and y-axes represent independent signals from the samples
  2. For this data, which algorithm projects the data into more independent clusters?
    • MDS
    • t-SNE
  3. For this data, which algorithm is better in presenting the continuum of data?
    • MDS
    • t-SNE
  4. For this data, which distance metric creates the most cluster formation in t-SNE?
    • Euclidean
    • Cityblock
  5. For this data, does t-SNE preserve the original structure better with a PCA or random initialization?
    • PCA
    • Random
  6. How does adding noise affect MDS and t-SNE's embedding?