A first date with your data: Exploratory analyses through dimensionality reduction

Flowchart of machine learning from Scikit-learn.

We're staying in the bottom half of the flowchart today - data exploration.

Dimensionality reduction algorithms like PCA, ICA, MDS, t-SNE (we'll get into the acronyms in a second) are all trying to accomplish the same thing: smush your high dimensional data into a palatable number of dimensions (often <10).

Matrix decomposition methods: PCA and ICA

Matrix decomposition methods are trying to factor a matrix $X$ into constitutent parts, $Y$ and $W$.

$Y = WX$

These matrix equations may be kind of intimidating so one way to think about them adding the signal from genes:

$ \text{Component }1 = 10\text{gene}_1 - 50\text{gene}_2 + 2\text{gene}_3 \ldots $

Depending on the algorithm the coefficients will have different constraints (have to sum to one or be independent or something annoying like that) but the idea is the same: summarize the gene expression (features) into fewer components, each of which are linear combinations of the original genes (features).

PCA

ICA

Manifold learning

  • MDS
  • t-SNE

Comparison of methods

We'll use a commonly used machine learning dataset of handwritten digits, plus a couple fake biological datasets, to explore differences between all the algorithms we've looked at so far.

Application to Shalek2013 and Macaulay2016

We'll apply what we've learned so far to the Shalek2013 and Macaulay2016 datasets.


In [ ]: