Flowchart of machine learning from Scikit-learn.
We're staying in the bottom half of the flowchart today - data exploration.
Dimensionality reduction algorithms like PCA, ICA, MDS, t-SNE (we'll get into the acronyms in a second) are all trying to accomplish the same thing: smush your high dimensional data into a palatable number of dimensions (often <10).
Matrix decomposition methods are trying to factor a matrix $X$ into constitutent parts, $Y$ and $W$.
$Y = WX$
These matrix equations may be kind of intimidating so one way to think about them adding the signal from genes:
$ \text{Component }1 = 10\text{gene}_1 - 50\text{gene}_2 + 2\text{gene}_3 \ldots $
Depending on the algorithm the coefficients will have different constraints (have to sum to one or be independent or something annoying like that) but the idea is the same: summarize the gene expression (features) into fewer components, each of which are linear combinations of the original genes (features).
We'll use a commonly used machine learning dataset of handwritten digits, plus a couple fake biological datasets, to explore differences between all the algorithms we've looked at so far.
We'll apply what we've learned so far to the Shalek2013 and Macaulay2016 datasets.
In [ ]: