"Pseudotime" ordering

What is pseudotime ordering? Pseudotime is a concept introduced to biology by the Monocle paper, Trapnell et al (2014), where, given a population of single cells sampled across different time points, you want to order them not by the order they were harvested, but by the biological time. For simplicity, we will use differentiation as the model system here.

The idea behind this is that cells expressing similar genes are near each other in biological time along differentiation.

Graph-based pseudotime reconstruction

What does "graph-based" psuedotime reconstruction mean? Usually, this means that every cell in the data is a "node" and the algorithm attemps to make the most reasonable lines ("edges") between them. Some algorithms use groups of cells rather then cells, but the idea is the same.

What matters to you is that every cell becomes a data point, and it becomes important that you have densely sampled the space of biological features. Meaning, that your cells are generally very close together on a biological time scale, and each cell is a gradual step along the changing process. Thus, if you have groups of cells that are harvested across very different time scales (e.g. weeks or months), then psuedotime is probably not as applicable since the differences in gene expression between harvestings are probably the same as the pseudotime ordering.

Since these are graph-based, there's potentially $n^2$ (one between each pair of cells, technically $\left(\frac{n-1}{2}\right)^2$ but $n^2$ is close enough) edges and thus $\binom{n-1}{n^2} = ??$ potential paths between cells. As we expect to have hundreds, if not thousands of cells, and important first step is to reduce this enourmous graph into something more manageable.

To summarize, these algorithms can largely be described in two steps:

  1. Start with a subset of the complete cell-cell graph.
  2. Use some criteria to refine the initial graph and come up with the psuedotime ordering.

Monocle

Monocle is an algorithm which takes every cell in your dataset and fits a minimum spanning tree to the data, meaning it creates edges between cells such that the total distance between all connected cells is minimized.

Minimum spanning tree (MST)

A minimum spanning tree is a subset of the graph that touches every point with minimum cost - i.e. minimum weight.

Since data is noisy and not EVERY cell exactly follows the differentiation trajectory, Monocle then uses a very clever (and subtle) data structure called a $PQ$-tree to find the backbone of the tree, which then defines the ordering.

Wanderlust

Wanderlust fits a $k$-nearest neighbors ($k$-NN) graph to the data, meaning, all cells within distance $k$ are connected, and then cuts down the graph by finding the shortest path through the graph subset.

Wishbone

Wishbone, from the same lab as the Wanderlust paper, also uses a $k$-nearest neighbors graph but it is the only graph based algorithm so far that allows for branches in the path.