How does gene dropout affect my results?

A key issue in single-cell anything-seq is that you're not capturing every single molecule that you want. Thus you have many more zeros in your data than you truly have. We will discuss:

How does gene dropout affect your interpretation of the results?
What computational tricks can you use to avoid making conclusions that are a result of lack of data?

To be able to talk about this, we need to introduce some computational concepts. Here, we will talk about:

Correlation metrics
Distance metrics
Clustering linkage method
Hierarchical clustering - agglomerative vs dismissive

Let's get started! In the first code cell, we import modules we'll use



In [ ]:

    
# Alphabetical order for nonstandard python modules is conventional
# We're doing "import superlongname as abbrev" for our laziness -- 
# -- this way we don't have to type out the whole thing each time.

# Python plotting library
import matplotlib.pyplot as plt

# Dataframes in Python
import pandas as pd

# Statistical plotting library we'll use
import seaborn as sns
# Use the visual style of plots that I prefer and use the 
# "notebook" context, which sets the default font and figure sizes
sns.set(style='whitegrid')

# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline

# Import figure code for interactive widgets
import fig_code

Correlation metrics

Spearman correlation

Spearman correlation` answers the simple question, every time $x$ increases, does $y$ increase also? If yes, then spearman correlation = 1.

Mathematically speaking, Spearman tells you whether $x$ and $y$ increase monotonically together (but not necessarily linearly!!)

Pearson correlation

Pearson Correlation answers the question, every time my $x$ decreases by some amount $a$, does $y$ decrease by an amount proportional to that, say $10a$ or $0.5a$, and is this amount constant?

$\rho_{x,y} = \frac{\mathrm{cov}(\vec{x}, \vec{y})}{\sigma_x, \sigma_y}$

Mathematically speaking, pearson tells you whether $x$ and $y$ are linear to each other.

Spearman vs Pearson

Spearman's correlation is related to Pearson because Spearman

Spearman correlation = Pearson correlation on the ranks of the data.

Anscombe's quartet

Anscombe's quartet is a group of four datasets that have nearly identical statistical properties that we'll use for exploring distance and correlation metrics.



In [ ]:

    
# Read the file - notice it is a URL. pandas can read either URLs or files on your computer
anscombe = pd.read_csv("https://github.com/mwaskom/seaborn-data/raw/master/anscombe.csv")

# Say the variable name with no arguments to look at the data
anscombe

Let's use FacetGrid from seaborn to plot the data onto four axes, and plot the regression line `



In [ ]:

    
# Make a "grid" of plots based on the column name "dataset"
g = sns.FacetGrid(anscombe, col='dataset')

# Make a regression plot (regplot) using 'x' for the x-axis and 'y' for the y-axis
g.map(sns.regplot, 'x', 'y')

Below is a widget that calculates different summary statistics or distance metrics using Anscombe's quartet. It shows both a table and a barplot of the values. Play around with the different settings and discuss the questions below with your partner.



In [ ]:

    
fig_code.interact_anscombe()

Discussion

Discuss the questions below while you play with the widgets.

Which metrics were nearly the same between all four datasets of Anscombe's quartet? Why?
Which metrics were different between all four datasets of Anscombe's quartet? Why?
Why do we use different summary statistics?

Distance metrics: Euclidean vs Manhattan

One important point of how you decide two points are "near" each other is which distance metric you use.

Euclidean distance is what you learned in algebra class: $d(x, y) = \sqrt{x^2 + y^2}$, but all the way to $N$ dimensional vectors ($\vec{x}, \vec{y}$ represent $N$-dimensional vectors): $d(\vec{x}, \vec{y}) = \sqrt{\sum_i^N \left(x_i -y_i\right)^2}$
Manhattan distance (also called "taxicab geometry") is similar but no squares or square roots: $d(\vec{x}, \vec{y}) = \sum_i^N |x_i - y_i|$

Clustering linkage methods: Ward, average, single, complete

Single: Compares shortest distance between clusters
Complete: Compares largest distance between clusters
Average: Compares average distance between clusters
Ward: Compares how the addition of a new cluster increases the within-cluster variance
Centroid: Compares centroid points of clusters

source: http://www.slideshare.net/neerajkaushik/cluster-analysis

Hierarchical clustering: Agglomerative vs Dismissive

Hierarchical clustering creates a ordered grouping of the cells (or genes, but today we're focusing on cells) based on how close they are. To create this grouping, you can either be "top-down" (dismissive) or "bottom-up" (agglomerative)

"Top-down" means you start with one BIG cluster of all the cells, you remove one cell at a time, and the way you choose that cell is by using the one that is least similar to everyone else. You're basically kicking out the outcasts every time :(
"Bottom-up" is much more inclusive - everyone starts as their own solo cluster, and then the two closest cells get merged together, then the next closest cell gets added to the growing cluster. This way. you're growing a big happy family.

Below is a diagram showing the steps of "bottom-up" (agglomerative) clustering on a small dataset. Notice that as you group points together, you add "leaves" to your "tree" -- yes these are the real words that are used! The diagram of lines on top of the ordered letters showing the clustering is called a "dendrogram" ("tree diagram").

source: https://www.researchgate.net/figure/273456906_fig3_Figure-4-Example-of-hierarchical-clustering-clusters-are-consecutively-merged-with-the

Dropout using Macosko2015 data

To explore the concepts of correlation, linkage, and distance metrics, we will be using a 300-cell, 259-gene subset of the Macosko 2015 data that contains 50 cells each from the following clusters:



In [ ]:

    
for cluster_id, name in fig_code.cluster_id_to_name.items():
    # The 'f' before the string means it's a "format string,"
    # which means it will read the variable names that exist
    # in your workspace. This is a very helpful and convient thing that was
    # just released in Python 3.6! (not available in Python 3.5)
    print('---')
    print(f'{cluster_id}: {name}')

Here is a plot of the colors associated with each group



In [ ]:

    
fig_code.plot_color_legend()

Below is another widget for you to play with. It lets you set different gene dropout thresholds (starts at 0 dropout), which will randomly remove (additional!) genes from the data.

We will be evaluating dropout by looking at the cell-cell correlations, measured by the correlation metric (starts at pearson), then will be clustered using hierarchical clustering using the distance metric and linkage method specified.



In [ ]:

    
fig_code.plot_dropout_interactive()

Notes:

The ends of the branches are called "leaves" and this indicates how closely related the pairs of samples (or groups of samples) are related
By "best clustering", I mean where do you see the most biologically relevant information?

Discussion

Discuss the questions below while you play with the widgets.

Correlation methods

Can you tell the difference between the correlations at 0% dropout? What about at 50%?
Which correlation method maintains the clustering from the paper even at higher dropouts?
Which correlation method made the best clusters, in your opininon?
Which one would you use for single-cell RNA seq data with lots of dropout?

Distance metrics

Which distance metric produced the longest branch lengths? The shortest?
Which distance metric was most consistent with their clustering? Least consistent?
Which distance metric made the best clusters, in your opininon?
Which one would you use for single-cell RNA seq data with lots of dropout?

Linkage methods

Which linkage method produced the longest branch lengths? The shortest?
Which linkage method was most consistent with their clustering? Least consistent?
Which linkage method made the best clusters, in your opininon?
Which one would you use for single-cell RNA seq data with lots of dropout?

General

Did the cell type clustering from the paper always agree with the different methods/metrics? Why do you think that is?
Do you think their clustering could have been improved? How?
What influenced the clustering the most?
- Gene dropout
- Correlation method
- Distance metric
- Linkage method

How to make the clustered heatmap above

Now we'll break down how to read the clustered heatmap we made above.

Let's move on to 1.4_make_clustered_heatmap.