A key issue in single-cell anything-seq is that you're not capturing every single molecule that you want. Thus you have many more zeros in your data than you truly have. We will discuss:
To be able to talk about this, we need to introduce some computational concepts. Here, we will talk about:
Let's get started! In the first code cell, we import modules we'll use
In [ ]:
# Alphabetical order for nonstandard python modules is conventional
# We're doing "import superlongname as abbrev" for our laziness --
# -- this way we don't have to type out the whole thing each time.
# Python plotting library
import matplotlib.pyplot as plt
# Dataframes in Python
import pandas as pd
# Statistical plotting library we'll use
import seaborn as sns
# Use the visual style of plots that I prefer and use the
# "notebook" context, which sets the default font and figure sizes
sns.set(style='whitegrid')
# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline
# Import figure code for interactive widgets
import fig_code
Spearman correlation` answers the simple question, every time $x$ increases, does $y$ increase also? If yes, then spearman correlation = 1.
Mathematically speaking, Spearman tells you whether $x$ and $y$ increase monotonically together (but not necessarily linearly!!)
Pearson Correlation answers the question, every time my $x$ decreases by some amount $a$, does $y$ decrease by an amount proportional to that, say $10a$ or $0.5a$, and is this amount constant?
$\rho_{x,y} = \frac{\mathrm{cov}(\vec{x}, \vec{y})}{\sigma_x, \sigma_y}$
Mathematically speaking, pearson tells you whether $x$ and $y$ are linear to each other.
Spearman's correlation is related to Pearson because Spearman
Spearman correlation = Pearson correlation on the ranks of the data.
In [ ]:
# Read the file - notice it is a URL. pandas can read either URLs or files on your computer
anscombe = pd.read_csv("https://github.com/mwaskom/seaborn-data/raw/master/anscombe.csv")
# Say the variable name with no arguments to look at the data
anscombe
Let's use FacetGrid
from seaborn to plot the data onto four axes, and plot the regression line `
In [ ]:
# Make a "grid" of plots based on the column name "dataset"
g = sns.FacetGrid(anscombe, col='dataset')
# Make a regression plot (regplot) using 'x' for the x-axis and 'y' for the y-axis
g.map(sns.regplot, 'x', 'y')
Below is a widget that calculates different summary statistics or distance metrics using Anscombe's quartet. It shows both a table and a barplot of the values. Play around with the different settings and discuss the questions below with your partner.
In [ ]:
fig_code.interact_anscombe()
One important point of how you decide two points are "near" each other is which distance metric you use.
source: http://www.slideshare.net/neerajkaushik/cluster-analysis
Hierarchical clustering creates a ordered grouping of the cells (or genes, but today we're focusing on cells) based on how close they are. To create this grouping, you can either be "top-down" (dismissive) or "bottom-up" (agglomerative)
Below is a diagram showing the steps of "bottom-up" (agglomerative) clustering on a small dataset. Notice that as you group points together, you add "leaves" to your "tree" -- yes these are the real words that are used! The diagram of lines on top of the ordered letters showing the clustering is called a "dendrogram" ("tree diagram").
In [ ]:
for cluster_id, name in fig_code.cluster_id_to_name.items():
# The 'f' before the string means it's a "format string,"
# which means it will read the variable names that exist
# in your workspace. This is a very helpful and convient thing that was
# just released in Python 3.6! (not available in Python 3.5)
print('---')
print(f'{cluster_id}: {name}')
Here is a plot of the colors associated with each group
In [ ]:
fig_code.plot_color_legend()
Below is another widget for you to play with. It lets you set different gene dropout thresholds (starts at 0 dropout), which will randomly remove (additional!) genes from the data.
We will be evaluating dropout by looking at the cell-cell correlations, measured by the correlation metric (starts at pearson), then will be clustered using hierarchical clustering using the distance metric and linkage method specified.
In [ ]:
fig_code.plot_dropout_interactive()
Discuss the questions below while you play with the widgets.
Now we'll break down how to read the clustered heatmap we made above.
Let's move on to 1.4_make_clustered_heatmap.