Need to simultaneously pick the samples you'll use and the genes you'll use. But the relative quality of the cells also depends on which genes you use to assess that quality, so what's a biologist to do??
Most of the time, the bad cells will be obvious with any metric you use. Here are some examples of metrics you can use for each cell:
This is a paper that we'll investigate in depth on the last day for the case study. The cliff notes version is that they harvested GFP+ cells from a transgenic Zebrafish, where the GFP labeled a particular cell population, split them on high and low GFP and categorized the subpopulation by clustering and pseudotime.
Here's Figure 1 of their paper for an overview:
For now, we'll look at different thresholds for gene expression and how that affects the results. They used Salmon to quantify gene expression to transcripts per million (TPM), so all thresholds will be relative to the TPM. We'll see how this CV works for TPM, and how changing the minimum gene expression and minimum number of cells to use.
In [145]:
import ipywidgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(context='notebook', style='white')
%matplotlib inline
macaulay2016 = pd.read_csv('../4._Case_Study/supplementary-data-1-sample-info/original_experiment_sample_info.csv',
index_col=0)
# Sort rows and columns
macaulay2016 = macaulay2016.sort_index(0).sort_index(1)
# Add "percentage reads mapped"
macaulay2016['% Assigned tags (proxy for mapped reads)'] = 100 * macaulay2016['Total Assigned Tags']/macaulay2016['Total Tags']
macaulay2016.head()
Out[145]:
We're gonna take a look at different quality control metrics
In [146]:
def explore_quality_control(x, y, logy):
x = x.lower()
if x == 'batch':
fig, axes = plt.subplots(figsize=(8, 3), ncols=2)
ax = axes[0]
if logy:
ax.set_yscale('log')
sns.boxplot(y=y, x=x, data=macaulay2016, ax=ax, palette='Paired')
ax.set(title=y)
ax = axes[1]
sns.countplot(x, data=macaulay2016)
ax.set(ylabel="Number of cells", title='Cells per batch')
else:
fig, ax = plt.subplots(figsize=(30, 3))
macaulay2016[y].plot(kind='bar', logy=logy, ax=ax)
ax.set(ylim=(0, macaulay2016[y].max()*1.05), title=y)
sns.despine()
fig.tight_layout()
ipywidgets.interact(explore_quality_control,
x=ipywidgets.Dropdown(value='Cells', options=['Cells', 'Batch'], description='x-axis'),
y=ipywidgets.Dropdown(value='MT Content', options=['MT Content', 'ERCC Content',
'% Assigned tags (proxy for mapped reads)',
'detected_genes', 'total_molecules'],
description='y-axis'),
logy=ipywidgets.Checkbox(value=False, description='Set the y-axis as log10-scale?'));
Papers have looked into the capture efficiency of single-cell data and have estimated that the Fluidigm C1 capture 5-10% of the total mRNA content of the cell [citation needed], and the genes whose expression is measured in some cells but not others are called "drop outs."
School Spirit (Skit 2)
You keep it going man
You keep those books rolling
You pick up all those books that you're gonna read and not remember
And you roll, man
You get that associate's degree, okay?
Then you get your bachelor's
Then you get your masters
Then you get your masters' masters
Then you get your doctorate
You go man!
And then when everyone says quit
You show them those degrees, man
When everyone says "Hey, you're not working, you're not making any money"
You say "You look at my degrees, and you look at my life
Yeah, I'm 52! So what?
Hate all you want, but I'm smart, I'm so smart
And I'm in school
All these guys out here making money all these ways
And I'm spending mine to be smart!
You know why?
Cause when I die buddy
You know what's gonna keep me warm?
That's right, those degrees."
The way that Brennecke et al 2013 deal with this is by looking at the dispersion of genes with a metric called "coefficient of variation squared" or "$CV^2$" with $\mu$ indicating the mean gene expression across cells, and $\sigma$ indicating the standard deviation of gene expression across cells, $CV^2$ is defined by,
$CV^2 = \left(\frac{\sigma}{\mu}\right)^2$
They then plot the $CV^2$ (y-axis) as a function of the mean gene expression (x-axis):
The brown dots are all genes, the blue indicate the spike-ins, and the pink indicates the genes reliably variant more than the ERCCs. They then argue that these are the genes whose biological variance exceeds the technical variance. This is really only appropriate when you have exact counts for ERCCs and genes using UMIs because otherwise you're not fully confident of the number of counts.
In [147]:
from sklearn.decomposition import FastICA
from sklearn.manifold import TSNE
ica_columns = ['difference_component', 'within_small_component', 'outlier_component', 'within_large_component']
import numpy as np
macaulay2016_expression = pd.read_csv('../4._Case_Study/macaulay2016/gene_expression_s.csv', index_col=0)
# Sort the rows and columns
macaulay2016_expression = macaulay2016_expression.sort_index(0).sort_index(1)
# Use only cells that pass QC
macaulay2016_expression = macaulay2016_expression[macaulay2016.index[macaulay2016['Pass QC']]]
ercc_names = [x for x in macaulay2016_expression.index if 'ERCC' in x]
ercc_original = macaulay2016_expression.loc[ercc_names]
macaulay2016_expression_genes = macaulay2016_expression.loc[macaulay2016_expression.index.difference(ercc_names)]
# Rescale just the genes to TPM since now we removed the ERCCs by rescaling each sample (column)
macaulay2016_expression_genes = (macaulay2016_expression_genes / macaulay2016_expression_genes.sum()) * 1e6
def explore_thresholds(min_log_tpm, min_n_cells, log_base, logy=True):
if log_base == 'e':
expression = np.log(macaulay2016_expression_genes + 1)
ercc = np.log(ercc_original + 1)
elif log_base == '2':
expression = np.log2(macaulay2016_expression_genes + 1)
ercc = np.log2(ercc_original + 1)
elif log_base == '10':
expression = np.log10(macaulay2016_expression_genes + 1)
ercc = np.log10(ercc_original + 1)
else:
raise ValueError("Not a valid log base")
genes_detected_per_cell = (expression > min_log_tpm).sum(axis=0)
cells_detected_per_gene = (expression > min_log_tpm).sum(axis=1)
mask = (expression > min_log_tpm).sum(axis=1) >= min_n_cells
expression = expression.ix[mask]
# Use only cells that have an assigned cluster color
cell_color = macaulay2016.loc[expression.columns, 'cluster_color'].dropna()
cell_color = cell_color.map(eval)
expression = expression[cell_color.index]
# Calculate coefficient of variation for expression
mean = expression.mean(axis=1)
std = expression.std(axis=1)
# Coefficient of variation squared
cv2 = np.square(std/mean)
# CV for ERCCs
mean_ercc = ercc.mean(axis=1)
std_ercc = ercc.std(axis=1)
cv2_ercc = np.square(std_ercc/mean_ercc)
fig, axes = plt.subplots(figsize=(12, 3), ncols=4)
ax = axes[0]
ax.semilogy(mean, cv2, 'o', alpha=0.25, markeredgewidth=0.5)
ax.semilogy(mean_ercc, cv2_ercc, 'o', color='Crimson', markeredgewidth=0.5)
ax.set(xlabel='Mean expression ($\log_{{{}}}(TPM+1)$)'.format(log_base), ylabel='$CV^2$', ylim=(1e-3, 1e2))
sns.despine()
ax = axes[1]
sns.distplot(genes_detected_per_cell, ax=ax, kde=False)
ax.set(title='Number of genes detected per cell', ylabel='Number of detected genes', xlabel='Number of cells')
ax.locator_params(nbins=5) # reduce the number of ticks to max 6 (5 bins between 6 ticks)
sns.despine()
ax = axes[2]
sns.distplot(cells_detected_per_gene, kde=False, ax=ax, hist_kws=dict(zorder=-1), color='darkgreY')
if logy:
ax.set_yscale('log')
ax.grid(axis='y', color='white', zorder=1)
ymin, ymax = ax.get_ylim()
ax.vlines(min_n_cells, ymin, ymax, linestyle='--')
ax.set(xlim=(0, 400), xlabel='Number of cells', ylabel='Number of detected genes',
title='Number of cells detected per gene')
ica = FastICA(n_components=4, random_state=3984)
decomposed = pd.DataFrame(ica.fit_transform(expression.T.copy()), index=expression.columns, columns=ica_columns)
embedder = TSNE(n_components=2, perplexity=75, random_state=254)
embedded = pd.DataFrame(embedder.fit_transform(decomposed))
ax = axes[3]
ax.scatter(embedded[0], embedded[1], c=cell_color, s=40)
# Empty the tick labels
ax.set(xticks=[], yticks=[])
sns.despine()
fig.tight_layout()
ipywidgets.interact(explore_thresholds,
min_log_tpm=ipywidgets.IntSlider(min=0, max=10, value=1,
description='Expression Threshold: Minimum log(TPM+1)'),
min_n_cells=ipywidgets.IntSlider(min=0, max=50, value=3,
description='Minimum number of cells'),
log_base=ipywidgets.Dropdown(options=['e', '2', '10'], value='10', description="log_{X} ?"),
logy=ipywidgets.Checkbox(True, description='Log-scale the y-axis of the histogram in the second plot?'));
The default values are what they used in the paper.
Here's an explanation of the panels: