Technical noise and artefacts

Single cell quality control

Need to simultaneously pick the samples you'll use and the genes you'll use. But the relative quality of the cells also depends on which genes you use to assess that quality, so what's a biologist to do??

Kolodziejczyk et al, Mol Cell (2015)

Samples first

Most of the time, the bad cells will be obvious with any metric you use. Here are some examples of metrics you can use for each cell:

  • Number of cells
    • Should be one!
    • Can observe this from visual inspection of the microfluidic chip or droplet
  • GC content of library
    • Should be similar to generic transcriptome
  • Numbers of reads mapped
    • Should be in your defined range. For some experiments that's <50,000 reads, for others, that's <1 million reads.
  • Percentage of reads mapped
    • Usually at least 80% mapping to the genome
  • Proportion of reads mapped to mitochondria vs genome
    • High mt/genome ratio suggests apoptotic cell
  • Proportion of reads mapped to spike-ins vs genome
    • High spike-in/genome ratio suggests low mRNA content of cell (which may be biologically true or a technical artefact - you decide)
  • Proportion of reads mapped to mitochondria vs genome
    • High mitochondrial/genome ratio suggests the cell was apoptosing as it was captured

Data from Macaulay & Svensson 2016

This is a paper that we'll investigate in depth on the last day for the case study. The cliff notes version is that they harvested GFP+ cells from a transgenic Zebrafish, where the GFP labeled a particular cell population, split them on high and low GFP and categorized the subpopulation by clustering and pseudotime.

Here's Figure 1 of their paper for an overview:

For now, we'll look at different thresholds for gene expression and how that affects the results. They used Salmon to quantify gene expression to transcripts per million (TPM), so all thresholds will be relative to the TPM. We'll see how this CV works for TPM, and how changing the minimum gene expression and minimum number of cells to use.


In [145]:
import ipywidgets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set(context='notebook', style='white')
%matplotlib inline

macaulay2016 = pd.read_csv('../4._Case_Study/supplementary-data-1-sample-info/original_experiment_sample_info.csv', 
                              index_col=0)
# Sort rows and columns
macaulay2016 = macaulay2016.sort_index(0).sort_index(1)

# Add "percentage reads mapped"
macaulay2016['% Assigned tags (proxy for mapped reads)'] = 100 * macaulay2016['Total Assigned Tags']/macaulay2016['Total Tags']
macaulay2016.head()


Out[145]:
% Parent % Total 3'UTR_Exons 488 5'UTR_Exons 561 Average mapped length CDS_Exons Cells ERCC Content ... log_total_molecules lower_tpm_limit molecule_r2 outlier_component total_molecules tsne_0 tsne_1 within_large_component within_small_component % Assigned tags (proxy for mapped reads)
Well
DIM_1_A1 NaN 100.00% 615178 80.0 118167 6.0 229.53 1273306 1.0 116339.14620 ... 6.460057 38.688781 0.849279 -0.119395 2884413.0 -0.357368 12.650468 0.070390 -0.104144 80.714032
DIM_1_A10 NaN 100.00% 625664 39.0 99645 1.0 230.31 1032911 1.0 279412.04190 ... 5.900172 286.543062 0.853730 0.017293 794642.0 -0.001426 6.763604 0.073285 0.020246 87.040182
DIM_1_A11 NaN 100.00% 733798 53.0 245374 1.0 230.64 3031565 1.0 24324.44224 ... 6.978323 19.049522 0.778746 -0.036702 9513110.0 -8.698019 8.706809 -0.016871 -0.179175 87.303328
DIM_1_A12 NaN 100.00% 827658 337.0 148061 9.0 230.22 1597099 1.0 186802.23200 ... 6.094479 190.479102 0.901161 0.034679 1243023.0 -4.495216 11.184408 0.045065 -0.065027 86.116606
DIM_1_A2 NaN 100.00% 187244 83.0 38552 5.0 229.60 322876 1.0 251667.90200 ... 5.955462 388.363168 0.817718 0.072677 902531.0 -1.362267 3.037966 0.074522 0.049654 84.112176

5 rows × 50 columns

We're gonna take a look at different quality control metrics


In [146]:
def explore_quality_control(x, y, logy):
    x = x.lower()
    if x == 'batch':
        fig, axes = plt.subplots(figsize=(8, 3), ncols=2)
        ax = axes[0]
        if logy:
            ax.set_yscale('log')
        sns.boxplot(y=y, x=x, data=macaulay2016, ax=ax, palette='Paired')
        ax.set(title=y)
        
        ax = axes[1]
        sns.countplot(x, data=macaulay2016)
        ax.set(ylabel="Number of cells", title='Cells per batch')
    else:
        fig, ax = plt.subplots(figsize=(30, 3))
        macaulay2016[y].plot(kind='bar', logy=logy, ax=ax)
        ax.set(ylim=(0, macaulay2016[y].max()*1.05), title=y)
        
    sns.despine()
    fig.tight_layout()

ipywidgets.interact(explore_quality_control,
                    x=ipywidgets.Dropdown(value='Cells', options=['Cells', 'Batch'], description='x-axis'),
                    y=ipywidgets.Dropdown(value='MT Content', options=['MT Content', 'ERCC Content', 
                                                                       '% Assigned tags (proxy for mapped reads)',
                                                                       'detected_genes', 'total_molecules'],
                                          description='y-axis'),
                    logy=ipywidgets.Checkbox(value=False, description='Set the y-axis as log10-scale?'));


Dealing with zeros and "drop-outs"

Papers have looked into the capture efficiency of single-cell data and have estimated that the Fluidigm C1 capture 5-10% of the total mRNA content of the cell [citation needed], and the genes whose expression is measured in some cells but not others are called "drop outs."

Kanye West's "The College Dropout" (2004)

School Spirit (Skit 2)

You keep it going man
You keep those books rolling
You pick up all those books that you're gonna read and not remember
And you roll, man
You get that associate's degree, okay?
Then you get your bachelor's
Then you get your masters
Then you get your masters' masters
Then you get your doctorate
You go man!
And then when everyone says quit
You show them those degrees, man
When everyone says "Hey, you're not working, you're not making any money"
You say "You look at my degrees, and you look at my life
Yeah, I'm 52! So what?
Hate all you want, but I'm smart, I'm so smart
And I'm in school
All these guys out here making money all these ways
And I'm spending mine to be smart!
You know why?
Cause when I die buddy
You know what's gonna keep me warm?
That's right, those degrees."

Coefficient of Variation

The way that Brennecke et al 2013 deal with this is by looking at the dispersion of genes with a metric called "coefficient of variation squared" or "$CV^2$" with $\mu$ indicating the mean gene expression across cells, and $\sigma$ indicating the standard deviation of gene expression across cells, $CV^2$ is defined by,

$CV^2 = \left(\frac{\sigma}{\mu}\right)^2$

They then plot the $CV^2$ (y-axis) as a function of the mean gene expression (x-axis):

The brown dots are all genes, the blue indicate the spike-ins, and the pink indicates the genes reliably variant more than the ERCCs. They then argue that these are the genes whose biological variance exceeds the technical variance. This is really only appropriate when you have exact counts for ERCCs and genes using UMIs because otherwise you're not fully confident of the number of counts.


In [147]:
from sklearn.decomposition import FastICA
from sklearn.manifold import TSNE


ica_columns = ['difference_component', 'within_small_component', 'outlier_component', 'within_large_component']


import numpy as np

macaulay2016_expression = pd.read_csv('../4._Case_Study/macaulay2016/gene_expression_s.csv', index_col=0)
# Sort the rows and columns
macaulay2016_expression = macaulay2016_expression.sort_index(0).sort_index(1)

# Use only cells that pass QC
macaulay2016_expression = macaulay2016_expression[macaulay2016.index[macaulay2016['Pass QC']]]

ercc_names = [x for x in macaulay2016_expression.index if 'ERCC' in x]

ercc_original = macaulay2016_expression.loc[ercc_names]

macaulay2016_expression_genes = macaulay2016_expression.loc[macaulay2016_expression.index.difference(ercc_names)]

# Rescale just the genes to TPM since now we removed the ERCCs by rescaling each sample (column)
macaulay2016_expression_genes = (macaulay2016_expression_genes / macaulay2016_expression_genes.sum()) * 1e6


def explore_thresholds(min_log_tpm, min_n_cells, log_base, logy=True):
    if log_base == 'e':
        expression = np.log(macaulay2016_expression_genes + 1)
        ercc = np.log(ercc_original + 1)
    elif log_base == '2':
        expression = np.log2(macaulay2016_expression_genes + 1)
        ercc = np.log2(ercc_original + 1)
    elif log_base == '10':
        expression = np.log10(macaulay2016_expression_genes + 1)
        ercc = np.log10(ercc_original + 1)
    else:
        raise ValueError("Not a valid log base")
        
    genes_detected_per_cell = (expression > min_log_tpm).sum(axis=0)
    cells_detected_per_gene = (expression > min_log_tpm).sum(axis=1)
    
    mask = (expression > min_log_tpm).sum(axis=1) >= min_n_cells
    expression = expression.ix[mask]
    
    # Use only cells that have an assigned cluster color
    cell_color = macaulay2016.loc[expression.columns, 'cluster_color'].dropna()
    cell_color = cell_color.map(eval)
    expression = expression[cell_color.index]

    # Calculate coefficient of variation for expression
    mean = expression.mean(axis=1)
    std = expression.std(axis=1)
    
    # Coefficient  of variation squared
    cv2 = np.square(std/mean)
    
    # CV for ERCCs
    mean_ercc = ercc.mean(axis=1)
    std_ercc = ercc.std(axis=1)
    cv2_ercc = np.square(std_ercc/mean_ercc)
    
    fig, axes = plt.subplots(figsize=(12, 3), ncols=4)
    ax = axes[0]
    ax.semilogy(mean, cv2, 'o', alpha=0.25, markeredgewidth=0.5)
    ax.semilogy(mean_ercc, cv2_ercc, 'o', color='Crimson', markeredgewidth=0.5) 
    ax.set(xlabel='Mean expression ($\log_{{{}}}(TPM+1)$)'.format(log_base), ylabel='$CV^2$', ylim=(1e-3, 1e2))
    sns.despine()

    ax = axes[1]
    sns.distplot(genes_detected_per_cell, ax=ax, kde=False)
    ax.set(title='Number of genes detected per cell', ylabel='Number of detected genes', xlabel='Number of cells')
    ax.locator_params(nbins=5)  # reduce the number of ticks to max 6 (5 bins between 6 ticks)
    sns.despine()

    ax = axes[2]
    sns.distplot(cells_detected_per_gene, kde=False, ax=ax, hist_kws=dict(zorder=-1), color='darkgreY')
    if logy:
        ax.set_yscale('log')
    ax.grid(axis='y', color='white', zorder=1)
    ymin, ymax = ax.get_ylim()
    ax.vlines(min_n_cells, ymin, ymax, linestyle='--')
    ax.set(xlim=(0, 400), xlabel='Number of cells', ylabel='Number of detected genes', 
           title='Number of cells detected per gene')

    
    ica = FastICA(n_components=4, random_state=3984)
    decomposed = pd.DataFrame(ica.fit_transform(expression.T.copy()), index=expression.columns,  columns=ica_columns)

    embedder = TSNE(n_components=2, perplexity=75, random_state=254)
    embedded = pd.DataFrame(embedder.fit_transform(decomposed))
    
    ax = axes[3]
    ax.scatter(embedded[0], embedded[1], c=cell_color, s=40)
    # Empty the tick labels
    ax.set(xticks=[], yticks=[])
    sns.despine()
    fig.tight_layout()

ipywidgets.interact(explore_thresholds, 
                    min_log_tpm=ipywidgets.IntSlider(min=0, max=10, value=1, 
                                                     description='Expression Threshold: Minimum log(TPM+1)'), 
                    min_n_cells=ipywidgets.IntSlider(min=0, max=50, value=3,
                                                    description='Minimum number of cells'), 
                    log_base=ipywidgets.Dropdown(options=['e', '2', '10'], value='10', description="log_{X} ?"),
                    logy=ipywidgets.Checkbox(True, description='Log-scale the y-axis of the histogram in the second plot?'));


/Users/olga/anaconda3/envs/single-cell-bioinformatics-v3/lib/python3.5/site-packages/matplotlib/figure.py:1744: UserWarning: This figure includes Axes that are not compatible with tight_layout, so its results might be incorrect.
  warnings.warn("This figure includes Axes that are not "

The default values are what they used in the paper.

Here's an explanation of the panels:

  • Mean vs CV for expressed genes
  • The distribution of for each cell, how many genes were detected
  • The distribution of for each gene, how many cells detected that gene
  • A dimensionality reduction plot of the data. We'll talk about these exact algorithms tomorrow, for now, pay attention to how it looks relative to Figure 1c from the Macaulay paper.

The conclusion here is... thresholds matter!