Welcome to the world of single-cell RNA-seq! In order to harness the promise of this powerful assay, we must be able to overcome the enormous levels of technical noise arising from experimental protocols. For the next two classes, we'll focus on computational methods for processing and analyzing scRNA-seq data.
Take a look at the review from Bacher and Kendziorski 2016. Though the computational tools used for scRNA-seq analysis are constantly changing, the underlying concepts remain the same. Compared to bulk RNA-seq, what are some unique features of scRNA-seq data that require novel computational approaches (see figure 1)?
The primary goal of normalization is to remove the influence of technical effects on scRNA-seq read counts while preserving true biological variation. What are some systemic sources of technical noise that can affect scRNA-seq data?
We're going to get set up for class by loading in our single-cell dataset and taking a preliminary look at some QC metrics of interest. In class, we will be learning how to do QC and preprocessing on our dataset with Seurat.
Run the code below and try to understand the figures it generates.
The dataset we'll be using is generated from human primary glioblastoma tumor cells with the Smart-seq2 protocol (https://www.ncbi.nlm.nih.gov/pubmed/29091775). We're using a subset of this dataset (n=1854 cells). The raw count matrix provided has been quantified using Salmon.
If you look at the file, you'll notice the first column contains Ensembl gene ids. Since we want a matrix of only numeric counts, we'll store the gene ids column in rownames with row.names=1 as an argument.
In [0]:
# Load in raw counts from Darmanis et al.
sc_data <- read.table("data/unfiltered_darmanis_counts.tsv", header=T, sep="\t", row.names=1)
Let's get a better idea of what kind of scRNA-seq data features we want to look at before analysis. One QC metric we're interested in is the number of reads or molecules detected per cell. Cells with few reads are likely low-quality cells or empty droplets. Cells with an abnormally high number of reads might be doublets or multiplets. These cells will confuse our analysis, so we need to filter them out.
Let's check what our read count per cell is in this experiment.
In [0]:
total_counts <- apply(sc_data, 2, sum)
quantile(total_counts)
What is the range of total counts per cell? Are there any outliers?
Notice that one sample has only 48 reads. It's likely that this cell is an empty well or low-quality cell - we want to filter these cells out so they don't confuse our downstream analysis!
We can visualize a density plot of total read counts per cell, while plotting a vertical red line to indicate a potential QC cutoff. All cells left of the line would be filtered out.
In [0]:
library(ggplot2)
# Density plot of read counts per cell
qplot(total_counts, xlab="Total read count per cell", geom="density") +
geom_vline(xintercept=25000, color="red") + # Add line for our cutoff
theme_classic() # For aesthetics
Next we can check the number of genes detected per cell. This value should correlate strongly with total read count. However, we can imagine that sometimes a single gene could account for all the counts in a cell, which would indicate a poor library and should not be included for analysis. Thus, it is a good idea to check the number of genes detected per cell, in addition to the number of reads.
One thing we'll have to decide is: how many counts have to map to a gene for it to be considered "detected"? There is no right answer for this - here we'll use a lenient cutoff of 0.
In [0]:
# Set cutoff > 0 - if at least 1 read maps to the gene, it is considered "detected"
genecount_cutoff <- 0
# Generate TRUE/FALSE matrix for detection - sum to get number of genes detected in each cell
detection_mat <- as.matrix(sc_data) > genecount_cutoff
num_genes_exp <- apply(detection_mat, 2, sum)
# Density plot of genes expressed per cell
qplot(num_genes_exp, geom="density", xlab="Number of genes expressed by each cell") +
geom_density() +
geom_vline(xintercept=500, color="red") +
theme_classic() # Add this for aesthetics
You may have noticed our metadata file has cell type annotations for each cell. We can incorporate this information to tap into the power of single-cell data.
Run the code below to plot each cell type separately and see how the number of genes detected differs across cell types. What do you observe?
In [0]:
sc_metadata <- read.table("data/unfiltered_darmanis_metadata.tsv", sep="\t", header=T)
head(sc_metadata)
# Create a data frame with cell type information for plotting
num_genes_exp_CellType_df <- data.frame(num_genes_exp, cell_type=sc_metadata$cell.type.ch1)
# Plot each cell type separately to see how the gene filter would affect each type differently
ggplot(num_genes_exp_CellType_df, aes(x=num_genes_exp)) +
xlab("Number of genes expressed by each sample") +
geom_density() +
geom_vline(xintercept=500, color="red") +
facet_wrap(~cell_type) + # facet_wrap() uses labels to make individual graphs
theme_classic()
Now that we have a better understanding of some scRNA-seq QC metrics, we can make our lives easier by using Seurat to visualize QC stats and filter out cells and features that don't pass QC. This is what we'll do tomorrow!