Thank you to Alex's Lemonade Stand and the Seurat team for their teaching resources, from which much of the material below is adapted.

scRNA-seq preprocessing and normalization

It's always important to do rigorous preprocessing before analyzing our datasets, and more so for noisy scRNA-seq data. These steps are essential for producing interpretable results in the downstream analysis. In this class, we’ll perform quality control and normalization of scRNA-seq count data with Seurat.

Setup

Many tools have been developed for the analysis of scRNA-seq data - we'll focus on one of the most widely used and well-maintained packages. Seurat is an R package that can perform QC, clustering, exploration, and visualization of scRNA-seq data, and they are regularly adding more features. It's a good place to start for us!

If you're interested, they have many more useful vigenettes on their website: https://satijalab.org/seurat/get_started.html

We installed Seurat for you, but you need to access it by going to Project settings --> Project control --> switch to "Experimental" software environment --> Save and Restart


In [0]:
library(Seurat)
packageVersion("Seurat") # Make sure this is v3.0.0
library(ggplot2)

Load in Darmanis et al. dataset

We're going to continue using the dataset we started working on in the prelab.

This dataset is generated from human primary glioblastoma tumor cells with the Smart-seq2 protocol (https://www.ncbi.nlm.nih.gov/pubmed/29091775). We're using a subset of this dataset (n=1854 cells). The raw count matrix provided has been quantified using Salmon.


In [0]:
# Load in raw counts from Darmanis et al.
sc_data <- read.table("data/unfiltered_darmanis_counts.tsv", header=T, sep="\t", row.names=1)

Take a look at the data. How many genes and how many cells are there in the raw dataset? What is the prefix of cell names?

What percentage of the data matrix is 0? Is this surprising?

Seurat can do our QC and normalization

Recall from the prelab that we looked at two QC metrics of interest: 1. number of reads detected per cell and 2. number of genes detected per cell. Now that we have a better understanding of these QC metrics, we can make our lives easier by using Seurat to visualize these QC stats (and more!) and filter out cells and features that don't pass QC metrics.

If you're interested, you can get a lot more information about the Seurat package and its capabilities here: https://satijalab.org/seurat/get_started.html

Before we start Seurat, we first want to convert Ensembl gene ids to HGNC symbols for easier interpretation. Run the code below to replace Ensembl ids with hgnc symbols in rownames of our data matrix.


In [0]:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
bm <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), values=rownames(sc_data), mart=ensembl)

hgnc.symbols <- bm$hgnc_symbol[match(rownames(sc_data), bm$ensembl_gene_id)]
sc_data <- as.matrix(sc_data)
rownames(sc_data) <- hgnc.symbols
1) We'll start by reading in our count data as a Seurat object. This object will hold our count matrix, as well as data for all downstream processing steps or analyses (normalization, scaling, PCA, clustering results, etc.). We can specify extra parameters to take only the features that are present in min.cells and the cells that have min.features.

In [0]:
sc <- CreateSeuratObject(counts=sc_data, project="Darmanis", min.cells=5, min.features=200)
sc

The original count matrix is stored in sc[["RNA"]]@counts

Print the raw counts for the first 6 genes of the first 6 cells below:


In [0]:

2) Seurat automatically generates the number of reads (nCount_RNA) and number of genes (nFeature_RNA) detected per cell. We can access this data in the sc@meta.data slot.

Generate a density plot of nCount_RNA - it should exactly match the density plot we produced in the prelab for total read count!


In [0]:
head(sc@meta.data)

qplot(sc@meta.data$nCount_RNA, xlab="Total read counts", geom="density") +
    geom_vline(xintercept=25000, color="red") +
    theme_classic()
3) One final QC metric we're often interested in is the percentage of reads mapping to the mitochondrial genome. A high percentage often indicates low-quality or dying cells. Seurat allows us to search for mitochondrial gene names and calculate the percentage of reads mapping to those genes. We can then stash these stats into our Seurat object's metadata by assigning sc[[]].

In [0]:
sc[["percent.mito"]] <- PercentageFeatureSet(object=sc, pattern="^MT-")

What would you have to change in the code above if we were working with mouse data?

4) How is the quality of this experiment? How do you know? We can visualize some QC metrics of interest in a violin plot. We can also check that the number of genes detected correlates with read count.

In [0]:
VlnPlot(object=sc, features=c("nCount_RNA", "nFeature_RNA", "percent.mito"), ncol=3, pt.size=0.5)
FeatureScatter(object=sc, feature1="nCount_RNA", feature2="nFeature_RNA")
5) Remove low-quality cells (high mitochondrial content), empty droplets (low number of genes), doublets/multiplets (abnormally high number of genes).

Seurat lets us easily apply QC filters based on our custom criteria. In this case, we want cells with >250 genes, but less than 2500, and <10% mitochondrial content.


In [0]:
sc <- subset(sc, subset = nFeature_RNA > 250 & nFeature_RNA < 2500 & percent.mito < 10)
sc

How many genes and cells remain after QC?

6) Recover cell type annotations, stash in sc@meta.data

Run the following code to add cell type annotations to our Seurat object metadata. This will be useful later when we're visualizing the different cell populations.


In [0]:
sc_metadata <- read.table("data/unfiltered_darmanis_metadata.tsv", sep="\t", header=T)

celltypes <- sc_metadata$cell.type.ch1[match(rownames(sc@meta.data), sc_metadata$geo_accession)]
sc@meta.data$celltype <- celltypes

Print the metadata for the first 6 cells below:


In [0]:

7) Normalization

Seurat v3 implements a more sophisticated normalization method, SCTransform, but it is still a work in progress and doesn't work well on large datasets yet. Thus, we will use their original method that normalizes by sequencing depth and log transforms the result. We should note that normalization of scRNA-seq data is an active field of research with many distinct approaches and little benchmarking.


In [0]:
sc <- NormalizeData(object=sc, normalization.method = "LogNormalize", scale.factor=10000)
8) Identify highly variable features

We'll specify that we want the top 2000 highly variable genes with nfeatures. These will be used as input genes for downstream dimensionality reduction (PCA).


In [0]:
sc <- FindVariableFeatures(object=sc, selection.method="vst", nfeatures=2000)

# Plot top 10 most variable genes
top10 <- head(x=VariableFeatures(object=sc), 10)
plot1 <- VariableFeaturePlot(object=sc)
plot2 <- LabelPoints(plot=plot1, points=top10, repel=TRUE)
plot2
9) Scale data

Scaling is an essential pre-processing step for dimensionality reduction methods like PCA. Here, we not only scale the data, but also regress out unwanted sources of variation such as mitochondrial contamination (we could also do cell cycle stage!).

We scale only the top 2000 highly variable genes for speed. This won't affect our PCA result, as PCA will use only the top variable genes as input anyway. However, if you need to plot heatmaps, you'll want to run ScaleData() on all the genes.


In [0]:
sc <- ScaleData(object=sc, vars.to.regress="percent.mito")

Print the scaled counts for the first 6 genes of the first 6 cells below:

10) Compare normalized data to raw count data

Let's look at how proper data processing impacts our ability to draw interpretable conclusions about the biology. We'll generate PCA plots for both the raw count data and the normalized count data. What do you notice?


In [0]:
# Run PCA on raw count data and normalized count data
raw_pca <- prcomp(t(as.matrix(sc[["RNA"]]@counts)))
norm_pca <- prcomp(t((sc[["RNA"]]@scale.data)))

# Retrieve PCA loading scores
raw_pca_scores <- data.frame(raw_pca$x[,1:2], cell_type=sc@meta.data$celltype)
norm_pca_scores <- data.frame(norm_pca$x[,1:2], cell_type=sc@meta.data$celltype)

# Plot PCA with cell labels
ggplot(raw_pca_scores, aes(x=PC1, y=PC2, color=cell_type)) +
    geom_point() +
    ggtitle("Raw counts PCA") +
    theme_classic()
#     colorblindr::scale_color_OkabeIto()

# Plot PCA with cell labels
ggplot(norm_pca_scores, aes(x=PC1, y=PC2, color=cell_type)) +
    geom_point() +
    ggtitle("Normalized counts PCA") + 
    theme_classic()
#     colorblindr::scale_color_OkabeIto()
11) Finally, save your processed Seurat object for future downstream analysis.

In [0]:
saveRDS(sc, file="sc_Darmanis_normalized.rds")

Homework (10 pts)

We're going to be analyzing the Zheng et al. dataset in the next homework (https://www.nature.com/articles/ncomms14049). This dataset consists of 68k FACS-sorted immune cells from peripheral blood. We'll use a small downsampled subset to save time and memory.
Using what you've learned above, process the Zheng et al. dataset. You DON'T need to rerun every step, only the ones essential for producing the final processed Seurat object.
These steps include:
  1. Visualize QC metrics
  2. Filter out low-quality cells and lowly expressed genes. Criteria: nFeature_RNA > 500, nFeature_RNA < 2500, percent.mito < 10
  3. Normalize
  4. Scale, regress out mitochondrial content variable
  5. Save the filtered, processed Seurat object
Make sure to include all essential code!

In [0]:
# We loaded in the data for you
Zheng_data <- read.table("data/Zheng_pbmc_downsample300_C9_filt.txt", sep="\t", header=T, check.names=F)

geneids <- as.character(Zheng_data[,ncol(Zheng_data)])
Zheng_data$gene_symbols <- NULL
Zheng_data <- as.matrix(Zheng_data)
rownames(Zheng_data) <- geneids

# Create Seurat object
sc <- CreateSeuratObject(counts=Zheng_data, project="Zheng2017")

# Store type information in meta data object
celltype <- sapply(strsplit(rownames(sc@meta.data), split="_"), function(x) x[[2]])
sc@meta.data$celltype <- celltype
Q1 (4 pts). Visualize QC metrics of interest and filter out poor-quality cells.

In [0]:

Q2 (4 pts). Normalize the data, identify the top 2000 highly variable features, and scale the data based on those features. Make sure to regress out mitochondrial contamination in the scale step. Finally, use saveRDS() to save the Seurat object as an .rds object named sc_Zheng_normlized.rds.

In [0]:

Q3 (2 pts). What are the primary sources of technical variability and biases you would be worried about in this experiment? See Zheng et al. for information about the experiment and Ziegenhain et al. for an overview of scRNA-seq technical biases (both papers are in your directory).