Thank you to Alex's Lemonade Stand and the Seurat team for their teaching resources, from which much of the material below is adapted.
It's always important to do rigorous preprocessing before analyzing our datasets, and more so for noisy scRNA-seq data. These steps are essential for producing interpretable results in the downstream analysis. In this class, we’ll perform quality control and normalization of scRNA-seq count data with Seurat.
Many tools have been developed for the analysis of scRNA-seq data - we'll focus on one of the most widely used and well-maintained packages. Seurat is an R package that can perform QC, clustering, exploration, and visualization of scRNA-seq data, and they are regularly adding more features. It's a good place to start for us!
If you're interested, they have many more useful vigenettes on their website: https://satijalab.org/seurat/get_started.html
We installed Seurat for you, but you need to access it by going to Project settings --> Project control --> switch to "Experimental" software environment --> Save and Restart
In [0]:
library(Seurat)
packageVersion("Seurat") # Make sure this is v3.0.0
library(ggplot2)
We're going to continue using the dataset we started working on in the prelab.
This dataset is generated from human primary glioblastoma tumor cells with the Smart-seq2 protocol (https://www.ncbi.nlm.nih.gov/pubmed/29091775). We're using a subset of this dataset (n=1854 cells). The raw count matrix provided has been quantified using Salmon.
In [0]:
# Load in raw counts from Darmanis et al.
sc_data <- read.table("data/unfiltered_darmanis_counts.tsv", header=T, sep="\t", row.names=1)
Take a look at the data. How many genes and how many cells are there in the raw dataset? What is the prefix of cell names?
What percentage of the data matrix is 0? Is this surprising?
Recall from the prelab that we looked at two QC metrics of interest: 1. number of reads detected per cell and 2. number of genes detected per cell. Now that we have a better understanding of these QC metrics, we can make our lives easier by using Seurat to visualize these QC stats (and more!) and filter out cells and features that don't pass QC metrics.
If you're interested, you can get a lot more information about the Seurat package and its capabilities here: https://satijalab.org/seurat/get_started.html
Before we start Seurat, we first want to convert Ensembl gene ids to HGNC symbols for easier interpretation. Run the code below to replace Ensembl ids with hgnc symbols in rownames of our data matrix.
In [0]:
library(biomaRt)
ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
bm <- getBM(attributes=c("ensembl_gene_id", "hgnc_symbol"), values=rownames(sc_data), mart=ensembl)
hgnc.symbols <- bm$hgnc_symbol[match(rownames(sc_data), bm$ensembl_gene_id)]
sc_data <- as.matrix(sc_data)
rownames(sc_data) <- hgnc.symbols
In [0]:
sc <- CreateSeuratObject(counts=sc_data, project="Darmanis", min.cells=5, min.features=200)
sc
The original count matrix is stored in sc[["RNA"]]@counts
Print the raw counts for the first 6 genes of the first 6 cells below:
In [0]:
In [0]:
head(sc@meta.data)
qplot(sc@meta.data$nCount_RNA, xlab="Total read counts", geom="density") +
geom_vline(xintercept=25000, color="red") +
theme_classic()
In [0]:
sc[["percent.mito"]] <- PercentageFeatureSet(object=sc, pattern="^MT-")
What would you have to change in the code above if we were working with mouse data?
In [0]:
VlnPlot(object=sc, features=c("nCount_RNA", "nFeature_RNA", "percent.mito"), ncol=3, pt.size=0.5)
FeatureScatter(object=sc, feature1="nCount_RNA", feature2="nFeature_RNA")
Seurat lets us easily apply QC filters based on our custom criteria. In this case, we want cells with >250 genes, but less than 2500, and <10% mitochondrial content.
In [0]:
sc <- subset(sc, subset = nFeature_RNA > 250 & nFeature_RNA < 2500 & percent.mito < 10)
sc
How many genes and cells remain after QC?
In [0]:
sc_metadata <- read.table("data/unfiltered_darmanis_metadata.tsv", sep="\t", header=T)
celltypes <- sc_metadata$cell.type.ch1[match(rownames(sc@meta.data), sc_metadata$geo_accession)]
sc@meta.data$celltype <- celltypes
Print the metadata for the first 6 cells below:
In [0]:
Seurat v3 implements a more sophisticated normalization method, SCTransform, but it is still a work in progress and doesn't work well on large datasets yet. Thus, we will use their original method that normalizes by sequencing depth and log transforms the result. We should note that normalization of scRNA-seq data is an active field of research with many distinct approaches and little benchmarking.
In [0]:
sc <- NormalizeData(object=sc, normalization.method = "LogNormalize", scale.factor=10000)
In [0]:
sc <- FindVariableFeatures(object=sc, selection.method="vst", nfeatures=2000)
# Plot top 10 most variable genes
top10 <- head(x=VariableFeatures(object=sc), 10)
plot1 <- VariableFeaturePlot(object=sc)
plot2 <- LabelPoints(plot=plot1, points=top10, repel=TRUE)
plot2
Scaling is an essential pre-processing step for dimensionality reduction methods like PCA. Here, we not only scale the data, but also regress out unwanted sources of variation such as mitochondrial contamination (we could also do cell cycle stage!).
We scale only the top 2000 highly variable genes for speed. This won't affect our PCA result, as PCA will use only the top variable genes as input anyway. However, if you need to plot heatmaps, you'll want to run ScaleData() on all the genes.
In [0]:
sc <- ScaleData(object=sc, vars.to.regress="percent.mito")
Print the scaled counts for the first 6 genes of the first 6 cells below:
In [0]:
# Run PCA on raw count data and normalized count data
raw_pca <- prcomp(t(as.matrix(sc[["RNA"]]@counts)))
norm_pca <- prcomp(t((sc[["RNA"]]@scale.data)))
# Retrieve PCA loading scores
raw_pca_scores <- data.frame(raw_pca$x[,1:2], cell_type=sc@meta.data$celltype)
norm_pca_scores <- data.frame(norm_pca$x[,1:2], cell_type=sc@meta.data$celltype)
# Plot PCA with cell labels
ggplot(raw_pca_scores, aes(x=PC1, y=PC2, color=cell_type)) +
geom_point() +
ggtitle("Raw counts PCA") +
theme_classic()
# colorblindr::scale_color_OkabeIto()
# Plot PCA with cell labels
ggplot(norm_pca_scores, aes(x=PC1, y=PC2, color=cell_type)) +
geom_point() +
ggtitle("Normalized counts PCA") +
theme_classic()
# colorblindr::scale_color_OkabeIto()
In [0]:
saveRDS(sc, file="sc_Darmanis_normalized.rds")
In [0]:
# We loaded in the data for you
Zheng_data <- read.table("data/Zheng_pbmc_downsample300_C9_filt.txt", sep="\t", header=T, check.names=F)
geneids <- as.character(Zheng_data[,ncol(Zheng_data)])
Zheng_data$gene_symbols <- NULL
Zheng_data <- as.matrix(Zheng_data)
rownames(Zheng_data) <- geneids
# Create Seurat object
sc <- CreateSeuratObject(counts=Zheng_data, project="Zheng2017")
# Store type information in meta data object
celltype <- sapply(strsplit(rownames(sc@meta.data), split="_"), function(x) x[[2]])
sc@meta.data$celltype <- celltype
In [0]:
In [0]: