Now that we've rigourously QC'd and normalized our data to remove confounders, we can move on to the interesting part! Of course, analysis steps will vary depending on the biological question, but there a few things we can do that are useful in a wide range of contexts.
Use readRDS() to load in the Darmanis Seurat object we pre-processed in the last homework:
In [0]:
One of the first things we can do is run PCA on our data. Doing a linear dimensionality reduction can not only help us visualize the data, but also identify "metafeatures" of correlated gene sets. These metafeatures are more robust to noise and will be used as input features for more sophisticated dimensionality reduction methods that we'll use later.
Seurat provides a function RunPCA() that will perform PCA on the scaled data for us. We can specify the input features to use using the features argument. In this case, we want to use the highly variable genes we previously identified.
In [0]:
library(Seurat)
sc <- RunPCA(object=sc, features=VariableFeatures(object=sc), verbose=F)
We can visualize our PCA using DimPlot(), which will plot PC1 and PC2. We set the reduction argument to "pca", and color the dots by the cell type annotation stored in sc@meta.data.
In [0]:
DimPlot(object=sc, reduction="pca", pt.size=0.3, group.by="celltype")
We can look at the top genes that define each principal component:
In [0]:
print(x = sc[["pca"]], dims=1:3, nfeatures=6)
What do you notice about the PCA plot?
One question we must answer is how many principal components are important? There are several independent ways to estimate this! Seurat implements the JackStraw procedure used in Macosko et al (https://www.cell.com/fulltext/S0092-8674(15)00549-8).
Seurat can also visualize an Elbow plot, which tells us the percentage of variation explained by each principal component. These plots are often used to determine the number of PCs needed to capture the majority of the signal. Since the JackStraw procedure takes a long time on large datasets, we'll roughly estimate the dimensionality of the data with an Elbow plot.
In [0]:
ElbowPlot(object=sc)
# If you wanted to implement JackStraw procedure...
# sc <- JackStraw(object=sc, num.replicate=100)
# sc <- ScoreJackStraw(object=sc, dims=1:20)
# JackStrawPlot(object=sc, dims=1:16)
Based on the elbow plot, how many PCs do you think we should use?
Let's decide on 15 PCs to use for downstream clustering and visualization. You can change this number based on your interpretation of the plots!
In [0]:
num_dims <- 15
There are many, many single-cell clustering algorithms available these days, and each have their strengths and weaknesses. We will use Seurat's graph-based clustering algorithm to identify distinct clusters in our data.
Clustering happens in two parts. First, FindNeighbors() will construct a K-nearest neighbor graph with the number of specified PCs. Second, FindClusters() will apply the Louvain algorithm to partition the graph into highly inter-connected "communities".
In [0]:
sc <- FindNeighbors(object=sc, dims=1:num_dims)
sc <- FindClusters(object=sc, resolution=0.4)
You can access the cluster IDs in the @active.ident slot. How many clusters did Seurat find in our data?
Note that there is a resolution parameter that sets the "granularity" of the downsteam clustering. The higher this value, the more clusters we will obtain. For a dataset with 3000 cells, Seurat suggests setting this value to 0.4-1.2. As the number of cells increases, the resolution value should increase as well.
Let's see how this resolution parameter might affect our clustering.
How many clusters do we get with resolution values of 0.1? 1.2? 5? Let's try clustering with each resolution value:
In [0]:
res <- c(0.1, 1.2, 5)
for(i in 1:length(res)) {
sc <- FindNeighbors(object=sc, dims=1:num_dims)
sc <- FindClusters(object=sc, resolution=res[i])
print(paste("res=", res[i], "; number of communities=", length(levels(sc@active.ident)), sep=""))
}
Fill in the number of clusters at each resolution setting:
Let's set the resolution parameter to 0.4 for our downstream analysis.
In [0]:
sc <- FindNeighbors(object=sc, dims=1:num_dims)
sc <- FindClusters(object=sc, resolution=0.4)
We'll often want to use non-linear dimensionality reduction techniques to visualize and explore scRNA-seq datasets. The goal of both UMAP and tSNE is to place similar cells together in a low-dimensional space. Cells of the same type or in the same cluster should appear together on the 2D plots.
Again, we'll use the same PCs as input for dimensionality reductions.
Let's run UMAP twice. In the first, we'll color the cells by their cluster assignments from the previous step. In the second, we'll color them by their cell type annotation. Is the clustering biologically relevant? Does the UMAP visualization correspond to our clustering?
In [0]:
# Color by cluster ID
sc <- RunUMAP(object=sc, dims=1:num_dims)
p1 <- DimPlot(object=sc, reduction="umap", pt.size=0.2)
# Color by cell type annotation
sc <- RunUMAP(object=sc, dims=1:num_dims)
p2 <- DimPlot(object=sc, reduction="umap", pt.size=0.2, group.by="celltype")
CombinePlots(list(p1, p2), ncol=2)
Now let's try another dimensionality reduction method - tSNE. We'll color them the same way as for UMAP.
Is the clustering biologically relevant? Does the tSNE visualization correspond to our clustering?
In [0]:
# Color by cluster ID
sc <- RunTSNE(object=sc, dims=1:num_dims)
p1 <- DimPlot(object=sc, reduction="tsne", pt.size=0.2)
# Color by cell type annotation
sc <- RunTSNE(object=sc, dims=1:num_dims)
p2 <- DimPlot(object=sc, reduction="tsne", pt.size=0.2, group.by="celltype")
CombinePlots(list(p1, p2), ncol=2)
How do the UMAP and tSNE dimensionality reductions compare? Plot the two visualizations side by side. Replace
In [0]:
# Compare UMAP and TSNE plots
p1 <- DimPlot(object=sc, reduction=<DIM_PLOT>, pt.size=0.2)
p2 <- DimPlot(object=sc, reduction=<DIM_PLOT>, pt.size=0.2)
CombinePlots(list(p1, p2), ncol=2)
Seurat also has a function for identifying biomarkers that define a cluster. These genes are differntially expressed in a cluster compared to all the other cells. See the Seurat documentation for many more parameter options for this function.
In [0]:
sc.markers <- FindAllMarkers(object=sc, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.25)
In [0]:
# List biomarkers for each cluster
genes.markers <- lapply(0:(length(unique(sc.markers$cluster))-1), function(x) {
print(paste("cluster ", x, sep=""))
genes <- sc.markers$gene[which(sc.markers$cluster == x)]
return(genes)
})
genes.markers
Let's plot the expression of a few genes that define clusters. Each of these genes were identified with FindAllMarkers() as one of the top differentially expressed genes for a cluster.
In [0]:
FeaturePlot(object = sc, features = c("IFI30", "CCL3", "CDR1", "CRYAB", "WIF1", "GJB1"), pt.size=0.05)
In [0]:
In [0]:
In [0]:
# We have to scale *all* genes beacuse need to match to gene sets, not just use PCs for dimensionality reduction.
sc <- ScaleData(object=sc, features=rownames(sc))
celltypes <- c("Bcell", "CD4_naive", "cytotoxic_CD8", "Treg", "macrophage", "NK")
# Load in cell type markers
celltype.markers <- lapply(celltypes, function(x) {
markers <- read.table(paste("immune_markers_JerbyArnon/", x, "_markers.txt", sep=""), header= F)
markers <- as.character(markers$V1)
return(markers)
})
names(celltype.markers) <- celltypes
# Get scaled data matrix
sc.mat_scaled <- sc[["RNA"]]@scale.data
# For each immune signature, take the mean of the z-scored expression values for each cell.
meta_celltype_exps <- lapply(1:length(celltype.markers), function(x) {
mat <- sc.mat_scaled[na.omit(match(celltype.markers[[x]], rownames(sc))), ]
meta_celltype_exp <- as.numeric(colMeans(t(scale(t(mat)))))
return(meta_celltype_exp)
})
names(meta_celltype_exps) <- celltypes
mat.meta_celltype_exps <- do.call(rbind, meta_celltype_exps)
# Add composite features to the data matrix of a duplicate seurat object, sc1
sc1 <- sc
sc1[["RNA"]]@data <- rbind(sc[["RNA"]]@scale.data, mat.meta_celltype_exps)
In [0]:
# Plot composite features of immune signatures
FeaturePlot(object = sc1, features = c("Bcell", "CD4_naive", "cytotoxic_CD8", "Treg", "macrophage", "NK"), pt.size=0.1)