scRNA-seq analysis (dimensionality reduction, clustering, identifying DE genes)

Now that we've rigourously QC'd and normalized our data to remove confounders, we can move on to the interesting part! Of course, analysis steps will vary depending on the biological question, but there a few things we can do that are useful in a wide range of contexts.

Use readRDS() to load in the Darmanis Seurat object we pre-processed in the last homework:



In [0]:

PCA

One of the first things we can do is run PCA on our data. Doing a linear dimensionality reduction can not only help us visualize the data, but also identify "metafeatures" of correlated gene sets. These metafeatures are more robust to noise and will be used as input features for more sophisticated dimensionality reduction methods that we'll use later.

Seurat provides a function RunPCA() that will perform PCA on the scaled data for us. We can specify the input features to use using the features argument. In this case, we want to use the highly variable genes we previously identified.



In [0]:

    
library(Seurat)

sc <- RunPCA(object=sc, features=VariableFeatures(object=sc), verbose=F)

We can visualize our PCA using DimPlot(), which will plot PC1 and PC2. We set the reduction argument to "pca", and color the dots by the cell type annotation stored in sc@meta.data.



In [0]:

    
DimPlot(object=sc, reduction="pca", pt.size=0.3, group.by="celltype")

We can look at the top genes that define each principal component:



In [0]:

    
print(x = sc[["pca"]], dims=1:3, nfeatures=6)

What do you notice about the PCA plot?

One question we must answer is how many principal components are important? There are several independent ways to estimate this! Seurat implements the JackStraw procedure used in Macosko et al (https://www.cell.com/fulltext/S0092-8674(15)00549-8).

Seurat can also visualize an Elbow plot, which tells us the percentage of variation explained by each principal component. These plots are often used to determine the number of PCs needed to capture the majority of the signal. Since the JackStraw procedure takes a long time on large datasets, we'll roughly estimate the dimensionality of the data with an Elbow plot.



In [0]:

    
ElbowPlot(object=sc)

# If you wanted to implement JackStraw procedure...
# sc <- JackStraw(object=sc, num.replicate=100)
# sc <- ScoreJackStraw(object=sc, dims=1:20)
# JackStrawPlot(object=sc, dims=1:16)

Based on the elbow plot, how many PCs do you think we should use?

Let's decide on 15 PCs to use for downstream clustering and visualization. You can change this number based on your interpretation of the plots!



In [0]:

    
num_dims <- 15

Clustering

There are many, many single-cell clustering algorithms available these days, and each have their strengths and weaknesses. We will use Seurat's graph-based clustering algorithm to identify distinct clusters in our data.

Clustering happens in two parts. First, FindNeighbors() will construct a K-nearest neighbor graph with the number of specified PCs. Second, FindClusters() will apply the Louvain algorithm to partition the graph into highly inter-connected "communities".



In [0]:

    
sc <- FindNeighbors(object=sc, dims=1:num_dims)
sc <- FindClusters(object=sc, resolution=0.4)

You can access the cluster IDs in the @active.ident slot. How many clusters did Seurat find in our data?

Note that there is a resolution parameter that sets the "granularity" of the downsteam clustering. The higher this value, the more clusters we will obtain. For a dataset with 3000 cells, Seurat suggests setting this value to 0.4-1.2. As the number of cells increases, the resolution value should increase as well.

Let's see how this resolution parameter might affect our clustering.

How many clusters do we get with resolution values of 0.1? 1.2? 5? Let's try clustering with each resolution value:



In [0]:

    
res <- c(0.1, 1.2, 5)
for(i in 1:length(res)) {
    sc <- FindNeighbors(object=sc, dims=1:num_dims)
    sc <- FindClusters(object=sc, resolution=res[i])
    print(paste("res=", res[i], "; number of communities=", length(levels(sc@active.ident)), sep=""))
}

Fill in the number of clusters at each resolution setting:

resolution=0.1 --> resolution=1.2 --> resolution=5 -->

Let's set the resolution parameter to 0.4 for our downstream analysis.



In [0]:

    
sc <- FindNeighbors(object=sc, dims=1:num_dims)
sc <- FindClusters(object=sc, resolution=0.4)

Dimensionality reduction and visualization - UMAP and tSNE

We'll often want to use non-linear dimensionality reduction techniques to visualize and explore scRNA-seq datasets. The goal of both UMAP and tSNE is to place similar cells together in a low-dimensional space. Cells of the same type or in the same cluster should appear together on the 2D plots.

Again, we'll use the same PCs as input for dimensionality reductions.

Let's run UMAP twice. In the first, we'll color the cells by their cluster assignments from the previous step. In the second, we'll color them by their cell type annotation. Is the clustering biologically relevant? Does the UMAP visualization correspond to our clustering?



In [0]:

    
# Color by cluster ID
sc <- RunUMAP(object=sc, dims=1:num_dims)
p1 <- DimPlot(object=sc, reduction="umap", pt.size=0.2)

# Color by cell type annotation
sc <- RunUMAP(object=sc, dims=1:num_dims)
p2 <- DimPlot(object=sc, reduction="umap", pt.size=0.2, group.by="celltype")

CombinePlots(list(p1, p2), ncol=2)

Now let's try another dimensionality reduction method - tSNE. We'll color them the same way as for UMAP.

Is the clustering biologically relevant? Does the tSNE visualization correspond to our clustering?



In [0]:

    
# Color by cluster ID
sc <- RunTSNE(object=sc, dims=1:num_dims)
p1 <- DimPlot(object=sc, reduction="tsne", pt.size=0.2)

# Color by cell type annotation
sc <- RunTSNE(object=sc, dims=1:num_dims)
p2 <- DimPlot(object=sc, reduction="tsne", pt.size=0.2, group.by="celltype")

CombinePlots(list(p1, p2), ncol=2)

How do the UMAP and tSNE dimensionality reductions compare? Plot the two visualizations side by side. Replace with the correct method.



In [0]:

    
# Compare UMAP and TSNE plots
p1 <- DimPlot(object=sc, reduction=<DIM_PLOT>, pt.size=0.2)
p2 <- DimPlot(object=sc, reduction=<DIM_PLOT>, pt.size=0.2)

CombinePlots(list(p1, p2), ncol=2)

Identify biomarkers for each cluster

Seurat also has a function for identifying biomarkers that define a cluster. These genes are differntially expressed in a cluster compared to all the other cells. See the Seurat documentation for many more parameter options for this function.



In [0]:

    
sc.markers <- FindAllMarkers(object=sc, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.25)



In [0]:

    
# List biomarkers for each cluster
genes.markers <- lapply(0:(length(unique(sc.markers$cluster))-1), function(x) {
    print(paste("cluster ", x, sep=""))
    genes <- sc.markers$gene[which(sc.markers$cluster == x)]
    return(genes)
})

genes.markers

Let's plot the expression of a few genes that define clusters. Each of these genes were identified with FindAllMarkers() as one of the top differentially expressed genes for a cluster.



In [0]:

    
FeaturePlot(object = sc, features = c("IFI30", "CCL3", "CDR1", "CRYAB", "WIF1", "GJB1"), pt.size=0.05)

Homework (10 pts)

We're now going to return to the Zheng dataset we processed in the previous homework. Recall that this dataset consists of purified immune cell populations. These populations include B cells, naive CD4+, cytotoxic CD8+, regulatory T cells, NK cells, and monocytes. We've provided lists of marker genes that define a cell type (from Jerby-Arnon et al. 2018 https://www.ncbi.nlm.nih.gov/pubmed/30388455).

Using what you learned above, annotate the immune cells based on marker expression. Again, you should NOT run every step - only run the essential steps for our goal of annotating cells.

These steps include:

Load in the processed Zheng seurat object
Run PCA and determine dimensionality of dataset
Using PCs as input, cluster the cells
Visualize clusters by UMAP - check that cells within clusters group together
Identify markers that define each cluster by differential expression
Make a composite feature of each immune cell type signature by averaging the z-scored expression values of the genes in each signature.
Plot the composite features to see in which clusters they are overexpressed.
Given the differentially expressed markers and the expression of immune cell signatures, manually annotate the clusters.

Make sure to include all essential code!

Q1 (3 pts). Run PCA, cluster cells with resolution=0.8, and visualize clusters with UMAP. Be sure to determine the dimensionality of the dataset by your method of choice, and use that number of PCs for downstream clustering and visualization.

How many PCs did you use for clustering? How many clusters did you end up with? Do the clusters group together with UMAP dimensionality reduction?



In [0]:

Q2 (2 pts). Identify markers that define each cluster by differential expression.



In [0]:

Q3 (3 pts). Use the code below to make composite features of each immune cell type with their given signature genes. Briefly describe what the code is doing in your own words.



In [0]:

    
# We have to scale *all* genes beacuse need to match to gene sets, not just use PCs for dimensionality reduction.
sc <- ScaleData(object=sc, features=rownames(sc))

celltypes <- c("Bcell", "CD4_naive", "cytotoxic_CD8", "Treg", "macrophage", "NK")

# Load in cell type markers
celltype.markers <- lapply(celltypes, function(x) {
    markers <- read.table(paste("immune_markers_JerbyArnon/", x, "_markers.txt", sep=""), header= F)
    markers <- as.character(markers$V1)
    return(markers)
})
names(celltype.markers) <- celltypes

# Get scaled data matrix
sc.mat_scaled <- sc[["RNA"]]@scale.data

# For each immune signature, take the mean of the z-scored expression values for each cell.
meta_celltype_exps <- lapply(1:length(celltype.markers), function(x) {
    mat <- sc.mat_scaled[na.omit(match(celltype.markers[[x]], rownames(sc))), ]
    meta_celltype_exp <- as.numeric(colMeans(t(scale(t(mat)))))
    return(meta_celltype_exp)
})
names(meta_celltype_exps) <- celltypes
mat.meta_celltype_exps <- do.call(rbind, meta_celltype_exps)

# Add composite features to the data matrix of a duplicate seurat object, sc1
sc1 <- sc
sc1[["RNA"]]@data <- rbind(sc[["RNA"]]@scale.data, mat.meta_celltype_exps)



In [0]:

    
# Plot composite features of immune signatures
FeaturePlot(object = sc1, features = c("Bcell", "CD4_naive", "cytotoxic_CD8", "Treg", "macrophage", "NK"), pt.size=0.1)

Q4 (2 pts). Based on the differentially expressed markers and the expression of immune signatures, manually annotate the clusters. Some clusters will be very difficult to annotate, which is often the case in real single cell data. Don't worry too much about those, just try your best! We will only grade on the clearest clusters