In this practical session we will go through basics of preprocessing and visualizing molecular data (gene expression data). The data we use is single-cell RNA-seq data from healthy human brain samples. link The same dataset is used in the deconvolution project work and because of that, the main point of this practical session is for you to get familiar with handling the data, applying minimal preprocessing, visualizing the data, and finding differentially expressed genes.
In [ ]:
# Install required packages
install.packages(c("ggplot2", "Rtsne", "fields", "ROCR"))
source("https://bioconductor.org/biocLite.R")
biocLite("M3Drop")
In [ ]:
# Load libraries and some some helper functions
library(class)
library(fields)
library(ROCR)
library(ggplot2)
library(Rtsne)
library(M3Drop)
source("dimRedEvaluation.R")
Once gene expression has been quantified it is summarized as an expression matrix where each row corresponds to a gene (or transcript) and each column corresponds to a single cell. This matrix should be examined to remove poor quality cells which were not detected in either read QC or mapping QC steps. Failure to remove low quality cells at this stage may add technical noise which has the potential to obscure the biological signals of interest in the downstream analysis. Since there is currently no standard method for performing scRNASeq the expected values for the various QC measures that will be presented here can vary substantially from experiment to experiment. Thus, to perform QC we will be looking for cells and genes which are outliers with respect to the rest of the dataset rather than comparing to independent quality standards. Consequently, care should be taken when comparing quality metrics across datasets collected using different protocols.
We load the downloaded data, which is separated to csv-files per cell. The individual files are combined to form a data matrix where rows correspond to genes and columns correspond to individual cells. For convinience we modify the column names to match how the sample names are written in the annotation file. Also at this point we remove last three rows which correspond to some basic stats from the alignment step (not actual genes).
In [ ]:
files = list.files(path = "../data/molecular-data", pattern = "csv$", full.names = T)
data <- do.call(cbind, sapply(files, read.delim, header = F, row.names = 1, stringsAsFactors = F))
rownames(data) <- rownames(read.delim(files[1], header = F, row.names = 1))
rownames(data) <- gsub(" ", "", rownames(data))
colnames(data) <- gsub(".*/|[.]csv.V2$", "", colnames(data))
colnames(data) <- gsub("_.*", "", colnames(data))
data <- data[1:22085,]
dim(data)
Annotations (metadata) corresponding the said data is loaded and ordered to match the ordering of samples in the data matrix.
In [ ]:
anno <- read.delim("../data/molecular-data/metadata.tsv", stringsAsFactors = F)
anno <- anno[match(colnames(data), anno$geo_accession),]
rownames(anno) <- anno$geo_accession
Total number of RNA molecules detected per sample (library size) is import to check and remove cells with low library size. Wells with few reads/molecules are likely to have been broken or failed to capture a cell, and should thus be removed. Also, cells expressing unusually high number of reads might consist of multiple cells (multiplets) and should be removed. Note: Be careful when excluding cells in this manner, you might remove good cells (rare subtypes?)
In [ ]:
hist(x = colSums(data), breaks = 100)
lib.low.threshold <- 1e5
abline(v = lib.low.threshold, col = "red")
keep <- colSums(data) > lib.low.threshold
data <- data[,keep]
anno <- anno[keep,]
dim(data)
lib.high.threshold <- 1.8e6
abline(v = lib.high.threshold, col = "red")
keep <- colSums(data) < lib.high.threshold
data <- data[,keep]
anno <- anno[keep,]
dim(data)
In addition to ensuring sufficient sequencing depth for each sample, we also want to make sure that the reads are distributed across the transcriptome. Thus, we count the total number of unique genes detected in each sample and remove genes given a threshold.
In [ ]:
hist(x = rowSums(data > 0), breaks = 100)
gene.threshold <- 5
abline(v = gene.threshold, col = "red")
data <- data[rowSums(data>0) > gene.threshold,]
dim(data)
In some experiments, control RNAs, e.g. ERCC spike-in RNAs, may be used. Spike-in RNAs are good for (at least) two things:
In [ ]:
ercc <- grep("^ERCC-", rownames(data), value = T)
ercc.ratio <- apply(data[ercc,], 2, sum) / apply(data, 2, sum)
hist(x = ercc.ratio, breaks = 100)
Another useful QC metric is the ratio of reads aligned onto mitochrondrial genes per cell. High percentages may indicate broken cells and thus should be removed. However, no mitochondrial genes are expressed are found in the data and this step is skipped.
In [ ]:
mito <- grep("^MT-", rownames(data), value = T)
mito.ratio <- apply(data[mito,], 2, sum) / apply(data, 2, sum)
hist(x = mito.ratio, breaks = 100)
In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles may provide insights about how the experimental procedures could be improved.
It is typically a good idea to remove genes whose expression level is considered “undetectable”. We define a gene as detectable if at least two cells contain more than 5 transcripts from the gene. However, the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells.
In [ ]:
data <- data[rowSums(data > 5) > 2,]
dim(data)
Save non-normalized data for later usage.
In [ ]:
data.reads <- data
Library sizes vary because scRNA-seq data is often sequenced on highly multiplexed platforms the total reads which are derived from each cell may differ substantially. Some quantification methods (eg. Cufflinks, RSEM) incorporated library size when determining gene expression estimates thus do not require this normalization. However, if another quantification method was used then library size must be corrected for by multiplying or dividing each column of the expression matrix by a "normalization factor" which is an estimate of the library size relative to the other cells. Many methods to correct for library size have been developped for bulk RNA-seq and can be equally applied to scRNA-seq (eg. UQ, SF, CPM, RPKM, FPKM, TPM).
The simplest way to normalize this data is to convert it to counts per million (CPM) by dividing each column by its total then multiplying by 1,000,000. Note that spike-ins, if used, should be excluded from the calculation of total expression in order to correct for total cell RNA content. Also note that we are adding 1 to the data before normalization because we will log-transform the data eventually!
In [ ]:
data <- apply(data+1, 2, function(x) x / sum(x) * 1e6)
One potential drawback of CPM is if your sample contains genes that are both very highly expressed and differentially expressed across the cells. In this case, the total molecules in the cell may depend of whether such genes are on/off in the cell and normalizing by total molecules may hide the differential expression of those genes and/or falsely create differential expression for the remaining genes.
Log-transformation is pretty much a standard procedure in molecular data. Molecule count distribution is often skewed because cells not expressing a gene have low read counts but cells expressing the same genes have exponentially higher read counts. E.g. take a look at the distribution of FBXW7 before and after log-transformation. Note that the bimodal distribution seen in the histogram suggests that there are subpopulations of cells expressing FBXW7.
In [ ]:
hist(x = data["FBXW7",], breaks = 25)
hist(x = log(data["FBXW7",]), breaks = 25)
data <- log(data)
After preprocessing we usually want to see an overview of the data by visualizing it. But before visualization we want to find differentially expressing genes that help us discriminate different cell types from the data. One simple way is to order genes by their dispersion and use n top genes. High dispersion indicates the gene is expressed in different magnitudes among the cell population and could indicate existence of different cell types. Downside using this approach is that generally highly expressed genes are favored over lowly expressed. Quick fix to this is divide the dispersion by mean expression, i.e. calculating coefficient of variation.
In [ ]:
vargenes <- names(sort(apply(data, 1, var), decreasing = T)) # variance
vargenes <- names(sort(apply(data, 1, var) / apply(data, 1, mean), decreasing = T)) # coefficient of variation
The easiest way to overview the data is by transforming it using the principal component analysis and then visualize the first two principal components. Principal component analysis (PCA) is a statistical procedure that uses a transformation to convert a set of observations into a set of values of linearly uncorrelated variables called principal components (PCs). The number of principal components is less than or equal to the number of original variables.
In [ ]:
data.pca <- prcomp(x = data[vargenes[1:1000],], center = T, scale. = T)
data.pca <- cbind.data.frame(data.pca$rotation[,1:2], anno$characteristics_ch1.1)
colnames(data.pca) <- c("X1", "X2", "cell_type")
ggplot(data.pca, aes(x = X1, y = X2, color = cell_type)) + geom_point(alpha = 0.6, size = 2) + ggtitle("PCA")
An alternative to PCA for visualizing scRNASeq data is a tSNE plot. tSNE (t-Distributed Stochastic Neighbor Embedding) combines dimensionality reduction (e.g. PCA) with random walks on the nearest-neighbour network to map high dimensional data to a 2- or 3-dimensional space while preserving local distances between cells. In contrast with PCA, tSNE is a stochastic algorithm which means running the method multiple times on the same dataset will result in different plots. Due to the non-linear and stochastic nature of the algorithm, it is more difficult to intuitively interpret a tSNE visualization To ensure reproducibility, we fix the “seed” of the random-number generator in the code below so that we always get the same plot.
In [ ]:
set.seed(8)
data.tsne <- Rtsne(X = t(data[vargenes[1:1000],]), check_duplicates = F, pca = T)
data.tsne <- cbind.data.frame(data.tsne$Y, anno$characteristics_ch1.1)
colnames(data.tsne) <- c("X1", "X2", "cell_type")
ggplot(data.tsne, aes(x = X1, y = X2, color = cell_type)) + geom_point(alpha = 0.6, size = 2) + ggtitle("t-SNE")
While visual inspection is useful, evaluation using quality metrics is also important. Here we assess the quality of the embedding with trustworthiness and continuity (unsupervised metrics) and with nearest neighbor error (supervised metric).
In [ ]:
trustworthiness(X = t(data), Y = data.pca[,1:2], k = 12)
trustworthiness(X = t(data), Y = data.tsne[,1:2], k = 12)
continuity(X = t(data), Y = data.pca[,1:2], k = 12)
continuity(X = t(data), Y = data.tsne[,1:2], k = 12)
1 - sum(knn.cv(train = data.pca[,1:2], cl = data.pca$cell_type, k = 1) == data.pca$cell_type) / nrow(data.pca)
1 - sum(knn.cv(train = data.tsne[,1:2], cl = data.tsne$cell_type, k = 1) == data.tsne$cell_type) / nrow(data.tsne)
Single-cell RNA sequencing is able to quantify the whole transcriptome from the small amount of RNA present in individual cells. However, a consequence of reverse-transcribing and amplifying small quantities of RNA is a large number of dropouts, genes with zero expression in particular cells. The frequency of dropout events is strongly non-linearly related to the measured expression levels of the respective genes. M3Drop posits that these dropouts are due to failures of reverse transcription, a simple enzyme reaction, thus should be modelled using the Michaelis-Menten equation as follows: P_i = 1 - S_i / (S_i + K) where P_i is the proportion of cells where gene i dropouts out, S_i is the mean expression of gene i and K is the Michaelis constant. The first step is to clean the data by remove cells with too few detected genes, genes that with very low expression, and to normalize the data. This can be done using any method but M3Drop includes a simple function that will clean the expression matrix and convert raw counts to counts per million (CPM). If alternative normalization methods are used the input expression matrix must not be log-transformed, nor contain negative values. If normalization adjusts zero values then M3Drop will use the minimal expression value in the entire matrix as the value for dropouts.
In [ ]:
data.drop <- M3DropCleanData(data.reads, labels = anno$characteristics_ch1.1, is.counts = T)
Next, we can compare the fits of different possible functions relating the proportion of dropouts to the average expression for each gene.
In [ ]:
fits <- M3DropDropoutModels(data.drop$data)
Visual inspection of the resulting plot shows that the Michaelis-Menten equation is the best fit to the data. However, we can also examine some statistics to confirm this:
In [ ]:
data.frame(MM = fits$MMFit$SAr, Logistic = fits$LogiFit$SAr, DoubleExpo = fits$ExpoFit$SAr) # Sum absolute residuals
data.frame(MM = fits$MMFit$SSr, Logistic = fits$LogiFit$SSr, DoubleExpo = fits$ExpoFit$SSr) # Sum squared residuals
Here we actually see that statistics favor the logistic curve (smaller value better)
Since the Michaelis-Menten equation is concave, averaging across a mixed population forces differentially expressed genes to be shifted to the right of the Michaelis-Menten curve. DE genes are identified by comparing the local K calculated for a specific gene to the global K fitted across all genes using a Z-test followed by multiple-testing correction
In [ ]:
DE_genes <- M3DropDifferentialExpression(data.drop$data, mt_method = "fdr", mt_threshold = 0.01)
Note that this function runs directly from the expression matrix, hence one could skip straight to identifying DE genes without comparing models and any external normalisation method can be applied to the raw counts prior to DE analysis.
To check that the identified genes are truly differentially expressed we can plot the normalised expression of the genes across cells.
In [ ]:
heat_out <- M3DropExpressionHeatmap(DE_genes$Gene, data.drop$data, cell_labels = data.drop$labels)
To get certain clustering out from the dendrogram of the heatmap:
In [ ]:
clustering <- M3DropGetHeatmapCellClusters(heat_out, k = 4)
We can then test all genes as marker genes for our cell type labels (or the provided clusters).
In [ ]:
marker_genes <- M3DropGetMarkers(data.drop$data, data.drop$labels)
Marker genes are ranked by the area-under the ROC curve (AUC) for predicting the population with the highest expression of the gene from the other groups. Significance is calculated using a Wilcox-rank-sum test. Now we can examine the marker genes of different cell types more closely.
In [ ]:
head(marker_genes[marker_genes$Group=="endothelial",], 20)