AMIA 2016 Annual Symposium Workshop (WG13)

Mining Large-scale Cancer Genomics Data Using Cloud-based Bioinformatics Approaches (RNAseq)

* Part 02 Hands-on practice on differential gene expression identification

Riyue Bao, Ph.D. Center for Research Informatics, The University of Chicago. November 13, 2016


The workshop materials are accessible on Github licensed via LGPLv3.

Table of Contents:


Objective [[top](#top)]

  • Learn the downstream analysis of RNAseq data
    • Detect genes differentially expressed between conditions
    • Identify pathways / network enriched in genes of interest
    • Generate high-quality figures for publication (PCA, heatmap, sample/gene cluster, GO/pathways, etc.)
  • Practice the workflow interactively

Dataset [[top](#top)]

The test datasets used in this workshop are from Fog. et al. 2015. Loss of PRDM11 promotes MYC-driven lymphomagenesis. Blood, 125(8):1272-81 http://www.bloodjournal.org/content/125/8/1272.long?sso-checked=true


Identify DEGs: DESeq2 [[top](#top)]

Commonly used tool / package for detection of signficant DEGs include DESeq2, limma/limmavoom, edgeR, etc.

For this workshop, We will demo how to use DESeq2 to detect DEGs between WT and KO groups.


1. Clean the environment [[top](#top)]


In [1]:
rm(list=ls())

2. Start the clock [[top](#top)]


In [2]:
ptm <- proc.time()

3. Load libraries / packages [[top](#top)]


In [3]:
##-- List packages required in this analysis
cpan.pkg.list <- c('ggplot2', 'scales', 'ape', 'RColorBrewer', 
                  'reshape','VennDiagram')
bioc.pkg.list <- c('ctc',  'limma', 'edgeR', 'DESeq2', 'vsn', 
                  'genefilter', 'pheatmap', 
                  'clusterProfiler', 'pathview',
                  'AnnotationHub')

##-- Set up CPAN repo (required if running IRkernel in Jupyter)
# cpan.repos <- 'http://cran.us.r-project.org'

##-- Install CPAN packages
# install.packages('ggplot2', repos=cpan.repos) 
# lapply(cpan.pkg.list, suppressMessages(install.packages), 
#        repos=cpan.repos, character.only = TRUE)

##-- Set up Bioconductor repo
# source("http://bioconductor.org/biocLite.R")

##-- Install Bioc packages
# biocLite('DESeq2')
# lapply(bioc.pkg.list, suppressMessages(biocLite), 
#        character.only = TRUE)

##-- Load libraries
for(pkg in c(cpan.pkg.list, bioc.pkg.list)) {
    print(pkg)
    suppressMessages(require(pkg, character.only = TRUE))
}

##-- replace Hs with other (e.g. Mm) if you work on other species
##-- annotation database used by clusterProfiler
# suppressMessages(library(org.Hs.eg.db))


[1] "ggplot2"
[1] "scales"
[1] "ape"
[1] "RColorBrewer"
[1] "reshape"
[1] "VennDiagram"
[1] "ctc"
[1] "limma"
[1] "edgeR"
[1] "DESeq2"
[1] "vsn"
[1] "genefilter"
[1] "pheatmap"
[1] "clusterProfiler"
[1] "pathview"
[1] "AnnotationHub"

4. Set up global parameters, input/output directories and files [[top](#top)]


In [4]:
##-- Parameters
cancer <- 'DLBC' 
fdr <- 0.05
fc <- 1.5
gene.type <- 'coding'
caller <- 'deseq2'
group1 <- 'KO'
group2 <- 'WT'
colors <- c('#CC0000', '#00CC00')

##-- Set up working directory
work.dir <- '.'
setwd(work.dir)

##-- Input/Output directories
in.dir <- 'ipynb_data/input'
out.dir <- 'ipynb_data/output'

##-- Input/Output files
expr.file <- paste0(cancer, '.raw_counts.tsv')
sample.file <- paste0(cancer, '.sample_group.tsv')
geneinfo.file <- 'gencode.v24.primary_assembly.annotation.gtf.geneinfo'

5. Print analysis info [[top](#top)]


In [5]:
print(paste0('Cancer = ', cancer))
print(paste0('gene type = ', gene.type))
print(paste0('DEG fdr cutoff = ', fdr))
print(paste0('DEG fc cutoff = ', fc))
print(paste0('Expression file = ', expr.file))
print(paste0('Sample group file  = ', sample.file))
print(paste0('Gene info file  = ', geneinfo.file))


[1] "Cancer = DLBC"
[1] "gene type = coding"
[1] "DEG fdr cutoff = 0.05"
[1] "DEG fc cutoff = 1.5"
[1] "Expression file = DLBC.raw_counts.tsv"
[1] "Sample group file  = DLBC.sample_group.tsv"
[1] "Gene info file  = gencode.v24.primary_assembly.annotation.gtf.geneinfo"

6. Import data files [[top](#top)]


In [6]:
##-- Read data files 
data.expr <- read.delim(paste0(in.dir, '/', expr.file),
                       header = TRUE, stringsAsFactors=FALSE)
data.sample <- read.delim(paste0(in.dir, '/', sample.file),
                         header = TRUE, stringsAsFactors=FALSE)
data.geneinfo <- read.delim(paste0(in.dir, '/', geneinfo.file), 
                           header = TRUE, stringsAsFactors = FALSE)

7. Peek into imported data [[top](#top)]


In [7]:
##-- Expression matrix: raw read counts
print(paste0('Expression matrix = ', 
             nrow(data.expr), ' genes, ', 
             ncol(data.expr), ' fields'))
data.expr[1:3,]

##-- Sample table: experimental design & groups
print(paste0('Sample table = ', 
             nrow(data.sample), ' samples, ', 
             length(table(data.sample$Group)), ' groups'))
data.sample

##-- Gene annotation 
print(paste0('Gene annotation = ', 
             nrow(data.geneinfo), ' genes, ', 
             ncol(data.geneinfo), ' fields'))
head(data.geneinfo)


[1] "Expression matrix = 60725 genes, 12 fields"
GeneidChrStartEndStrandLengthKO01KO02KO03WT01WT02WT03
ENSG00000223972.5 chr1;chr1;chr1;chr1 11869;12613;12975;13221 12227;12721;13052;14409 +;+;+;+ 1735 1 1 0 1 0 0
ENSG00000227232.5 chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;2953414501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570-;-;-;-;-;-;-;-;-;-;- 1351 2 3 0 2 6 3
ENSG00000278267.1 chr1 17369 17436 - 68 0 0 0 0 0 0
[1] "Sample table = 6 samples, 2 groups"
SampleGroup
KO01KO
KO02KO
KO03KO
WT01WT
WT02WT
WT03WT
[1] "Gene annotation = 60725 genes, 6 fields"
gene_idgene_typegene_statusgene_namelevelhavana_gene
ENSG00000000003.14 protein_coding KNOWN TSPAN6 2 OTTHUMG00000022002.1
ENSG00000000005.5 protein_coding KNOWN TNMD 2 OTTHUMG00000022001.1
ENSG00000000419.12 protein_coding KNOWN DPM1 2 OTTHUMG00000032742.2
ENSG00000000457.13 protein_coding KNOWN SCYL3 2 OTTHUMG00000035941.5
ENSG00000000460.16 protein_coding KNOWN C1orf112 2 OTTHUMG00000035821.7
ENSG00000000938.12 protein_coding KNOWN FGR 2 OTTHUMG00000003516.2

8. Preprocess data: Expression matrix, Gene annotation [[top](#top)]


In [8]:
data.expr.proc <- data.expr
data.sample.proc <- data.sample
data.geneinfo.proc <- data.geneinfo

##-- Set up row names of each data frame
row.names(data.expr.proc) <- data.expr.proc[,1]
row.names(data.sample.proc) <- data.sample.proc[,1]
row.names(data.geneinfo.proc) <- data.geneinfo.proc[,1]

##-- Expression matrix: remove extra columns 
# colnames(data.expr.proc)
data.expr.proc <- data.expr.proc[,c(1,7:12)]
colnames(data.expr.proc)[1] <- 'Gene'

##-- Expression matrix: add gene symbol to Ensembl geneid 
##-- from annotation
print(sum(data.expr.proc$Gene %in% data.geneinfo.proc$gene_id) 
      == nrow(data.expr.proc))
data.expr.proc <- merge(data.geneinfo.proc[,c('gene_id', 
                                             'gene_name')], 
                       data.expr.proc, by = 'row.names')

##-- Expression matrix: concat gene symbol & Ensembl id 
##-- as unique key for each gene row
data.expr.proc$Gene <- paste0(data.expr.proc$gene_name, 
                             '!', data.expr.proc$gene_id)
data.expr.proc <- data.expr.proc[,-c(1:3)]
row.names(data.expr.proc) <- data.expr.proc[,1]
data.expr.proc <- data.expr.proc[,-1]

##-- Expression matrix: peek into preprocessed data
data.expr.proc[1:3,]

##-- Gene annotation: show gene types
data.frame(table(data.geneinfo.proc$gene_type))

##-- Gene annotation: retrieve Ensembl id only
data.geneinfo.proc.gene <- data.geneinfo.proc[,c(1,2,4)]
data.geneinfo.proc.gene <- unique(data.geneinfo.proc.gene)
print(paste0('Total genes = ', nrow(data.geneinfo.proc.gene)))

##-- Gene annotation: concat gene symbol & Ensembl id
##-- as unique key for each gene row
data.geneinfo.proc.gene <- data.frame(Key = 
                    paste0(data.geneinfo.proc.gene$gene_name,'!',
                           data.geneinfo.proc.gene$gene_id), 
                                     data.geneinfo.proc.gene)
row.names(data.geneinfo.proc.gene) <- data.geneinfo.proc.gene[,1]
data.geneinfo.proc <- data.geneinfo.proc[,-1]

##-- Gene annotation: keep two types of genes, 
##-- coding and lincRNA, for further analysis
gene.coding <- data.geneinfo.proc.gene[
    data.geneinfo.proc.gene$gene_type == 'protein_coding',]
gene.lincrna <- data.geneinfo.proc[
    data.geneinfo.proc.gene$gene_type == 'lincRNA',]

##-- Gene annotation: show how many genes are in each list
print(paste0('Coding genes = ', nrow(gene.coding)))
print(paste0('lincRNA genes = ', nrow(gene.lincrna)))


[1] TRUE
KO01KO02KO03WT01WT02WT03
TSPAN6!ENSG00000000003.14 3 1 4 0 0 0
TNMD!ENSG00000000005.5 0 0 0 0 0 0
DPM1!ENSG00000000419.12319723181356263027352721
Var1Freq
3prime_overlapping_ncrna 29
antisense 5564
bidirectional_promoter_lncrna 3
IG_C_gene 14
IG_C_pseudogene 9
IG_D_gene 37
IG_J_gene 18
IG_J_pseudogene 3
IG_V_gene 145
IG_V_pseudogene 183
lincRNA 7674
macro_lncRNA 1
miRNA 4205
misc_RNA 2307
Mt_rRNA 2
Mt_tRNA 22
non_coding 3
polymorphic_pseudogene 58
processed_pseudogene 10285
processed_transcript 502
protein_coding 19844
pseudogene 25
ribozyme 8
rRNA 550
scaRNA 49
sense_intronic 919
sense_overlapping 193
snoRNA 949
snRNA 1905
sRNA 20
TEC 1053
transcribed_processed_pseudogene 445
transcribed_unitary_pseudogene 3
transcribed_unprocessed_pseudogene 682
translated_unprocessed_pseudogene 1
TR_C_gene 6
TR_D_gene 4
TR_J_gene 79
TR_J_pseudogene 4
TR_V_gene 108
TR_V_pseudogene 30
unitary_pseudogene 171
unprocessed_pseudogene 2612
vaultRNA 1
[1] "Total genes = 60725"
[1] "Coding genes = 19844"
[1] "lincRNA genes = 7674"

9. Proior to DEG identification: Subset expression matrix based on gene type [[top](#top)]


In [9]:
data.sample.proc.sub <- data.sample.proc

##-- Expression matrix: select gene rows 
##-- based on specified gene type (coding or lincrna)
gene.list <- ''
if(gene.type == 'coding') { gene.list <- gene.coding  }
if(gene.type == 'lincrna') { gene.list <- gene.lincrna }
data.expr.proc.sub <- data.expr.proc[
    row.names(data.expr.proc) %in% row.names(gene.list),] 

##-- Expression matrix: sort matrix sample column 
##-- to be consistent with sample table 
##--(required for DEG analysis)
data.expr.proc.sub <- data.expr.proc.sub[,data.sample.proc.sub$Sample] 

# data.expr = data.expr.proc.sub
# save(data.expr, file = paste0(paste0(out.dir, '/', caller, '/', expr.file, '.coding.Data')))
write.table(data.expr.proc.sub,
           file = paste0(out.dir, '/', caller, '/', expr.file, '.coding.tsv'),
           col.names = TRUE, row.names = TRUE, quote = FALSE, sep = '\t')

print(paste0('Gene list = ', 
             nrow(gene.list), ' (', gene.type, ')'))
print(paste0('Expression matrix before subsetting: ', 
             nrow(data.expr.proc), ' genes'))
print(paste0('Expression matrix after subsetting: ', 
             nrow(data.expr.proc.sub), ' genes'))


[1] "Gene list = 19844 (coding)"
[1] "Expression matrix before subsetting: 60725 genes"
[1] "Expression matrix after subsetting: 19844 genes"

10. Identify DEGs: DESeq2 [[top](#top)]


In [10]:
out.prefix <- paste0(out.dir, '/', caller, '/',cancer,'.',
                    gene.type, '.', caller)

##-- DESeq2: note data.sample.proc.sub must 
##-- have the same sample column
##-- order as data.expr.proc.sub
Group <- as.factor(data.sample.proc.sub$Group)

##-- DESeq2: covert to matrix data format (previously was data frame)
cds <- as.matrix(data.expr.proc.sub)

##-- DESeq2: build DESeqDataSet object, prepare design matrix
dds <- DESeqDataSetFromMatrix(countData = cds,
                           colData = data.sample.proc.sub, 
                           design = ~ Group)

##-- DESeq2: note this is just simple filter to reduce mem
##-- no affect on DEG results
# dim(dds)
# dds <- dds[ rowSums(counts(dds)) > 0, ]
# dim(dds)

##-- DESeq2: plot estimated dispersions
# pdf(paste0(out.prefix, '.dispersion.pdf'),width = 7, height = 7)
# plotDispEsts(dds, xlim = c(1,10000), ylim = c(1E-10, 1))
# dev.off()

print(dds)


Warning message in DESeqDataSet(se, design = design, ignoreRank):
“some variables in design formula are characters, converting to factors”
class: DESeqDataSet 
dim: 19844 6 
metadata(1): version
assays(1): counts
rownames(19844): TSPAN6!ENSG00000000003.14 TNMD!ENSG00000000005.5 ...
  CRLF2!ENSGR0000205755.11 ZBED1!ENSGR0000214717.10
rowData names(0):
colnames(6): KO01 KO02 ... WT02 WT03
colData names(2): Sample Group

11. DESeq2: Normalize and log2-transform count matrix for heatmap, sample clustering, etc. (NOT for DEG identification) [[top](#top)]


In [11]:
##-- DESeq2: three normalization algrithms (rld, vsd and vsd.fast)
rld <- rlog(dds, blind=FALSE)
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
# vsd.fast <- vst(dds, blind=FALSE)

##-- DESeq2: peek into normalized expression matrix
head(assay(rld), 3)
head(assay(vsd), 3)

##-- DESeq2: print normalized expression matrix to local files
write.table(data.frame(Gene = row.names(assay(rld)), assay(rld)),
          file = paste0(out.prefix, '.rld.txt'),
          sep = '\t', col.names = TRUE, row.names = FALSE, quote = FALSE)

write.table(data.frame(Gene = row.names(assay(vsd)), assay(vsd)),
          file = paste0(out.prefix, '.vsd.txt'),
          sep = '\t', col.names = TRUE, row.names = FALSE, quote = FALSE)


KO01KO02KO03WT01WT02WT03
TSPAN6!ENSG00000000003.14 0.474222 0.460208 0.4941183 0.4517336 0.4512781 0.4511058
TNMD!ENSG00000000005.5 0.000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
DPM1!ENSG00000000419.1211.325678 11.178722 10.954651411.366567911.306315411.2640777
KO01KO02KO03WT01WT02WT03
TSPAN6!ENSG00000000003.14 6.680465 6.589379 6.789483 6.43599 6.43599 6.43599
TNMD!ENSG00000000005.5 6.435990 6.435990 6.435990 6.43599 6.43599 6.43599
DPM1!ENSG00000000419.1211.46474311.25322910.92204111.52370 11.43723 11.37653

12. DESeq2: plot mean to var to confirm variance shrink works [[top](#top)]


In [12]:
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: remove genes not expressed in any samples
##-- for plottig purposes
notAllZero <- (rowSums(counts(dds))>0)

##-- DESeq2: mean to var plots 
# pdf(paste0(out.prefix, '.meanvar.log2.pdf'),width = 7, height = 7)
meanSdPlot(log2(counts(estimateSizeFactors(dds),
                       normalized=TRUE)[notAllZero,] + 1))
# dev.off()

# pdf(paste0(out.prefix, '.meanvar.rld.pdf'),width = 7, height = 7)
meanSdPlot(assay(rld[notAllZero,]))
# dev.off()

# pdf(paste0(out.prefix, '.meanvar.vsd.pdf'),width = 7, height = 7)
meanSdPlot(assay(vsd[notAllZero,]))
# dev.off()


13. DESeq2: Quality assessment of sample-to-sample relationships (correlation heatmap & PCA) [[top](#top)]


In [13]:
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: sample correlation heatmap
##-- calculate sample distance 
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$Group, rld$Libtype, sep="-")
colnames(sampleDistMatrix) <- rownames(sampleDistMatrix)

##-- DESeq2: plot heatmap
heatmap.colors <- rev(cm.colors(32))[1:16]
# pdf(paste0(out.prefix, '.sm_cor.pdf'),width = 7, height = 7)
pheatmap(sampleDistMatrix,
       clustering_distance_rows=sampleDists,
       clustering_distance_cols=sampleDists,
       col=heatmap.colors)
# dev.off()

##-- DESeq2: Principal component analysis (PCA) plot of the samples
##-- use ggplot2 to customize the PCA plot
# pdf(paste0(out.prefix, '.pca.pdf'),width = 7, height = 7)
data.pca <- plotPCA(rld, intgroup=c('Group'), returnData=TRUE)
percent.var <- round(100 * attr(data.pca, "percentVar"))
pca.colors <- c(KO = colors[1], WT = colors[2])
p1 <- ggplot(data.pca, aes(PC1, PC2, color = Group)) +
            geom_point(size = 5, shape = 17) +
            scale_colour_manual(values = pca.colors) + 
            xlab(paste0("PC1: ",percent.var[1],"% variance")) +
            ylab(paste0("PC2: ",percent.var[2],"% variance"))
plot(p1)


14. DESeq2: identify DEG between groups [[top](#top)]


In [14]:
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

print(paste0('Group 1 = ', group1))
print(paste0('Group 2 = ', group2))

comp <- paste0(group1, 'vs', group2, '.')
out.prefix <- paste0(out.dir, '/', caller,'/',cancer,'.',
                    gene.type, '.',comp,caller,'.txt')

##-- DESeq2: fit the model and identify DEGs
##-- same as running those three steps: 
##-- (1) dds = estimateSizeFactors(dds)
##-- (2) dds = estimateDispersions(dds)
##-- (3) dds = nbinomWaldTest(dds)
dds <- DESeq(dds, test="Wald", betaPrior=TRUE)
res <- results(dds, 
              contrast=c("Group",group1,group2), 
              pAdjustMethod ="fdr", 
              alpha=fdr)

##-- DESeq2: save the result as an R object for reloading
##-- in future analysis
# save(res, file = paste0(out.prefix,'.RData'))

##-- DESeq2: Plot MA to display differential expression 
##-- versus expression strength
# pdf(paste0(out.prefix,'.MAplot.pdf'), width=7, height=7)
plotMA(res, main="DESeq2", ylim=c(-2,2))
# dev.off()

##-- DESeq2: peek into DEG data object
summary(res)
mcols(res)$description
res <- as.data.frame(res)

##-- DESeq2: add fold change (via anti-log log2FC)
res$foldChange <- NA
row.pos <- which(! is.na(res$log2FoldChange) & 
                res$log2FoldChange >= 0)
row.neg <- which(! is.na(res$log2FoldChange) & 
                res$log2FoldChange < 0)
res$foldChange[row.pos] <- 2^res$log2FoldChange[row.pos]
res$foldChange[row.neg] <- -2^((-1) * res$log2FoldChange[row.neg])
res <- data.frame(id = row.names(res), res)
# print(sum(res$foldChange == 0) == 0)

##-- DESeq2: peek into gene list
res[1:3,]

##-- DESeq2: save DEG output into local file
res.ensembl <- gsub('\\S+[!]', '', res[,1], perl = TRUE)
res.symbol <- gsub('[!]\\S+', '', res[,1], perl = TRUE)
res.print <- data.frame(ENSEMBL = res.ensembl,
                       SYMBOL = res.symbol,
                        res[,-c(1:2)])
write.table(res.print,
          file = out.prefix,
          sep = '\t', col.names = TRUE, row.names = FALSE, quote = FALSE)


[1] "Group 1 = KO"
[1] "Group 2 = WT"
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
out of 16790 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 1010, 6% 
LFC < 0 (down)   : 1038, 6.2% 
outliers [1]     : 1, 0.006% 
low counts [2]   : 4190, 25% 
(mean count < 10)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

  1. 'mean of normalized counts for all samples'
  2. 'log2 fold change (MAP): Group KO vs WT'
  3. 'standard error: Group KO vs WT'
  4. 'Wald statistic: Group KO vs WT'
  5. 'Wald test p-value: Group KO vs WT'
  6. 'fdr adjusted p-values'
idbaseMeanlog2FoldChangelfcSEstatpvaluepadjfoldChange
TSPAN6!ENSG00000000003.14TSPAN6!ENSG00000000003.14 1.449331 0.1551665 0.08359604 1.856147 0.06343258 NA 1.113550
TNMD!ENSG00000000005.5TNMD!ENSG00000000005.5 0.000000 NA NA NA NA NA NA
DPM1!ENSG00000000419.12DPM1!ENSG00000000419.12 2427.048731 -0.2017824 0.12436288 -1.622529 0.10469004 0.2767594 -1.150118

15. DESeq2: Filter for significant DEGs by fold change and fdr [[top](#top)]


In [15]:
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: remove nan values for the foldchange == NAN
before <- nrow(res)
res <- res[!is.na(res$foldChange) & ! is.na(res$padj),];
after <- nrow(res)
print(paste0('Genes removed = ', (before - after), 
             ' (fold change is NA)'))
print(paste0('Genes kept = ', after))

print(paste0('Filter DEGs by: fc, ', fc, ', fdr ', fdr))

##-- DESeq2: plot histogram of unadj p-value for sanity check
# pdf(paste0(out.prefix,'.pvalue.hist.pdf'), width=7, height=7)
hist(res$pval, breaks=100, 
     col="skyblue", border="slateblue", 
     main="Histogram of unadjusted p-value", 
     xlab="unajusted p-value") 
# dev.off()   

##-- DESeq2: filter DEGs
res.flt <- res[(res$foldChange >= fc | res$foldChange <= -fc) & 
              res$padj < fdr,]
print(paste0('Genes filtered = ', (after - nrow(res.flt)), 
             ' (fc, ', fc, ', fdr ', fdr, ')'))
print(paste0('Genes kept = ', nrow(res.flt)))

##-- DESeq2: peek into filtered gene list
res.flt[1:3,]

##-- DESeq2: save filtered DEG output into local file
if(nrow(res.flt) > 0) {
    res.flt.ensembl <- gsub('\\S+[!]', '', res.flt[,1], perl = TRUE)
    res.flt.print <- res.print[res.print$ENSEMBL %in% res.flt.ensembl,]
    write.table(res.flt.print, paste0(out.prefix, 
                                '.flt.fdr', fdr, '_fc', fc), 
                col.names=TRUE, row.names=FALSE, sep='\t', quote=FALSE)
}


[1] "Genes removed = 7245 (fold change is NA)"
[1] "Genes kept = 12599"
[1] "Filter DEGs by: fc, 1.5, fdr 0.05"
[1] "Genes filtered = 12303 (fc, 1.5, fdr 0.05)"
[1] "Genes kept = 296"
idbaseMeanlog2FoldChangelfcSEstatpvaluepadjfoldChange
SPATA20!ENSG00000006282.20SPATA20!ENSG00000006282.20339.1868 -0.6129525 0.1144608 -5.355130 8.549488e-08 4.766150e-06 -1.529386
MYLIP!ENSG00000007944.14MYLIP!ENSG00000007944.14 498.9798 -0.8249904 0.1084094 -7.609956 2.741897e-14 7.050033e-12 -1.771523
SNAI2!ENSG00000019549.8SNAI2!ENSG00000019549.8 143.4191 0.6002812 0.1529813 3.923886 8.713192e-05 1.596681e-03 1.516012

16. DESeq2: Plot expression heatmap of significant DEGs [[top](#top)]


In [16]:
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: continue with significant DEGs from previous step
print(paste0('Genes significant = ', nrow(res.flt), 
             ' (fc, ', fc, ', fdr ', fdr, ')'))

##-- DESeq2: select those sig genes from normalized expression matrix
gene.select <- res.flt$id
data.plot <- assay(rld)
data.plot <- data.plot[row.names(data.plot) %in% gene.select,,
                      drop = FALSE]

##-- DESeq2: prepare annotation label and colors
data.plot.anno <- as.data.frame(
    data.sample.proc.sub[,colnames(data.sample.proc.sub) %in%
                         c('Group'),
                         drop = FALSE])
data.plot.anno.colors <- list(Group = c(KO = colors[1], WT = colors[2]))

##-- DESeq2: plot sig DEG heatmap
print(paste0('Heatmap = ', nrow(data.plot), ' genes on the row, ', 
             ncol(data.plot), ' samples on the column'))
pheatmap(data.plot,
         scale = 'row',
         cluster_rows = TRUE, cluster_cols = TRUE,
         show_rownames = FALSE, show_colnames = TRUE,
         clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean", 
         clustering_method = "complete",
         annotation_row = NA, 
         annotation_col = data.plot.anno,
         annotation_colors = data.plot.anno.colors
        )


[1] "Genes significant = 296 (fc, 1.5, fdr 0.05)"
[1] "Heatmap = 296 genes on the row, 6 samples on the column"

17. clusterProfiler: Prepare input for GO/KEGG enrichment analysis [[top](#top)]


In [17]:
##-- clusterProfiler: NOTE those commands were tested for clusterProfiler 
##-- version 3.0.5 or above. Older versions may not work as they have 
##-- options revised/removed in the new versions.

##-- clusterProfiler: prepare input for GO/KEGG enrichment analysis
genes.all <- res.print
genes.sig <- res.flt.print
genes.all$ENSEMBL <- gsub('[.]\\d+', '', genes.all$ENSEMBL, perl = TRUE)
genes.sig$ENSEMBL <- gsub('[.]\\d+', '', genes.sig$ENSEMBL, perl = TRUE)

##-- clusterProfiler: remove genes with fc / pvalue as NA
genes.all <- na.omit(genes.all)

##-- clusterProfiler: add EntrezID
##-- show which keytype is available
keytypes(org.Hs.eg.db) 
genes.all.anno <- bitr(geneID   =  genes.all$ENSEMBL, 
                      fromType = 'ENSEMBL', 
                      toType   = c('ENTREZID', 'SYMBOL'), 
                      OrgDb    = 'org.Hs.eg.db', 
                      drop     = TRUE)

##-- clusterProfiler: save annotated output to local files
write.table(genes.all.anno,
            file = paste0(out.prefix,'.anno'),
            col.names = TRUE, row.names = FALSE, sep = '\t', 
            quote = FALSE)

##-- clusterProfiler: remove genes with duplicate EntrezIDs
genes.all.anno <- genes.all.anno[
    which(! duplicated(genes.all.anno$ENTREZID)), ]
row.names(genes.all.anno) <- genes.all.anno$ENTREZID

##-- clusterProfiler: add fc and p-value to genes.all.anno
genes.all.anno <- merge(genes.all.anno, genes.all, by = 'ENSEMBL')
row.names(genes.all.anno) <- genes.all.anno$ENTREZID

##-- clusterProfiler: prepare significant gene list
genes.sig.anno <- genes.all.anno[genes.all.anno$ENSEMBL %in% 
                                genes.sig$ENSEMBL,]

print(paste0('Genes significant = ', nrow(genes.sig.anno)))
genes.sig.anno[1:3,1:7]

##-- clusterProfiler: prepare input for GSEA 
##-- ranked gene list with decreasing fold change
gene.list <- genes.all.anno$foldChange
names(gene.list) <- genes.all.anno$ENTREZID
gene.list <- sort(gene.list, decreasing = TRUE)
gene.list[1:3]


  1. 'ACCNUM'
  2. 'ALIAS'
  3. 'ENSEMBL'
  4. 'ENSEMBLPROT'
  5. 'ENSEMBLTRANS'
  6. 'ENTREZID'
  7. 'ENZYME'
  8. 'EVIDENCE'
  9. 'EVIDENCEALL'
  10. 'GENENAME'
  11. 'GO'
  12. 'GOALL'
  13. 'IPI'
  14. 'MAP'
  15. 'OMIM'
  16. 'ONTOLOGY'
  17. 'ONTOLOGYALL'
  18. 'PATH'
  19. 'PFAM'
  20. 'PMID'
  21. 'PROSITE'
  22. 'REFSEQ'
  23. 'SYMBOL'
  24. 'UCSCKG'
  25. 'UNIGENE'
  26. 'UNIPROT'
'select()' returned 1:many mapping between keys and columns
Warning message in bitr(geneID = genes.all$ENSEMBL, fromType = "ENSEMBL", toType = c("ENTREZID", :
“0.45% of input gene IDs are fail to map...”
[1] "Genes significant = 301"
ENSEMBLENTREZIDSYMBOL.xSYMBOL.ylog2FoldChangelfcSEstat
64847ENSG0000000628264847 SPATA20 SPATA20 -0.6129525 0.1144608 -5.355130
29116ENSG0000000794429116 MYLIP MYLIP -0.8249904 0.1084094 -7.609956
6591ENSG000000195496591 SNAI2 SNAI2 0.6002812 0.1529813 3.923886
3310
5.45723141695043
1958
3.49436114376338
2353
3.43324047473579

18. clusterProfiler: GO enrichment analysis [[top](#top)]


In [18]:
##-- clusterProfiler: GO over-representation test
ego <- enrichGO( gene          = genes.sig.anno$ENTREZID,
                universe      = genes.all.anno$ENTREZID,
                OrgDb         = 'org.Hs.eg.db',
                ont           = "BP",
                pAdjustMethod = "fdr",
                pvalueCutoff  = 0.05,
                readable      = TRUE)
summary(ego)[1:3,1:7]

##-- clusterProfiler: save GO enrichment output to an R object (faster data import next time!)
save(ego, file = paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc, '.enrichGO.RData'))


IDDescriptionGeneRatioBgRatiopvaluep.adjustqvalue
GO:1902105GO:1902105 regulation of leukocyte differentiation 17/269 153/11041 1.793987e-07 0.0003200531 0.0002676516
GO:1903706GO:1903706 regulation of hemopoiesis 20/269 211/11041 2.062198e-07 0.0003200531 0.0002676516
GO:0002761GO:0002761 regulation of myeloid leukocyte differentiation11/269 69/11041 7.631931e-07 0.0007896505 0.0006603629

19. clusterProfiler: KEGG pathway enrichment analysis [[top](#top)]


In [19]:
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 10, repr.plot.height = 5)

##-- clusterProfiler: if running this cell separately, 
##-- load a fresh new copy of ego
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc, '.enrichGO.RData'))

##-- clusterProfiler: enrichGO - drop 5 level of GO terms (too general!)
ego.alllevel <- ego
for(i in 1:5) { 
  ego <- dropGO(ego, level = i)
}

##-- clusterProfiler: visualization of GO enrichment results
p1 <- barplot(ego, showCategory=20)
p2 <- dotplot(ego, showCategory=20)
# p3 = plotGOgraph(ego, firstSigNodes = 10)

##-- clusterProfiler: show the plots in notebook
plot(p1)
plot(p2)



In [20]:
##-- clusterProfiler: KEGG over-representation test
kk <- enrichKEGG(gene          = genes.sig.anno$ENTREZID,
                 universe      = genes.all.anno$ENTREZID,
                 organism      = "human",
                 pAdjustMethod = "fdr",
                 pvalueCutoff  = 0.05)
summary(kk)[1:3,]

##-- clusterProfiler: save KEGG pathway enrichment output to an R object (faster data import next time!)
save(kk, file = paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc,'.enrichKEGG.RData'))


IDDescriptionGeneRatioBgRatiopvaluep.adjustqvaluegeneIDCount
hsa05322hsa05322 Systemic lupus erythematosus 12/145 70/4637 1.239854e-06 0.0002653287 0.0002531912 8970/8365/8347/8349/8356/440689/7124/8339/8358/8367/8343/835712
hsa04620hsa04620 Toll-like receptor signaling pathway 10/145 80/4637 1.654245e-04 0.0177004237 0.0168907142 6696/2353/3725/7124/6351/9560/388372/6349/414062/6348 10
hsa05132hsa05132 Salmonella infection 9/145 69/4637 2.550959e-04 0.0181968390 0.0173644209 3606/2353/3725/6351/9560/388372/6349/414062/6348 9

In [21]:
##-- clusterProfiler: if running this cell separately, load a fresh new copy of kk
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc,'.enrichKEGG.RData'))

##-- clusterProfiler: visualize pathway with all genes
pathway <- 'hsa04010'
plot.title <- paste0(pathway, '.gene_all.pathview.png')
p4 <- pathview(gene.data  = gene.list,
              pathway.id = pathway,
              species    = "human",
              limit      = list(gene=4, cpd=1),
              kegg.dir   = paste0(out.dir,'/',caller),
              out.suffix = paste0('gene_all.pathview'))


'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /home/ubuntu/dev/rnaseq/CRI-Workshop-Nov2016-RNAseq/notebook_ext
Info: Writing image file hsa04010.gene_all.pathview.png

<img src='ipynb_data/assets/hsa04010.gene_all.pathview.png', title = 'hsa04010', width = 1000, height = 1000>

Identify DEGs: DESeq2 - End

20. Stop the clock! [[top](#top)]


In [22]:
proc.time() - ptm


   user  system elapsed 
 78.096   5.675  88.237 

In [23]:
print('Program run finished!')

##-- Print analysis environment (for reproducible research)
sessionInfo()


[1] "Program run finished!"
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] hexbin_1.27.1              AnnotationHub_2.4.2       
 [3] pathview_1.12.0            org.Hs.eg.db_3.3.0        
 [5] AnnotationDbi_1.34.4       clusterProfiler_3.0.5     
 [7] DOSE_2.10.7                pheatmap_1.0.8            
 [9] genefilter_1.54.2          vsn_3.40.0                
[11] DESeq2_1.12.4              SummarizedExperiment_1.2.3
[13] Biobase_2.32.0             GenomicRanges_1.24.3      
[15] GenomeInfoDb_1.8.7         IRanges_2.6.1             
[17] S4Vectors_0.10.3           BiocGenerics_0.18.0       
[19] edgeR_3.14.0               limma_3.28.21             
[21] ctc_1.46.0                 amap_0.8-14               
[23] VennDiagram_1.6.17         futile.logger_1.4.3       
[25] reshape_0.8.6              RColorBrewer_1.1-2        
[27] ape_3.5                    scales_0.4.0              
[29] ggplot2_2.1.0             

loaded via a namespace (and not attached):
 [1] nlme_3.1-128                  bitops_1.0-6                 
 [3] matrixStats_0.51.0            httr_1.2.1                   
 [5] Rgraphviz_2.16.0              repr_0.9.9000                
 [7] tools_3.3.1                   R6_2.2.0                     
 [9] affyio_1.42.0                 rpart_4.1-10                 
[11] Hmisc_3.17-4                  DBI_0.5-1                    
[13] colorspace_1.2-7              nnet_7.3-12                  
[15] gridExtra_2.2.1               preprocessCore_1.34.0        
[17] chron_2.3-47                  graph_1.50.0                 
[19] SparseM_1.72                  labeling_0.3                 
[21] topGO_2.24.0                  KEGGgraph_1.30.0             
[23] affy_1.50.0                   pbdZMQ_0.2-4                 
[25] stringr_1.1.0                 digest_0.6.10                
[27] foreign_0.8-67                XVector_0.12.1               
[29] htmltools_0.3.5               RSQLite_1.0.0                
[31] shiny_0.14.1                  BiocInstaller_1.22.3         
[33] jsonlite_1.1                  BiocParallel_1.6.6           
[35] acepack_1.4.0                 GOSemSim_1.30.3              
[37] RCurl_1.95-4.8                magrittr_1.5                 
[39] GO.db_3.3.0                   Formula_1.2-1                
[41] Matrix_1.2-7.1                Rcpp_0.12.7                  
[43] IRkernel_0.7                  munsell_0.4.3                
[45] stringi_1.1.2                 zlibbioc_1.18.0              
[47] plyr_1.8.4                    qvalue_2.4.2                 
[49] DO.db_2.9                     crayon_1.3.2                 
[51] lattice_0.20-34               Biostrings_2.40.2            
[53] IRdisplay_0.4.9000            splines_3.3.1                
[55] annotate_1.50.1               KEGGREST_1.12.3              
[57] locfit_1.5-9.1                igraph_1.0.1                 
[59] uuid_0.1-2                    geneplotter_1.50.0           
[61] reshape2_1.4.2                futile.options_1.0.0         
[63] XML_3.98-1.1                  evaluate_0.10                
[65] latticeExtra_0.6-28           lambda.r_1.1.9               
[67] data.table_1.9.6              httpuv_1.3.3                 
[69] png_0.1-7                     gtable_0.2.0                 
[71] tidyr_0.6.0                   assertthat_0.1               
[73] mime_0.5                      xtable_1.8-2                 
[75] survival_2.39-5               tibble_1.2                   
[77] cluster_2.0.5                 interactiveDisplayBase_1.10.3
[79] GSEABase_1.34.1              

Now ... let's move on to Notebook 03 (hands-on)