AMIA 2016 Annual Symposium Workshop (WG13)

Mining Large-scale Cancer Genomics Data Using Cloud-based Bioinformatics Approaches (RNAseq)

* Part 02 Hands-on practice on differential gene expression identification

Riyue Bao, Ph.D. Center for Research Informatics, The University of Chicago. November 13, 2016

The workshop materials are accessible on Github licensed via LGPLv3.

Objective [[top](#top)]

Learn the downstream analysis of RNAseq data
- Detect genes differentially expressed between conditions
- Identify pathways / network enriched in genes of interest
- Generate high-quality figures for publication (PCA, heatmap, sample/gene cluster, GO/pathways, etc.)
Practice the workflow interactively

Dataset [[top](#top)]

The test datasets used in this workshop are from Fog. et al. 2015. Loss of PRDM11 promotes MYC-driven lymphomagenesis. Blood, 125(8):1272-81 http://www.bloodjournal.org/content/125/8/1272.long?sso-checked=true

Identify DEGs: DESeq2 [[top](#top)]

Commonly used tool / package for detection of signficant DEGs include DESeq2, limma/limmavoom, edgeR, etc.

For this workshop, We will demo how to use DESeq2 to detect DEGs between WT and KO groups.

1. Clean the environment [[top](#top)]



In [1]:

    
rm(list=ls())

2. Start the clock [[top](#top)]



In [2]:

    
ptm <- proc.time()

3. Load libraries / packages [[top](#top)]



In [3]:

    
##-- List packages required in this analysis
cpan.pkg.list <- c('ggplot2', 'scales', 'ape', 'RColorBrewer', 
                  'reshape','VennDiagram')
bioc.pkg.list <- c('ctc',  'limma', 'edgeR', 'DESeq2', 'vsn', 
                  'genefilter', 'pheatmap', 
                  'clusterProfiler', 'pathview',
                  'AnnotationHub')

##-- Set up CPAN repo (required if running IRkernel in Jupyter)
# cpan.repos <- 'http://cran.us.r-project.org'

##-- Install CPAN packages
# install.packages('ggplot2', repos=cpan.repos) 
# lapply(cpan.pkg.list, suppressMessages(install.packages), 
#        repos=cpan.repos, character.only = TRUE)

##-- Set up Bioconductor repo
# source("http://bioconductor.org/biocLite.R")

##-- Install Bioc packages
# biocLite('DESeq2')
# lapply(bioc.pkg.list, suppressMessages(biocLite), 
#        character.only = TRUE)

##-- Load libraries
for(pkg in c(cpan.pkg.list, bioc.pkg.list)) {
    print(pkg)
    suppressMessages(require(pkg, character.only = TRUE))
}

##-- replace Hs with other (e.g. Mm) if you work on other species
##-- annotation database used by clusterProfiler
# suppressMessages(library(org.Hs.eg.db))









    



[1] "ggplot2"
[1] "scales"
[1] "ape"
[1] "RColorBrewer"
[1] "reshape"
[1] "VennDiagram"
[1] "ctc"
[1] "limma"
[1] "edgeR"
[1] "DESeq2"
[1] "vsn"
[1] "genefilter"
[1] "pheatmap"
[1] "clusterProfiler"
[1] "pathview"
[1] "AnnotationHub"

4. Set up global parameters, input/output directories and files [[top](#top)]



In [4]:

    
##-- Parameters
cancer <- 'DLBC' 
fdr <- 0.05
fc <- 1.5
gene.type <- 'coding'
caller <- 'deseq2'
group1 <- 'KO'
group2 <- 'WT'
colors <- c('#CC0000', '#00CC00')

##-- Set up working directory
work.dir <- '.'
setwd(work.dir)

##-- Input/Output directories
in.dir <- 'ipynb_data/input'
out.dir <- 'ipynb_data/output'

##-- Input/Output files
expr.file <- paste0(cancer, '.raw_counts.tsv')
sample.file <- paste0(cancer, '.sample_group.tsv')
geneinfo.file <- 'gencode.v24.primary_assembly.annotation.gtf.geneinfo'

5. Print analysis info [[top](#top)]



In [5]:

    
print(paste0('Cancer = ', cancer))
print(paste0('gene type = ', gene.type))
print(paste0('DEG fdr cutoff = ', fdr))
print(paste0('DEG fc cutoff = ', fc))
print(paste0('Expression file = ', expr.file))
print(paste0('Sample group file  = ', sample.file))
print(paste0('Gene info file  = ', geneinfo.file))









    



[1] "Cancer = DLBC"
[1] "gene type = coding"
[1] "DEG fdr cutoff = 0.05"
[1] "DEG fc cutoff = 1.5"
[1] "Expression file = DLBC.raw_counts.tsv"
[1] "Sample group file  = DLBC.sample_group.tsv"
[1] "Gene info file  = gencode.v24.primary_assembly.annotation.gtf.geneinfo"

6. Import data files [[top](#top)]



In [6]:

    
##-- Read data files 
data.expr <- read.delim(paste0(in.dir, '/', expr.file),
                       header = TRUE, stringsAsFactors=FALSE)
data.sample <- read.delim(paste0(in.dir, '/', sample.file),
                         header = TRUE, stringsAsFactors=FALSE)
data.geneinfo <- read.delim(paste0(in.dir, '/', geneinfo.file), 
                           header = TRUE, stringsAsFactors = FALSE)

7. Peek into imported data [[top](#top)]



In [7]:

    
##-- Expression matrix: raw read counts
print(paste0('Expression matrix = ', 
             nrow(data.expr), ' genes, ', 
             ncol(data.expr), ' fields'))
data.expr[1:3,]

##-- Sample table: experimental design & groups
print(paste0('Sample table = ', 
             nrow(data.sample), ' samples, ', 
             length(table(data.sample$Group)), ' groups'))
data.sample

##-- Gene annotation 
print(paste0('Gene annotation = ', 
             nrow(data.geneinfo), ' genes, ', 
             ncol(data.geneinfo), ' fields'))
head(data.geneinfo)









    



[1] "Expression matrix = 60725 genes, 12 fields"






    





Geneid Chr Start End Strand Length KO01 KO02 KO03 WT01 WT02 WT03

	ENSG00000223972.5                                                chr1;chr1;chr1;chr1                                              11869;12613;12975;13221                                          12227;12721;13052;14409                                          +;+;+;+                                                          1735                                                             1                                                                1                                                                0                                                                1                                                                0                                                                0                                                                
	ENSG00000227232.5                                                chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1           14404;15005;15796;16607;16858;17233;17606;17915;18268;24738;29534 14501;15038;15947;16765;17055;17368;17742;18061;18366;24891;29570 -;-;-;-;-;-;-;-;-;-;-                                            1351                                                             2                                                                3                                                                0                                                                2                                                                6                                                                3                                                                
	ENSG00000278267.1                                                chr1                                                             17369                                                            17436                                                            -                                                                  68                                                             0                                                                0                                                                0                                                                0                                                                0                                                                0                                                                









    



[1] "Sample table = 6 samples, 2 groups"






    





Sample Group

	KO01 KO  
	KO02 KO  
	KO03 KO  
	WT01 WT  
	WT02 WT  
	WT03 WT  









    



[1] "Gene annotation = 60725 genes, 6 fields"






    





gene_id gene_type gene_status gene_name level havana_gene

	ENSG00000000003.14  protein_coding      KNOWN               TSPAN6              2                   OTTHUMG00000022002.1
	ENSG00000000005.5   protein_coding      KNOWN               TNMD                2                   OTTHUMG00000022001.1
	ENSG00000000419.12  protein_coding      KNOWN               DPM1                2                   OTTHUMG00000032742.2
	ENSG00000000457.13  protein_coding      KNOWN               SCYL3               2                   OTTHUMG00000035941.5
	ENSG00000000460.16  protein_coding      KNOWN               C1orf112            2                   OTTHUMG00000035821.7
	ENSG00000000938.12  protein_coding      KNOWN               FGR                 2                   OTTHUMG00000003516.2

8. Preprocess data: Expression matrix, Gene annotation [[top](#top)]



In [8]:

    
data.expr.proc <- data.expr
data.sample.proc <- data.sample
data.geneinfo.proc <- data.geneinfo

##-- Set up row names of each data frame
row.names(data.expr.proc) <- data.expr.proc[,1]
row.names(data.sample.proc) <- data.sample.proc[,1]
row.names(data.geneinfo.proc) <- data.geneinfo.proc[,1]

##-- Expression matrix: remove extra columns 
# colnames(data.expr.proc)
data.expr.proc <- data.expr.proc[,c(1,7:12)]
colnames(data.expr.proc)[1] <- 'Gene'

##-- Expression matrix: add gene symbol to Ensembl geneid 
##-- from annotation
print(sum(data.expr.proc$Gene %in% data.geneinfo.proc$gene_id) 
      == nrow(data.expr.proc))
data.expr.proc <- merge(data.geneinfo.proc[,c('gene_id', 
                                             'gene_name')], 
                       data.expr.proc, by = 'row.names')

##-- Expression matrix: concat gene symbol & Ensembl id 
##-- as unique key for each gene row
data.expr.proc$Gene <- paste0(data.expr.proc$gene_name, 
                             '!', data.expr.proc$gene_id)
data.expr.proc <- data.expr.proc[,-c(1:3)]
row.names(data.expr.proc) <- data.expr.proc[,1]
data.expr.proc <- data.expr.proc[,-1]

##-- Expression matrix: peek into preprocessed data
data.expr.proc[1:3,]

##-- Gene annotation: show gene types
data.frame(table(data.geneinfo.proc$gene_type))

##-- Gene annotation: retrieve Ensembl id only
data.geneinfo.proc.gene <- data.geneinfo.proc[,c(1,2,4)]
data.geneinfo.proc.gene <- unique(data.geneinfo.proc.gene)
print(paste0('Total genes = ', nrow(data.geneinfo.proc.gene)))

##-- Gene annotation: concat gene symbol & Ensembl id
##-- as unique key for each gene row
data.geneinfo.proc.gene <- data.frame(Key = 
                    paste0(data.geneinfo.proc.gene$gene_name,'!',
                           data.geneinfo.proc.gene$gene_id), 
                                     data.geneinfo.proc.gene)
row.names(data.geneinfo.proc.gene) <- data.geneinfo.proc.gene[,1]
data.geneinfo.proc <- data.geneinfo.proc[,-1]

##-- Gene annotation: keep two types of genes, 
##-- coding and lincRNA, for further analysis
gene.coding <- data.geneinfo.proc.gene[
    data.geneinfo.proc.gene$gene_type == 'protein_coding',]
gene.lincrna <- data.geneinfo.proc[
    data.geneinfo.proc.gene$gene_type == 'lincRNA',]

##-- Gene annotation: show how many genes are in each list
print(paste0('Coding genes = ', nrow(gene.coding)))
print(paste0('lincRNA genes = ', nrow(gene.lincrna)))









    



[1] TRUE






    





KO01 KO02 KO03 WT01 WT02 WT03

	TSPAN6!ENSG00000000003.14    3    1    4    0    0    0
	TNMD!ENSG00000000005.5    0    0    0    0    0    0
	DPM1!ENSG00000000419.12 3197 2318 1356 2630 2735 2721









    





Var1 Freq

	3prime_overlapping_ncrna             29                             
	antisense                          5564                             
	bidirectional_promoter_lncrna         3                             
	IG_C_gene                            14                             
	IG_C_pseudogene                       9                             
	IG_D_gene                            37                             
	IG_J_gene                            18                             
	IG_J_pseudogene                       3                             
	IG_V_gene                           145                             
	IG_V_pseudogene                     183                             
	lincRNA                            7674                             
	macro_lncRNA                          1                             
	miRNA                              4205                             
	misc_RNA                           2307                             
	Mt_rRNA                               2                             
	Mt_tRNA                              22                             
	non_coding                            3                             
	polymorphic_pseudogene               58                             
	processed_pseudogene              10285                             
	processed_transcript                502                             
	protein_coding                    19844                             
	pseudogene                           25                             
	ribozyme                              8                             
	rRNA                                550                             
	scaRNA                               49                             
	sense_intronic                      919                             
	sense_overlapping                   193                             
	snoRNA                              949                             
	snRNA                              1905                             
	sRNA                                 20                             
	TEC                                1053                             
	transcribed_processed_pseudogene    445                             
	transcribed_unitary_pseudogene        3                             
	transcribed_unprocessed_pseudogene   682                             
	translated_unprocessed_pseudogene     1                             
	TR_C_gene                             6                             
	TR_D_gene                             4                             
	TR_J_gene                            79                             
	TR_J_pseudogene                       4                             
	TR_V_gene                           108                             
	TR_V_pseudogene                      30                             
	unitary_pseudogene                  171                             
	unprocessed_pseudogene             2612                             
	vaultRNA                              1                             









    



[1] "Total genes = 60725"
[1] "Coding genes = 19844"
[1] "lincRNA genes = 7674"

9. Proior to DEG identification: Subset expression matrix based on gene type [[top](#top)]



In [9]:

    
data.sample.proc.sub <- data.sample.proc

##-- Expression matrix: select gene rows 
##-- based on specified gene type (coding or lincrna)
gene.list <- ''
if(gene.type == 'coding') { gene.list <- gene.coding  }
if(gene.type == 'lincrna') { gene.list <- gene.lincrna }
data.expr.proc.sub <- data.expr.proc[
    row.names(data.expr.proc) %in% row.names(gene.list),] 

##-- Expression matrix: sort matrix sample column 
##-- to be consistent with sample table 
##--(required for DEG analysis)
data.expr.proc.sub <- data.expr.proc.sub[,data.sample.proc.sub$Sample] 

# data.expr = data.expr.proc.sub
# save(data.expr, file = paste0(paste0(out.dir, '/', caller, '/', expr.file, '.coding.Data')))
write.table(data.expr.proc.sub,
           file = paste0(out.dir, '/', caller, '/', expr.file, '.coding.tsv'),
           col.names = TRUE, row.names = TRUE, quote = FALSE, sep = '\t')

print(paste0('Gene list = ', 
             nrow(gene.list), ' (', gene.type, ')'))
print(paste0('Expression matrix before subsetting: ', 
             nrow(data.expr.proc), ' genes'))
print(paste0('Expression matrix after subsetting: ', 
             nrow(data.expr.proc.sub), ' genes'))









    



[1] "Gene list = 19844 (coding)"
[1] "Expression matrix before subsetting: 60725 genes"
[1] "Expression matrix after subsetting: 19844 genes"

10. Identify DEGs: DESeq2 [[top](#top)]



In [10]:

    
out.prefix <- paste0(out.dir, '/', caller, '/',cancer,'.',
                    gene.type, '.', caller)

##-- DESeq2: note data.sample.proc.sub must 
##-- have the same sample column
##-- order as data.expr.proc.sub
Group <- as.factor(data.sample.proc.sub$Group)

##-- DESeq2: covert to matrix data format (previously was data frame)
cds <- as.matrix(data.expr.proc.sub)

##-- DESeq2: build DESeqDataSet object, prepare design matrix
dds <- DESeqDataSetFromMatrix(countData = cds,
                           colData = data.sample.proc.sub, 
                           design = ~ Group)

##-- DESeq2: note this is just simple filter to reduce mem
##-- no affect on DEG results
# dim(dds)
# dds <- dds[ rowSums(counts(dds)) > 0, ]
# dim(dds)

##-- DESeq2: plot estimated dispersions
# pdf(paste0(out.prefix, '.dispersion.pdf'),width = 7, height = 7)
# plotDispEsts(dds, xlim = c(1,10000), ylim = c(1E-10, 1))
# dev.off()

print(dds)









    



Warning message in DESeqDataSet(se, design = design, ignoreRank):
“some variables in design formula are characters, converting to factors”





    



class: DESeqDataSet 
dim: 19844 6 
metadata(1): version
assays(1): counts
rownames(19844): TSPAN6!ENSG00000000003.14 TNMD!ENSG00000000005.5 ...
  CRLF2!ENSGR0000205755.11 ZBED1!ENSGR0000214717.10
rowData names(0):
colnames(6): KO01 KO02 ... WT02 WT03
colData names(2): Sample Group

11. DESeq2: Normalize and log2-transform count matrix for heatmap, sample clustering, etc. (NOT for DEG identification) [[top](#top)]



In [11]:

    
##-- DESeq2: three normalization algrithms (rld, vsd and vsd.fast)
rld <- rlog(dds, blind=FALSE)
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
# vsd.fast <- vst(dds, blind=FALSE)

##-- DESeq2: peek into normalized expression matrix
head(assay(rld), 3)
head(assay(vsd), 3)

##-- DESeq2: print normalized expression matrix to local files
write.table(data.frame(Gene = row.names(assay(rld)), assay(rld)),
          file = paste0(out.prefix, '.rld.txt'),
          sep = '\t', col.names = TRUE, row.names = FALSE, quote = FALSE)

write.table(data.frame(Gene = row.names(assay(vsd)), assay(vsd)),
          file = paste0(out.prefix, '.vsd.txt'),
          sep = '\t', col.names = TRUE, row.names = FALSE, quote = FALSE)









    





KO01 KO02 KO03 WT01 WT02 WT03

	TSPAN6!ENSG00000000003.14  0.474222  0.460208  0.4941183  0.4517336  0.4512781  0.4511058
	TNMD!ENSG00000000005.5  0.000000  0.000000  0.0000000  0.0000000  0.0000000  0.0000000
	DPM1!ENSG00000000419.12 11.325678 11.178722 10.9546514 11.3665679 11.3063154 11.2640777









    





KO01 KO02 KO03 WT01 WT02 WT03

	TSPAN6!ENSG00000000003.14  6.680465  6.589379  6.789483  6.43599  6.43599  6.43599 
	TNMD!ENSG00000000005.5  6.435990  6.435990  6.435990  6.43599  6.43599  6.43599 
	DPM1!ENSG00000000419.12 11.464743 11.253229 10.922041 11.52370 11.43723 11.37653

12. DESeq2: plot mean to var to confirm variance shrink works [[top](#top)]



In [12]:

    
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: remove genes not expressed in any samples
##-- for plottig purposes
notAllZero <- (rowSums(counts(dds))>0)

##-- DESeq2: mean to var plots 
# pdf(paste0(out.prefix, '.meanvar.log2.pdf'),width = 7, height = 7)
meanSdPlot(log2(counts(estimateSizeFactors(dds),
                       normalized=TRUE)[notAllZero,] + 1))
# dev.off()

# pdf(paste0(out.prefix, '.meanvar.rld.pdf'),width = 7, height = 7)
meanSdPlot(assay(rld[notAllZero,]))
# dev.off()

# pdf(paste0(out.prefix, '.meanvar.vsd.pdf'),width = 7, height = 7)
meanSdPlot(assay(vsd[notAllZero,]))
# dev.off()

13. DESeq2: Quality assessment of sample-to-sample relationships (correlation heatmap & PCA) [[top](#top)]



In [13]:

    
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: sample correlation heatmap
##-- calculate sample distance 
sampleDists <- dist(t(assay(rld)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$Group, rld$Libtype, sep="-")
colnames(sampleDistMatrix) <- rownames(sampleDistMatrix)

##-- DESeq2: plot heatmap
heatmap.colors <- rev(cm.colors(32))[1:16]
# pdf(paste0(out.prefix, '.sm_cor.pdf'),width = 7, height = 7)
pheatmap(sampleDistMatrix,
       clustering_distance_rows=sampleDists,
       clustering_distance_cols=sampleDists,
       col=heatmap.colors)
# dev.off()

##-- DESeq2: Principal component analysis (PCA) plot of the samples
##-- use ggplot2 to customize the PCA plot
# pdf(paste0(out.prefix, '.pca.pdf'),width = 7, height = 7)
data.pca <- plotPCA(rld, intgroup=c('Group'), returnData=TRUE)
percent.var <- round(100 * attr(data.pca, "percentVar"))
pca.colors <- c(KO = colors[1], WT = colors[2])
p1 <- ggplot(data.pca, aes(PC1, PC2, color = Group)) +
            geom_point(size = 5, shape = 17) +
            scale_colour_manual(values = pca.colors) + 
            xlab(paste0("PC1: ",percent.var[1],"% variance")) +
            ylab(paste0("PC2: ",percent.var[2],"% variance"))
plot(p1)

14. DESeq2: identify DEG between groups [[top](#top)]



In [14]:

    
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

print(paste0('Group 1 = ', group1))
print(paste0('Group 2 = ', group2))

comp <- paste0(group1, 'vs', group2, '.')
out.prefix <- paste0(out.dir, '/', caller,'/',cancer,'.',
                    gene.type, '.',comp,caller,'.txt')

##-- DESeq2: fit the model and identify DEGs
##-- same as running those three steps: 
##-- (1) dds = estimateSizeFactors(dds)
##-- (2) dds = estimateDispersions(dds)
##-- (3) dds = nbinomWaldTest(dds)
dds <- DESeq(dds, test="Wald", betaPrior=TRUE)
res <- results(dds, 
              contrast=c("Group",group1,group2), 
              pAdjustMethod ="fdr", 
              alpha=fdr)

##-- DESeq2: save the result as an R object for reloading
##-- in future analysis
# save(res, file = paste0(out.prefix,'.RData'))

##-- DESeq2: Plot MA to display differential expression 
##-- versus expression strength
# pdf(paste0(out.prefix,'.MAplot.pdf'), width=7, height=7)
plotMA(res, main="DESeq2", ylim=c(-2,2))
# dev.off()

##-- DESeq2: peek into DEG data object
summary(res)
mcols(res)$description
res <- as.data.frame(res)

##-- DESeq2: add fold change (via anti-log log2FC)
res$foldChange <- NA
row.pos <- which(! is.na(res$log2FoldChange) & 
                res$log2FoldChange >= 0)
row.neg <- which(! is.na(res$log2FoldChange) & 
                res$log2FoldChange < 0)
res$foldChange[row.pos] <- 2^res$log2FoldChange[row.pos]
res$foldChange[row.neg] <- -2^((-1) * res$log2FoldChange[row.neg])
res <- data.frame(id = row.names(res), res)
# print(sum(res$foldChange == 0) == 0)

##-- DESeq2: peek into gene list
res[1:3,]

##-- DESeq2: save DEG output into local file
res.ensembl <- gsub('\\S+[!]', '', res[,1], perl = TRUE)
res.symbol <- gsub('[!]\\S+', '', res[,1], perl = TRUE)
res.print <- data.frame(ENSEMBL = res.ensembl,
                       SYMBOL = res.symbol,
                        res[,-c(1:2)])
write.table(res.print,
          file = out.prefix,
          sep = '\t', col.names = TRUE, row.names = FALSE, quote = FALSE)









    



[1] "Group 1 = KO"
[1] "Group 2 = WT"






    



estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing






    



out of 16790 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)     : 1010, 6% 
LFC < 0 (down)   : 1038, 6.2% 
outliers [1]     : 1, 0.006% 
low counts [2]   : 4190, 25% 
(mean count < 10)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results







    





	'mean of normalized counts for all samples'
	'log2 fold change (MAP): Group KO vs WT'
	'standard error: Group KO vs WT'
	'Wald statistic: Group KO vs WT'
	'Wald test p-value: Group KO vs WT'
	'fdr adjusted p-values'








    





id baseMean log2FoldChange lfcSE stat pvalue padj foldChange

	TSPAN6!ENSG00000000003.14 TSPAN6!ENSG00000000003.14    1.449331               0.1551665               0.08359604                1.856147                0.06343258                      NA                 1.113550                
	TNMD!ENSG00000000005.5 TNMD!ENSG00000000005.5      0.000000                      NA                       NA                      NA                        NA                      NA                       NA                
	DPM1!ENSG00000000419.12 DPM1!ENSG00000000419.12  2427.048731              -0.2017824               0.12436288               -1.622529                0.10469004               0.2767594                -1.150118

15. DESeq2: Filter for significant DEGs by fold change and fdr [[top](#top)]



In [15]:

    
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: remove nan values for the foldchange == NAN
before <- nrow(res)
res <- res[!is.na(res$foldChange) & ! is.na(res$padj),];
after <- nrow(res)
print(paste0('Genes removed = ', (before - after), 
             ' (fold change is NA)'))
print(paste0('Genes kept = ', after))

print(paste0('Filter DEGs by: fc, ', fc, ', fdr ', fdr))

##-- DESeq2: plot histogram of unadj p-value for sanity check
# pdf(paste0(out.prefix,'.pvalue.hist.pdf'), width=7, height=7)
hist(res$pval, breaks=100, 
     col="skyblue", border="slateblue", 
     main="Histogram of unadjusted p-value", 
     xlab="unajusted p-value") 
# dev.off()   

##-- DESeq2: filter DEGs
res.flt <- res[(res$foldChange >= fc | res$foldChange <= -fc) & 
              res$padj < fdr,]
print(paste0('Genes filtered = ', (after - nrow(res.flt)), 
             ' (fc, ', fc, ', fdr ', fdr, ')'))
print(paste0('Genes kept = ', nrow(res.flt)))

##-- DESeq2: peek into filtered gene list
res.flt[1:3,]

##-- DESeq2: save filtered DEG output into local file
if(nrow(res.flt) > 0) {
    res.flt.ensembl <- gsub('\\S+[!]', '', res.flt[,1], perl = TRUE)
    res.flt.print <- res.print[res.print$ENSEMBL %in% res.flt.ensembl,]
    write.table(res.flt.print, paste0(out.prefix, 
                                '.flt.fdr', fdr, '_fc', fc), 
                col.names=TRUE, row.names=FALSE, sep='\t', quote=FALSE)
}









    



[1] "Genes removed = 7245 (fold change is NA)"
[1] "Genes kept = 12599"
[1] "Filter DEGs by: fc, 1.5, fdr 0.05"
[1] "Genes filtered = 12303 (fc, 1.5, fdr 0.05)"
[1] "Genes kept = 296"






    





id baseMean log2FoldChange lfcSE stat pvalue padj foldChange

	SPATA20!ENSG00000006282.20 SPATA20!ENSG00000006282.20 339.1868                  -0.6129525                0.1144608                 -5.355130                 8.549488e-08              4.766150e-06              -1.529386                 
	MYLIP!ENSG00000007944.14 MYLIP!ENSG00000007944.14  498.9798                  -0.8249904                0.1084094                 -7.609956                 2.741897e-14              7.050033e-12              -1.771523                 
	SNAI2!ENSG00000019549.8 SNAI2!ENSG00000019549.8   143.4191                   0.6002812                0.1529813                  3.923886                 8.713192e-05              1.596681e-03               1.516012

16. DESeq2: Plot expression heatmap of significant DEGs [[top](#top)]



In [16]:

    
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 6, repr.plot.height = 5)

##-- DESeq2: continue with significant DEGs from previous step
print(paste0('Genes significant = ', nrow(res.flt), 
             ' (fc, ', fc, ', fdr ', fdr, ')'))

##-- DESeq2: select those sig genes from normalized expression matrix
gene.select <- res.flt$id
data.plot <- assay(rld)
data.plot <- data.plot[row.names(data.plot) %in% gene.select,,
                      drop = FALSE]

##-- DESeq2: prepare annotation label and colors
data.plot.anno <- as.data.frame(
    data.sample.proc.sub[,colnames(data.sample.proc.sub) %in%
                         c('Group'),
                         drop = FALSE])
data.plot.anno.colors <- list(Group = c(KO = colors[1], WT = colors[2]))

##-- DESeq2: plot sig DEG heatmap
print(paste0('Heatmap = ', nrow(data.plot), ' genes on the row, ', 
             ncol(data.plot), ' samples on the column'))
pheatmap(data.plot,
         scale = 'row',
         cluster_rows = TRUE, cluster_cols = TRUE,
         show_rownames = FALSE, show_colnames = TRUE,
         clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean", 
         clustering_method = "complete",
         annotation_row = NA, 
         annotation_col = data.plot.anno,
         annotation_colors = data.plot.anno.colors
        )









    



[1] "Genes significant = 296 (fc, 1.5, fdr 0.05)"
[1] "Heatmap = 296 genes on the row, 6 samples on the column"

17. clusterProfiler: Prepare input for GO/KEGG enrichment analysis [[top](#top)]



In [17]:

    
##-- clusterProfiler: NOTE those commands were tested for clusterProfiler 
##-- version 3.0.5 or above. Older versions may not work as they have 
##-- options revised/removed in the new versions.

##-- clusterProfiler: prepare input for GO/KEGG enrichment analysis
genes.all <- res.print
genes.sig <- res.flt.print
genes.all$ENSEMBL <- gsub('[.]\\d+', '', genes.all$ENSEMBL, perl = TRUE)
genes.sig$ENSEMBL <- gsub('[.]\\d+', '', genes.sig$ENSEMBL, perl = TRUE)

##-- clusterProfiler: remove genes with fc / pvalue as NA
genes.all <- na.omit(genes.all)

##-- clusterProfiler: add EntrezID
##-- show which keytype is available
keytypes(org.Hs.eg.db) 
genes.all.anno <- bitr(geneID   =  genes.all$ENSEMBL, 
                      fromType = 'ENSEMBL', 
                      toType   = c('ENTREZID', 'SYMBOL'), 
                      OrgDb    = 'org.Hs.eg.db', 
                      drop     = TRUE)

##-- clusterProfiler: save annotated output to local files
write.table(genes.all.anno,
            file = paste0(out.prefix,'.anno'),
            col.names = TRUE, row.names = FALSE, sep = '\t', 
            quote = FALSE)

##-- clusterProfiler: remove genes with duplicate EntrezIDs
genes.all.anno <- genes.all.anno[
    which(! duplicated(genes.all.anno$ENTREZID)), ]
row.names(genes.all.anno) <- genes.all.anno$ENTREZID

##-- clusterProfiler: add fc and p-value to genes.all.anno
genes.all.anno <- merge(genes.all.anno, genes.all, by = 'ENSEMBL')
row.names(genes.all.anno) <- genes.all.anno$ENTREZID

##-- clusterProfiler: prepare significant gene list
genes.sig.anno <- genes.all.anno[genes.all.anno$ENSEMBL %in% 
                                genes.sig$ENSEMBL,]

print(paste0('Genes significant = ', nrow(genes.sig.anno)))
genes.sig.anno[1:3,1:7]

##-- clusterProfiler: prepare input for GSEA 
##-- ranked gene list with decreasing fold change
gene.list <- genes.all.anno$foldChange
names(gene.list) <- genes.all.anno$ENTREZID
gene.list <- sort(gene.list, decreasing = TRUE)
gene.list[1:3]









    





	'ACCNUM'
	'ALIAS'
	'ENSEMBL'
	'ENSEMBLPROT'
	'ENSEMBLTRANS'
	'ENTREZID'
	'ENZYME'
	'EVIDENCE'
	'EVIDENCEALL'
	'GENENAME'
	'GO'
	'GOALL'
	'IPI'
	'MAP'
	'OMIM'
	'ONTOLOGY'
	'ONTOLOGYALL'
	'PATH'
	'PFAM'
	'PMID'
	'PROSITE'
	'REFSEQ'
	'SYMBOL'
	'UCSCKG'
	'UNIGENE'
	'UNIPROT'








    



'select()' returned 1:many mapping between keys and columns
Warning message in bitr(geneID = genes.all$ENSEMBL, fromType = "ENSEMBL", toType = c("ENTREZID", :
“0.45% of input gene IDs are fail to map...”





    



[1] "Genes significant = 301"






    





ENSEMBL ENTREZID SYMBOL.x SYMBOL.y log2FoldChange lfcSE stat

	64847 ENSG00000006282 64847          SPATA20        SPATA20        -0.6129525     0.1144608      -5.355130      
	29116 ENSG00000007944 29116          MYLIP          MYLIP          -0.8249904     0.1084094      -7.609956      
	6591 ENSG00000019549 6591           SNAI2          SNAI2           0.6002812     0.1529813       3.923886      









    





	3310
		5.45723141695043
	1958
		3.49436114376338
	2353
		3.43324047473579

18. clusterProfiler: GO enrichment analysis [[top](#top)]



In [18]:

    
##-- clusterProfiler: GO over-representation test
ego <- enrichGO( gene          = genes.sig.anno$ENTREZID,
                universe      = genes.all.anno$ENTREZID,
                OrgDb         = 'org.Hs.eg.db',
                ont           = "BP",
                pAdjustMethod = "fdr",
                pvalueCutoff  = 0.05,
                readable      = TRUE)
summary(ego)[1:3,1:7]

##-- clusterProfiler: save GO enrichment output to an R object (faster data import next time!)
save(ego, file = paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc, '.enrichGO.RData'))









    





ID Description GeneRatio BgRatio pvalue p.adjust qvalue

	GO:1902105 GO:1902105                                     regulation of leukocyte differentiation        17/269                                         153/11041                                      1.793987e-07                                   0.0003200531                                   0.0002676516                                   
	GO:1903706 GO:1903706                                     regulation of hemopoiesis                      20/269                                         211/11041                                      2.062198e-07                                   0.0003200531                                   0.0002676516                                   
	GO:0002761 GO:0002761                                     regulation of myeloid leukocyte differentiation 11/269                                         69/11041                                       7.631931e-07                                   0.0007896505                                   0.0006603629

19. clusterProfiler: KEGG pathway enrichment analysis [[top](#top)]



In [19]:

    
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 10, repr.plot.height = 5)

##-- clusterProfiler: if running this cell separately, 
##-- load a fresh new copy of ego
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc, '.enrichGO.RData'))

##-- clusterProfiler: enrichGO - drop 5 level of GO terms (too general!)
ego.alllevel <- ego
for(i in 1:5) { 
  ego <- dropGO(ego, level = i)
}

##-- clusterProfiler: visualization of GO enrichment results
p1 <- barplot(ego, showCategory=20)
p2 <- dotplot(ego, showCategory=20)
# p3 = plotGOgraph(ego, firstSigNodes = 10)

##-- clusterProfiler: show the plots in notebook
plot(p1)
plot(p2)



In [20]:

    
##-- clusterProfiler: KEGG over-representation test
kk <- enrichKEGG(gene          = genes.sig.anno$ENTREZID,
                 universe      = genes.all.anno$ENTREZID,
                 organism      = "human",
                 pAdjustMethod = "fdr",
                 pvalueCutoff  = 0.05)
summary(kk)[1:3,]

##-- clusterProfiler: save KEGG pathway enrichment output to an R object (faster data import next time!)
save(kk, file = paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc,'.enrichKEGG.RData'))









    





ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count

	hsa05322 hsa05322                                                     Systemic lupus erythematosus                                 12/145                                                       70/4637                                                      1.239854e-06                                                 0.0002653287                                                 0.0002531912                                                 8970/8365/8347/8349/8356/440689/7124/8339/8358/8367/8343/8357 12                                                           
	hsa04620 hsa04620                                                     Toll-like receptor signaling pathway                         10/145                                                       80/4637                                                      1.654245e-04                                                 0.0177004237                                                 0.0168907142                                                 6696/2353/3725/7124/6351/9560/388372/6349/414062/6348        10                                                           
	hsa05132 hsa05132                                                     Salmonella infection                                         9/145                                                        69/4637                                                      2.550959e-04                                                 0.0181968390                                                 0.0173644209                                                 3606/2353/3725/6351/9560/388372/6349/414062/6348              9



In [21]:

    
##-- clusterProfiler: if running this cell separately, load a fresh new copy of kk
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc,'.enrichKEGG.RData'))

##-- clusterProfiler: visualize pathway with all genes
pathway <- 'hsa04010'
plot.title <- paste0(pathway, '.gene_all.pathview.png')
p4 <- pathview(gene.data  = gene.list,
              pathway.id = pathway,
              species    = "human",
              limit      = list(gene=4, cpd=1),
              kegg.dir   = paste0(out.dir,'/',caller),
              out.suffix = paste0('gene_all.pathview'))









    



'select()' returned 1:1 mapping between keys and columns
Info: Working in directory /home/ubuntu/dev/rnaseq/CRI-Workshop-Nov2016-RNAseq/notebook_ext
Info: Writing image file hsa04010.gene_all.pathview.png

Identify DEGs: DESeq2 - End

20. Stop the clock! [[top](#top)]



In [22]:

    
proc.time() - ptm









    





   user  system elapsed 
 78.096   5.675  88.237



In [23]:

    
print('Program run finished!')

##-- Print analysis environment (for reproducible research)
sessionInfo()









    



[1] "Program run finished!"






    





R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] parallel  stats4    grid      stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] hexbin_1.27.1              AnnotationHub_2.4.2       
 [3] pathview_1.12.0            org.Hs.eg.db_3.3.0        
 [5] AnnotationDbi_1.34.4       clusterProfiler_3.0.5     
 [7] DOSE_2.10.7                pheatmap_1.0.8            
 [9] genefilter_1.54.2          vsn_3.40.0                
[11] DESeq2_1.12.4              SummarizedExperiment_1.2.3
[13] Biobase_2.32.0             GenomicRanges_1.24.3      
[15] GenomeInfoDb_1.8.7         IRanges_2.6.1             
[17] S4Vectors_0.10.3           BiocGenerics_0.18.0       
[19] edgeR_3.14.0               limma_3.28.21             
[21] ctc_1.46.0                 amap_0.8-14               
[23] VennDiagram_1.6.17         futile.logger_1.4.3       
[25] reshape_0.8.6              RColorBrewer_1.1-2        
[27] ape_3.5                    scales_0.4.0              
[29] ggplot2_2.1.0             

loaded via a namespace (and not attached):
 [1] nlme_3.1-128                  bitops_1.0-6                 
 [3] matrixStats_0.51.0            httr_1.2.1                   
 [5] Rgraphviz_2.16.0              repr_0.9.9000                
 [7] tools_3.3.1                   R6_2.2.0                     
 [9] affyio_1.42.0                 rpart_4.1-10                 
[11] Hmisc_3.17-4                  DBI_0.5-1                    
[13] colorspace_1.2-7              nnet_7.3-12                  
[15] gridExtra_2.2.1               preprocessCore_1.34.0        
[17] chron_2.3-47                  graph_1.50.0                 
[19] SparseM_1.72                  labeling_0.3                 
[21] topGO_2.24.0                  KEGGgraph_1.30.0             
[23] affy_1.50.0                   pbdZMQ_0.2-4                 
[25] stringr_1.1.0                 digest_0.6.10                
[27] foreign_0.8-67                XVector_0.12.1               
[29] htmltools_0.3.5               RSQLite_1.0.0                
[31] shiny_0.14.1                  BiocInstaller_1.22.3         
[33] jsonlite_1.1                  BiocParallel_1.6.6           
[35] acepack_1.4.0                 GOSemSim_1.30.3              
[37] RCurl_1.95-4.8                magrittr_1.5                 
[39] GO.db_3.3.0                   Formula_1.2-1                
[41] Matrix_1.2-7.1                Rcpp_0.12.7                  
[43] IRkernel_0.7                  munsell_0.4.3                
[45] stringi_1.1.2                 zlibbioc_1.18.0              
[47] plyr_1.8.4                    qvalue_2.4.2                 
[49] DO.db_2.9                     crayon_1.3.2                 
[51] lattice_0.20-34               Biostrings_2.40.2            
[53] IRdisplay_0.4.9000            splines_3.3.1                
[55] annotate_1.50.1               KEGGREST_1.12.3              
[57] locfit_1.5-9.1                igraph_1.0.1                 
[59] uuid_0.1-2                    geneplotter_1.50.0           
[61] reshape2_1.4.2                futile.options_1.0.0         
[63] XML_3.98-1.1                  evaluate_0.10                
[65] latticeExtra_0.6-28           lambda.r_1.1.9               
[67] data.table_1.9.6              httpuv_1.3.3                 
[69] png_0.1-7                     gtable_0.2.0                 
[71] tidyr_0.6.0                   assertthat_0.1               
[73] mime_0.5                      xtable_1.8-2                 
[75] survival_2.39-5               tibble_1.2                   
[77] cluster_2.0.5                 interactiveDisplayBase_1.10.3
[79] GSEABase_1.34.1