Riyue Bao, Ph.D. Center for Research Informatics, The University of Chicago. 9:00 AM - 10:15 AM, November 13, 2016
The workshop materials are accessible on Github.
<img src='notebook_ext/ipynb_data/assets/Figure24.png', title = 'RNAseq objective', width = 600, height = 600>
Sample | Group | Sequencing File | Sequencing Data |
---|---|---|---|
KO01 | KO | KO01.fastq.gz | 74,126,025 reads |
KO02 | KO | KO02.fastq.gz | 64,695,948 reads |
KO03 | KO | KO03.fastq.gz | 52,972,573 reads |
WT01 | WT | WT01.fastq.gz | 55,005,729 reads |
WT01 | WT | WT02.fastq.gz | 61,079,377 reads |
WT01 | WT | WT03.fastq.gz | 66,517,156 reads |
... and more!
<img src='notebook_ext/ipynb_data/assets/Figure14.png', title = 'A typical RNAseq experiment', width = 800, height = 400>
<img src='notebook_ext/ipynb_data/assets/Figure20.2.png', title = 'RNAseq analysis', width = 800, height = 400>
The best-practice analysis protocol takes 8 major steps. For a more detailed description and program commands used in each step, refer to the extended version of notebooks (notebook_ext/01.Run_RNAseq.tutorial.ipynb
).
<img src='notebook_ext/ipynb_data/assets/Figure25.png', title = 'RNAseq workflow', width = 900, height = 900>
Due to time limit, we will not run the BDS pipeline in this workshop. Since it is automated, participants are encouraged to explore it after the workshop following the instructions at Github. The launching of the pipeline only requires submission of one script Build_RNAseq.sh
, which was already prepared for you. The pipeline was designed to take care of all the dependencies between tasks/jobs with robust checkpoints and high reproducibility.
KO01.fastq.gz
), where each read is presented by 4 lines
<img src='notebook_ext/ipynb_data/assets/Figure10.png', title = 'Sequencing reads in FastQ format', width = 600, height = 90>notebook_ext/01.Run_RNAseq.tutorial.ipynb
, MultiQC)notebook_ext/01.Run_RNAseq.tutorial.ipynb
, MultiQC)For detailed answers to those questions, refer to the extended version of notebooks (notebook_ext/01.Run_RNAseq.tutorial.ipynb
)
Q1: Which sample (S1-4) has the most severe contamination from genomic DNA? Hint: higher percentage of intergenic reads indicates more severe DNA contamination in RNA samples
Q2: Which sample (S1-4) has the least efficient depletion of ribosome RNAs (rRNAs)? Hint: rRNAs account for > 80% of the whole transcriptome. If not removed, the majority of the sequencing reads will be derived from rRNA and not the coding or lincRNAs.<img src='notebook_ext/ipynb_data/assets/Figure5.png', title = 'Figure5', width = 600, height = 600>
<img src='notebook_ext/ipynb_data/assets/Figure16.png', title = 'KO vs WT read coverage FOS', width = 600, height = 600>
notebook_ext/ipynb_data/Rscripts
, e.g. DESeq2.plot_PCA.R
R
script notebook_ext/02.Run_RNAseq.tutorial.ipynb
Involves five sub-steps in R (i to v).
To systematically identify significant DEGs across the transcriptome, we will use DESeq2
edgeR
, limmavoom
) may generate different results! Seyednasrollah F et al. 2015Good coding practice
;
but not recommended
In [ ]:
##-- Clean the environment: always run this before starting new analysis!
rm(list=ls())
##-- Load libraries required in this analysis
##-- Those packages were pre-installed in the AWS machine
pkg.list <- c('ggplot2', 'RColorBrewer', 'reshape', 'broom',
'DESeq2', 'vsn', 'genefilter', 'pheatmap',
'clusterProfiler', 'pathview', 'IRdisplay')
for(pkg in pkg.list) {
suppressMessages(library(pkg, character.only = TRUE))
}
##-- Parameters
cancer <- 'DLBC' ## full name
fdr <- 0.05 ## false-discovery rate adjusted p-value
fc <- 1.5 ## expression fold of change
gene.type <- 'coding' ## protein-coding genes
caller <- 'deseq2'
group1 <- 'KO' ## knockout
group2 <- 'WT' ## wild-type
colors <- c('#CC0000', '#00CC00')
##-- Set up working directory
work.dir <- '.'
setwd(work.dir)
##-- Input/Output directories
in.dir <- 'notebook_ext/ipynb_data/input'
out.dir <- 'notebook_ext/ipynb_data/output'
##-- Input/Output files
expr.file <- paste0(cancer, '.raw_counts.coding.tsv')
sample.file <- paste0(cancer, '.sample_group.tsv')
##-- Print analysis info
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'DESeq2.print_info.R')
source(script)
In [ ]:
##-- Read data files
data.expr <- read.delim(file.path(in.dir, expr.file), header=TRUE,
stringsAsFactors=FALSE)
data.sample <- read.delim(file.path(in.dir, sample.file), header=TRUE,
stringsAsFactors=FALSE)
print('Show the first three lines of gene expression file:')
data.expr[1:3,]
print('Show the sample group file:')
data.sample
In [ ]:
##-- Format data matrix
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'DESeq2.format_data.R')
source(script)
##-- DESeq2: build DESeqDataSet object, prepare design matrix
dds <- DESeqDataSetFromMatrix(countData = as.matrix(data.expr.proc),
colData = data.sample,
design = ~ Group)
# print(dds)
##-- DESeq2: normalize matrix for clustering/heatmap generation
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
##-- DESeq2: correlation between bio replicates (skip in workshop)
# script <- file.path('notebook_ext','ipynb_data','Rscripts',
# 'DESeq2.plot_SMcor_heatmap.R')
#source(script)
##-- DESeq2: Principal component analysis (PCA) plot
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'DESeq2.plot_PCA.R')
source(script)
KO | WT |
---|---|
1 | 0 |
1 | 0 |
1 | 0 |
0 | 1 |
0 | 1 |
0 | 1 |
In [ ]:
##-- DESeq2: set up variables
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'DESeq2.set_variable.R')
source(script)
##-- DESeq2: fit the model and identify DEGs
print('Running DESeq2 ... ')
dds <- DESeq(dds, test="Wald", betaPrior=TRUE)
res <- results(dds,
contrast=c("Group",group1,group2),
pAdjustMethod ="fdr",
alpha=fdr)
print('Done!')
##-- DESeq2: peek into DEG data object
summary(res)
res <- as.data.frame(res)
##-- DESeq2: add anti-log2 fold change and gene symbol to output
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'DESeq2.add_fc_symbol.R')
source(script)
print('Show the first three lines of DESeq2 output:')
res.print[1:3,]
In [ ]:
##-- DESeq2: filter significant DEGs by fc 1.5 and adjusted p < 0.05
print('Show the first three lines of DESeq2 output:')
res.print[1:3,]
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'DESeq2.flt_sigDEGs.R')
source(script)
##-- DESeq2: plot sig DEG heatmap
print(paste0('Heatmap = ', nrow(data.plot), ' genes on the row, ',
ncol(data.plot), ' samples on the column'))
pheatmap(data.plot,
scale = 'row',
cluster_rows = TRUE, cluster_cols = TRUE,
show_rownames = FALSE, show_colnames = TRUE,
clustering_distance_rows = "euclidean",
clustering_distance_cols = "euclidean",
clustering_method = "complete",
)
In [ ]:
##-- clusterProfiler: NOTE those commands were tested for clusterProfiler
##-- version 3.0.5 or above. Older versions may not work as they have
##-- options revised/removed in the new versions.
##-- clusterProfiler: prepare input for GO/KEGG enrichment analysis
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'clusterProfilter.prepare_input.R')
source(script)
##-- clusterProfiler: GO over-representation test
##-- takes ~2 minutes .. skip in workshop!
# ego <- enrichGO( gene = genes.sig.anno$ENTREZID,
# universe = genes.all.anno$ENTREZID,
# OrgDb = 'org.Hs.eg.db',
# ont = "BP",
# pAdjustMethod = "fdr",
# pvalueCutoff = 0.05,
# readable = TRUE)
##-- instead, load already generated output
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc, '.enrichGO.RData'))
summary(ego)[1:3,1:7]
##-- clusterProfiler: KEGG over-representation test
##-- takes ~2 minutes .. skip in workshop!
# kk <- enrichKEGG(gene = genes.sig.anno$ENTREZID,
# universe = genes.all.anno$ENTREZID,
# organism = "human",
# pAdjustMethod = "fdr",
# pvalueCutoff = 0.05)
##-- instead, load already generated output
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc,'.enrichKEGG.RData'))
summary(kk)[1:3,]
##-- clusterProfiler: visualize pathway with all genes
gene.list <- genes.all.anno$foldChange
names(gene.list) <- genes.all.anno$ENTREZID
pathway <- 'hsa04010'
plot.title <- paste0(pathway, '.gene_all.pathview.png')
p4 <- pathview(gene.data = gene.list,
pathway.id = pathway,
species = "human",
limit = list(gene=4, cpd=1),
kegg.dir = paste0(out.dir,'/',caller),
out.suffix = paste0('gene_all.pathview'))
##-- clusterProfiler: visualization of GO and KEGG enrichment results
barplot(ego, showCategory=20)
dotplot(ego, showCategory=20)
In [ ]:
##-- KEGG pathwayview written into local files. Display the plot in R
display_png(file='notebook_ext/ipynb_data/assets/hsa04010.gene_all.pathview.png')
The analysis takes 3 major steps. Full-length description & R code is provided in notebook_ext/03.Run_RNAseq.tutorial.ipynb
.
<img src='notebook_ext/ipynb_data/assets/Figure26.png', title = 'TCGA & Clinical workflow', width = 900, height = 900>
In [ ]:
##-- Clean up environment
rm(list=ls())
set.seed(13)
##-- Load libraries required in this analysis
##-- Those packages were pre-installed in the AWS machine
pkg.list <- c('ggplot2', 'RColorBrewer', 'reshape', 'broom',
'gplots', 'survival', 'NMF', 'IRdisplay')
for(pkg in pkg.list) {
suppressMessages(library(pkg, character.only = TRUE))
}
##-- Parameters
cancer <- 'OV'
gene.top.count <- 150
##-- Import data
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'TCGA_OV.import_data.R')
source(script)
##-- run NMF to cluster samples & genes
##-- (use 4 core 'p4', and print more info 'v')
##-- takes 15 minutes to run... skip in workshop
print('NMF clustering start...')
# expr.sub.nmf <- nmf(expr.sub,
# rank = 3,
# method = 'brunet',
# seed = 'random',
# nrun = 100,
# .opt = 'vp4')
##-- workshop only: load already generated result
load(paste0(out.dir,'/',expr.file,'.nmf.RData'))
print('Done!')
##-- plot expression heatmap
script <- file.path('notebook_ext','ipynb_data','Rscripts',
'TCGA_OV.plot_heatmap.R')
source(script)
# print(table(expr.sub.nmf.smclr$cluster))
# print(table(expr.sub.nmf.geneclr$cluster))
Involves two sub-steps in R (i to ii).
In [ ]:
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml")
options(repr.plot.width = 5, repr.plot.height = 5)
##-- add right censoring
table(clinical$vital.status)
clinical$censor <- NA
clinical$censor[which(clinical$vital.status == 'LIVING')] <- 0
clinical$censor[which(clinical$vital.status == 'DECEASED')] <- 1
clinical$censor <- as.numeric(clinical$censor)
##-- KM plot
surv <- Surv(clinical$overall.survival.day, clinical$censor)
surv.fit <- survfit(surv ~ clinical$cluster)
plot(surv.fit, mark=4, col=c('#00CC00','#0000CC','#CC0000'),
lty=1, lwd=1.5,cex=0.8,cex.lab=1.5, cex.axis=1.5, cex.main=1,
main='Kaplan-Meier survival curves for TCGA OV dataset',
xlab='Overall Survival (days)',
ylab='Probability of Survival')
text(1500,0.08, labels=paste0('cluster1 (n=',sample.counts[1,2],')'),
cex=1.2, col='#00CC00')
text(650,0.40, labels=paste0('cluster2\n(n=',sample.counts[2,2],')'),
cex=1.2, col='#0000CC')
text(2500,0.68, labels=paste0('cluster3 (n=',sample.counts[3,2],')'),
cex=1.2, col='#CC0000')
separator
##-- log-rank test
print(paste0('Running log-rank test for survival'))
surv <- Surv(clinical$overall.survival.day, clinical$censor)
surv.diff <- survdiff(surv ~ clinical$cluster)
surv.diff
separator
The KM plot is similar to TCGA OV paper Figure 2e. Note that patient cluster C1 in the TCGA OV nature paper corresponds to C3 in our analysis!
<img src='notebook_ext/ipynb_data/assets/Figure27.png', title = 'TCGA OV Nature paper Fig 2e', width = 300, height = 300>
In [ ]:
##-- cox pp haz model (comparing cluster 2 and 3 as an example!)
print(paste0('Running Cox proportional hazards model for survival'))
clinical.sub <- clinical[clinical$cluster %in% c(1,2,3),]
surv <- Surv(clinical.sub$overall.survival.day,
clinical.sub$censor)
print('Cluster 1 will be set as baseline ... ')
separator
##-- univariate cox model
print(paste0('(a) Univariate cox model for survival'))
m1 <- coxph(surv ~ (clinical.sub$cluster))
tidy(m1)
separator
##-- full cox model
print(paste0('(b) Full multivariate cox model for survival'))
m2 <- coxph(surv ~ (clinical.sub$cluster +
clinical.sub$age.at.diagnosis.year +
clinical.sub$tumor.grade))
tidy(m2)
separator
##-- reduced cox model
##-- Always use the simplest model with the least necessary amount
##-- of covariates to minimize overfitting
print(paste0('(c) Reduced multivariate cox model for survival'))
m3 <- coxph(surv ~ (clinical.sub$cluster +
clinical.sub$age.at.diagnosis.year))
tidy(m3)
In [ ]:
print('Program finished!')
sessionInfo()
In this workshop, we demonstrated how to run RNAseq analysis to identify DEGs & pathways, and how to use expression data to identify patient groups with different clincial outcomes. We introduced the commonly used bioinformatics tools and good-practice approaches. All the analysis were performed on a pre-built machine in AWS EC2 cloud.
All the class materials will stay open on GitHub if you want to practice or use the pipelines and scripts for your own research projects.
Questions? Post on Github or contact Riyue (Sunny) at rbao AT bsd dot uchicago dot edu
.
<img src='notebook_ext/ipynb_data/assets/Figure21.png', title = 'RNAseq summary', width = 600, height = 600>
Go to your Jupyter Notebook homepage. Click [New] button at top right corner of the homepage. In the dropdown menu, click [Terminal].
All workshop materials are openly accessible via GitHub. Documentation and tutorials of the pipelines can be found in README and wiki.
BigDataScript
& Perl
##-- commands
pwd
cd ~/CRI-Workshop-AMIA-2016-RNAseq
ll pipeline
ll notebook_ext
##-- commands
pwd
cd ~/CRI-Workshop-AMIA-2016-RNAseq/pipeline
cd test
./Build_RNAseq.DLBC.sh
##-- START Thu Oct 27 15:57:38 UTC 2016 Running ../Build_RNAseq.pl
##-- START Thu Oct 27 15:57:38 UTC 2016 Running Submit_RNAseq.DLBC.sh
##-- running ... 3 ~ 4 minutes
##-- END Thu Oct 27 16:01:25 UTC 2016
[1] Fog. et al. 2015. Loss of PRDM11 promotes MYC-driven lymphomagenesis. Blood 125(8):1272-81
[2] The Cancer Genome Atlas Research Network. 2011. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615)
[3] Sims et al. 2014. Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics 15, 121–132