AMIA 2016 Annual Symposium Workshop (WG13)

Mining Large-scale Cancer Genomics Data Using Cloud-based Bioinformatics Approaches (RNAseq)

Riyue Bao, Ph.D. Center for Research Informatics, The University of Chicago. 9:00 AM - 10:15 AM, November 13, 2016

The workshop materials are available on Github licensed via LGPLv3.

Objective [[top](#top)]

Introduction to RNAseq technology and clinical application (10 minutes)
- Learn the background and when to use RNAseq for your projects
How to perform RNAseq analysis (50 minutes) hands-on
- Learn the good-practice analysis protocol (8 steps)
  - From raw sequencing data to quantitative measures of transcripts (automated pipeline)
  - Identify differentially expressed genes (DEGs) between conditions and affected pathways (R & Bioconductor)
  - Generate high-quality figures for publication
Associate gene expression data with clinical outcome (15 minutes) hands-on
- Discover tumor subtypes and detect survival difference (3 steps) (R & Bioconductor)

Dataset [[top](#top)]

Data for RNAseq analysis
- Two groups (PRDM11 KO vs WT, human U2932 cells), 6 samples [1]
- Aim to identify DEGs / pathways altered between KO and WT groups

Sample	Group	Sequencing File	Sequencing Data
KO01	KO	KO01.fastq.gz	74,126,025 reads
KO02	KO	KO02.fastq.gz	64,695,948 reads
KO03	KO	KO03.fastq.gz	52,972,573 reads
WT01	WT	WT01.fastq.gz	55,005,729 reads
WT01	WT	WT02.fastq.gz	61,079,377 reads
WT01	WT	WT03.fastq.gz	66,517,156 reads

Data for clinical association
- The Cancer Genome Atlas (TCGA) ovarian cancer, 379 primary tumors Genomics Data Commons [2]
- Aim to use gene expression to group patients into subtypes and detect survival differences
- TCGA datasets (35 cancer types, > 20,000 tumor/normal samples, > 9,000 patients) <img src='notebook_ext/ipynb_data/assets/Figure23.2.png', title = 'GDC', width = 600, height = 600>

1. RNAseq technology, clinical application and data analysis [[top](#top)]

Biological and clinical questions

I am interested in studying transcriptional landscape shift before and after drug treatment in cell lines
I want to identify which pathways are affected after knocking down my favorite gene in mice
I have expression data of clinical isolates collected at various time points, when patient's response changed. Why?
I have a cohort of patients and want to discover which gene signature predicts patient's response to treatment
I want to detect gene fusions, expressed mutations, and disrupted isoforms in tumors that may be related to disease

... and more!

1.1. What is RNAseq and when to use it [[top](#top)]

High-throughput sequencing of RNA: Profile, identify or assemble transcripts
- Detect gene expression changes between conditions (hands-on)
- Identify novel splice sites / exons, mutations, fusion genes, etc.
- Broad detection range (lowly to highly expressed genes), high sensitivity, low requirement of RNA amount
- Available for all species (reference genome is optional): reference genome-guided alignment or de novo assembly
- Transcriptome-wide approach for quantitative measurements and gene discovery without prior knowledge
Experimental Design (considerations and limitations)
- Include biological replicates for increased discovery power <img src='notebook_ext/ipynb_data/assets/Figure15.2.png', title = 'Biological replicates', width = 400, height = 400>

1.2 Challenges and limitations [[top](#top)]

Relatively poor RNA quality for tumor FFPE samples
Contamination from adjacent normal tissue
Still more expensive than targeted-panel sequencing such as NanoString
40 million mapped reads are usually sufficient for gene profiling, but > 80 million are required to detect bottom 1% lowly expressed genes [3]

2. How to perform RNAseq analysis [[top](#top)]

Workflow

The good-practice analysis protocol takes 8 major steps. For a more detailed description and program commands used in each step, refer to the extended version of notebooks (notebook_ext/01.Run_RNAseq.tutorial.ipynb).

Steps 1-4: From raw sequencing to transcript quantification is automated through BigDataScript (BDS) pipeline.
Steps 5-8: DEG and pathway analysis will be practiced interactively to better interpret the results.

Due to time limit, we will not run the BDS pipeline in this workshop. Since it is automated, participants are encouraged to explore it after the workshop following the instructions at GitHub. The launching of the pipeline only requires submission of one script Build_RNAseq.sh, which was already prepared for you. The pipeline was designed to take care of all the dependencies between tasks/jobs with robust checkpoints and high reproducibility.

Steps 1-4

2.3.1 - 2.3.2 Quality assessment and preprocessing of raw sequencing reads [[top](#top)]

Raw sequencing reads are stored in FastQ format (e.g. KO01.fastq.gz), where each read is presented by 4 lines <img src='notebook_ext/ipynb_data/assets/Figure10.png', title = 'Sequencing reads in FastQ format', width = 600, height = 90>
QC produces reports that help you evaluate if a sequencing run is successful and if reads are of high quality (example in notebook_ext/01.Run_RNAseq.tutorial.ipynb, MultiQC)
Optional Preprocess reads to improve mapping rate and accuracy (in the next step, 1.3.3)
- Trim low-quality bases, clip adapters, etc.
- Avoid overtrimming in RNAseq! (Williams et al. 2016)

2.3.3 - 2.3.4 Map reads to reference genome and quantification of transcript abundance [[top](#top)]

Read mapping identifies the location in the genome where a sequencing read comes from
Accurate mapping result is the key for quantification and DEG identification
splice-aware aligner (e.g. STAR) <img src='notebook_ext/ipynb_data/assets/Figure12.png', title = 'RNAseq mapping result', width = 600, height = 400>
Different aligners may generate very different results (Engström et al. 2013)

RNAseq metrics

Collect metrics to evaluate RNA sample quality and identify potential problems (example in notebook_ext/01.Run_RNAseq.tutorial.ipynb, MultiQC)
- Is there high-level genomic DNA contamination?
- Was ribosome RNA successfully depleted during library prep?
- Is the RNA highly degraded?
- How do reads distribute on the genome? (exons, introns, intergenic, etc.)
- Does the target gene knocked down in KO samples have reduced expression as expected?

For detailed answers to those questions, refer to the extended version of notebooks (notebook_ext/01.Run_RNAseq.tutorial.ipynb)

Q1: Which sample (S1-4) has the most severe contamination from genomic DNA? Hint: higher percentage of intergenic reads indicates more severe DNA contamination in RNA samples

Q2: Which sample (S1-4) has the least efficient depletion of ribosome RNAs (rRNAs)? Hint: rRNAs account for > 80% of the whole transcriptome. If not removed, the majority of the sequencing reads will be derived from rRNA and not the coding or lincRNAs.

Steps 5-8

From the pipeline run, we generated read alignment and counts for every annotated gene on the genome
The next step is to utilize the read counts data to detect DEGs
For example, if we visualize FOS gene across 6 samples in genome browser (e.g. IGV)

Hands-on practice START

Due to time constraints, and to facilitate workshop demo, small R script snippets were prepared to simplify steps described in hands-on
Those scripts can be found in directory notebook_ext/ipynb_data/Rscripts, e.g. DESeq2.plot_PCA.R
Contents of those snippets will not be covered in this workshop but detailed comments were added to every R script
Full length description & R code is provided in notebook_ext/02.Run_RNAseq.tutorial.ipynb

2.3.5 - 2.3.7 Normalize gene expression and identify DEGs: DESeq2 [[top](#top)]

Involves five sub-steps in R (i to v).

To systematically identify significant DEGs across the transcriptome, we will use DESeq2

Different tools (edgeR, limmavoom) may generate different results! Seyednasrollah F et al. 2015
We will also practice how to generate publication-quality figures such as PCA, heatmap, clustering, and pathways from our data

Good coding practice

One statement each line for clarity. It is possible to concatenate statements with ; but not recommended
Break long statements into multiple lines with less than 80 columns per line for better readability!

i. Prepare working environment and set up parameters

Set up global variables at the beginning, avoid hard-coding of file names, directories, etc.
Can easily apply to other projects with similar experimental design



In [ ]:

    
##-- Clean the environment: always run this before starting new analysis!
rm(list=ls())

##-- Load libraries required in this analysis
##-- Those packages were pre-installed in the AWS machine
pkg.list <- c('ggplot2', 'RColorBrewer', 'reshape', 'broom',
             'DESeq2', 'vsn', 'genefilter', 'pheatmap',
             'clusterProfiler', 'pathview', 'IRdisplay')
for(pkg in pkg.list) {
    suppressMessages(library(pkg, character.only = TRUE))
}

##-- Parameters
cancer <- 'DLBC' ## full name
fdr <- 0.05 ## false-discovery rate adjusted p-value
fc <- 1.5 ## expression fold of change
gene.type <- 'coding' ## protein-coding genes
caller <- 'deseq2'
group1 <- 'KO' ## knockout
group2 <- 'WT' ## wild-type
colors <- c('#CC0000', '#00CC00')

##-- Set up working directory
work.dir <- '.'
setwd(work.dir)

##-- Input/Output directories
in.dir <- 'notebook_ext/ipynb_data/input'
out.dir <- 'notebook_ext/ipynb_data/output'

##-- Input/Output files
expr.file <- paste0(cancer, '.raw_counts.coding.tsv')
sample.file <- paste0(cancer, '.sample_group.tsv')

##-- Print analysis info
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'DESeq2.print_info.R')
source(script)

ii. Import data

Read counts on coding genes
Sample groups



In [ ]:

    
##-- Read data files
data.expr <- read.delim(file.path(in.dir, expr.file), header=TRUE, 
                       stringsAsFactors=FALSE)
data.sample <- read.delim(file.path(in.dir, sample.file), header=TRUE, 
                         stringsAsFactors=FALSE)

print('Show the first three lines of gene expression file:')
data.expr[1:3,]
print('Show the sample group file:')
data.sample

iii. Principle Component Analysis (PCA)

Both are good assessment to identify mislabeled samples and outliers
All genes are used as input!
PCA - do samples group as expected?



In [ ]:

    
##-- Format data matrix
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'DESeq2.format_data.R')
source(script)

##-- DESeq2: build DESeqDataSet object, prepare design matrix
dds <- DESeqDataSetFromMatrix(countData = as.matrix(data.expr.proc),
                             colData = data.sample, 
                             design = ~ Group)
# print(dds)

##-- DESeq2: normalize matrix for clustering/heatmap generation
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)

##-- DESeq2: correlation between bio replicates (skip in workshop)
# script <- file.path('notebook_ext','ipynb_data','Rscripts',
# 'DESeq2.plot_SMcor_heatmap.R')
#source(script)

##-- DESeq2: Principal component analysis (PCA) plot
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'DESeq2.plot_PCA.R')
source(script)

iv. Normalize and detect DEGs

Normalize read counts across samples
Build design matrices for linear modeling

KO	WT
1	0
1	0
1	0
0	1
0	1
0	1

Fit data to a linear model
Lowly expressed genes are automatically filtered out (default as mean read counts < 10)
Estimate the KO vs WT fold-change and p-values for each gene



In [ ]:

    
##-- DESeq2: set up variables
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'DESeq2.set_variable.R')
source(script)

##-- DESeq2: fit the model and identify DEGs
print('Running DESeq2 ... ')
dds <- DESeq(dds, test="Wald", betaPrior=TRUE)
res <- results(dds, 
              contrast=c("Group",group1,group2), 
              pAdjustMethod ="fdr", 
              alpha=fdr)
print('Done!')

##-- DESeq2: peek into DEG data object
summary(res)
res <- as.data.frame(res)

##-- DESeq2: add anti-log2 fold change and gene symbol to output
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'DESeq2.add_fc_symbol.R')
source(script)

print('Show the first three lines of DESeq2 output:')
res.print[1:3,]

v. Filter for significant DEGs and plot heatmap

Filter for significant DEGs with at least 1.5 fold-change and FDR-adjusted p-value < 0.05
Generate expression heatmap on significant DEGs (different from PCA and correlation heatmap described in iii)



In [ ]:

    
##-- DESeq2: filter significant DEGs by fc 1.5 and adjusted p < 0.05
print('Show the first three lines of DESeq2 output:')
res.print[1:3,]
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'DESeq2.flt_sigDEGs.R')
source(script)

##-- DESeq2: plot sig DEG heatmap
print(paste0('Heatmap = ', nrow(data.plot), ' genes on the row, ', 
             ncol(data.plot), ' samples on the column'))
pheatmap(data.plot,
         scale = 'row',
         cluster_rows = TRUE, cluster_cols = TRUE,
         show_rownames = FALSE, show_colnames = TRUE,
         clustering_distance_rows = "euclidean",
         clustering_distance_cols = "euclidean", 
         clustering_method = "complete",
        )

2.3.8 Identify enriched GO and pathways: clusterProfiler [[top](#top)]

Gene Ontology bological processes and KEGG pathways
Enrichment in significant DEGs



In [ ]:

    
##-- clusterProfiler: NOTE those commands were tested for clusterProfiler 
##-- version 3.0.5 or above. Older versions may not work as they have 
##-- options revised/removed in the new versions.

##-- clusterProfiler: prepare input for GO/KEGG enrichment analysis
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'clusterProfilter.prepare_input.R')
source(script)

##-- clusterProfiler: GO over-representation test
##-- takes ~2 minutes .. skip in workshop!
# ego <- enrichGO( gene          = genes.sig.anno$ENTREZID,
#                 universe      = genes.all.anno$ENTREZID,
#                 OrgDb         = 'org.Hs.eg.db',
#                 ont           = "BP",
#                 pAdjustMethod = "fdr",
#                 pvalueCutoff  = 0.05,
#                 readable      = TRUE)
##-- instead, load already generated output
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc, '.enrichGO.RData'))
summary(ego)[1:3,1:7]

##-- clusterProfiler: KEGG over-representation test
##-- takes ~2 minutes .. skip in workshop!
# kk <- enrichKEGG(gene          = genes.sig.anno$ENTREZID,
#                  universe      = genes.all.anno$ENTREZID,
#                  organism      = "human",
#                  pAdjustMethod = "fdr",
#                  pvalueCutoff  = 0.05)
##-- instead, load already generated output
load(paste0(out.prefix, '.flt.fdr', fdr,'_fc',fc,'.enrichKEGG.RData'))
summary(kk)[1:3,]

##-- clusterProfiler: visualize pathway with all genes
gene.list <- genes.all.anno$foldChange
names(gene.list) <- genes.all.anno$ENTREZID
pathway <- 'hsa04010'
plot.title <- paste0(pathway, '.gene_all.pathview.png')
p4 <- pathview(gene.data  = gene.list,
              pathway.id = pathway,
              species    = "human",
              limit      = list(gene=4, cpd=1),
              kegg.dir   = paste0(out.dir,'/',caller),
              out.suffix = paste0('gene_all.pathview'))

##-- clusterProfiler: visualization of GO and KEGG enrichment results
barplot(ego, showCategory=20)
dotplot(ego, showCategory=20)



In [ ]:

    
##-- KEGG pathwayview written into local files. Display the plot in R
display_png(file='notebook_ext/ipynb_data/assets/hsa04010.gene_all.pathview.png')

3. Practice how to associate gene expression with clinical data [[top](#top)]

TCGA hosted on GDC (https://gdc-portal.nci.nih.gov) <img src='notebook_ext/ipynb_data/assets/Figure19.png', title = 'GDC website', width = 800, height = 200>
Our data
- ~600 Ovarian Cancer patients (expression and clinical) <img src='notebook_ext/ipynb_data/assets/Figure18.png', title = 'TCGA OV Nature paper', width = 800, height = 200>

Workflow [[top](#top)]

The analysis takes 3 major steps. Full-length description & R code is provided in notebook_ext/03.Run_RNAseq.tutorial.ipynb.

3.1 - 3.2. Import data, normalize expression and cluster patients based on expression profile [[top](#top)]

Select 150 most variable genes for sample clustering
- Without patient groups known, using all genes for sample clustering is computational intensive and may mask the true differences between potential sample groups
- The most variable genes likely capture the majority of the sample variation



In [ ]:

    
##-- Clean up environment
rm(list=ls())
set.seed(13)

##-- Load libraries required in this analysis
##-- Those packages were pre-installed in the AWS machine
pkg.list <- c('ggplot2', 'RColorBrewer', 'reshape', 'broom',
             'gplots', 'survival', 'NMF', 'IRdisplay')
for(pkg in pkg.list) {
    suppressMessages(library(pkg, character.only = TRUE))
}

##-- Parameters
cancer <- 'OV' 
gene.top.count <- 150

##-- Import data
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'TCGA_OV.import_data.R')
source(script)

##-- run NMF to cluster samples & genes 
##-- (use 4 core 'p4', and print more info 'v')
##-- takes 15 minutes to run... skip in workshop
print('NMF clustering start...')
# expr.sub.nmf <- nmf(expr.sub, 
#                    rank = 3, 
#                    method = 'brunet', 
#                    seed = 'random', 
#                    nrun = 100, 
#                    .opt = 'vp4') 
##-- workshop only: load already generated result
load(paste0(out.dir,'/',expr.file,'.nmf.RData'))
print('Done!')

##-- plot expression heatmap
script <- file.path('notebook_ext','ipynb_data','Rscripts',
                   'TCGA_OV.plot_heatmap.R')
source(script)

# print(table(expr.sub.nmf.smclr$cluster))
# print(table(expr.sub.nmf.geneclr$cluster))

3.2 Survival analysis [[top](#top)]

Involves two sub-steps in R (i to ii).

Identify overall survival (OS) difference between the three tumor subtypes / patient groups
Kaplan–Meier (KM) plot, log-rank test, and cox proportional hazards model (univariate and multivariate)

i. KM plot and log-rank test



In [ ]:

    
##-- Set up R plot display options in notebook
options(jupyter.plot_mimetypes = "image/svg+xml") 
options(repr.plot.width = 5, repr.plot.height = 5)

##-- add right censoring
table(clinical$vital.status)
clinical$censor <- NA
clinical$censor[which(clinical$vital.status == 'LIVING')] <- 0
clinical$censor[which(clinical$vital.status == 'DECEASED')] <- 1
clinical$censor <- as.numeric(clinical$censor)

##-- KM plot 
surv <- Surv(clinical$overall.survival.day, clinical$censor)
surv.fit <- survfit(surv ~ clinical$cluster)
plot(surv.fit, mark=4, col=c('#00CC00','#0000CC','#CC0000'), 
     lty=1, lwd=1.5,cex=0.8,cex.lab=1.5, cex.axis=1.5, cex.main=1,
     main='Kaplan-Meier survival curves for TCGA OV dataset',
     xlab='Overall Survival (days)', 
     ylab='Probability of Survival')
text(1500,0.08,  labels=paste0('cluster1 (n=',sample.counts[1,2],')'), 
     cex=1.2, col='#00CC00')
text(650,0.40, labels=paste0('cluster2\n(n=',sample.counts[2,2],')'), 
     cex=1.2, col='#0000CC')
text(2500,0.68, labels=paste0('cluster3 (n=',sample.counts[3,2],')'), 
     cex=1.2, col='#CC0000')

separator

##-- log-rank test
print(paste0('Running log-rank test for survival'))
surv <- Surv(clinical$overall.survival.day, clinical$censor)
surv.diff <- survdiff(surv ~ clinical$cluster)
surv.diff

separator

The KM plot is similar to TCGA OV paper Figure 2e. Note that patient cluster C1 in the TCGA OV nature paper corresponds to C3 in our analysis!

ii. Cox multivariate and univariate model

Include one factor of interest (e.g. cluster 1/2/3): univariate
Include multiple factors such as cluster, age, tumor grade, and tumor stage: multivariate



In [ ]:

    
##-- cox pp haz model (comparing cluster 2 and 3 as an example!)
print(paste0('Running Cox proportional hazards model for survival'))
clinical.sub <- clinical[clinical$cluster %in% c(1,2,3),]
surv <- Surv(clinical.sub$overall.survival.day, 
            clinical.sub$censor)

print('Cluster 1 will be set as baseline ... ')

separator

##-- univariate cox model 
print(paste0('(a) Univariate cox model for survival'))
m1 <- coxph(surv ~ (clinical.sub$cluster))
tidy(m1)
m1
separator

##-- full cox model
print(paste0('(b) Full multivariate cox model for survival'))
m2 <- coxph(surv ~ (clinical.sub$cluster + 
                    clinical.sub$age.at.diagnosis.year +
                    clinical.sub$tumor.grade))
tidy(m2)
m2
separator

##-- reduced cox model
##-- Always use the simplest model with the least necessary amount 
##-- of covariates to minimize overfitting
print(paste0('(c) Reduced multivariate cox model for survival'))
m3 <- coxph(surv ~ (clinical.sub$cluster +
                    clinical.sub$age.at.diagnosis.year))
tidy(m3)
m3

Hands-on practice END



In [ ]:

    
print('Program finished!')
sessionInfo()

Summary [[top](#top)]

In this workshop, we demonstrated how to run RNAseq analysis to identify DEGs & pathways, and how to use expression data to identify patient groups with different clinical outcomes. We introduced the commonly used bioinformatics tools and good-practice approaches. All the analysis was performed on a pre-built machine in AWS EC2 cloud.

All the class materials will stay open on GitHub if you want to practice or use the pipelines and scripts for your own research projects.

Questions? Post on Github or contact Riyue (Sunny) at rbao AT bsd dot uchicago dot edu.

Run pipeline (after the workshop) [[top](#top)]

1. Open terminal from Jupyter Notebook

Go to your Jupyter Notebook homepage. Click [New] button at top right corner of the homepage. In the dropdown menu, click [Terminal].

2. Get familiar with the file structure

All workshop materials are openly accessible via GitHub. Documentation and tutorials of the pipelines can be found in README and wiki.

Run_RNAseq.tutorial.ipynb: The main lecture / hands-on notebook (covered in workshop)
Run_RNAseq.tutorial.rendered.ipynb: Same as above, but with all outputs & figures rendered
pipeline directory: the automated, scalable and modularized pipeline
- From raw sequencing reads to quantified transcript abundance through one click
- BigDataScript & Perl
notebook_ext directory: Extended notebooks 01-03 (not covered in workshop)
- lecture notes
- R codes and comments

##-- commands
pwd
cd ~/CRI-Workshop-AMIA-2016-RNAseq
ll pipeline
ll notebook_ext

2. Launch pipeline (takes ~ 5 minutes)

##-- commands
pwd
cd ~/CRI-Workshop-AMIA-2016-RNAseq/pipeline
cd test
./Build_RNAseq.DLBC.sh
##-- START Thu Oct 27 15:57:38 UTC 2016 Running ../Build_RNAseq.pl
##-- START Thu Oct 27 15:57:38 UTC 2016 Running Submit_RNAseq.DLBC.sh
##-- running ... 3 ~ 4 minutes
##-- END Thu Oct 27 16:01:25 UTC 2016

Next Session: ChIPseq analysis and Integration of RNAseq & ChIPseq (Kyle Hernandez PhD)

References

[1] Fog. et al. 2015. Loss of PRDM11 promotes MYC-driven lymphomagenesis. Blood 125(8):1272-81
[2] The Cancer Genome Atlas Research Network. 2011. Integrated genomic analyses of ovarian carcinoma. Nature 474, 609–615)
[3] Sims et al. 2014. Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics 15, 121–132