AMIA 2016 Annual Symposium Workshop (WG13)

Mining Large-scale Cancer Genomics Data Using Cloud-based Bioinformatics Approaches (RNAseq)

* Part 01 RNAseq technology, clinical application and data analysis

Riyue Bao, Ph.D. Center for Research Informatics, The University of Chicago. November 13, 2016

The workshop materials are accessible on Github licensed via LGPLv3.

Objective [[top](#top)]

Learn the good-practice RNAseq analysis pipeline
Learn commonly used bioinformatics tools
Practice the automated, scalable pipeline
Explore the quality metrics and input/output of the RNAseq pipeline
Visualize result files and quality plots

Workflow [[top](#top)]

Dataset [[top](#top)]

The test datasets used in this workshop are from Fog. et al. 2015. Loss of PRDM11 promotes MYC-driven lymphomagenesis. Blood, 125(8):1272-81 http://www.bloodjournal.org/content/125/8/1272.long?sso-checked=true

Pipeline [[top](#top)]

For this workshop, the machine you are using has everything pre-compiled and pre-installed. It is ready for analysis.

In the future, if you'd like to use the pipleine on your own machines, download analysis pipeline from Github and follow the instructions to install.

git clonehttps://github.com/cribioinfo/CRI-Workshop-AMIA-2016-RNAseq.git

Detailed documentation of the pipeline can be found on Github README and wiki.

Run pipeline [[top](#top)]

1. Open terminal from Jupyter Notebook

At Jupyter homepage, go to [New] button at the top-right corner of the notebook. In the dropdown menu, click [Terminal].

2. Get familiar with the file structure

This repository contains the following items:

Run_RNAseq.tutorial.ipynb - the main notebook for lecture and hands-on
Run_RNAseq.tutorial.rendered.ipynb - same as above, but with all outputs & figures already rendered for browsing
notebook_ext/ directory: the extended version of contents covered in the main notebook
pipeline/ directory: the automated, scalable and modularized pipeline
- From raw sequencing reads to quantified transcript abundance through one click
- BigDataScript & Perl

##-- commands
pwd
cd ~/CRI-Workshop-AMIA-2016-RNAseq
ls -al
ls -al notebook_ext
ls -al pipeline

3. Launch pipeline (takes ~ 5 minutes)

##-- commands
pwd
cd ~/CRI-Workshop-AMIA-2016-RNAseq/pipeline/test
./Build_RNAseq.DLBC.sh
##-- START Thu Oct 27 15:57:38 UTC 2016 Running ../Build_RNAseq.pl
##-- START Thu Oct 27 15:57:38 UTC 2016 Running Submit_RNAseq.DLBC.sh
##-- running ... 3 ~ 4 minutes
##-- END Thu Oct 27 16:01:25 UTC 2016

How to perform RNAseq analysis [[top](#top)]

Steps 1 - 4 : Automated pipeline (Run_RNAseq.bds)
- Pipeline results running on the complete dataset (for visualization and DEG analysis): /home/ubuntu/data/rnaseq/fullset
- Pipeline results running on the test dataset (for testing only): /home/ubuntu/dev/rnaseq/subset
Steps 5 - 8 : Interactive R & Bioconductor (Notebook 02)

Unless pointed out otherwise, all commands shown apply to the test datasets only. Do not use them directly on other datasets. Refer to the pipeline documentation for instructions on how to set up the pipeline for your own projects.

Steps 1 - 4 are run by automated pipeline

1. Quality assessment of raw sequencing reads: FastQC [[top](#top)]

Goal
- Check if the reads are of high quality
- Check if any preprocessing step is required (e.g. base trimming, adapter clipping, read filtering)
Method
- FastQC, version 0.11.5
- Scan raw sequencing reads and produce QC reports for evaluation
Our data
- Read quality pretty good (baseQ >= 30 in all base positions)
- Preprocessing is optional
- In this workshop, we will still run the step for demonstration

``` fastqc --extract -o $out.dir -t 2 --nogroup $r1.fastq.gz ```

2. (optional) Preprocessing: Trimmomatic [[top](#top)]

Goal
- Clean up reads for improved alignment rate and accuracy (for the next step)
Method
- Trimmomatic, version 0.36
- Clip adapters
- Trim leading/trailing low quality or N bases
- Trim reads to a specific length
- Filter out reads of low average quality / of specific length
- Convert base quality scores between Phred33 and Phred64 FastQ format
Avoid over trimming in RNAseq!
- Williams et al. 2016. Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics. DOI: 10.1186/s12859-016-0956-2
Our data (KO01 as an example)
- 74,126,025 reads total. Survived: 73,636,793 (99.34%) Dropped: 489,232 (0.66%)

``` java -Xmx4G -jar trimmomatic-0.36.jar SE -threads 4 -phred33 $r1.fastq.gz $r1.trim.fastq.gz ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 LEADING:5 TRAILING:5 MINLEN:36 SLIDINGWINDOW:4:15 ```

3. Map reads to reference genome (GRCh38): STAR [[top](#top)]

Goal
- Identify the location in the genome where a sequencing read comes from
- A read may map to one, or multiple genomic locations
- For each read, alignment of more than one target regions is ranked by alignment score
- Accurate mapping result is the key for DEG identification
Method
- STAR, version 2.5.2b
- splice-aware aligner
- Ultrafast (~15 minutes for 50m reads), require lots of memory (~36GB for human genome)
- Flexible options to allow canonical/non-canonical junctions, with/wo known gene annotations, etc.
Different aligners may generate very different results
- Engström et al. 2013. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. DOI: 10.1038/nmeth.2722
Our data
- PRDM11 knockdown U2932 cells in triplicates (KO01-03 vs WT01-03)
- NOT the full PRDM11 gene is knockdown!
- siRNAs target exons 4,5,6 of PRDM11, thus only those three exons show a reduction of expression in the KO samples; other exons are not affected
- PRDM11 knockdown leads to upregulation of FOS expression

``` STAR --runMode alignReads --genomeLoad NoSharedMemory --outFileNamePrefix $out.prefix --readFilesCommand zcat --genomeDir $refgenome.dir --readFilesIn $r1.trim.fastq.gz --runThreadN 2 --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonicalUnannotated --outSAMtype BAM SortedByCoordinate ```

Collect RNAseq quality metrics & coverage: Picardtools, bedtools, RSeQC [[top](#top)]

Goal
- Evaluate the quality of reads and alignments
- Identify potential problems regarding (RNA) sample quality
  - Is the RNA highly degraded?
  - Is there high-level genomic DNA contamination?
  - Was ribosome RNA sucessfully depleted?
  - How do reads distribute on the genome? (exons, introns, intergenic, etc.)
  - Is the strandness of read alignment consistent with library type?
    - non-stranded or forward/reverse strand-specific libs
  - Does the target gene that was knocked down in KO samples have reduced expression, compared to WT?
Method
- Picardtools, version 2.6.0
  - Utilities CollectMultipleMetrics CollectRnaSeqMetrics
  - Read mapping rate, number of correctly mapped reads, mismatch%
  - Distribution of reads mapped to genomic features (e.g. exon, intron, intergenic, CD, UTR, etc.)
  - Percentage of reads mapped to ribosome RNAs
  - Gene body coverage, read clipping profile, etc.
- RSeQC, version 2.6.4
  - Strandness of mapped reads, etc.
- bedtools, version 2.26.0
  - Utilities genomecov
  - Calculate read coverage on the genome at per-base resolution
  - Scaling factor can be supplied to normalize coverage across samples (library size)
    
``` java -Xmx4G -jar picard.jar CollectRnaSeqMetrics I=$sample.bam O=$out.rnaseq_metrics REF_FLAT=$refgeneanno.refflat.txt RIBOSOMAL_INTERVALS=$refgeneanno.rRNA.interval_list STRAND=$strandness CHART=$out.rnaseq_metrics.pdf METRIC_ACCUMULATION_LEVEL=SAMPLE VALIDATION_STRINGENCY=LENIENT java -Xmx4G -jar picard.jar CollectMultipleMetrics I=$sample.bam O=$out R=$refgenome.fa PROGRAM=CollectAlignmentSummaryMetrics PROGRAM=CollectInsertSizeMetrics PROGRAM=QualityScoreDistribution PROGRAM=MeanQualityByCycle VALIDATION_STRINGENCY=LENIENT ```
``` infer_experiment.py -i $sample.bam -r $refgeneanno.bed12 -s 200000 > $sample.infer_experiment ```
``` bedtools genomecov -bga -split -ibam $sample.bam -scale $scale.factor > $sample.cov.bedgraph bedGraphToBigWig $sample.cov.bedgraph $refgenome.chrom.size $sample.cov.bigwig ```
Combine all metrics into one report!
- MultiQC, version 0.8
- 26 supported tools



In [1]:

    
from IPython.display import IFrame
IFrame('ipynb_data/assets/multiqc_report.html', width=1000, height=700)









    Out[1]:

How to use those metrics to decide if a sample has potential quality problem? [[top](#top)]

Our data (left panel); Other (right panel)

4.1 RNA degradation

RNA quality: Fresh samples (e.g. cell line) > frozen samples (e.g. mouse tissues) >> FFPE samples (e.g. human tumors)
During lib prep, RNA quality is inferred by The RNA integrity number (RIN) evaluated using the 28S to 18S rRNA ratio
In most cases, RIN > 4.5 is recommended
However, studies have shown that RIN can be quite inaccurate for FFPE samples
A better accessment is Gene body coverage plot (Picardtools). FFPE samples often has 5' degradation

4.2 Genomic DNA contamination

In lib prep, DNase digestion removes genomic DNA
While fresh samples often has good-quality RNAs, highly-degraded samples (e.g. FFPE tumors) often has a higher degree of genomic DNA contamination
Sometimes DNA contamination could occupy 70% of sequencing reads, greately reducing the discovery power of DEG analysis
Good accessment to identify and estimate genomic DNA contamination includes
- Read distribution in genomic features: high fraction of intergenic reads indicates DNA contamination
- Visualize integernic region in genome browser (e.g. IGV)

Which sample (S1-4) has the most severe contamination from genomic DNA?

4.3 Ribosome RNA fraction

rRNA accounts for > 80% of the transcriptome
Mmost RNAseq lib prep protocol includes an ribosomal RNA depletion step
However, if RNA quality is relatively poor (e.g. FFPE tumors), rRNA depletion often is efficient
Accessing if the depletion step is successful through read distribution in genomic features
- How many reads map to ribosome RNA regions on the genome?

Which sample (S1-4) has the most insufficient rRNA depletion in lib prep? (refer to previous figure)

4.4 Confirmation of reduced/increased expression for knockdown/overexpressed genes in KO/OE samples

Is there a gene that is expected/known to be repressed or overexpressed?
Does the RNAseq result reflect expression change of this gene?
In our data, PRDM11 expression is expected to be down since this is the gene knocked down in U2932 cells
NOT all exons are affected!
Only those targeted by siRNA (small RNA interference) will be affected

4. Quantify transcript abundance: featureCounts [[top](#top)]

Goal
- Estimate number of reads mapped to gene features (e.g. gene, exon, etc.)
Method
- featureCounts, version 2.5.2b
- Ultrafast (~10 minutes for 50m reads), require low amount of memory (~4GB for human genome)
- Flexible options to count the reads based on specific mapping criteria or study purposes
  - Gene-level or exon-level (for isoforms)
  - Uniquely mapped reads
  - Primary alignment
  - Properly paired reads (if reads are paired-end)
  - Mapping quality thresholds
- Choose the option accordingly based on your experimental design!
Our data
- Assigned 55,910,832
- Unassigned_Ambiguity 3,804,649
- Unassigned_NoFeatures 4,192,188
- Unassigned_MappingQuality 29,279,029

``` featureCounts -s 0 -F GTF -t exon -g gene_id -Q 255 -J --primary -a $refgeneanno.gtf -T 2 -G $refgenome.fa -o $sample.star.featurecounts.raw_counts.txt $sample.star.merged.bam ```

BDS pipeline run report [[top](#top)]



In [2]:

    
from IPython.display import IFrame
IFrame('ipynb_data/assets/Run_RNAseq.bds.20161021_204005_581.report.html', width=1000, height=700)









    Out[2]:

Step 5-8 are run by interactive R / Bioconductor practice (Notebook 02)

5-7. Identify differentially expressed genes (DEGs) between conditions: DESeq2 [[top](#top)]

Goal
- Identify DEGs between groups of interest
Method
- DESeq2, version 1.12.4
Our data

8. Identify biological processes and pathways enriched in genes of interest: clusterProfilter [[top](#top)]

Goal
- Identify GO (gene ontology) and pathways significantly altered between groups of interest
Method
- clusterProfiler, version 3.0.5
Our data