Overview of RNA-Seq analysis

0. Preliminaries

Notation

A line beginning with a dollar sign ($) indicates the command line. It is meant to orient you that the command is to be performed in the terminal.

$ 

This is not to be confused with something like $HOME, which means a variable named "HOME" that has the value (on my computer), shown by echo (which prints the output of commands). The dollar sign here means to access that variable. The line after the command line means the output of the command.

$ echo $HOME 
/Users/olga

Anatomy of a Unix command

We'll be using unix commands here and it's worth pausing and talking about what these commands are really doing. Let's pretend we have a program called libraryprep. We can call the program using its name:

$ libraryprep

Let's say it has the option flag (aka "flag" aka "argument") --stranded. You'd tell the program this flag by putting it after the command:

$ libraryprep --stranded

A lot of times there's a shortened version of the flag. For --stranded let's say it's -s. Then you could use the shorter command,

$ libraryprep -s  # This is equal to "libraryprep --stranded"

Notice the hash "#". In all the languages we'll be using (bash, Python), this is a "comment" meaning its only for human reading and the computer can stop reading as soon as it sees the hash.

If a flag requires an input, you would put it immediately after the flag. For example, say we have a flag called --input and it requires the cell lysis file:

$ libraryprep -s --input lysis

Now Let's say it has the options -a, -s, -d, -f We can put all of them sequentially:

$ libraryprep -a -s -d -f

Or put them all in a row after a single dash. So the above command is equivalent to:

$ libraryprep -asdf

But if you have a multi-character flag like --stranded then that has to be separate:

$ libraryprep -adf --stranded  # correct
$ libraryprep -adfstranded     # incorrect

Or if the flag requires an argument, then you have to keep it separate, unless the shortened version is at the end. For the purposes of this, let's say -i is short for --input.

$ libraryprep -asdf --input lysis  # correct
$ libraryprep -asdfi lysis         # correct
$ libraryprep -asdfinput lysis     # incorrect
$ libraryprep -asdif lysis         # incorrect

If you have a long command, you can separate it out onto multiple lines for readability (meaning just for you the human, not for the computer) by using the backslash "\" for each line that's supposed to continue on to the next one. The indentation is purely for readability and isn't required - it just looks nicer.

Fun fact: The way I remember that "\" backslash because the top is falling backwards, and "/" is frontslash because the top is falling forwards.

$ # Correct
$ libraryprep -adf \
    --input lysis \
    --stranded

It is very important that there are no spaces or other characters after the slashes! Otherwise your program will throw a cryptic syntax error and you will be sad.

$ # Incorrect
$ libraryprep -adf \ # comments aren't allowed here
    --input lysis \ --flags-arent-okay-either
    --stranded

Getting help

For most command-line programs, you can get help on how to actually use the program one of four ways. This is called finding the "usage" message. I've written them in priority order so, try the first one first, then the second, etc.

  1. $ programname
    1. Without any arguments, this will call the program. Sometimes this will give you usage information, sometimes it will result in an error.
  2. $ programname -h
    1. The -h is (generally) a universal short version flag for "help me!!! I don't know what's going on!!" Though sometimes that doesn't work either ...
  3. $ programname --help
    1. The expanded version of -h
  4. $ man programname
    1. man is for "manual" and will bring up the manual pages for that command

Subcommands

Some programs we'll use will have subcommands. This means that we're calling a specific subprogram of a single program. For example, let's pretend we have a program called microscopy, and it has the subcommands brightfield and fluorescence. You wouldn't be able to call just microscopy by itself, because you need to specify the exact subprogram (in this analogy, the type of light inputs) that you want to use:

$ microscopy
Error: Must specify either "brightfield" or "fluorescence"

There will be flags like --zoom that would make sense for both subcommands:

$ microscopy brightfield --zoom 10x

But only for the fluorescence subcommand does the flag --wavelength make any sense, since brightfield would show all the light.

$ microscopy fluorescence --wavelength 488 --zoom 10x

"Standard in" and "standard out"

Some of these commands use the greater than (>), which means that the output of the program should be saved a file. I like to think of it as the big input (the left side of the >) being squeezed into the small corner of the arrow (right side of the >).

For example, let's go back to our libraryprep example. We tell the libraryprep program we want a stranded library with our lysis input, and to save it as the file called cdna.

$ libraryprep --stranded --input lysis > cdna

1. Library quality control

.fasta file format

This is the simple, very commonly used file format in bioinformatics for storing sequences. An example of the p53 protein sequence is below.

>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens GN=TP53 PE=1 SV=4
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

.fastq file format

The current standard of sequencing files is fastq, which is derived from the fasta format. Check out an example fastq format below which shows 3 entries:

$ head -n 12 projects/shalek2013/raw_data/S10.chr11.R1.fastq
@SRR578577.1 C0HLAACXX120407:2:1107:16553:42649 length=101
TCTCCTTAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGACCAAA
+SRR578577.1 C0HLAACXX120407:2:1107:16553:42649 length=101
@@@FFDDDHHFHGIG?EAFDHHGHHHFHIEG;FDHIHIIIGIIDGHGHHIGIGGEHGGGGIFHIIGGGGGIGHHHECEDFFFFFFDEDEEE@BCBBBCB##
@SRR578577.2 C0HLAACXX120407:2:1203:19118:72461 length=101
CCTCTCCTTAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGACCA
+SRR578577.2 C0HLAACXX120407:2:1203:19118:72461 length=101
?@@DFFFDHGFDHEGHFHII<EHHAHHIIIIIIIIIIGEHEIIIIGGG<B<3BFGIEIGIIIGGC.7=CDCHCE??ACEBEE@>CACEC>6;;==BBBBBB
@SRR578577.3 C0HLAACXX120407:2:1110:7566:40229 length=101
ACCCTCTCCTTAAATTTTAAGTAAATGTTTAAGGGATTTTACACCGGTCTATGGAGGTTTGCATGTGTAATTTTACCTCTAATTAATTATAAGGCCAGGAC
+SRR578577.3 C0HLAACXX120407:2:1110:7566:40229 length=101
???;:A;;C:D?AG<EFBDG<3CF:F><CCDHI>GFFEGEFFEGE:D0?6?D*/.8-;>AC=FDD=@CAEE=>?CAE@B@@BD>ABB;;>5>;;>?#####

Notice that each read takes up 4 lines, where

@unique_identifier length=101
[genomic sequence]
+unique_identifier length=101
[ASCII-encoded PHRED quality score]

Let's break the quality down a bit.

Sequencing quality scores

The quality of base calling for each position in the read is encoded as a score $Q$ from 1 to 40. Defining $p$ as the probability of incorrect base calling, then

$ Q = -10 \log_{10}(p) $

Given any score, we can back-calculate the probability of an incorrect base by using our understanding of logarithms.

$$ \begin{align} 40 &= -10 \log_{10}(p)\\ \frac{40}{-10} &= \frac{-10}{-10} \log_{10}(p)\\ -4 &= \log_{10}(p)\\ 10^{-4} &= 10^{\log_{10}(p)}\\ 10^{-4} &= p \end{align} $$

Thus the best score of 40 represents the lowest probability of incorrect base calling, $p=10^{-4}$.

Exercise 1

What is the probabilty of incorrect base calling for a score of 39? For 31?

PHRED64 quality scores

Because "40" takes up two spaces, instead of using numbers the score is encoded using letters in a system called "ASCII." You already know that computers store information in zeros and ones, which can be turned into numbers, well, the letter "!" (exclamation point) represents the number 64 and acts as the lowest value for the quality scores, while

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI

  • Input: Raw .fastq files
  • Program: FASTQC, SeQC
  • Output: Report (.html) and plots (.png) of library quality

Example:

$ fastqc $HOME/projects/differentiation/raw_data/sample01.R1.fastq -o $HOME/projects/differentiation/processed_data


2. Adapter trimming

  • Input: Raw .fastq files
  • Program: cutadapt, trimgalore
  • Output: Adapter trimmed .fastq files

Example:

cutadapt \
    # format of the input files \
    -f fastq  \
    # Maximum number of times an adapter sequence can be removed \
    --times 2 \
    # Maximum error rate in adapter \
    -e 0.0  \
    # Minimum overlap length \
    -O 5  \
    # Minimum sequencing quality \
    --quality-cutoff 6 \
    # Minimum read length \
    -m 18 \
    # Adapters that could appear anywhere in the sequencing read (5' end or 3' end) \
    -b TCGTATGCCGTCTTCTGCTTG \
    -b ATCTCGTATGCCGTCTTCTGCTTG \
    -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
    -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
    -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \
    -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \
    # Read 1 output \
    -o $HOME/projects/shalek2013/processed_data/S10_R1.polyATrim.adapterTrim.fastq \
    # Read 2 output \
    -p $HOME/projects/shalek2013/processed_data/S10_R2.polyATrim.adapterTrim.fastq \
    # Read 1 input \
    $HOME/projects/shalek2013/raw_data/S10_R1.fastq.gz \
    # Read 2 input \
    $HOME/projects/shalek2013/raw_data/S10_R2.fastq.gz \
    # Statistics about how many adapters were removed
    > $HOME/projects/shalek2013/processed_data/S10_R2.polyATrim.adapterTrim.metrics

3. Map to repetitive elements

.sam file format

"Sequence alignment map"

  • Input: Trimmed .fastq files and repetitive element "genome"
  • Program: STAR
  • Output: Alignment .sam and unaligned .fastq

Example:

STAR --genomeDir $HOME/genomes/RepBase18.05.fasta/STAR/ --readFilesIn \
    $HOME/projects/differentiation/raw_data/sample01.R1.fastq \
    $HOME/projects/differentiation/raw_data/sample01.R2.fastq \
    --outFileNamePrefix $HOME/projects/differentiation/processed_data/sample01.repetitive. \
    # Save unmapped files as fastq
    --outReadsUnmapped Fastx

4. Map to genome

  • Input: .fastq of trimmed reads which didn't align to repetitive elements
  • Program: STAR
  • Output: Alignment .sam

Example:

STAR --genomeDir $HOME/genomes/hg19/gencode/v20/star \
    --readFilesIn \
    $HOME/projects/differentiation/processed_data/sample01.repetitive.Unmapped.out.mate1 \
    $HOME/projects/differentiation/processed_data/sample01.repetitive.Unmapped.out.mate2 \
    --outFileNamePrefix \
    $HOME/projects/differentiation/processed_data/sample01.rmRepetitive.

Exercise 2

Run STAR with the following parameters:

  • genome directory: $HOME/genomes/mm10/gencode/m8/star_chr11
  • Fastq files: $HOME/projects/shalek2013/raw_data/S10.chr11.R1.fastq and $HOME/projects/shalek2013/raw_data/S10.chr11.R2.fastq
  • Output prefix: $HOME/projects/shalek2013/processed_data/S10.chr11.

Questions

  1. What files were created?
  2. In what file can you find the percentage of mapped reads?
  3. In what file can you find the debugging information for the program, like the full command, the files used, and the values of all the parameters?

5. Create sorted, indexed .bam file

.bam file format

The .bam file format is a compressed (aka "binary" - meaning not human readable) version of the .sam format.

All programs downstream of mapping require a sorted, indexed .bam file. You can also do these steps with your repetitive element-aligned .sam file if

Think of creating a sorted, indexed bam file like making a dictionary.

A raw aligned .sam file like a jumble of words in random order:

single
cell
bioinformatics
experiment
beyonce
sequencing
library
prep

Sorting puts the words into alphabet order:

beyonce
bioinformatics
cell
experiment
library
prep
sequencing
single

And indexing is like adding the alphabet tabs in a dictionary:

B:
    beyonce
    bioinformatics
C:
    cell
E:
    experiment
L: 
    library
P:
    prep
S:
    sequencing
    single

Instead of letters, indexing adds "chromosome tabs" so each program will know where, say chromosome 7 (chr7) starts and can jump to it right away.

5.0 Convert .sam to .bam

  • Input: .sam of aligned reads
  • Program: samtools view -b
    • -b flag forces output to be .bam
  • Output: Compressed .bam file

Example (samtools v1.3.1):

samtools view -b sample01.sam > sample01.sorted.bam

Exercise 3

Convert your .sam file to .bam

5.1 Sort .sam

For deeply sequenced samples (>50 million reads), this will take quite some time

  • Input: .sam of aligned reads
  • Program: samtools sort
  • Output: Sorted .bam

Example (samtools v1.3.1):

samtools sort sample01.bam > sample01.sorted.bam

Exercise 4

Sort your .bam file

5.2 Index .sam

Indexing only works with sorted bam files

  • Input: .bam of sorted, aligned reads
  • Program: samtools sort
  • Output: Sorted .bam

Example (samtools v1.3.1):

samtools index sample01.sorted.bam

This creates the file sample01.sorted.bam.bai

Exercise 5

Index your .bam file

6. Quantify gene expression

Units

TPM, FPKM, what? Check out this blog post on What the FPKM? to learn about RNA-seq units in depth.

File types

Gene annotation files are either .gtf or .gff format. We'll use the GENCODE GTF format for our purposes. A subset of the gtf format is shown below. Notice the third column specifies the type of feature.

chr11   HAVANA  gene    3125904 3131004 .       +       .       gene_id "ENSMUSG00000082286.10"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; level 2; havana_gene "OTTMUSG00000000759.2";
chr11   HAVANA  transcript      3125904 3129396 .       +       .       gene_id "ENSMUSG00000082286.10"; transcript_id "ENSMUST00000145164.7"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Pisd-ps1-004"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTMUSG00000000759.2"; havana_transcript "OTTMUST00000086908.1";
chr11   HAVANA  exon    3125904 3126058 .       +       .       gene_id "ENSMUSG00000082286.10"; transcript_id "ENSMUST00000145164.7"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Pisd-ps1-004"; exon_number 1; exon_id "ENSMUSE00000728833.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTMUSG00000000759.2"; havana_transcript "OTTMUST00000086908.1";
chr11   HAVANA  exon    3127479 3127644 .       +       .       gene_id "ENSMUSG00000082286.10"; transcript_id "ENSMUST00000145164.7"; gene_type "transcribed_unprocessed_pseudogene"; gene_status "KNOWN"; gene_name "Pisd-ps1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "Pisd-ps1-004"; exon_number 2; exon_id "ENSMUSE00000708691.1"; level 2; tag "basic"; transcript_support_level "1"; havana_gene "OTTMUSG00000000759.2"; havana_transcript "OTTMUST00000086908.1";

How to run the programs

  • Input: .bam of sorted, aligned reads
  • Program: featureCounts, Salmon, HTSeq, ...
  • Output: some kind of table (comma- or tab-delimited file)

Example:

featureCounts -a annotation.gtf -o sample01.featurecounts.txt sample1.sorted.bam

Exercise 5

Run featureCounts on your sorted, indexed .bam file

  • Annotation file: $HOME/genomes/mm10/gencode/m8/gencode.vM8.basic.annotation.chr11.gtf

7. Create gene expression matrix

  • Input: Individual cells' gene expression counts
  • Program: R, Python, awk, cut, paste
  • Output: A matrix of size (n_samples $\times$ n_features)

8. Cell-level QC

  • Input: Gene expression matrix
  • Program: R, Python, MATLAB
  • Possible criteria:
    • Low number of reads mapped (indicates lowly sequenced)
    • Low percentage of reads mapped (indicates poor library)
    • High mitochondrial:genome ratio (indicates dying cell)
    • High spike-in:genome ratio (indicates low cellular mRNA content)
    • Low numbers of expressed genes (indicates low detection)
  • Output: Gene expression matrix with "bad" cells removed

9. Gene-level QC

  • Input: Gene expression matrix with "bad" cells removed
  • Program: R, Python, MATLAB
  • Possible criteria:
    • Low expression (e.g. $< 1$ TPM in $> 5$ cells)
  • Output: Gene expression matrix with "bad" genes removed

10. Exploratory analysis

  • Input: Gene expression matrix with "bad" genes removed
  • Program: R, Python, MATLAB
  • Possible algorithms:
    • Dimensionality reduction
      • PCA
      • ICA
      • MDS
      • t-SNE
  • Output: plots

11. Subpopulation identification

  • Input: Gene expression matrix with "bad" genes removed
  • Program: R, Python, MATLAB
  • Possible algorithms:
    • Clustering
      • Hierarchical clustering
      • $K$-means
    • Pseudo-time
      • Monocle
      • Waterfall
      • Wanderlust
      • Wishbone
  • Output: plots, cluster identification

12. Subpopulation characterization

  • Input: Gene expression matrix with "bad" genes removed, cluster or psuedotime ordering
  • Program: R, Python, MATLAB
  • Possible algorithms:
    • Differential expression
      • DESeq
      • scDE ("single-cell differential expression")
    • Classifiers
      • Support Vector Machines (SVM)
      • DecisionTree classifiers
    • Clustering
      • Hierarchical clustering on ordered gene expression
  • Output: Lists of genes

13. Interpretation

  • Input: Lists of genes
  • Program: Biologist, Gene Ontology (e.g. DAVID) or other gene set enrichment analysis, PubMed
  • Output: Biological description of populations