After visually inspecting the genome alignment, the next step in a typical RNA-Seq analysis is to estimate transcript abundance. To do this, reads are assigned to the transcripts they came from. These assignments are then used to quantify gene or transcript abundance (expression level).
For this tutorial, we are using Kallisto to assign reads to a set of transcript sequences and quantify transcript abundance. Kallisto does not assemble transcripts and cannot identify novel isoforms. So, when a reference transcriptome isn't available, the transcripts will need to be assembled de novo from the reads. However, for this tutorial, we already have a reference transcriptome available.
The objectives of this part of the tutorial are:
Many of the existing methods used for estimating transcript abundance are alignment-based. This means they rely on mapping reads onto the reference genome. The gene expression levels are then calculated by counting the number of reads overlapping the transcripts. However, read alignment is a computationally and time intensive process. So, in this tutorial, we will be running Kallisto which uses a fast, alignment-free method for transcript quantification.
Near-optimal probabilistic RNA-seq quantification
Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter
Nat Biotechnol. 2016 May;34(5):525-7. doi: 10.1038/nbt.3519
Kallisto uses a process called pseudoalignment to make it efficient. Rather than looking at where the reads map, Kallisto uses the compatibility between the reads and transcripts to estimate transcript abundance. Thus, most transcript quantification with Kallisto can be done on a simple laptop (Figure 3).
Figure 3. Performance of kallisto and other methods
(a) Accuracy of kallisto, Cufflinks, Sailfish, EMSAR, eXpress and RSEM on 20 RSEM simulations of 30 million 75-bp paired-end reads. (b) Total running time in minutes for processing the 20 simulated data sets of 30 million paired-end reads described in a. Please see the Kallisto publication for original figure and more information.
As with alignment-based methods, Kallisto needs an index. To generate the index, Kallisto first builds a transcriptome de Bruijn Graph (T-BDG) from all of the k-mers (short sequences of k nucleotides) that it finds in the transcriptome. Each node in the graph corresponds to a k-mer and each transcript is represented by its path through the graph. Using these paths, each k-mer is assigned a k-compatibility class. Some k-mers will be redundant i.e. shared by the same transcripts. These are skipped to make the index compact and quicker to search. A great worked example of this process can be found here.
The command kallisto index
can be used to build a Kallisto index from transcript sequences.
In [ ]:
kallisto index
Here we can see the version of Kallisto that we're using (useful for publication methods) and the information that we'll need to give kallisto index
. The only information we need to give kallisto index
is the location of our transcript sequences (PccAS_v3_transcripts.fa). However, it's useful to have a meaningful filename for the resulting index. We can add this by using the option -i
which expects a value, our index prefix (PccAS_v3_kallisto).
In [ ]:
kallisto quant
We can see that kallisto quant
needs us to tell it where our sample read are. Although we don't have to, it's usually a good idea to keep the results of each quantification in a different directory. This is because the output filename are always the same (e.g. abundances.tsv). If we ran a second analysis, these could get overwritten. To use a different output directory, we can use the -o
option. We will also be using the -b
option for bootstrapping.
Not all reads will be assigned unambiguously i.e. to a single transcript. This means that there will be "noise" in our abundance estimates where reads can be assigned to multiple transcripts. Because Kallisto is so quick, it has time to quantify the uncertainty in its abundance estimates using random resampling and replacement. This process is called bootstrapping and indicates how reliable the expression estimates are from the observed pseudoalignment. The bootstrap values can be used downstream to distinguish the technical variability from the biological variability in your experiment.
Make sure you are in the data
directory with the tutorial files.
In [ ]:
cd data
Build an index called PccAS_v3_kallisto from transcript sequences in PccAS_v3_transcripts.fa.
In [ ]:
kallisto index -i PccAS_v3_kallisto PccAS_v3_transcripts.fa
Quantify the transcript expression levels for the MT1 sample with 100 bootstrap samples and calling the output directory MT1.
In [ ]:
kallisto quant -i PccAS_v3_kallisto -o MT1 -b 100 MT1_1.fastq MT1_2.fastq
You'll find your Kallisto results in a new output directory which we called MT1. Let's take a look.
In [ ]:
ls MT1
Running kallisto quant
generated three output files in our MT1
folder:
Note: when the number of bootstrap values (-b
) is very high, Kallisto will generate a large amount of data. To help, it outputs bootstrap results in HDF5 format (abundance.h5). This file can be read directly by sleuth.
In the MT1/abundance.tsv file we have the abundance estimates for each gene for the MT1 sample. Let's take a quick look.
In [ ]:
head MT1/abundance.tsv
In MT1/abundance.tsv there are five columns which give us information about the transcript abundances for our MT1 sample.
In the last column we have our normalised abundance value for each gene. These are our transcripts per million or TPM. If you have time at the end of this tutorial, see our normalisation guide which covers common normalisation methods and has a bonus exercise.
To get the result for a specific gene, we can use grep
.
In [ ]:
grep PCHAS_0100100 MT1/abundance.tsv
If we wanted to get the TPM value for a particular gene, we can use awk
.
In [ ]:
awk -F"\t" '$1=="PCHAS_0100100" {print $5}' MT1/abundance.tsv
Use kallisto quant
four more times, for the MT2 sample and the three SBP samples.
Hint: look at the terminal output from kallisto index
Hint: you can use grep
or look at the terminal output from kallisto quant
or in the run_info.json files
Hint: use grep
to look at the abundance.tsv files
You can head back to visualising transcriptomes with IGV or continue on to identifying differentially expressed genes with sleuth.