Transcript quantification with Kallisto

Introduction

After visually inspecting the genome alignment, the next step in a typical RNA-Seq analysis is to estimate transcript abundance. To do this, reads are assigned to the transcripts they came from. These assignments are then used to quantify gene or transcript abundance (expression level).

For this tutorial, we are using Kallisto to assign reads to a set of transcript sequences and quantify transcript abundance. Kallisto does not assemble transcripts and cannot identify novel isoforms. So, when a reference transcriptome isn't available, the transcripts will need to be assembled de novo from the reads. However, for this tutorial, we already have a reference transcriptome available.

The objectives of this part of the tutorial are:

  • use Kallisto to index a transcriptome
  • use Kallisto to estimate transcript abundance

Quantifying transcripts with Kallisto

Many of the existing methods used for estimating transcript abundance are alignment-based. This means they rely on mapping reads onto the reference genome. The gene expression levels are then calculated by counting the number of reads overlapping the transcripts. However, read alignment is a computationally and time intensive process. So, in this tutorial, we will be running Kallisto which uses a fast, alignment-free method for transcript quantification.

Near-optimal probabilistic RNA-seq quantification
Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter
Nat Biotechnol. 2016 May;34(5):525-7. doi: 10.1038/nbt.3519

Kallisto uses a process called pseudoalignment to make it efficient. Rather than looking at where the reads map, Kallisto uses the compatibility between the reads and transcripts to estimate transcript abundance. Thus, most transcript quantification with Kallisto can be done on a simple laptop (Figure 3).

Figure 3. Performance of kallisto and other methods
(a) Accuracy of kallisto, Cufflinks, Sailfish, EMSAR, eXpress and RSEM on 20 RSEM simulations of 30 million 75-bp paired-end reads. (b) Total running time in minutes for processing the 20 simulated data sets of 30 million paired-end reads described in a. Please see the Kallisto publication for original figure and more information.

Step 1: building a Kallisto index

As with alignment-based methods, Kallisto needs an index. To generate the index, Kallisto first builds a transcriptome de Bruijn Graph (T-BDG) from all of the k-mers (short sequences of k nucleotides) that it finds in the transcriptome. Each node in the graph corresponds to a k-mer and each transcript is represented by its path through the graph. Using these paths, each k-mer is assigned a k-compatibility class. Some k-mers will be redundant i.e. shared by the same transcripts. These are skipped to make the index compact and quicker to search. A great worked example of this process can be found here.

The command kallisto index can be used to build a Kallisto index from transcript sequences.


In [ ]:
kallisto index

Here we can see the version of Kallisto that we're using (useful for publication methods) and the information that we'll need to give kallisto index. The only information we need to give kallisto index is the location of our transcript sequences (PccAS_v3_transcripts.fa). However, it's useful to have a meaningful filename for the resulting index. We can add this by using the option -i which expects a value, our index prefix (PccAS_v3_kallisto).

Step 2: estimating transcript abundance

With this Kallisto index, you can use kallisto quant to estimate transcript abundances. You will need to run this command separately for each sample.


In [ ]:
kallisto quant

We can see that kallisto quant needs us to tell it where our sample read are. Although we don't have to, it's usually a good idea to keep the results of each quantification in a different directory. This is because the output filename are always the same (e.g. abundances.tsv). If we ran a second analysis, these could get overwritten. To use a different output directory, we can use the -o option. We will also be using the -b option for bootstrapping.

Bootstrapping

Not all reads will be assigned unambiguously i.e. to a single transcript. This means that there will be "noise" in our abundance estimates where reads can be assigned to multiple transcripts. Because Kallisto is so quick, it has time to quantify the uncertainty in its abundance estimates using random resampling and replacement. This process is called bootstrapping and indicates how reliable the expression estimates are from the observed pseudoalignment. The bootstrap values can be used downstream to distinguish the technical variability from the biological variability in your experiment.


Exercise 4

Make sure you are in the data directory with the tutorial files.


In [ ]:
cd data

Build an index called PccAS_v3_kallisto from transcript sequences in PccAS_v3_transcripts.fa.


In [ ]:
kallisto index -i PccAS_v3_kallisto PccAS_v3_transcripts.fa

Quantify the transcript expression levels for the MT1 sample with 100 bootstrap samples and calling the output directory MT1.


In [ ]:
kallisto quant -i PccAS_v3_kallisto -o MT1 -b 100 MT1_1.fastq MT1_2.fastq

You'll find your Kallisto results in a new output directory which we called MT1. Let's take a look.


In [ ]:
ls MT1

Running kallisto quant generated three output files in our MT1 folder:

  • abundance.h5
    HDF5 binary file containing run info, abundance esimates, bootstrap estimates, and transcript length information length.
  • abundance.tsv
    Plain text file containing abundance estimates (doesn't contain bootstrap estimates).
  • run_info.json
    JSON file containing information about the run.

Note: when the number of bootstrap values (-b) is very high, Kallisto will generate a large amount of data. To help, it outputs bootstrap results in HDF5 format (abundance.h5). This file can be read directly by sleuth.

In the MT1/abundance.tsv file we have the abundance estimates for each gene for the MT1 sample. Let's take a quick look.


In [ ]:
head MT1/abundance.tsv

In MT1/abundance.tsv there are five columns which give us information about the transcript abundances for our MT1 sample.

  • target_id
    Unique transcript identifier.
  • length
    Number of bases found in exons.
  • eff_length
    Effective length. Uses fragment length distribution to determine the effective number of positions that can be sampled on each transcript.
  • est_counts
    Estimated counts*. This may not always be an integer as reads which map to multiple transcripts are fractionally assigned to each of the corresponding transcripts.
  • tpm
    Transcripts per million. Normalised value accounting for length and sequence depth bias.

In the last column we have our normalised abundance value for each gene. These are our transcripts per million or TPM. If you have time at the end of this tutorial, see our normalisation guide which covers common normalisation methods and has a bonus exercise.

To get the result for a specific gene, we can use grep.


In [ ]:
grep PCHAS_0100100 MT1/abundance.tsv

If we wanted to get the TPM value for a particular gene, we can use awk.


In [ ]:
awk -F"\t" '$1=="PCHAS_0100100" {print $5}' MT1/abundance.tsv

Use kallisto quant four more times, for the MT2 sample and the three SBP samples.


Questions

Q1: What k-mer length was used to build the Kallisto index?

Hint: look at the terminal output from kallisto index

Q2: How many transcript sequences are there in PccAS_v3_transcripts.fa?

Hint: you can use grep or look at the terminal output from kallisto quant or in the run_info.json files

Q3: What is the transcripts per million (TPM) value for PCHAS_1402500 in each of the samples?

Hint: use grep to look at the abundance.tsv files

Q4: Do you think PCHAS_1402500 is differentially expressed?