Mapping RNA-Seq reads to the genome using HISAT2

Introduction

For this exercise, we have reduced the number of reads in each sample to around 2.5 million to reduce the mapping time. However, this will be sufficient to detect most differentially expressed genes.

The objectives of this part of the tutorial are:

  • use HISAT2 to build an index from the reference genome
  • use HISAT2 to map RNA-Seq reads to the reference genome

Mapping RNA-Seq reads to a genome

By this stage, you should have already performed a standard NGS quality control check on your reads to see whether there were any issues with the sample preparation or sequencing. For more information, see our NGS Data formats and QC tutorial.

Next, we map our RNA-Seq reads to a reference genome to get context. This allows you to visually inspect your RNA-Seq data, identify contamination, novel exons and splice sites as well as giving you an overall feel for your transcriptome.

HISAT2

To map the RNA-Seq reads from our five samples to the reference genome, we will be using HISAT2, a fast and sensitive splice-aware aligner. HISAT2 compresses the genome using an indexing scheme based on the Burrows-Wheeler transform (BWT) and Ferragina-Manzini (FM) index to reduce the amount of space needed to store the genome. This also makes the genome quick to search, using a whole-genome FM index to anchor each alignment and then tens of thousands local FM indexes for very rapid extensions of these alignments.

For more information, and to find the original version of Figure 2, please see the HISAT paper:

HISAT: a fast spliced aligner with low memory requirements
Daehwan Kim, Ben Langmead and Steven L Salzberg
Nat Methods. 2015 Apr;12(4):357-60. doi:10.1038/nmeth.3317

HISAT2 is a splice-aware aligner which means it takes into account that when a read is mapped it may be split across multiple exons with (sometimes large) intronic gaps between aligned regions. As you can see in Figure 2, HISAT2 splits read alignments into five classes based on the number of exons the read alignment is split across and the length of the anchor (longest continuously mapped portion of a split read):

  • Aligns to a single exon (M)
  • Alignment split across 2 exons with long anchors over 15bp (2M_gt_15)
  • Alignment split across 2 exons with intermediate anchors between 8bp and 15bp (2M_8_15)
  • Alignment split across 2 exons with short anchors less than 7bp (2M_1_7)
  • Alignment split across more than 2 exons (gt_2M)

HISAT2 used the global index to place the longest continuously mapped portion of a read (anchor). This information is then used to identify the relevant local index. In most cases, HISAT2 will only need to use a single local index to place the remaining portion of the read without having to search the rest of the genome.

For the human genome, HISAT2 will build a single global index and 48,000 local FM indexes. Each of the local indexes represents a 64kb genomic region. The majority of human introns are significantly shorter than 64kb, so >90% of human introns fall into a single local index. Moreover, each of the local indexes overlaps its neighbour by ~1kb which means that it also has the ability to detect reads spanning multiple indexes.

There are five HISAT2 RNA-seq read mapping categories: (i) M, exonic read; (ii) 2M_gt_15, junction reads with long, >15-bp anchors in both exons; (iii) 2M_8_15, junction reads with intermediate, 8- to 15-bp anchors; (iv) 2M_1_7, junction reads with short, 1- to 7-bp, anchors; and (v) gt_2M, junction reads spanning more than two exons (Figure 2A). Exoninc reads span only a single exon and represent over 60% of the read mappings in the 20 million 100-bp simulated read dataset.


Exercise 2

Be patient, each of the following steps will take a couple of minutes!

Make sure you are in the data directory with the tutorial files.


In [ ]:
cd data

Look at the usage instructions for hisat2-build.


In [ ]:
hisat2-build -h

This not only tells us the version of HISAT2 we're using (essential for publication methods):

HISAT2 version 2.0.4 by Daehwan Kim

But, that we also need to give histat2-build two pieces of information:

Usage: hisat2-build [options]* <reference_in> <ht2_index_base>

These are:

  • <reference_in>
    location of our reference sequence file (PccAS_v3_genome.fa)
  • <ht2_index_base>
    what we want to call our HISAT2 index files (PccAS_v3_hisat2.idx)

Build a HISAT2 index for our Plasmodium chabaudi chabaudi AS (P. chabaudi) reference genome using hisat2-build.


In [ ]:
hisat2-build PccAS_v3_genome.fa PccAS_v3_hisat2.idx

You can see the generated index files using:


In [ ]:
ls PccAS_v3_hisat2.idx*

Look at the usage for hisat2.


In [ ]:
hisat2 -h

Here we can see that hisat2 needs several bits of information so that it can do the mapping:

hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
  • -x <ht2-idx>
    the prefix that we chose for our index files with hisat2-build (PccAS_v3_hisat2.idx)
  • {-1 <m1> -2 <m2> | -U <r>}
    the left (-1) and right (-2) read files for the sample (MT1_1.fastq and MT1_2.fastq respectively
  • [-S <sam>]
    the name of the file we want to write the output alignment to (MT1.sam) as, by default, hisat2 will print the results to the terminal (stdout)

We will also be adding one more piece of information, the maximum intron length (default 500,000 bases). For this analysis, we want to set the maximum intron length to 10,000. We can do this by adding the option --max-intronlen 10000.

Map the reads for the MT1 sample using HISAT2.


In [ ]:
hisat2 --max-intronlen 10000 -x PccAS_v3_hisat2.idx \
-1 MT1_1.fastq -2 MT1_2.fastq -S MT1.sam

HISAT2 has written the alignment in SAM format. This is a format which allows humans to look at our alignments. However, we need to convert the SAM file to its binary version, a BAM file. We do this for several reasons. Mainly we do it because most downstream programs require our alignments to be in BAM format and not SAM format. However, we also do it because the BAM file is smaller and so takes up less (very precious!) storage space. For more information, see the format guide: http://samtools.github.io/hts-specs/SAMv1.pdf.

Convert the SAM file to a BAM file.


In [ ]:
samtools view -b -o MT1.bam MT1.sam

We now need to sort the BAM file ready for indexing. When we aligned our reads with HISAT2, the alignments were produced in the same order as the sequences in our FASTQ files. To index the BAM file we need the alignments to be ordered by their respective positions in the reference genome. We can do this using samtools sort which will sort the alignments by their co-ordinates for each chromosome.

Sort the BAM file.


In [ ]:
samtools sort -o MT1_sorted.bam MT1.bam

Next, we need to index our BAM file. This makes searching the alignments much more efficient. It allows programs like IGV (which we will be using to visualise the alignment) to quickly get the alignments that overlap the genomic regions you're looking at. We can do this with samtools index which will generate an index file with the extension .bai.

Index the BAM file so that it can be read efficiently by IGV.


In [ ]:
samtools index MT1_sorted.bam

Now repeat this process of mapping, converting (SAM to BAM), sorting and indexing with the reads from the MT2 sample.

Hopefully, the sorted and indexed BAM files have already been generated for you. Let's check.


In [ ]:
ls SBP*bam*

If this doesn't return .bam and .bai files for your three SBP samples, run these commands.


In [ ]:
chmod +x map_SBP_samples.sh
./map_SBP_samples.sh

These commands run a bash script which will do the mapping, converting, sorting and indexing for all of the SBP samples. There's a great introduction to bash scripting and loops as part of our Unix tutorial.

If you have time at the end of the tutorial, feel free to take a look at the script and a breakdown of what it does in Running commands on multiple samples. Bash scripts and loops are a useful way of automating an analysis and running the same commands for multiple samples. Imagine if you had 50 samples and not 5!


Questions

Q1: How many index files were generated when you ran hisat2-build?

Hint: look for the files with the .ht2 extension

Q2: What was the overall alignment rate for each of the MT samples (MT1 and MT2) to the reference genome?

Hint: look at the the output from the hisat2 commands

Q3: How many MT1 and MT2 reads were not aligned to the reference genome?

Hint: look at the the output from the hisat2 commands, you're looking for reads (not read pairs) which have aligned 0 times (remember that one read from a pair may map even if the other doesn't)


What's next?

For a quick recap of what the tutorial covers head back to the introduction.

If you want a reintroduction to the tutorial dataset, head back to introducing the tutorial dataset.

Otherwise, let's continue on to visualising transcriptomes with IGV.