Aligning the PAX5 sample to the genome

There are a number of competing tools for short read alignment, each with its own set of strengths, weaknesses, and caveats. Here we will use Bowtie2, a widely used ultrafast, memory efficient short read aligner.

Bowtie2 has a number of parameters in order to perform the alignment. To view them all type:


In [ ]:
bowtie2 --help

Bowtie2 uses indexed genome for the alignment in order to keep its memory footprint small. Because of time constraints we will build the index only for one chromosome of the human genome. For this we need the chromosome sequence in fasta format. This is stored in a file named HS19.fa, under the subdirectory genome.

If you are not in there already, change into the data directory.


In [ ]:
cd data

We will be storing our indexed genome in a folder called bowtie_index.

Check if the bowtie_index folder already exists.


In [ ]:
ls bowtie_index

If it doesn't exist already, create the folder bowtie_index.


In [ ]:
mkdir bowtie_index

Then, index the chromosome using the command:


In [ ]:
bowtie2-build genome/HS19.fa bowtie_index/hs19

Be patient, building the index may take 5-10 minutes!

This command will output 6 files that constitute the index. These files that have the prefix hs19 and are stored in the bowtie_index directory.

To check the files have been successfully created type:


In [ ]:
ls -l bowtie_index

Now that the genome is indexed we can move on to the actual alignment. In the following command the first argument (-k) instructs Bowtie2 to report only uniquely mapped reads. The following argument (-x) specifies the basename of the index for the genome to be searched; in our case is hs19. Then there is the name of the FASTQ file and the last argument (-S) that ensures that the output is in SAM format.

Align the PAX5 reads using Bowtie2:


In [ ]:
bowtie2 -k 1 -x bowtie_index/hs19 PAX5.fastq -S PAX5.sam

The above command outputs the alignments in SAM format and stores them in the file PAX5.sam.

In general before you run Bowtie2, you have to know which FASTQ format you have. The available FASTQ formats in Bowtie2 are:

--phred33 input quals are Phred+33 (default)
--phred64 input quals are Phred+64
--int-quals input quals are specified as space-delimited integers

See http://en.wikipedia.org/wiki/FASTQ_format to find more detailed information about the different quality encodings.

The PAX5.fastq file we are working on uses encoding Phred+33 (the default). Bowtie2 will take 2-3 minutes to align the file. This is fast compared to other aligners that sacrifice some speed to obtain higher sensitivity.

Look at the file in the SAM format by typing:


In [ ]:
head -n 10 PAX5.sam

You can find more information on the SAM format by looking at https://samtools.github.io/hts-specs/SAMv1.pdf.


Questions

Q1. How can you distinguish between the header of the SAM format and the actual alignments?
Hint: look at section 1.3 in the documentation (https://samtools.github.io/hts-specs/SAMv1.pdf).

Q2. What information does the header provide you with?
Hint: use the documentation to work out what the header tags mean

Q3. Which chromosome are the reads mapped to?


What's next?

For a quick recap of what the tutorial covers head back to the introduction.

If you want a reintroduction to the tutorial dataset, head back to introducing the tutorial dataset.

Otherwise, let's continue on to manipulating SAM output.