QC assessment of NGS data

As mentioned previously, QC is an important part of any analysis. In this section we are going to look at some of the metrics and graphs that can be used to assess the QC of NGS data.

Base quality

Illumina sequencing technology relies on sequencing by synthesis. One of the most common problems with this is dephasing. For each sequencing cycle, there is a possibility that the replication machinery slips and either incorporates more than one nucleotide or perhaps misses to incorporate one at all. The more cycles that are run (i.e. the longer the read length gets), the greater the accumulation of these types of errors gets. This leads to a heterogeneous population in the cluster, and a decreased signal purity, which in turn reduces the precision of the base calling. The figure below shows an example of this.

Because of dephasing, it is possible to have high-quality data at the beginning of the read but really low-quality data towards the end of the read. In those cases you can decide to trim off the low-quality reads, for example using a tool called Trimmomatic.

The figures below shows an example of a high-quality read data (top) and a poor quality read data (bottom).

Other base calling errors

There are several different reasons for a base to be called incorrectly, as shown in the figure below. Phasing noise and signal decay is a result of the dephasing issue described above. During library preparation, mixed clusters can occur if multiple templates get co-located. These clusters should be removed from the downstream analysis. Boundary effects occur due to optical effects when the intensity is uneven across each tile, resulting in lower intensity toward the edges. Cross-talk occurs because the emission frequency spectra for each of the four dyes partly overlap. Finally, for early chemistries T fluorophore accumulation was an issue, where incomplete cleavage of the dye coupled to thymine lead to an accumulation the nucleotide.

Base-calling for next-generation sequencing platforms, doi: 10.1093/bib/bbq077

Mismatches per cycle

Aligning reads to a high-quality reference genome can provide insight to the quality of a sequencing run by showing you the mismatches to the reference sequence. This can help you detect cycle-specific errors. Mismatches can occur due to two main causes, sequencing errors and differences between your sample and the reference genome, which is important to bear in mind when interpreting mismatch graphs. The figures below show an example of a good run (top) and a bad one (bottom). In the first figure, the distribution of the number of mismatches is even between the cycles, which is what we would expect from a good run. However, in the second figure, two cycles stand out with a lot of mismatches compared to the other cycles.

GC bias

It is a good idea to compare the GC content of the reads against the expected distribution in a reference sequence. The GC content varies between species, so a shift in GC content like the one seen below could be an indication of sample contamination. In the left image below, we can see that the GC content of the sample is about the same as for the reference, at ~38%. However, in the right figure, the GC content of the sample is closer to 55%, indicating that there is an issue with this sample.

GC content by cycle

Looking at the GC content per cycle can help detect if the adapter sequence was trimmed. For a random library, it is expected to be little to no difference between the different bases of a sequence run, so the lines in this plot should be parallel with each other like in the first of the two figures below. In the second of the figures, the initial spikes are likely due to adapter sequences that have not been removed.

Insert size

For paired-end sequencing the size of DNA fragments also matters. In the first of the examples below, the insert size peaks around 440 bp. In the second however, there is also a peak at around 200 bp. This indicates that there was an issue with the fragment size selection during library prep.

Exercises

Q1: The figure below is from a 100bp paired-end sequencing. Can you spot any problems?

Insertions/Deletions per cycle

Sometimes, air bubbles occur in the flow cell, which can manifest as false indels. The spike in the second image provides an example of how this can look.

Generating QC stats

Now let's try this out! We will generate QC stats for two lanes of Illumina paired-end sequencing data from yeast. We will use the bwa mapper to align the data to the Saccromyces cerevisiae genome, followed by samtools stats to generate the stats.

Read pairs are usually stored in two separate FASTQ files so that n-th read in the first file and the n-th read in the second file constitute a read pair. Can you devise a quick sanity check that reads in these two files indeed form pairs? The files must have the same number of lines and the naming of the reads usually suggests if they form a pair. The location of the files is:

data/lane1/s_7_1.fastq   
data/lane1/s_7_2.fastq

In [ ]:

Let's have a look at the script we are going to run to create the mappings:


In [ ]:
cat create_mapping.sh

The script contains several commands, some are combined together using pipes. (UNIX pipes is a very powerful and elegant concept which allows us to feed the output of one command into the next command and avoid writing intermediate files. If you are not comfortable with UNIX, consider having a go at the UNIX tutorial).

Now run the script to create the mappings and stats:


In [ ]:
./create_mapping.sh

The script will produce the BAM file lane1.sorted.bam and a matching index file:


In [ ]:
ls -alrt data

Now we will use samtools stats to generate the stats for the primary alignments. The option -f can be used to filter reads with specific tags, while -F can be used to filter out reads with specific tags. The following command will include only primary alignments:


In [ ]:
samtools stats -F SECONDARY data/lane1.sorted.bam \
    > data/lane1.sorted.bam.bchk

Have a look at the first 41 lines of the statistics file that was generated:


In [ ]:
head -n 41 data/lane1.sorted.bam.bchk

This file contains a number of useful stats that we can use to get a better picture of our data, and it can even be plotted with plot-bamstats, as you will see soon. First let's have a closer look at some of the different stats. Each part of the file starts with a # followed by a description of the section and how to extract it from the file. Let's have a look at all the sections in the file:


In [ ]:
grep ^'#' data/lane1.sorted.bam.bchk | grep 'Use'

Summary Numbers (SN)

This initial section contains a summary of the alignment and includes some general statistics. In particular, you can see how many bases mapped, and how much of the genome that was covered.

Now look at the output and try to answer the questions below.

Q2: What is the total number of reads?


In [ ]:

Q3: What proportion of the reads were mapped?


In [ ]:

Q4: How many pairs were mapped to a different chromosome?


In [ ]:

Q5: What is the insert size mean and standard deviation?


In [ ]:

Q6: How many reads were paired properly?


In [ ]:

Finally, we will create some QC plots from the output of the stats command using the command plot-bamstats which is included in the samtools package:


In [ ]:
plot-bamstats -p data/lane1-plots/ data/lane1.sorted.bam.bchk

Now in your web browser open the file lane1-plots/index.html to view the QC information.

Q7: How many reads have zero mapping quality?


In [ ]:

Q8: Which of the first fragments or second fragments are higher base quality on average?


In [ ]:

The answers to the questions on this page can be found here.

Now continue to the next section of the tutorial: Identifying contamination
Alternatively, you can return to the previous section or the index page.