NGS Data formats and QC

Introduction

There are several file formats for storing Next Generation Sequencing (NGS) data. In this tutorial we will look at some of the most common formats for storing NGS reads and variant data. We will cover the following formats:

FASTQ - This format stores unaligned read sequences with base qualities
SAM/BAM - This format stores unaligned or aligned reads (text and binary formats)
CRAM - This format is similar to BAM but has better compression than BAM
VCF/BCF - Flexible variant call format for storing SNPs, indels, structural variations (text and binary formats)

Following this, we will work through some examples of converting between the different formats.

Further to understanding the different file formats, it is important to remember that all sequencing platforms have technical limitations that can introduce biases in your sequencing data. Because of this it is very important to check the quality of the data before starting any analysis, whether you are planning to use something you have sequenced yourself or publicly available data. In the latter part of this tutorial we will describe how to perform a QC assessment for your NGS data, and also suggest how to identify possible contamination.

Learning outcomes

On completion of the tutorial, you can expect to be able to:

  • Describe the different NGS data formats available (FASTQ, SAM/BAM, CRAM, VCF/BCF)
  • Perform conversions between the different data formats
  • Perform a QC assessment of high throughput sequence data
  • Identify possible contamination in high throughput sequence data

Tutorial sections

This tutorial comprises the following sections:

  1. Data formats
  2. File conversion
  3. QC assessment
  4. Identifying contamination

Authors

This tutorial was written by Sara Sjunnebo based on material from Petr Danecek and Thomas Keane.

Running the commands from this tutorial

You can run the commands in this tutorial either directly from the Jupyter notebook (if using Jupyter), or by typing the commands in your terminal window.

Running commands on Jupyter

If you are using Jupyter, command cells (like the one below) can be run by selecting the cell and clicking Cell -> Run from the menu above or using ctrl Enter to run the command. Let's give this a try by printing our working directory using the pwd command and listing the files within it. Run the commands in the two cells below.


In [ ]:
pwd

In [ ]:
ls -l

Running commands in the terminal

You can also follow this tutorial by typing all the commands you see into a terminal window. This is similar to the "Command Prompt" window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, select the cell below with the mouse and then either press control and enter or choose Cell -> Run in the menu at the top of the page.


In [ ]:
echo cd $PWD

Now open a new terminal on your computer and type the command that was output by the previous cell followed by the enter key. The command will look similar to this:

cd /home/manager/pathogen-informatics-training/Notebooks/QC/

Now you can follow the instructions in the tutorial from here.

Let’s get started!

This tutorial assumes that you have samtools, bcftools, bwa, Picard tools and Kraken installed on your computer. For download and installation instructions, please see:

To check that you have installed these correctly, you can run the following commands:


In [ ]:
samtools --help

In [ ]:
bcftools --help

In [ ]:
bwa

In [ ]:
java -jar $PICARD -h

In [ ]:
kraken --help

Where $PICARD is an environmental variable set to point at picard.jar.

This should return the help message for samtools, bcftools bwa, picardtools and kraken respectively.

To get started with the tutorial, head to the first section: Data formats