Identifying contamination

It is always a good idea to check that your data is from the species you expect it to be. A very useful tool for this is Kraken. In this tutorial we will go through how you can use Kraken to check your samples for contamination.

Note if using the Sanger cluster: Kraken is run as part of the automatic qc pipeline and you can retreive the results using the pf qc script. For more information, run pf man qc.

Setting up a database

To run Kraken you need to either build a database or download an existing one. The standard database is very large (33 GB), but thankfully there are some smaller, pre-built databased available. To download the smallest of them, the 4 GB MiniKraken. If you don't already have a kraken database set up, run:



In [ ]:

    
wget https://ccb.jhu.edu/software/kra\
    ken/dl/minikraken_20171019_4GB.tgz

Then all you need to do is un-tar it:



In [ ]:

    
tar -zxvf minikraken_20171019_4GB.tgz

This version of the database is constructed from complete bacterial, archaeal, and viral genomes in RefSeq, however it contains only around 3 percent of the kmers from the original kraken database (more information here). If the pre-packaged databases are not quite what you are looking for, you can create your own customized database instead. Details about this can be found here.

Note if using the Sanger cluster: There are several pre-built databases available centrally on the Sanger cluster. For more information, please contact the Pathogen Informatics team.

Running Kraken

To run Kraken, you need to provide the path to the database you just created. By default, the input files are assumed to be in FASTA format, so in this case we also need to tell Kraken that our input files are in FASTQ format, gzipped, and that they are paired end reads:



In [ ]:

    
kraken --db ./minikraken_20171013_4GB --output kraken_results \
    --fastq-input --gzip-compressed --paired \
    data/13681_1#18_1.fastq.gz data/13681_1#18_2.fastq.gz

The five columns in the file that's generated are:

"C"/"U": one letter code indicating that the sequence was either classified or unclassified.
The sequence ID, obtained from the FASTA/FASTQ header.
The taxonomy ID Kraken used to label the sequence; this is 0 if the sequence is unclassified.
The length of the sequence in bp.
A space-delimited list indicating the LCA mapping of each k-mer in the sequence.

To get a better overview you can create a kraken report:



In [ ]:

    
kraken-report --db ./minikraken_20171013_4GB --print_header \
    kraken_results > kraken-report

Looking at the results

Let's have a closer look at the kraken_report for the sample. If for some reason your kraken-run failed there is a prebaked kraken-report at data/kraken-report



In [ ]:

    
head -n 20 kraken-report

The six columns in this file are:

Percentage of reads covered by the clade rooted at this taxon
Number of reads covered by the clade rooted at this taxon
Number of reads assigned directly to this taxon
A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.
NCBI taxonomy ID
Scientific name

Exercises

Q1: What is the most prevalent species in this sample?



In [ ]:

Q2: Are there clear signs of contamination?



In [ ]:

Q3: What percentage of reads could not be classified?



In [ ]:

Heterozygous SNPs

For bacteria, another thing that you can look at to detect contamination is if there are heterozygous SNPs in your samples. Simply put, if you align your reads to a reference, you would expect any snps to be homozygous, i.e. if one read differs from the reference genome, then the rest of the reads that map to that same location will also do so:

Homozygous SNP
Ref:       CTTGAGACGAAATCACTAAAAAACGTGACGACTTG
Read1:  CTTGAGtCG
Read2:  CTTGAGtCGAAA
Read3:         GAGtCGAAATCACTAAAA
Read4:               GtCGAAATCA

But if there is contamination, this may not be the case. In the example below, half of the mapped reads have the T allele and half have the A.

Heterozygous SNP
Ref:       CTTGAGACGAAATCACTAAAAAACGTGACGACTTG
Read1:  CTTGAGtCG
Read2:  CTTGAGaCGAAA
Read3:         GAGaCGAAATCACTAAAA
Read4:               GtCGAAATCA

Note if using the Sanger cluster: Heterozygous SNPs are calculated as part of the automated QC pipeline. The result for each lane is available in the file heterozygous_snps_report.txt.

Congratulations! You have reached the end of this tutorial. You can find the answers to all the questions of the tutorial here.

To revisit the previous section click here. Alternatively you can head back to the index page