It is always a good idea to check that your data is from the species you expect it to be. A very useful tool for this is Kraken. In this tutorial we will go through how you can use Kraken to check your samples for contamination.
Note if using the Sanger cluster: Kraken is run as part of the automatic qc pipeline and you can retreive the results using the pf qc
script. For more information, run pf man qc
.
To run Kraken you need to either build a database or download an existing one. The standard database is very large (33 GB), but thankfully there are some smaller, pre-built databased available. To download the smallest of them, the 4 GB MiniKraken. If you don't already have a kraken database set up, run:
In [ ]:
wget https://ccb.jhu.edu/software/kra\
ken/dl/minikraken_20171019_4GB.tgz
Then all you need to do is un-tar it:
In [ ]:
tar -zxvf minikraken_20171019_4GB.tgz
This version of the database is constructed from complete bacterial, archaeal, and viral genomes in RefSeq, however it contains only around 3 percent of the kmers from the original kraken database (more information here). If the pre-packaged databases are not quite what you are looking for, you can create your own customized database instead. Details about this can be found here.
Note if using the Sanger cluster: There are several pre-built databases available centrally on the Sanger cluster. For more information, please contact the Pathogen Informatics team.
In [ ]:
kraken --db ./minikraken_20171013_4GB --output kraken_results \
--fastq-input --gzip-compressed --paired \
data/13681_1#18_1.fastq.gz data/13681_1#18_2.fastq.gz
The five columns in the file that's generated are:
To get a better overview you can create a kraken report:
In [ ]:
kraken-report --db ./minikraken_20171013_4GB --print_header \
kraken_results > kraken-report
In [ ]:
head -n 20 kraken-report
The six columns in this file are:
Q1: What is the most prevalent species in this sample?
In [ ]:
Q2: Are there clear signs of contamination?
In [ ]:
Q3: What percentage of reads could not be classified?
In [ ]:
For bacteria, another thing that you can look at to detect contamination is if there are heterozygous SNPs in your samples. Simply put, if you align your reads to a reference, you would expect any snps to be homozygous, i.e. if one read differs from the reference genome, then the rest of the reads that map to that same location will also do so:
Homozygous SNP
Ref: CTTGAGACGAAATCACTAAAAAACGTGACGACTTG
Read1: CTTGAGtCG
Read2: CTTGAGtCGAAA
Read3: GAGtCGAAATCACTAAAA
Read4: GtCGAAATCA
But if there is contamination, this may not be the case. In the example below, half of the mapped reads have the T allele and half have the A.
Heterozygous SNP
Ref: CTTGAGACGAAATCACTAAAAAACGTGACGACTTG
Read1: CTTGAGtCG
Read2: CTTGAGaCGAAA
Read3: GAGaCGAAATCACTAAAA
Read4: GtCGAAATCA
Note if using the Sanger cluster: Heterozygous SNPs are calculated as part of the automated QC pipeline. The result for each lane is available in the file heterozygous_snps_report.txt.
Congratulations! You have reached the end of this tutorial. You can find the answers to all the questions of the tutorial here.
To revisit the previous section click here. Alternatively you can head back to the index page