This document is a jupyter notebook that illustrates some of the ways to use xenomapper.
If you are viewing this on github it will have the output of each of the commands shown, or alternatively you can download the .ipynb file and run the examples on your own system. All commands are executed on the command line, and the %%bash
should be ingnored if you are following along at the command line (the %%bash
just tells jupyter to execute the cells code with bash when this document is run as a notebook).
Xenomapper is a tool for post processing reads that come from a biological source where two genomes are present and have been aligned to both reference genomes. Xenomapper compares the mapping scores between the two genomes, and if available as a suboptimal alignment within the genomes, to determine the specificity of the read. Unlike compound genome mapping approaches this is capable of resolving species specific multimapping reads.
Before using Xenomapper you must have mapped your reads to both genomes of interest. Xenomapper is most frequently used with the bowtie2 and HISAT2 aligners, and this document will focus on bowtie2. The dataset that we will use is from Rossello et al and is a publicly available cell line data set. Although this data set is older we use it as it does not require any restricted access. (Note the downloads for running this example total 700MB for the data and 7GB for the references)
In [ ]:
%%bash
brew install homebrew/science/bowtie2
brew install homebrew/science/sratoolkit
brew install homebrew/science/samtools
brew install python3
In [2]:
%%bash
brew info bowtie2 samtools sratoolkit
In [30]:
%%bash
pip3 install --upgrade xenomapper
The Short Read Archive uses its own file format so we need to use their tool to access the data. Unfortunately all this is very slow so downloading the full dataset can take a long time. To speed up the process we will only download the first 1 Million reads. (Alternatively you could use Aspera connect) You need to preserve the original readnames (--origfmt), and want singleton reads in a separate file (--split-3).
In [39]:
%%bash
fastq-dump -X 1000000 --origfmt --split-3 SRR879369
We also need to have both the bowtie2 index for both human and mouse so we will download the prebuilt versions from the bowtie2 web page
In [10]:
%%bash
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz
In [13]:
%%bash
curl -O ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/mm10.zip
In [20]:
%%bash
mkdir genomes
tar xzf GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz -C genomes/
unzip mm10.zip -d genomes/
When Xenomapper compares the output of the mapping against the two genomes it needs to be able to locate the same read in both genomes while simultaneously traversing the two files. This is trivial if the reads are maintained in the same order through the mapping process, so choosing our mapping options carefully avoids the need to resort the mapped reads by name. (Sorting works if you need to do this with previously mapped data, just use the same tool (e.g. samtools sort) on both files as lexicial sort order varies between programs).
Most aligners in single thread mode will return the results in the same order that they occur in the fastq input file, and in mutlithreaded mode there is usually an option to preserve the read order.
From the downloads above you should have two fastq files in your working directory and 12 .bt2 bowtie2 index files in the genomes folder.
In [40]:
%%bash
bowtie2 --local --threads 7 --reorder \
-x genomes/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index \
-1 SRR879369_1.fastq \
-2 SRR879369_2.fastq \
-S SRR879369.hs.sam
In [41]:
%%bash
bowtie2 --local --threads 7 --reorder \
-x genomes/mm10 \
-1 SRR879369_1.fastq \
-2 SRR879369_2.fastq \
-S SRR879369.mm.sam
We can see from the bowtie output that the majority of reads in this sample are human. Mapping to the human genome is sufficient to identify the concordant reads with paired end at this length; however, we are much more interested in the discordant reads as these will be important cancer structural variations. Also of note is that the numbers don't add up, there are clearly reads with valid alignments to both genomes that need disambiguating based on the alignment score.
In [42]:
%%bash
xenomapper --primary_sam SRR879369.hs.sam \
--secondary_sam SRR879369.mm.sam \
--primary_specific SRR879369.human_specific.sam \
--paired
The summary table is markdown formated. Also note that the summary for paired end data treats forward and reverse orientations of discord as separate categories. This is because the reverse read is frequently of lower quality that forward read and this information can be useful for tuning quality thresholds.
Category | Count |
---|---|
('primary_multi', 'primary_multi') | 62781 |
('primary_multi', 'primary_specific') | 19871 |
('primary_multi', 'secondary_multi') | 194 |
('primary_multi', 'secondary_specific') | 549 |
('primary_multi', 'unassigned') | 404 |
('primary_multi', 'unresolved') | 102 |
('primary_specific', 'primary_multi') | 21151 |
('primary_specific', 'primary_specific') | 868118 |
('primary_specific', 'secondary_multi') | 554 |
('primary_specific', 'secondary_specific') | 1212 |
('primary_specific', 'unassigned') | 4307 |
('primary_specific', 'unresolved') | 226 |
('secondary_multi', 'primary_multi') | 269 |
('secondary_multi', 'primary_specific') | 523 |
('secondary_multi', 'secondary_multi') | 182 |
('secondary_multi', 'secondary_specific') | 277 |
('secondary_multi', 'unassigned') | 201 |
('secondary_multi', 'unresolved') | 35 |
('secondary_specific', 'primary_multi') | 970 |
('secondary_specific', 'primary_specific') | 1597 |
('secondary_specific', 'secondary_multi') | 333 |
('secondary_specific', 'secondary_specific') | 995 |
('secondary_specific', 'unassigned') | 1230 |
('secondary_specific', 'unresolved') | 113 |
('unassigned', 'unassigned') | 12514 |
('unresolved', 'primary_multi') | 230 |
('unresolved', 'primary_specific') | 470 |
('unresolved', 'secondary_multi') | 46 |
('unresolved', 'secondary_specific') | 265 |
('unresolved', 'unassigned') | 124 |
('unresolved', 'unresolved') | 157 |
You can also specify addition output files for other categories of read. For example if we were interested in oncoviruses we could look in the unassigned category, although the more usual contents is poor quality reads and adaptor sequences.
In [46]:
%%bash
xenomapper --primary_sam SRR879369.hs.sam \
--secondary_sam SRR879369.mm.sam \
--primary_specific SRR879369.human_specific.sam \
--unassigned SRR879369.unassigned.sam \
--paired
In [50]:
%%bash
tail SRR879369.unassigned.sam