How to use xenomapper

This document is a jupyter notebook that illustrates some of the ways to use xenomapper. If you are viewing this on github it will have the output of each of the commands shown, or alternatively you can download the .ipynb file and run the examples on your own system. All commands are executed on the command line, and the %%bash should be ingnored if you are following along at the command line (the %%bash just tells jupyter to execute the cells code with bash when this document is run as a notebook).

Xenomapper is a tool for post processing reads that come from a biological source where two genomes are present and have been aligned to both reference genomes. Xenomapper compares the mapping scores between the two genomes, and if available as a suboptimal alignment within the genomes, to determine the specificity of the read. Unlike compound genome mapping approaches this is capable of resolving species specific multimapping reads.

Before using Xenomapper you must have mapped your reads to both genomes of interest. Xenomapper is most frequently used with the bowtie2 and HISAT2 aligners, and this document will focus on bowtie2. The dataset that we will use is from Rossello et al and is a publicly available cell line data set. Although this data set is older we use it as it does not require any restricted access. (Note the downloads for running this example total 700MB for the data and 7GB for the references)

Installation

This example requires bowtie2, sratoolkit, and samtools. If you do not already have these installed the easiest way to install these requirements on linux or MacOS is to use homebrew or linuxbrew


In [ ]:
%%bash
brew install homebrew/science/bowtie2
brew install homebrew/science/sratoolkit
brew install homebrew/science/samtools
brew install python3

In [2]:
%%bash
brew info bowtie2 samtools sratoolkit


homebrew/science/bowtie2: stable 2.2.6 (bottled), HEAD
Fast and sensitive gapped read aligner
http://bowtie-bio.sf.net/bowtie2
/usr/local/Cellar/bowtie2/2.2.6 (76 files, 20.4M) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew-science/blob/master/bowtie2.rb
==> Dependencies
Recommended: tbb
==> Options
--without-tbb
	Build without using Intel Thread Building Blocks (TBB)
--HEAD
	Install HEAD version

homebrew/science/samtools: stable 1.3 (bottled), HEAD
Tools (written in C using htslib) for manipulating next-generation sequencing data
http://www.htslib.org/
/usr/local/Cellar/samtools/1.3 (46 files, 813.3K) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew-science/blob/master/samtools.rb
==> Dependencies
Required: htslib
Optional: dwgsim
==> Options
--with-dwgsim
	Build with Whole Genome Simulation
--without-curses
	Skip use of libcurses, for platforms without it, or different curses naming
--HEAD
	Install HEAD version

homebrew/science/sratoolkit: stable 2.5.4 (bottled), HEAD
Tools for using data from INSDC Sequence Read Archive
https://github.com/ncbi/sra-tools
/usr/local/Cellar/sratoolkit/2.5.4 (48 files, 65.4M) *
  Poured from bottle
From: https://github.com/Homebrew/homebrew-science/blob/master/sratoolkit.rb
==> Dependencies
Build: autoconf
Required: libxml2
Recommended: libmagic, hdf5
==> Options
--without-hdf5
	Build without hdf5 support
--without-libmagic
	Build without libmagic support
--HEAD
	Install HEAD version

In [30]:
%%bash
pip3 install --upgrade xenomapper


Collecting xenomapper
  Using cached XenoMapper-1.0.0-py3-none-any.whl
Installing collected packages: xenomapper
  Found existing installation: XenoMapper 1.0.0
    Uninstalling XenoMapper-1.0.0:
      Successfully uninstalled XenoMapper-1.0.0
Successfully installed xenomapper-1.0.0

Download data files

The Short Read Archive uses its own file format so we need to use their tool to access the data. Unfortunately all this is very slow so downloading the full dataset can take a long time. To speed up the process we will only download the first 1 Million reads. (Alternatively you could use Aspera connect) You need to preserve the original readnames (--origfmt), and want singleton reads in a separate file (--split-3).


In [39]:
%%bash
fastq-dump -X 1000000 --origfmt --split-3 SRR879369


Read 1000000 spots for SRR879369
Written 1000000 spots for SRR879369

We also need to have both the bowtie2 index for both human and mouse so we will download the prebuilt versions from the bowtie2 web page


In [10]:
%%bash
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3575M  100 3575M    0     0  3470k      0  0:17:35  0:17:35 --:--:-- 3976k

In [13]:
%%bash
curl -O ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/mm10.zip


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3213M  100 3213M    0     0   390k      0  2:20:30  2:20:30 --:--:--  353k

In [20]:
%%bash
mkdir genomes
tar xzf GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index.tar.gz -C genomes/
unzip mm10.zip -d genomes/


Archive:  mm10.zip
  inflating: genomes/mm10.1.bt2      
  inflating: genomes/mm10.2.bt2      
  inflating: genomes/mm10.3.bt2      
  inflating: genomes/mm10.4.bt2      
  inflating: genomes/mm10.rev.1.bt2  
  inflating: genomes/mm10.rev.2.bt2  
  inflating: genomes/make_mm10.sh    

Mapping the data

When Xenomapper compares the output of the mapping against the two genomes it needs to be able to locate the same read in both genomes while simultaneously traversing the two files. This is trivial if the reads are maintained in the same order through the mapping process, so choosing our mapping options carefully avoids the need to resort the mapped reads by name. (Sorting works if you need to do this with previously mapped data, just use the same tool (e.g. samtools sort) on both files as lexicial sort order varies between programs).

Most aligners in single thread mode will return the results in the same order that they occur in the fastq input file, and in mutlithreaded mode there is usually an option to preserve the read order.

From the downloads above you should have two fastq files in your working directory and 12 .bt2 bowtie2 index files in the genomes folder.


In [40]:
%%bash
bowtie2 --local --threads 7 --reorder \
         -x genomes/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.bowtie_index \
         -1 SRR879369_1.fastq \
         -2 SRR879369_2.fastq \
         -S SRR879369.hs.sam


1000000 reads; of these:
  1000000 (100.00%) were paired; of these:
    147954 (14.80%) aligned concordantly 0 times
    580157 (58.02%) aligned concordantly exactly 1 time
    271889 (27.19%) aligned concordantly >1 times
    ----
    147954 pairs aligned concordantly 0 times; of these:
      61155 (41.33%) aligned discordantly 1 time
    ----
    86799 pairs aligned 0 times concordantly or discordantly; of these:
      173598 mates make up the pairs; of these:
        33871 (19.51%) aligned 0 times
        43432 (25.02%) aligned exactly 1 time
        96295 (55.47%) aligned >1 times
98.31% overall alignment rate

In [41]:
%%bash
bowtie2 --local --threads 7 --reorder \
         -x genomes/mm10 \
         -1 SRR879369_1.fastq \
         -2 SRR879369_2.fastq \
         -S SRR879369.mm.sam


1000000 reads; of these:
  1000000 (100.00%) were paired; of these:
    962331 (96.23%) aligned concordantly 0 times
    12675 (1.27%) aligned concordantly exactly 1 time
    24994 (2.50%) aligned concordantly >1 times
    ----
    962331 pairs aligned concordantly 0 times; of these:
      5041 (0.52%) aligned discordantly 1 time
    ----
    957290 pairs aligned 0 times concordantly or discordantly; of these:
      1914580 mates make up the pairs; of these:
        1675466 (87.51%) aligned 0 times
        74872 (3.91%) aligned exactly 1 time
        164242 (8.58%) aligned >1 times
16.23% overall alignment rate

We can see from the bowtie output that the majority of reads in this sample are human. Mapping to the human genome is sufficient to identify the concordant reads with paired end at this length; however, we are much more interested in the discordant reads as these will be important cancer structural variations. Also of note is that the numbers don't add up, there are clearly reads with valid alignments to both genomes that need disambiguating based on the alignment score.

Catagorizing reads with Xenomapper


In [42]:
%%bash
xenomapper --primary_sam SRR879369.hs.sam \
           --secondary_sam SRR879369.mm.sam \
           --primary_specific SRR879369.human_specific.sam \
           --paired


--------------------------------------------------------------------------------
Read Count Category Summary

|       Category                                     |     Count       |
|:--------------------------------------------------:|:---------------:|
|  ('primary_multi', 'primary_multi')                |          62781  |
|  ('primary_multi', 'primary_specific')             |          19871  |
|  ('primary_multi', 'secondary_multi')              |            194  |
|  ('primary_multi', 'secondary_specific')           |            549  |
|  ('primary_multi', 'unassigned')                   |            404  |
|  ('primary_multi', 'unresolved')                   |            102  |
|  ('primary_specific', 'primary_multi')             |          21151  |
|  ('primary_specific', 'primary_specific')          |         868118  |
|  ('primary_specific', 'secondary_multi')           |            554  |
|  ('primary_specific', 'secondary_specific')        |           1212  |
|  ('primary_specific', 'unassigned')                |           4307  |
|  ('primary_specific', 'unresolved')                |            226  |
|  ('secondary_multi', 'primary_multi')              |            269  |
|  ('secondary_multi', 'primary_specific')           |            523  |
|  ('secondary_multi', 'secondary_multi')            |            182  |
|  ('secondary_multi', 'secondary_specific')         |            277  |
|  ('secondary_multi', 'unassigned')                 |            201  |
|  ('secondary_multi', 'unresolved')                 |             35  |
|  ('secondary_specific', 'primary_multi')           |            970  |
|  ('secondary_specific', 'primary_specific')        |           1597  |
|  ('secondary_specific', 'secondary_multi')         |            333  |
|  ('secondary_specific', 'secondary_specific')      |            995  |
|  ('secondary_specific', 'unassigned')              |           1230  |
|  ('secondary_specific', 'unresolved')              |            113  |
|  ('unassigned', 'unassigned')                      |          12514  |
|  ('unresolved', 'primary_multi')                   |            230  |
|  ('unresolved', 'primary_specific')                |            470  |
|  ('unresolved', 'secondary_multi')                 |             46  |
|  ('unresolved', 'secondary_specific')              |            265  |
|  ('unresolved', 'unassigned')                      |            124  |
|  ('unresolved', 'unresolved')                      |            157  |

The summary table is markdown formated. Also note that the summary for paired end data treats forward and reverse orientations of discord as separate categories. This is because the reverse read is frequently of lower quality that forward read and this information can be useful for tuning quality thresholds.

Read Count Category Summary

Category Count
('primary_multi', 'primary_multi') 62781
('primary_multi', 'primary_specific') 19871
('primary_multi', 'secondary_multi') 194
('primary_multi', 'secondary_specific') 549
('primary_multi', 'unassigned') 404
('primary_multi', 'unresolved') 102
('primary_specific', 'primary_multi') 21151
('primary_specific', 'primary_specific') 868118
('primary_specific', 'secondary_multi') 554
('primary_specific', 'secondary_specific') 1212
('primary_specific', 'unassigned') 4307
('primary_specific', 'unresolved') 226
('secondary_multi', 'primary_multi') 269
('secondary_multi', 'primary_specific') 523
('secondary_multi', 'secondary_multi') 182
('secondary_multi', 'secondary_specific') 277
('secondary_multi', 'unassigned') 201
('secondary_multi', 'unresolved') 35
('secondary_specific', 'primary_multi') 970
('secondary_specific', 'primary_specific') 1597
('secondary_specific', 'secondary_multi') 333
('secondary_specific', 'secondary_specific') 995
('secondary_specific', 'unassigned') 1230
('secondary_specific', 'unresolved') 113
('unassigned', 'unassigned') 12514
('unresolved', 'primary_multi') 230
('unresolved', 'primary_specific') 470
('unresolved', 'secondary_multi') 46
('unresolved', 'secondary_specific') 265
('unresolved', 'unassigned') 124
('unresolved', 'unresolved') 157

You can also specify addition output files for other categories of read. For example if we were interested in oncoviruses we could look in the unassigned category, although the more usual contents is poor quality reads and adaptor sequences.


In [46]:
%%bash
xenomapper --primary_sam SRR879369.hs.sam \
           --secondary_sam SRR879369.mm.sam \
           --primary_specific SRR879369.human_specific.sam \
           --unassigned SRR879369.unassigned.sam \
           --paired


--------------------------------------------------------------------------------
Read Count Category Summary

|       Category                                     |     Count       |
|:--------------------------------------------------:|:---------------:|
|  ('primary_multi', 'primary_multi')                |          62781  |
|  ('primary_multi', 'primary_specific')             |          19871  |
|  ('primary_multi', 'secondary_multi')              |            194  |
|  ('primary_multi', 'secondary_specific')           |            549  |
|  ('primary_multi', 'unassigned')                   |            404  |
|  ('primary_multi', 'unresolved')                   |            102  |
|  ('primary_specific', 'primary_multi')             |          21151  |
|  ('primary_specific', 'primary_specific')          |         868118  |
|  ('primary_specific', 'secondary_multi')           |            554  |
|  ('primary_specific', 'secondary_specific')        |           1212  |
|  ('primary_specific', 'unassigned')                |           4307  |
|  ('primary_specific', 'unresolved')                |            226  |
|  ('secondary_multi', 'primary_multi')              |            269  |
|  ('secondary_multi', 'primary_specific')           |            523  |
|  ('secondary_multi', 'secondary_multi')            |            182  |
|  ('secondary_multi', 'secondary_specific')         |            277  |
|  ('secondary_multi', 'unassigned')                 |            201  |
|  ('secondary_multi', 'unresolved')                 |             35  |
|  ('secondary_specific', 'primary_multi')           |            970  |
|  ('secondary_specific', 'primary_specific')        |           1597  |
|  ('secondary_specific', 'secondary_multi')         |            333  |
|  ('secondary_specific', 'secondary_specific')      |            995  |
|  ('secondary_specific', 'unassigned')              |           1230  |
|  ('secondary_specific', 'unresolved')              |            113  |
|  ('unassigned', 'unassigned')                      |          12514  |
|  ('unresolved', 'primary_multi')                   |            230  |
|  ('unresolved', 'primary_specific')                |            470  |
|  ('unresolved', 'secondary_multi')                 |             46  |
|  ('unresolved', 'secondary_specific')              |            265  |
|  ('unresolved', 'unassigned')                      |            124  |
|  ('unresolved', 'unresolved')                      |            157  |


In [50]:
%%bash
tail SRR879369.unassigned.sam


D81P8DQ1:109:D1CW8ACXX:4:1102:12018:61417	77	*	0	0	*	*	0	0	AGACCTGGGCTTTAGAACCATACTGACCTCCTCAGCGGAAATTCAAGAACACCGGCTGCCAGAGACCGTTCTTGTCACAGGTAGGCTGGCCCTTCGGACCC	@@CFFFFDHHHHHJJIIIIGGHJJJFIIGIJJJJJJJJJIIIIIIJJIGIIJIJGFHFFFDCCDEDDB3<C?CCC@ACDC?>CCBC@<B2<A8998&25>B	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:12018:61417	141	*	0	0	*	*	0	0	ACCTAGACAGCAAAGATGGAAATTTTAGATCTGCATCATCAAGGTCCTGGGGGTCAGCCTACAGGTGTCCCAGAGAGAAACAATTATTCAGAACCGCAGAG	@@BDFFFFHHHHHJJJJJJEGIHIIJIIGIJJJJJJJEIIJJJJ?FIIIIJIJAAFHJJJJHHHHFDFEDEEEDDEDBCCDDBDCCDEEEDDDD8>DD@BB	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:14577:61420	77	*	0	0	*	*	0	0	AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGACCCGACCTCGATTGACGTCTTCTGCTTCTAAAACAAACTAATAAAGCAAAAGACACCCAGGCC	=??>===AA8C3<+2+<832++AE?*:E3?1C*1:?))8.7@A@#########################################################	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:14577:61420	141	*	0	0	*	*	0	0	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGAGTAGATCTCGGTGGTCGCCGTCTCATTAAAAAAAGATCAGAATAAGCAGGAGAAGAAACCAAGTTCTGAA	BB@FFDF=D?DH>G@EHCG?DEFGGG;?:B)0*?9?*/9*0;;;F'.B8AA##################################################	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:15636:61332	77	*	0	0	*	*	0	0	GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCCTCTGCTTTCTAAAAAAAAAGAAAGCAGGTGCATTAGAACTAGAC	BCCFFFDDHHHGHJJJIHJJJJJJIIJGG<FHIIIIEHIJD@@ADCG5;@###################################################	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:15636:61332	141	*	0	0	*	*	0	0	GGGGAAGAGGAGGGGGGGGGGGGGGGAAGGGGTGGGAATGGCGGGGGGGGGGGGCGTTATAAAAAAAAAAAAAAAAAAAAAAAAACACAGACGAACGAAGG	#####################################################################################################	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:16261:61333	77	*	0	0	*	*	0	0	AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTCTTCCGTCGTCTGCTTGTCAAGAAAAACAACCCAATAGATTGCGTGCACTAA	@@@DA?DDDAFDAEHGGF3A<AD??EHGCECCF*?F?0?FHD###########################################################	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:16261:61333	141	*	0	0	*	*	0	0	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAACTAAAAAACACAACAAGGAAAAAGCACAATGA	@B@DB=6DD<8?D:EDG@?19?<FCDGC*?9B?D9DCGIF?FBG6');BHA4@9.(;A;;@>A@#####################################	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:17052:61337	77	*	0	0	*	*	0	0	AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTTTAAAAAAATACAAAACGCAAACAAGAGCCATAAAT	CCCFFFFFHHHHGGGBHIIIJGHHHJJJGIHGHIDIDFHHIBG74CC2557=4@###############################################	YT:Z:UP
D81P8DQ1:109:D1CW8ACXX:4:1102:17052:61337	141	*	0	0	*	*	0	0	AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAACAAAAAGTGAAACATGAGAGTAGAGAACACTAA	@@BFFFFFHHHHGJFHIHIHIIJJHIJEFH?D0?BGEDHIBG7CBAHGGIFBBFFEEEACCCCB#####################################	YT:Z:UP