Iterative mapping first proposed by (Imakaev et al., 2012), allows to map usually a high number of reads. However other methodologies, less "brute-force" can be used to take into account the chimeric nature of the Hi-C reads.
A simple alternative is to allow split mapping.
Another way consists in pre-truncating (Ay and Noble, 2015) reads that contain a ligation site and map only the longest part of the read (Wingett et al., 2015).
Finally, an intermediate approach, fragment-based, consists in mapping full length reads first, and than splitting unmapped reads at the ligation sites (Serra et al. 2017).
Note: We use GEM2 (Marco-Sola et al. 2012), performance are very similar to Bowtie2, and in some cases slightly better.
For now TADbit is only compatible with GEM2.
In [1]:
from pytadbit.mapping.full_mapper import full_mapping
The full mapping function can be used to perform either iterative or fragment-based mapping, or a combination of both.
It is important to note that although the default mapping parameters used by TADbit are relatively strict, a non negligible proportion of the reads will be mis-mapped, and this applies at each iteration of the mapping.
Here an example of use as iterative mapping: (Estimated time 15h with 8 cores)
In [2]:
cell = 'mouse_B' # or mouse_PSC
rep = 'rep1' # or rep2
Note: the execution of this notebook should be repeated for each of the 4 replicates
In [3]:
! mkdir -p results/iterativ/$cell\_$rep
! mkdir -p results/iterativ/$cell\_$rep/01_mapping
In [ ]:
# for the first side of the reads
full_mapping(gem_index_path='genome/Mus_musculus-GRCm38.p6/Mus_musculus-GRCm38.p6_contigs.gem',
out_map_dir='results/iterativ/{0}_{1}/01_mapping/mapped_{0}_{1}_r1/'.format(cell, rep),
fastq_path='FASTQs/%s_%s_1.fastq.dsrc' % (cell,rep),
frag_map=False, clean=True, nthreads=8,
windows=((1,25),(1,35),(1,45),(1,55),(1,65),(1,75)),
temp_dir='results/iterativ/{0}_{1}/01_mapping/mapped_{0}_{1}_r1_tmp/'.format(cell, rep))
And for the second side of the read-end:
In [9]:
# for the second side of the reads
full_mapping(gem_index_path='genome/Mus_musculus-GRCm38.p6/Mus_musculus-GRCm38.p6_contigs.gem',
out_map_dir='results/iterativ/{0}_{1}/01_mapping/mapped_{0}_{1}_r2/'.format(cell, rep),
fastq_path='FASTQs/%s_%s_2.fastq.dsrc' % (cell,rep),
frag_map=False, clean=True, nthreads=8,
windows=((1,25),(1,35),(1,45),(1,55),(1,65),(1,75)),
temp_dir='results/iterativ/{0}_{1}/01_mapping/mapped_{0}_{1}_r2_tmp/'.format(cell, rep))
Out[9]:
The fragment-based mapping strategy works in 2 steps:
GATCGATC
and in the case of HindIII to AAGCTAGCTT
). The read-end is split accordingly replacing the ligation site by two RE sites: read-end-part-one---AAGCTAGCTT----read-end-part-two
will be split in:
read-end-part-one---AAGCTT
and:
AAGCTT----read-end-part-two
Note: if no ligation site is found, step two will be repeated using digested RE site as split point (AAGCT
in the case of HindIII). This is done in order to be protected against sequencing errors. When this path is followed the digested RE site is removed, but not replaced. If digested RE sites are not found either, the read will be classified as unmapped.
Note: both mapping strategies can be combined, for example defining the windows as previously (iterative mapping), but also give a RE name r_enz=MboI
and setting frag_map=True
like this if a read has not been mapped in any window, TADbit will also try to apply the fragment-based strategy.
In [10]:
! mkdir -p results/fragment/$cell\_$rep
! mkdir -p results/fragment/$cell\_$rep/01_mapping
In [11]:
# for the first side of the reads
full_mapping(gem_index_path='genome/Mus_musculus-GRCm38.p6/Mus_musculus-GRCm38.p6_contigs.gem',
out_map_dir='results/fragment/{0}_{1}/01_mapping/mapped_{0}_{1}_r1/'.format(cell, rep),
fastq_path='FASTQs/%s_%s_1.fastq.dsrc' % (cell, rep),
r_enz='MboI', frag_map=True, clean=True, nthreads=8,
temp_dir='results/fragment/{0}_{1}/01_mapping/mapped_{0}_{1}_r1_tmp/'.format(cell, rep))
Out[11]:
In [12]:
# for the second side of the reads
full_mapping(gem_index_path='genome/Mus_musculus-GRCm38.p6/Mus_musculus-GRCm38.p6_contigs.gem',
out_map_dir='results/fragment/{0}_{1}/01_mapping/mapped_{0}_{1}_r2/'.format(cell, rep),
fastq_path='FASTQs/%s_%s_2.fastq.dsrc' % (cell, rep),
r_enz='MboI', frag_map=True, clean=True, nthreads=8,
temp_dir='results/fragment/{0}_{1}/01_mapping/mapped_{0}_{1}_r2_tmp/'.format(cell, rep))
Out[12]:
[^](#ref-1) Imakaev, Maxim V and Fudenberg, Geoffrey and McCord, Rachel Patton and Naumova, Natalia and Goloborodko, Anton and Lajoie, Bryan R and Dekker, Job and Mirny, Leonid A. 2012. Iterative correction of Hi-C data reveals hallmarks of chromosome organization.. URL
[^](#ref-2) Ay, Ferhat and Noble, William Stafford. 2015. Analysis methods for studying the 3D architecture of the genome. URL
[^](#ref-3) Wingett, Steven and Ewels, Philip and Furlan-Magaril, Mayra and Nagano, Takashi and Schoenfelder, Stefan and Fraser, Peter and Andrews, Simon. 2015. HiCUP: pipeline for mapping and processing Hi-C data.. URL
[^](#ref-4) François Serra, Davide Baù, Mike Goodstadt, David Castillo, Guillaume Filion and Marc A. Marti-Renom. 2017. Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors. URL
[^](#ref-5) Marco-Sola, Santiago and Sammeth, Michael and Guigó, Roderic and Ribeca, Paolo. 2012. The GEM mapper: fast, accurate and versatile alignment by filtration.