Pairs of mapped read-ends are filtered in order to keep only valid pairs. The filters available in TADbit are these one:
In [2]:
cell = 'mouse_B' # or mouse_PSC
rep = 'rep1' # or rep2
In [3]:
from pytadbit.mapping.filter import filter_reads
The max_molecule_length
parameter used to filter-out pseudo-dangling-ends can be extracted from the insert_size
function in previous section.
The min_distance_to_re
, that affects the detection of random breaks, should be large enough in order to contain almost all the fragments.
In [4]:
# this will last ~10 minutes
masked = filter_reads(
'results/fragment/{0}_{1}/03_filtering/reads12_{0}_{1}.tsv'.format(cell, rep),
max_molecule_length=750, over_represented=0.005, max_frag_size=100000,
min_frag_size=50, re_proximity=5, min_dist_to_re=1000)
This generates a dictionary with the different filters and the reads affected by each.
In [5]:
from pytadbit.mapping.filter import apply_filter
apply_filter('results/fragment/{0}_{1}/03_filtering/reads12_{0}_{1}.tsv'.format(cell, rep),
'results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}.tsv'.format(cell, rep), masked,
filters=[1, 2, 3, 4, 6, 7, 9, 10])
Out[5]:
In [6]:
from pytadbit.mapping.analyze import hic_map
hic_map('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}.tsv'.format(cell, rep),
resolution=1000000, show=True, cmap='viridis')
Zoom to a single chromosome or a region:
In [7]:
hic_map('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}.tsv'.format(cell, rep),
resolution=1000000, show=True, focus='chr1', cmap='viridis')
In [8]:
hic_map('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}.tsv'.format(cell, rep),
resolution=1000000, show=True, focus=(500, 1000), cmap='viridis')
Working with TSV (tab-separated-value file format) files is very slow. For the next part of the tutorial we will be using BAM (binary-alignment-map) files, which are compressed and indexed.
Note: The fields we use in TADbit to generate a BAM file are not the conventional ones, we modify them as follows to store only the necessary information for the remaining part of the analysis:
Flag (binary mask for the application of the 10 filters previously described):
For example if we want to keep only pairs of read-ends that are excelusively inter-fragment contacts and that are not duplicated, we would apply filters 1, 2, 3 (self-circle, dangling-ends, errors) and 9 (duplicated) resulting in a binary number like this: 00100000111 which translates in decimal: 263. We could thus obtain these read-pairs with samtools view -F 263
.
In [9]:
from pytadbit.parsers.hic_bam_parser import bed2D_to_BAMhic
In [10]:
bed2D_to_BAMhic('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}.tsv'.format(cell, rep),
valid=True, ncpus=8,
outbam='results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}'.format(cell, rep),
frmt='mid', masked=None)
In [ ]: