In [1]:
from pytadbit.parsers.hic_parser import load_hic_data_from_reads
In [2]:
r_enz_1 = 'HindIII'
r_enz_2 = 'MboI'
reso = 1000000
In [ ]:
hic_data_1 = load_hic_data_from_reads(
'results/{0}/03_filtering/valid_reads12_{0}.tsv'.format(r_enz_1),
reso)
hic_data_2 = load_hic_data_from_reads(
'results/{0}/03_filtering/valid_reads12_{0}.tsv'.format(r_enz_2),
reso)
For this, very sparse, example dataset we are going to ask for very few interactions per bin.
This can be done either by setting min_perc
asking for each bin to contain a minimum percentage of cells with interaction data. Or but setting min_count
asking that the number of cells with interaction data of each bin is above the defined cutoff.
In [ ]:
hic_data_1.filter_columns(draw_hist=True, min_count=10, by_mean=True)
hic_data_2.filter_columns(draw_hist=True, min_count=10, by_mean=True)
In [ ]:
print len(hic_data_1)
print len(hic_data_1.bads)
print len(hic_data_2)
print len(hic_data_2.bads)
As normalization we use either the ICE normalization (Imakaev et al., 2012) with a "full" normalization until the sum of all columns of the matrix are equal, or something more similar to the vanilla normalization used in (Rao et al., 2014) which is exactly like running the ICE normalization without iterations.
Note: if columns with a lot of zeroes are present the ICE normalization will last very long to converge, and these low-coverage columns will present, at the end of the normalization, few cells with very high values of interaction
In [ ]:
hic_data_1.normalize_hic(iterations=10, max_dev=0.1)
hic_data_2.normalize_hic(iterations=10, max_dev=0.1)
At 100 kb itmakes no sense to view the full matrix (your matrix will have more cells than your screen have pixels).
Try plotting a region of the genome only, with the focus
parameter.
In [ ]:
from pytadbit.mapping.analyze import hic_map
hic_map(hic_data_1, normalized=True, focus='chr18', show=True)
hic_map(hic_data_2, normalized=True, focus='chr18', show=True)
ICE normalization is widely used however other, more convaluted, normalizations (Hu et al., 2012) (Yaffe and Tanay, 2011) can be used outside TADbit and then added as normalized matrices.
In [ ]:
from cPickle import dump
In [ ]:
! mkdir -p results/$r_enz_1/04_normalizing
! mkdir -p results/$r_enz_2/04_normalizing
Save biases to separate file.
In [ ]:
out = open('results/{1}/04_normalizing/biases_{0}_{1}.pick'.format(reso, r_enz_1), 'w')
dump(hic_data_1.bias, out)
out.close()
out = open('results/{1}/04_normalizing/biases_{0}_{1}.pick'.format(reso, r_enz_2), 'w')
dump(hic_data_2.bias, out)
out.close()
Save "bad" columns to an other file
In [ ]:
out = open('results/{1}/04_normalizing/bad_columns_{0}_{1}.pick'.format(reso, r_enz_1), 'w')
dump(hic_data_1.bads, out)
out.close()
out = open('results/{1}/04_normalizing/bad_columns_{0}_{1}.pick'.format(reso, r_enz_2), 'w')
dump(hic_data_2.bads, out)
out.close()
In [ ]:
! dryhic3.r results/HindIII/03_filtering/valid_reads12_HindIII.tsv \
results/HindIII/04_normalizing/bad_columns_1000000_HindIII.pick \
HindIII hg38 1000000 results/HindIII/04_normalizing/biases_dryhic_1000000_HindIII.pick
In [ ]:
! dryhic3.r results/MboI/03_filtering/valid_reads12_MboI.tsv \
results/MboI/04_normalizing/bad_columns_1000000_MboI.pick \
MboI hg38 1000000 results/MboI/04_normalizing/biases_dryhic_1000000_MboI.pick
In [ ]:
bias_dry_path = 'results/{1}/04_normalizing/biases_dryhic_{0}_{1}.tsv'
hic_data_1.bias = dict([(int(l.split()[0]), float(l.split()[1])) for l in open(bias_dry_path.format(reso, 'HindIII'))])
hic_data_2.bias = dict([(int(l.split()[0]), float(l.split()[1])) for l in open(bias_dry_path.format(reso, 'MboI'))])
In [ ]:
hic_map(hic_data_1, normalized=True, focus='chr18', show=True)
hic_map(hic_data_2, normalized=True, focus='chr18', show=True)
In [ ]:
hic_map(hic_data_1, by_chrom='intra', normalized=False,
savedata='results/{1}/04_normalizing/{0}_raw'.format(reso, r_enz_1))
hic_map(hic_data_2, by_chrom='intra', normalized=False,
savedata='results/{1}/04_normalizing/{0}_raw'.format(reso, r_enz_2))
In [ ]:
hic_map(hic_data_1, by_chrom='intra', normalized=True,
savedata='results/{1}/04_normalizing/{0}_norm'.format(reso, r_enz_1))
hic_map(hic_data_2, by_chrom='intra', normalized=True,
savedata='results/{1}/04_normalizing/{0}_norm'.format(reso, r_enz_2))
If the resolution is not to low, we could also save genomic matrices:
In [ ]:
if reso >= 300000:
hic_map(hic_data_1, by_chrom=False, normalized=False,
savedata='results/{1}/04_normalizing/{0}_raw.mat'.format(reso, r_enz_1))
hic_map(hic_data_1, by_chrom=False, normalized=True,
savedata='results/{1}/04_normalizing/{0}_norm.mat'.format(reso, r_enz_1))
hic_map(hic_data_2, by_chrom=False, normalized=False,
savedata='results/{1}/04_normalizing/{0}_raw.mat'.format(reso, r_enz_2))
hic_map(hic_data_2, by_chrom=False, normalized=True,
savedata='results/{1}/04_normalizing/{0}_norm.mat'.format(reso, r_enz_2))
[^](#ref-1) Imakaev, Maxim V and Fudenberg, Geoffrey and McCord, Rachel Patton and Naumova, Natalia and Goloborodko, Anton and Lajoie, Bryan R and Dekker, Job and Mirny, Leonid A. 2012. Iterative correction of Hi-C data reveals hallmarks of chromosome organization.. URL
[^](#ref-2) Rao, Suhas S P and Huntley, Miriam H and Durand, Neva C and Stamenova, Elena K and Bochkov, Ivan D. and James T. Robinson and Sanborn, Adrian L. and Machol, Ido and Omer, Arina D. and Lander, Eric S. and Lieberman-Aiden, Erez. 2014. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. URL
[^](#ref-3) Hu, Ming and Deng, Ke and Selvaraj, Siddarth and Qin, Zhaohui and Ren, Bing and Liu, Jun S. 2012. HiCNorm: removing biases in Hi-C data via Poisson regression.
[^](#ref-4) Yaffe, Eitan and Tanay, Amos. 2011. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture.. URL