The HiC_data object
Filter columns with too few interaction count
Normalization
- Other normalizations
Save raw and normalized matrices
- Save biases and bin filtering
- Save normalized chromosome matrices

The HiC_data object



In [1]:

    
from pytadbit.parsers.hic_parser import load_hic_data_from_reads



In [2]:

    
r_enz_1 = 'HindIII'
r_enz_2 = 'MboI'
reso = 1000000



In [ ]:

    
hic_data_1 = load_hic_data_from_reads(
    'results/{0}/03_filtering/valid_reads12_{0}.tsv'.format(r_enz_1),
    reso)
hic_data_2 = load_hic_data_from_reads(
    'results/{0}/03_filtering/valid_reads12_{0}.tsv'.format(r_enz_2),
    reso)

Filter columns with too few interaction count

For this, very sparse, example dataset we are going to ask for very few interactions per bin.

This can be done either by setting min_perc asking for each bin to contain a minimum percentage of cells with interaction data. Or but setting min_count asking that the number of cells with interaction data of each bin is above the defined cutoff.



In [ ]:

    
hic_data_1.filter_columns(draw_hist=True, min_count=10, by_mean=True)
hic_data_2.filter_columns(draw_hist=True, min_count=10, by_mean=True)



In [ ]:

    
print len(hic_data_1)
print len(hic_data_1.bads)
print len(hic_data_2)
print len(hic_data_2.bads)

Normalization

As normalization we use either the ICE normalization (Imakaev et al., 2012) with a "full" normalization until the sum of all columns of the matrix are equal, or something more similar to the vanilla normalization used in (Rao et al., 2014) which is exactly like running the ICE normalization without iterations.

Note: if columns with a lot of zeroes are present the ICE normalization will last very long to converge, and these low-coverage columns will present, at the end of the normalization, few cells with very high values of interaction



In [ ]:

    
hic_data_1.normalize_hic(iterations=10, max_dev=0.1)
hic_data_2.normalize_hic(iterations=10, max_dev=0.1)

At 100 kb itmakes no sense to view the full matrix (your matrix will have more cells than your screen have pixels).

Try plotting a region of the genome only, with the focus parameter.



In [ ]:

    
from pytadbit.mapping.analyze import hic_map

hic_map(hic_data_1, normalized=True, focus='chr18', show=True)
hic_map(hic_data_2, normalized=True, focus='chr18', show=True)

Other normalizations

ICE normalization is widely used however other, more convaluted, normalizations (Hu et al., 2012) (Yaffe and Tanay, 2011) can be used outside TADbit and then added as normalized matrices.

Save raw and normalized matrices

Save biases and bin filtering



In [ ]:

    
from cPickle import dump



In [ ]:

    
! mkdir -p results/$r_enz_1/04_normalizing
! mkdir -p results/$r_enz_2/04_normalizing

Save biases to separate file.



In [ ]:

    
out = open('results/{1}/04_normalizing/biases_{0}_{1}.pick'.format(reso, r_enz_1), 'w')
dump(hic_data_1.bias, out)
out.close()
out = open('results/{1}/04_normalizing/biases_{0}_{1}.pick'.format(reso, r_enz_2), 'w')
dump(hic_data_2.bias, out)
out.close()

Save "bad" columns to an other file



In [ ]:

    
out = open('results/{1}/04_normalizing/bad_columns_{0}_{1}.pick'.format(reso, r_enz_1), 'w')
dump(hic_data_1.bads, out)
out.close()
out = open('results/{1}/04_normalizing/bad_columns_{0}_{1}.pick'.format(reso, r_enz_2), 'w')
dump(hic_data_2.bads, out)
out.close()

Dryhic normalization

We can launch the external script to use dryhic normalization



In [ ]:

    
! dryhic3.r results/HindIII/03_filtering/valid_reads12_HindIII.tsv \
       results/HindIII/04_normalizing/bad_columns_1000000_HindIII.pick \
       HindIII hg38 1000000 results/HindIII/04_normalizing/biases_dryhic_1000000_HindIII.pick



In [ ]:

    
! dryhic3.r results/MboI/03_filtering/valid_reads12_MboI.tsv \
       results/MboI/04_normalizing/bad_columns_1000000_MboI.pick \
       MboI hg38 1000000 results/MboI/04_normalizing/biases_dryhic_1000000_MboI.pick



In [ ]:

    
bias_dry_path = 'results/{1}/04_normalizing/biases_dryhic_{0}_{1}.tsv'

hic_data_1.bias = dict([(int(l.split()[0]), float(l.split()[1])) for l in open(bias_dry_path.format(reso, 'HindIII'))])
hic_data_2.bias = dict([(int(l.split()[0]), float(l.split()[1])) for l in open(bias_dry_path.format(reso, 'MboI'))])



In [ ]:

    
hic_map(hic_data_1, normalized=True, focus='chr18', show=True)
hic_map(hic_data_2, normalized=True, focus='chr18', show=True)

Save normalized chromosome/genome matrices

This time we do not need to save appart the normalization biases and the list of columns with poor signal.



In [ ]:

    
hic_map(hic_data_1, by_chrom='intra', normalized=False,
       savedata='results/{1}/04_normalizing/{0}_raw'.format(reso, r_enz_1))
hic_map(hic_data_2, by_chrom='intra', normalized=False,
       savedata='results/{1}/04_normalizing/{0}_raw'.format(reso, r_enz_2))



In [ ]:

    
hic_map(hic_data_1, by_chrom='intra', normalized=True,
       savedata='results/{1}/04_normalizing/{0}_norm'.format(reso, r_enz_1))
hic_map(hic_data_2, by_chrom='intra', normalized=True,
       savedata='results/{1}/04_normalizing/{0}_norm'.format(reso, r_enz_2))

If the resolution is not to low, we could also save genomic matrices:



In [ ]:

    
if reso >= 300000:
    hic_map(hic_data_1, by_chrom=False, normalized=False,
            savedata='results/{1}/04_normalizing/{0}_raw.mat'.format(reso, r_enz_1))

    hic_map(hic_data_1, by_chrom=False, normalized=True,
            savedata='results/{1}/04_normalizing/{0}_norm.mat'.format(reso, r_enz_1))
    hic_map(hic_data_2, by_chrom=False, normalized=False,
            savedata='results/{1}/04_normalizing/{0}_raw.mat'.format(reso, r_enz_2))

    hic_map(hic_data_2, by_chrom=False, normalized=True,
            savedata='results/{1}/04_normalizing/{0}_norm.mat'.format(reso, r_enz_2))

References

^{[^](#ref-1)}Imakaev, Maxim V and Fudenberg, Geoffrey and McCord, Rachel Patton and Naumova, Natalia and Goloborodko, Anton and Lajoie, Bryan R and Dekker, Job and Mirny, Leonid A. 2012. Iterative correction of Hi-C data reveals hallmarks of chromosome organization.. URL

^{[^](#ref-2)}Rao, Suhas S P and Huntley, Miriam H and Durand, Neva C and Stamenova, Elena K and Bochkov, Ivan D. and James T. Robinson and Sanborn, Adrian L. and Machol, Ido and Omer, Arina D. and Lander, Eric S. and Lieberman-Aiden, Erez. 2014. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. URL

^{[^](#ref-3)}Hu, Ming and Deng, Ke and Selvaraj, Siddarth and Qin, Zhaohui and Ren, Bing and Liu, Jun S. 2012. HiCNorm: removing biases in Hi-C data via Poisson regression.

^{[^](#ref-4)}Yaffe, Eitan and Tanay, Amos. 2011. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture.. URL

Table of Contents