We read now each read in the map files. We want reads that are uniquely mapped and from each of them we store:
This information will be used to filter the reads and, finally, to construct the interaction matrices.
First we need to load the Homo sapiens genomic sequence:
In [52]:
from pytadbit.parsers.genome_parser import parse_fasta
from matplotlib import pyplot as plt
import numpy as np
In [20]:
genome_seq = parse_fasta('/media/storage/db/reference_genome/Homo_sapiens/hg38/hg38.fa')
We have loaded the full human genome:
In [3]:
plt.bar(range(len(genome_seq)), [len(genome_seq[c]) for c in genome_seq],
color='k', alpha=0.7)
_ = plt.xticks([v + 0.5 for v in range(len(genome_seq))], genome_seq.keys(), rotation=90)
plt.grid()
Then, we search for restriction enzyme sites in this sequence.
In [4]:
from pytadbit.mapping.restriction_enzymes import map_re_sites
frags_HindIII = map_re_sites('HindIII', genome_seq, verbose=True)
frags_MboI = map_re_sites('MboI' , genome_seq, verbose=True)
In [5]:
c0 = frags_HindIII.keys()[0]
v0 = 0
dists_HindIII = []
for c1, v1 in [(c, v) for c in frags_HindIII for p in frags_HindIII[c]
for v in frags_HindIII[c][p]]:
if c1 == c0:
diff = abs(v1 - v0)
if diff < 30000: # larger fragments are probably poorly assembled regions
dists_HindIII.append(diff)
v0 = v1
c0 = frags_MboI.keys()[0]
v0 = 0
dists_MboI = []
for c1, v1 in [(c, v) for c in frags_MboI for p in frags_MboI[c]
for v in frags_MboI[c][p]]:
if c1 == c0:
diff = abs(v1 - v0)
if diff < 30000: # larger fragments are probably poorly assembled regions
dists_MboI.append(diff)
v0 = v1
In [12]:
plt.figure(figsize=(12, 6))
_ = plt.hist([dists_HindIII, dists_MboI], bins=50, color=['red', 'green'],
alpha=0.5, label=['HindIII', 'MboI'], normed=True, range=(0, 10000))
y0, y1 = plt.ylim()
plt.vlines(np.median(dists_HindIII), y0, y1 * 0.7, color='red', linestyle='--',
label='median HindIII')
plt.vlines(np.median(dists_MboI) , y0, y1 * 0.75, color='green', linestyle='--',
label='median MboI')
plt.text(np.median(dists_HindIII), y1 * 0.7, '%.0f' % np.median(dists_HindIII), va='bottom',
ha='left')
plt.text(np.median(dists_MboI) , y1 * 0.75,'%.0f' % np.median(dists_MboI) , va='bottom',
ha='left')
plt.title('Restriction fragment size distribution')
plt.legend()
plt.xlim(0, 10000)
plt.grid()
In [96]:
# replicate and enzyme useds
r_enz = 'HindIII'
In case we haven't stored the location of each of the reads we could load them like this:
In [46]:
maps1 = [('results/iterativ/{0}/01_mapping/mapped_{0}_r1/'
'K562_{0}_1_full_1-{1}.map').format(r_enz, i)
for i in range(25, 80, 5)]
maps2 = [('results/iterativ/{0}/01_mapping/mapped_{0}_r2/'
'K562_{0}_2_full_1-{1}.map').format(r_enz, i)
for i in range(25, 80, 5)]
Load all reads, check if they are uniquely mapped. Result is stored in two separate file tab-separated files that will contain the essential information of each read
In [47]:
from pytadbit.parsers.map_parser import parse_map
In [48]:
! mkdir -p results/iterativ/$r_enz/02_parsing
In [49]:
parse_map(maps1, maps2,
'results/iterativ/{0}/02_parsing/reads1_{0}.tsv'.format(r_enz),
'results/iterativ/{0}/02_parsing/reads2_{0}.tsv'.format(r_enz),
genome_seq=genome_seq, re_name=r_enz verbose=True)
Out[49]:
In [50]:
from pytadbit.mapping.analyze import plot_iterative_mapping
In [51]:
total_reads = 10000000
lengths = plot_iterative_mapping(
'results/iterativ/{0}/02_parsing/reads1_{0}.tsv'.format(r_enz),
'results/iterativ/{0}/02_parsing/reads2_{0}.tsv'.format(r_enz),
total_reads)
In [101]:
! mkdir -p results/$r_enz/02_parsing
In [116]:
maps1 = [('results/{0}/01_mapping/map{0}_r1/'
'K562_{0}_1_full_1-end.map').format(r_enz),
('results/{0}/01_mapping/map{0}_r1/'
'K562_{0}_1_frag_1-end.map').format(r_enz)]
maps2 = [('results/{0}/01_mapping/map{0}_r2/'
'K562_{0}_2_full_1-end.map').format(r_enz),
('results/{0}/01_mapping/map{0}_r2/'
'K562_{0}_2_frag_1-end.map').format(r_enz)]
In [117]:
maps1
Out[117]:
In [118]:
parse_map(maps1, maps2,
'results/{0}/02_parsing/reads1_{0}.tsv'.format(r_enz),
'results/{0}/02_parsing/reads2_{0}.tsv'.format(r_enz),
genome_seq=genome_seq, re_name=r_enz, verbose=True)
Out[118]:
In [119]:
reads1 = 'results/{0}/02_parsing/reads1_{0}.tsv'.format(r_enz)
reads2 = 'results/{0}/02_parsing/reads2_{0}.tsv'.format(r_enz)
In [120]:
total_reads = 10000000
lengths = plot_iterative_mapping(reads1, reads2, total_reads)
From now on we are going to focus only on the results of the fragment based mapping
In [121]:
from pytadbit.mapping import get_intersection
In [122]:
! mkdir -p results/$r_enz/03_filtering
In [123]:
reads = 'results/{0}/03_filtering/reads12_{0}.tsv'.format(r_enz)
In [124]:
get_intersection(reads1, reads2, reads, verbose=True)
Out[124]:
In [125]:
from pytadbit.mapping.analyze import plot_distance_vs_interactions
plot_distance_vs_interactions(reads, resolution=100000, max_diff=1000, show=True)
Out[125]:
According to the fractal globule model (Mirny, 2011) the slope between 700 kb and 10 Mb should be around -1 in log scale
In [126]:
from pytadbit.mapping.analyze import plot_genomic_distribution
In [127]:
plot_genomic_distribution(reads, resolution=500000, ylim=(0, 10000), show=True)
We are working with a cell line with a very aberrant karyotype (https://commons.wikimedia.org/wiki/File:Karyotype_of_the_T47D_breast_cancer_cell_line.svg).
In [128]:
from pytadbit.mapping.analyze import hic_map
In [129]:
hic_map(reads, resolution=1000000, show=True)
From the reads that are mapped in a single fragment we can infer the average insert size:
In [130]:
from pytadbit.mapping.analyze import insert_sizes
In [131]:
insert_sizes(reads, show=True, nreads=1000000)
Out[131]:
The median size of the sequenced DNA fragments is thus 172 nt.
[^](#ref-1) Mirny, Leonid a. 2011. The fractal globule as a model of chromatin architecture in the cell.. URL