In [1]:
from pytadbit import load_chromosome
from pytadbit.parsers.hic_parser import load_hic_data_from_bam
In [2]:
crm = load_chromosome('results/fragment/chr3.tdb')
In [3]:
B, PSC = crm.experiments
In [4]:
B, PSC
Out[4]:
Load raw data matrices, and normalized matrices
In [5]:
base_path = 'results/fragment/{0}_both/03_filtering/valid_reads12_{0}.bam'
bias_path = 'results/fragment/{0}_both/04_normalizing/biases_{0}_both_{1}kb.biases'
reso = 100000
chrname = 'chr3'
cel1 = 'mouse_B'
cel2 = 'mouse_PSC'
In [6]:
hic_data1 = load_hic_data_from_bam(base_path.format(cel1),
resolution=reso,
region='chr3',
biases=bias_path.format(cel1, reso // 1000),
ncpus=8)
hic_data2 = load_hic_data_from_bam(base_path.format(cel2),
resolution=reso,
region='chr3',
biases=bias_path.format(cel2, reso // 1000),
ncpus=8)
In [7]:
B.load_hic_data([hic_data1.get_matrix(focus='chr3')])
B.load_norm_data([hic_data1.get_matrix(focus='chr3', normalized=True)])
PSC.load_hic_data([hic_data2.get_matrix(focus='chr3')])
PSC.load_norm_data([hic_data2.get_matrix(focus='chr3', normalized=True)])
We use the best parameters obtained from the optimization process to produce an ensemble of models.
From the models produced (n_models) we will tell TADbit to conserve a number of them (n_keep) that best satisfy the imposed restraints.
In [8]:
optimal_params = {'dcutoff': 2.0,
'kbending': 0.0,
'kforce': 5,
'lowfreq': -0.6,
'maxdist': 2000.0,
'reference': 'Stadhouders R, Vidal E, Serra F, Di Stefano B et al. 2018',
'scale': 0.01,
'upfreq': 0.0}
In [9]:
models_B = B.model_region(start=300, end=360, n_models=400, n_keep=100, n_cpus=8,
config=optimal_params)
In [10]:
models_PSC = PSC.model_region(start=300, end=360, n_models=400, n_keep=100, n_cpus=8,
config=optimal_params)
The ensemble of models have inherited the description from the Chromosome object
In [11]:
print(models_B.description)
We still can access to the experiment object from the 3D models:
In [12]:
print(models_B.experiment)
print(models_PSC.experiment)
We can have a look at the data that was used to define restraints:
In [13]:
models_B.zscore_plot()
In [14]:
models_PSC.zscore_plot()
and also visualize how the IMP objective function (OF) of the stored model improves during the MOnte Carlo optimization:
In [15]:
model = models_B[0]
model.objective_function(log=True, smooth=False)
The definition of the "best models" can be changed at any time. Only the best models will be used in the analysis.
Select top 10 models
In [16]:
models_B.define_best_models(10)
print("Lowest 10 IMP OF models:")
print(models_B)
Select top 100 models
In [17]:
models_B.define_best_models(100)
print("Lowest 100 IMP OF models:")
print(models_B)
The ensemble of models "models_B" and "models_PSC" contain the models generated by the Montecarlo simulation ordered by its Objective Function (OF). The first model in the list is the one than best satisfies the imposed restraints.
To get the data for the lowest IMP OF model in the set of models we retrieve model number 0
In [18]:
model = models_B[0]
print(model)
We can check the correlation of models_B with the original HiC matrix.
In the plot "Real vs modelled data" we should see a positive correlation of the contacts in the models with the frequency of interaction of the pairs of beads in the HiC matrix. High interaction frequency between two loci in the matrix is reflected by the fact of having a high proportion of models where the beads representing those two loci are "in contact" (distance lower than the cutoff).
In [19]:
models_B.correlate_with_real_data(plot=True, cutoff=2000)
Out[19]:
To plot all the models in the ensemble we use the view_models function. By default the centroid (the model closer to the median) is highlighted.
In [27]:
models_PSC.view_models(tool='plot')
We can also plot individual models.
In [28]:
models_PSC.view_models(models=[0], tool='plot')
And use Chimera (https://www.cgl.ucsf.edu/chimera/) for the visualization of the 3D structure
In [29]:
models_PSC.view_models(models=[0], tool='chimera')
In the Montecarlo simulation each of the models is built starting from a random initial conformation. Therefore models are not aligned in a preferred orientation. We can use the function align_models to rotate and translate the coordinates of the models so they follow the same orientation as one of the models in the ensemble. By default the model used as reference is the first one.
In [31]:
models_PSC.align_models(in_place=True)
With the function deconvolve we obtain a deconvolution analysis of a given froup of models.It first clusters models based on structural comparison (dRMSD). Differential contact map between each possible pair of clusters is shown in the resulting graph. This allows us to detect common sets of contacts in the ensemble.
In [32]:
models_PSC.deconvolve(fact=0.35, dcutoff=2000, represent_models='best', n_best_clusters=5)
The clustering of the models by Markov Cluster Algorith (MCL) or Ward can be based on different statistics measures (score, rmsd, drmsd or eqv). By default a score computed as a combination of rmsd, drmsd and eqv is used.
In [37]:
# Cluster models based on structural similarity
models_B.cluster_models(fact=0.95, dcutoff=1000)
print(models_B.clusters)
The analysis dendogram allows us to view the different clusters population and their OF values.
In [39]:
# Plot the resulting clusers
cl = models_B.cluster_analysis_dendrogram()
In [40]:
# Cluster models based on structural similarity
models_PSC.cluster_models(fact=0.95, dcutoff=1000)
print(models_PSC.clusters)
In [41]:
# Plot the resulting clusers
cl = models_PSC.cluster_analysis_dendrogram()
Model consistency gives a measure of the variability of the particles accross a set of models. Particles in the same position accross different models are considered consistent if their distance is less than the given cutoff.
In [42]:
# Calculate the consistency plot for all models in the first cluster (cluster 0)
models_B.model_consistency(cluster=1, cutoffs=(1000,1500))
In [43]:
# Calculate the consistency plot for all models in the first cluster (cluster 0)
models_PSC.model_consistency(cluster=1, cutoffs=(1000,1500))
In the modelling we have used a scale of 0.01 nm/bp; that means that if we expect 100 bp/nm of chromatin in each bead and between two consecutives beads.
In [44]:
# Calculate a DNA density plot
models_B.density_plot(cluster=1)
In [45]:
# Calculate a DNA density plot
models_PSC.density_plot()
In [46]:
# Get a similar plot for only the top cluster and show the standar deviation for a specific(s) running window (steps)
models_B.density_plot(cluster=1,error=True, steps=(5))
In [47]:
# Get a similar plot for only the top cluster and show the standar deviation for a specific(s) running window (steps)
models_PSC.density_plot(cluster=1,error=True, steps=(5))
In [48]:
models_B.walking_angle(steps=(3, 5, 7), signed=False)
In [49]:
models_B.interactions(cutoff=2000)
In [50]:
models_PSC.interactions(cutoff=2000)
Outer part of the model can be excluded from the estimation of accessible surface because contacts from this outer part to particles outside the model are unknown. To exclude the outer contour a sphere with a higher radius (superradius) is first tested in the mesh before proceding to the accessibility calculation.
In [51]:
models_B.accessibility(cluster=1, radius=1000, nump=10, superradius=2000)
In [52]:
models_PSC.accessibility(cluster=1, radius=1000, nump=10, superradius=2000)
To inspect the distance between two particles in the models we can use the median_3d_dist function which give us not only the median distance considering all the models but also an histogram of the different values accross them.
In [53]:
models_B.median_3d_dist(13, 20, plot=False)
Out[53]:
In [54]:
models_B.median_3d_dist(13, 20, plot=True)
In [55]:
models_PSC.median_3d_dist(13, 20, plot=True)
The median distance can be calculated only in one of the clusters or in a set of models.
In [56]:
models_B.median_3d_dist(13, 30, cluster=1)
In [57]:
models_B.median_3d_dist(13, 30, models=[0,1,2,3,4,5])
The Structural Models object can be saved and retrieved at a later stage
In [58]:
# Save your entire analysis and models
models_B.save_models('B.models')
In [59]:
from pytadbit import load_structuralmodels
In [60]:
# Load the models
loaded_models = load_structuralmodels('B.models')
print(loaded_models)
Other information can also be saved independently like the contacts map as a bed-like file, the models either as a xyz bed-like file or as cmm to visualize with Chimera.
We can also export the models and clusters to a JSON for a later inspection with TADkit (http://sgt.cnag.cat/3dg/tadkit/)
In [61]:
! mkdir -p results/models_B
In [62]:
models_B.experiment
Out[62]:
In [63]:
models_B.contact_map(models=range(5,10), cutoff=2000, savedata="results/models_B/contact.txt")
In [64]:
# Write a CMM file for the top model
models_B.write_cmm(directory="results/models_B", model_num=0)
# Write CMM ofcentroid model
models_B.write_cmm(directory="results/models_B", model_num=models_B.centroid_model(cluster=1))
# Write a XYZ file for the top model
models_B.write_xyz(directory="results/models_B", model_num=0)
# Write a XYZ file for the top 10 models
models_B.write_xyz(directory="results/models_B", models=range(10))
# Write a XYZ file for the cluster 1 models
models_B.write_xyz(directory="results/models_B", cluster=1)
# Write TADkit JSON http://sgt.cnag.cat/3dg/tadkit/demo.h/
models_B.description['species'] = 'Mus Musculus'
models_B.write_json(filename="results/models_B/models_B.json", title="Mouse B")
In [65]:
! mkdir -p results/models_PSC
In [66]:
models_PSC.description['species'] = 'Mus Musculus'
models_PSC.write_json(filename="results/models_PSC/models_PSC.json", title="Mouse PSC")
In [ ]: