Three-dimensional modeling of chromatine structure

Recover data from previous tutorial by loading the previously saved chromosome.

In [1]:
# Libraries
from pytadbit import load_chromosome # load chromosome
from pytadbit.imp.CONFIG import CONFIG # Pre-defined parameters for modeling
from pytadbit.imp.structuralmodels import load_structuralmodels # library for modeling.

# Load the chromosome
my_chrom = load_chromosome('some_path.tdb')
Next, load Hi-C data for each experiment (Hi-C data is not saved inside chromosome objects because of their size):

In [2]:
# Loop over experiments in chromosome and load Hi-C data.
res = 100000

for exp in my_chrom.experiments:
    try:
        exp.load_hic_data('../../scripts/sample_data/HIC_{0}_{1}_{1}_{2}_obs.txt'.format(
                          exp.name, my_chrom.name, res))
    except:
        print 'file not found for experiment: ' + exp.name
        continue

# Load Hi-C of the individual experiments and put it into the sum experiment BR+TR1+TR2
my_chrom.experiments['k562+gm06690'].load_hic_data(
              (my_chrom.experiments['k562'] + my_chrom.experiments['gm06690']).hic_data, 
              'k562+gm06690')
exp = my_chrom.experiments['gm06690']
exp.normalize_hic()
exp.filter_columns()
print my_chrom.experiments


/usr/local/lib/python2.7/dist-packages/pytadbit/parsers/hic_parser.py:93: UserWarning: WARNING: non integer values
  warn('WARNING: non integer values')
WARNING: experiments should be normalized before being summed
file not found for experiment: k562+gm06690
file not found for experiment: batch_gm06690_k562
[Experiment k562 (resolution: 100Kb, TADs: 37, Hi-C rows: 639, normalized: None), Experiment gm06690 (resolution: 100Kb, TADs: 34, Hi-C rows: 639, normalized: visibility_factor:1), Experiment k562+gm06690 (resolution: 100Kb, TADs: None, Hi-C rows: 639, normalized: None), Experiment batch_gm06690_k562 (resolution: 100Kb, TADs: 35, Hi-C rows: 639, normalized: None)]
/usr/local/lib/python2.7/dist-packages/pytadbit/utils/hic_filtering.py:136: ComplexWarning: Casting complex values to real discards the imaginary part
  round(root, 3), ' '.join(

WARNING: removing columns having less than 67.485 counts:
   246   247   248   249   250   251   252   253   254   255   256   257   258   259   260   261   262   263   264   265
   266   267   268   269   270   271   272   273   274   275   276   277   278   279   280   281   282   283   284   285
   286   287   288   289   290   291   292   293   294   295   296   297   298   299   300   301   302   303   304   305
   306   307   308   309   310   311   312   313   314   315   316   317   318   319   320   321   322   323   324   639

Model the 3D structure of a selected TAD

TADbit uses the Integrative Modeling Platform (IMP, http://www.integrativemodeling.org) for modeling the 3D structure of genomes and genomic domains. Here we aim at modeling the 3D structure of the selected TAD in the first tutorial (#1. Detection of TADs) using the optimal parameters from the second tutorial (#2 Parameter optimization.
All models are based on specific sets of experimental data for which TADbit modeling parameters need to be optimized (see tutorial #2). Optimizing the parameters takes significant CPU time and thus have created a python dictionary with sets of pre-optimized parameters for relaeased datasets. The specific parameters are stored in a python dictionary called CONFIG.

In [3]:
# See pre-defined sets of parameters
CONFIG


Out[3]:
{'dmel_01': {'kforce': 5,
  'lowfreq': -0.7,
  'maxdist': 600,
  'reference': 'victor corces dataset 2013',
  'scale': 0.01,
  'upfreq': 0.3}}
.. note:: Overtime, we will populate the CONFIG dictionary with more optimal sets of parameters for released Hi-C matrices.
To set the values for the parameters, one can create an array and use it for moodeling (see config parameter for model_region function)

In [4]:
# Set of optimal parameters from pervious tutorial #2

optpar = {'kforce': 5,
          'lowfreq': 0.0,
          'lowrdist': 100,
          'upfreq': 0.8,
          'maxdist': 2750,
          'scale': 0.01,
          'reference': 'gm cell from Job Dekker 2009'}

# Build 3D models based on the HiC data. This is done by IMP.
models = exp.model_region(100, 200, n_models=500, n_keep=100, n_cpus=8, keep_all=True, config=optpar)
print models


Preparing to model 101 particles
StructuralModels with 100 models of 101 particles
   (objective function range: 73269182 - 76546743)
   (corresponding to the best models out of 500 models).
  IMP modeling used this parameters:
   - maxdist     : 2750
   - upfreq      : 0.8
   - reference   : gm cell from Job Dekker 2009
   - kforce      : 5
   - lowfreq     : 0.0
   - scale       : 0.01
   - lowrdist    : 1000.0
  Models where clustered into 0 clusters

In [5]:
models.experiment


Out[5]:
Experiment gm06690 (resolution: 100Kb, TADs: 34, Hi-C rows: 639, normalized: visibility_factor:1)
Once finished, the IMP generated models are stored in a dictionary which keys are numbered from smaller to larger based on the IMP Objective Function (that is, how well the model satifies the input restraints). One can select parts of the models or single models to get some information.

In [6]:
# Select top 10 models
models.define_best_models(10)
print "Lowest 10 IMP OF models:" 
print models

# Select top 100 models
models.define_best_models(100)
print "Lowest 100 IMP OF models:" 
print models

# Get the data for the lowest IMP OF model (number 0) in the set of models
model = models[0]
print model


Lowest 10 IMP OF models:
StructuralModels with 10 models of 101 particles
   (objective function range: 73269182 - 74622165)
   (corresponding to the best models out of 500 models).
  IMP modeling used this parameters:
   - maxdist     : 2750
   - upfreq      : 0.8
   - reference   : gm cell from Job Dekker 2009
   - kforce      : 5
   - lowfreq     : 0.0
   - scale       : 0.01
   - lowrdist    : 1000.0
  Models where clustered into 0 clusters
Lowest 100 IMP OF models:
StructuralModels with 100 models of 101 particles
   (objective function range: 73269182 - 76546743)
   (corresponding to the best models out of 500 models).
  IMP modeling used this parameters:
   - maxdist     : 2750
   - upfreq      : 0.8
   - reference   : gm cell from Job Dekker 2009
   - kforce      : 5
   - lowfreq     : 0.0
   - scale       : 0.01
   - lowrdist    : 1000.0
  Models where clustered into 0 clusters
IMP model ranked 1 (101 particles) with: 
 - Final objective function value: 73269182.8524
 - random initial value: 199
 - first coordinates:
        X      Y      Z
   -11571  -6460  -4627
   -12418  -5912  -4701
   -11530  -5648  -5033

One measure to check whether the IMP optimization has reached equilibrium is to plot the value of the IMP Objective Function as a function of the iteration during optimization. If the plot has a plateau, this means that the modeling exercise run properly.

In [7]:
# Get the IMP OF of the stored model in "model"
model.objective_function(log=True, smooth=False)


One important aspect is to identfy whether the set of models has a good correlation with the input HiC data. This can be done with a single function that affects the models.

In [8]:
# Re-select again the top 1000 models
models.define_best_models(100)
# Calculate the correlation coefficient between a set of kept models and the original HiC matrix
models.correlate_with_real_data(plot=True, cutoff=1000)


Out[8]:
(0.66973037475999819, 0.0)

In [9]:
models.zscore_plot()



In [10]:
models.align_models(in_place=True)
models.deconvolve(fact=0.6, dcutoff=1000, represent_models='best', n_best_clusters=5)


Total number of clusters: 4
   Cluster #1 has 59 models [top model: 199]
   Cluster #2 has 32 models [top model: 20]
   Cluster #3 has 7 models [top model: 491]
   Cluster #4 has 2 models [top model: 159]

Model analysis

Model clustering

First we are going to cluster the 3D models based on their structural similarity. Clusters are numbered from larger (more models) to smallest (less models).

In [11]:
# Cluster models based on structural similarity
models.cluster_models(fact=0.95, dcutoff=2000)
print models.clusters


Number of singletons excluded from clustering: 20 (total singletons: 20)
Total number of clusters: 12
   Cluster #1 has 27 models [top model: 199]
   Cluster #2 has 9 models [top model: 4]
   Cluster #3 has 8 models [top model: 264]
   Cluster #4 has 8 models [top model: 437]
   Cluster #5 has 6 models [top model: 113]
   Cluster #6 has 6 models [top model: 388]
   Cluster #7 has 5 models [top model: 88]
   Cluster #8 has 3 models [top model: 20]
   Cluster #9 has 2 models [top model: 9]
   Cluster #10 has 2 models [top model: 382]
   Cluster #11 has 2 models [top model: 104]
   Cluster #12 has 2 models [top model: 311]

Total number of clusters: 12
   Cluster #1 has 27 models [top model: 199]
   Cluster #2 has 9 models [top model: 4]
   Cluster #3 has 8 models [top model: 264]
   Cluster #4 has 8 models [top model: 437]
   Cluster #5 has 6 models [top model: 113]
   Cluster #6 has 6 models [top model: 388]
   Cluster #7 has 5 models [top model: 88]
   Cluster #8 has 3 models [top model: 20]
   Cluster #9 has 2 models [top model: 9]
   Cluster #10 has 2 models [top model: 382]
   Cluster #11 has 2 models [top model: 104]
   Cluster #12 has 2 models [top model: 311]

The output of this analysis is stored in a Python dictionary that contains the cluster number and the models within the cluster. The output shows that the models result in 39 clusters where cluster number 1 (named 0) contains 149 models and cluster number 2 (named 1) contains 136 models.
Once a cluster is generated, one can plot it for easy visualization. The "y" axis of the plot shows the IMP Objective function. The width of the branch is proportional to the number of models in the cluster. One would expect that the largest cluster (the one numbered with a "0" and wider width branch) has the lowest IMP Objective Function. This is indicative that the optimization found most often the same solution, which corresponded to the lowest IMP Objective Function.

In [12]:
# Plot the resulting clusers
cl = models.cluster_analysis_dendrogram(color=True)


One can also show the similarity betwen clusters for a limited number of them (5 in this example)

In [13]:
# Show de dendogram for only the 5 top clusters and no colors
cl = models.cluster_analysis_dendrogram(n_best_clusters=5)


Models consistency

To assess how "deterministic" a cluster is, one can calculate for each particle the percentage of models (in the cluster) that superimpose a given particle within a given cut-off (pre-set cut-offs of 50, 100, 150 and 200 nm). The lower the consistency value (in %) the less deterministic the models within the selected cluster. This measure can be taken as a proxy of variability across the model.

In [14]:
# Calculate the consistency plot for all models in the first cluster (cluster 0)
models.model_consistency(cluster=1, cutoffs=(1000,2000))


Be aware that this measure makes sense using only models within a cluster and not models from different clusters.

DNA density plots

From the 3D models, the DNA density (or local compactness) can be calculated as the ratio of the bin size (in base pairs) and the distances between consequtive particles in the models. The higher the density the more compact DNA for the region. As this measure varies dramatically from particle to particle, one can calculate it using running averages.

In [15]:
# Calculate a DNA density plot
models.density_plot()



In [16]:
# Get a similar plot for only the top cluster and show the standar deviation for a specific(s) running window (steps)
models.density_plot(cluster=1,error=True, steps=(5))



In [17]:
models.walking_angle(steps=(3, 5, 7), signed=False)



In [18]:
models.interactions(cutoff=2000)


Models contact map

Given a set of selected models (either from a cluster or a list) one can calculate the percentage of pairs of particles within a distance cut-off. This can then be represented as a heat-map which is equivalent to a Hi-C interaction matrix.

In [19]:
# Get a contact map for the top 50 models at a distance cut-off of 300nm
models.contact_map(models=range(5,10), cutoff=1200, savedata="contact.txt")
The goal of TADbit is to find a 3D structure (or ensemble of structures) that best satisfies the original Hi-C matrix. Therefore, we can compare the contact map produced above to the original HiC input matrix for parts of the models.

In [20]:
# Correlate the contact map with the original input HiC matrix for cluster 1
models.correlate_with_real_data(cluster=1, plot=True, cutoff=1500)
# Correlate the contact map with the original input HiC matrix for cluster 2
models.correlate_with_real_data(cluster=2, plot=True, cutoff=1500)
# Correlate the contact map with the original input HiC matrix for cluster 10
models.correlate_with_real_data(cluster=10, plot=True, cutoff=1500)


Out[20]:
(0.60692621531513558, 0.0)

Calculating distances between particles

Sometimes is useful to get a distribution of distances between pairs of particles in the models (or sub-set of models). Next we show several ways of getting such representations.

In [21]:
# Get the average distance between particles 13 and 30 in all kept models
models.median_3d_dist(13, 20, plot=False)


Out[21]:
1507.9106832517446
Lets plot the distribution used to get this median value.

In [22]:
# Plot the distance distributions between particles 13 and 30 in all kept models
models.median_3d_dist(15, 20, plot=True)


We may also want to use only the 10 first models (lowest energy), or the models belonging to a cluster (example cluster 1).

In [23]:
# Plot the distance distributions between particles 13 and 30 in the top 100 models
models.median_3d_dist(13, 30, models=range(100))



In [24]:
# Plot the distance distributions between particles 13 and 30 in the models from cluster 0
models.median_3d_dist(0, 54, plot=True, cluster=1)


Save and load models and analysis

By saving your analysis, you won't need to repeat some of the most expensive calculations.

In [25]:
# Save your entire analysis and models
models.save_models('gm_01.models')
And to load them:

In [26]:
# Load the models
loaded_models = load_structuralmodels('gm_01.models')
print loaded_models


StructuralModels with 100 models of 101 particles
   (objective function range: 73269182 - 76546743)
   (corresponding to the best models out of 500 models).
  IMP modeling used this parameters:
   - maxdist     : 2750
   - upfreq      : 0.8
   - reference   : gm cell from Job Dekker 2009
   - kforce      : 5
   - lowfreq     : 0.0
   - scale       : 0.01
   - lowrdist    : 1000.0
  Models where clustered into 12 clusters
Specific 3D models can be saved in two formats: - CMM format, which can be directly load into Chimera for visualization. - XYZ format, which is a simple format that can be useful for further analysis that require coordinates.

In [27]:
# Write a CMM file for the top model
models.write_cmm(directory="./", model_num=0)
# Write CMM ofcentroid model
models.write_cmm(directory="./", model_num=models.centroid_model(cluster=1))
# Write a XYZ file for the top model
models.write_xyz(directory="./", model_num=0)
# Write a XYZ file for the top 10 models
models.write_xyz(directory="./", models=range(10))
# Write a XYZ file for the cluster 1 models
models.write_xyz(directory="./", cluster=1)

Chimera

Our group has been using the visualization tool Chimera from Ferrin's Group at UCSF (http://www.cgl.ucsf.edu/chimera/) to visualize the 3D models. Here we provide a couple of automatic ways of getting static and video images of selected models. A user can input the models using the generated CMM format in the previous step of this tutorial. ** NOTE ** To properly insert the images/videos in this tutorial, we need to import libraries from IPython. However, such libraries are not necessary for the modeling nor the analysis of the models. ** NOTE **

In [28]:
models.view_models(stress='centroid', tool='plot', figsize=(10,10), azimuth=-60, elevation=100)


Note that we represent with colors, only 1 model, the centromere. An easy way to get the centromere model is using the get_centromere funtion:

As models are linked to experiments, we can paint TADs in this visualization:


In [29]:
models.view_models(highlight='centroid', show='highlighted', tool='plot', figsize=(10,10), azimuth=-60, elevation=100, color='tad')


We can also color only the border, and in this view, borders are colored according to their score:


In [30]:
models.view_models(highlight='centroid', show='highlighted', tool='plot', figsize=(10,10), azimuth=-60, elevation=100, color='border')



In [31]:
print models[models.centroid_model(cluster=1)]


IMP model ranked 67 (101 particles) with: 
 - Final objective function value: 75957440.5179
 - random initial value: 117
 - first coordinates:
        X      Y      Z
      645   2695  -3409
      106   2687  -4296
     -301   2487  -3435


In [32]:
# Generate the image using Chimera in batch mode. That takes some time, wait a bit before running next command.
# You can check in your home directory whether this has finished.
models.view_models(models=[0], tool='chimera_nogui', savefig='/tmp/image_model_1.png')

In [33]:
## REMOVE
# THIS IS ONLY TO DISPLAY THE IMAGE IN iPhyton.
from IPython.display import Image
Image(filename='/tmp/image_model_1.png', width=654, height=454)
## STOP REMOVE


Out[33]:

In [34]:
# Generate the video using Chimera in batch mode. That takes SIGNIFICANT time, wait a bit before running next command.
# You can check in your home directory whether this has finished.
models.view_models(models=[0], tool='chimera_nogui', savefig='/tmp/image_model_1.webm')

In [35]:
## REMOVE
# THIS IS ONLY TO DISPLAY THE VIDEO IN iPhyton.
from IPython.display import HTML
video = open("/tmp/image_model_1.webm", "rb").read()
video_encoded = video.encode("base64")
video_tag = '''
<style media="screen" type="text/css">
#video{
    width:654px;
    height:454px;
}
</style>''' + '''
<video loop=1 id="video" controls="controls">
    <source src="data:video/webm;base64,{0}" type="video/webm"/>
    Your browser doesnt seems to support video tag
</video>'''.format(video_encoded)
HTML(video_tag)
## STOP REMOVE


Out[35]: