Cluster Analysis Tutorial

This tutorial describes how to run a cluster analysis, when the calculation was run with a ReplicaExchange protocol. The clustering algorithm will create cluster.$index directories, therefore you may want to remove these directories everytime you rerun the calculation. The matrix calculation is very time consuming. The matrix can be saved in a file, in case you want to apply different clustering methods. In additio, one can run the calculation in parallel (you need the mpi4py python library installed).

Paste the python code below in a file, and run it. (eg, imp/setup_environment.sh mpirun -np 8 python this_script.py)



In [ ]:

    
import IMP
import IMP.pmi
import IMP.pmi.macros

Set is_mpi=True if you want to run the script in parallel. Specify the number of processes through the mpirun command. mpi4py is needed to run in parallel



In [ ]:

    
is_mpi=True

model=IMP.Model()

Initialize the analysis class.



In [ ]:

    
mc=IMP.pmi.macros.AnalysisReplicaExchange0(model,
                                        stat_file_name_suffix="stat",
                                        global_output_directory="./post-EM",
                                        rmf_dir="rmfs/")

Define a list of features that has to be extracted from the stat files for each cluster. The keywords can be incomplete. In that case every keyword that match the substrings in the list will be extracted. For instance ISDCrossLinkMS_Distance_intrarb will extract all cross-link distances. The features will be stored in each cluster directory as a stat file (they can be read using process_output)



In [ ]:

    
feature_list=["ISDCrossLinkMS_Distance_intrarb",
              "ISDCrossLinkMS_Distance_interrb",
              "ISDCrossLinkMS_Data_Score",
              "GaussianEMRestraint_None",
              "SimplifiedModel_Linker_Score_None",
              "ISDCrossLinkMS_Psi_1.0_",
              "ISDCrossLinkMS_Sigma_1_"]

To run the clustering you have to specify what keyword in the stat files corresponds to the total score (ie,"SimplifiedModel_Total_Score_None"). Also specify what keywords correspond to the rmf_file and the rmf_file_index.

number_of_best_scoring_models corresponds to the number of best scoring models that has to be taken to make the clustering.

alignment_components is the list of component names onto which you want to calculate the alignment.

distance_matrix_file is the name of the file to write the distance matrix. It can be loaded in a second moment if you want to cluster using other parameters (eg, using another number of clusters).

feature_keys is the list of feature keys.

number_of_clusters is the number of cluster used by kmeans.

density_custom_ranges If defined, it will calculate the localization density for the cluster ensemble. It is a dictionary where each keyword is a density file name, and the value is a list of either component names as well as tuples in the form (firstres, lastres,component_name)



In [ ]:

    
mc.clustering("SimplifiedModel_Total_Score_None",
              "rmf_file",
              "rmf_frame_index",
              number_of_best_scoring_models=100,
              alignment_components=["ProtA","ProtB","ProtC"],
              distance_matrix_file="distance.rawmatrix.pkl",
              feature_keys=feature_list,
              load_distance_matrix_file=False,
              is_mpi=is_mpi,
              number_of_clusters=2,
              density_custom_ranges={"Density1":["ProtA",(1,100,"ProtB"),(200,350,"ProtB")],"Density2":["ProtC"]})