Guillermo Perez-Hernandez guille.perez@fu-berlin.de
In this notebook we will be using the 1 millisecond trajectory of Bovine Pancreatic Trypsin Inhibitor (BPTI) generated by DE Shaw Research on the Anton Supercomputer and kindly made available by their lab. The original work is
The trajectory has been duplicated and shortened to provide a mock-trajectory set and be able to deal with lists of trajectories of different lenghts:
c-alpha_centered.stride.100.xtc
c-alpha_centered.stride.100.reversed.xtc
c-alpha_centered.stride.100.halved.xtc
The typical usecase is having molecular dynamics (MD) simulation data in form of trajectory files with extensions like .xtc, .dcd
etc and the associated molecular topology as a .pdb
or .gro
file.
These files are the most general starting point for any analysis dealing with MD, and molpx
's API has been designed to be able to function without further input:
In [ ]:
top = 'notebooks/data/bpti-c-alpha_centered.pdb'
MD_trajfiles = ['notebooks/data/c-alpha_centered.stride.1000.xtc',
'notebooks/data/c-alpha_centered.stride.1000.reversed.xtc',
'notebooks/data/c-alpha_centered.stride.1000.halved.xtc'
]
dt = 244 #saving interval in the .xtc files, in ns
import molpx
from matplotlib import pylab as plt
%matplotlib ipympl
import pyemma
import numpy as np
# This way the user does not have to care where the data are:
top = molpx._molpxdir(top)
MD_trajfiles = [molpx._molpxdir(ff) for ff in MD_trajfiles]
However, molpx
relies heavily on the awesome mdtraj
module for dealing with molecular structures, and so most of molpx
's functions accept also Trajectory
-type objects (native to mdtraj
) as alternative inputs.
In [ ]:
# Create a memory representation of the trajectories
MD_list = [molpx.generate._md.load(itraj, top=top) for itraj in MD_trajfiles]
The same idea applies to the input of projected trajectories: molpx
can take the filenames as inputs (.npy
, .dat
, .txt
etc) or deal directly with numpy.ndarray
objects.
These alternative, "from-memory" input modes (md.Trajectory
and np.ndarray
objects) avoid forcing the user to read from file everytime an API function is called, saving I/O overhead
The following cell either reads or generates projected trajectory files for this demonstration. In a real usecase this step (done here using TICA) might not be needed, given that the user might have generated the projected trajectory elsewhere:
In [ ]:
# Perform TICA or read from file directly if already .npy-files exist
Y_filenames = [ff.replace('.xtc','.Y.npy') for ff in MD_trajfiles]
try:
Y = [np.load(ff) for ff in Y_filenames]
except:
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()
[np.save(ff, iY) for ff, iY in zip(Y_filenames, Y)]
Execute the following cell and click either on the FES or on the slidebar. Some input parameters have been comented out for you to try out:
In [ ]:
mpx_wdg_box = molpx.visualize.FES(MD_list,
#MD_trajfiles,
top,
Y_filenames,
#Y,
nbins=50,
#proj_idxs=[1,2],
proj_labels='TIC',
#n_overlays=5,
)
mpx_wdg_box
The user can sample structures as they occurr in sequence in the actual trajectory. Depending on the size of the dataset, this can be very time consuming, particularly if data is being read from disk.
In this example, try changing MD_trajfiles
to MD_list
and/or changing Y_filenames
to simply Y
and see if it helps.
Furthermore, the objects in memory can be strided down to fewer frames before being parsed to the method. To stride objects being read from the disk, use the stride
parameter.
visualize.traj
. Uncomment them and see what happens
In [ ]:
mpx_wdg_box = molpx.visualize.traj(MD_trajfiles,
top,
Y,
#Y_filenames,
plot_FES = True,
dt = dt*1e-6, tunits='ms',
#traj_selection = 1,
#sharey_traj=False,
#max_frames=100,
proj_idxs=[0, 1],
panel_height=2,
#proj_labels='TIC'
)
mpx_wdg_box
See the documentation of molpx.generate.sample
to find out about all possible options:
molpx.generate.sample(MD_trajectories, MD_top, projected_trajectories, atom_selection=None, proj_idxs=[0, 1], n_points=100, n_geom_samples=1, keep_all_samples=False, proj_stride=1, verbose=False, return_data=False)
In [ ]:
data_sample, geoms = molpx.generate.sample(#MD_list,
MD_trajfiles,
top,
#Y,
Y_filenames,
n_points=200 ,
n_geom_samples=2,
)
data_sample.shape, geoms
In [ ]:
# Replot the FES
plt.figure(figsize=(7,7))
h, (x,y) = np.histogramdd(np.vstack(Y)[:,:2], bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
# Create the linked widget
linked_ngl_wdg, linked_ax_wdg = molpx.visualize.sample(data_sample,
geoms.superpose(geoms[0]),
plt.gca(),
clear_lines=True,
#plot_path=True
)
plt.plot(data_sample[:,0], data_sample[:,1],' ok', zorder=0)
# Show it
linked_ngl_wdg
In [ ]:
paths_dict, idata = molpx.generate.projection_paths(#MD_list,
MD_trajfiles,
top,
Y_filenames,
#Y, # You can also directly give the data here
n_points=50,
proj_idxs=[0,1],
n_projs=3,
proj_dim = 3,
verbose=False,
)
In [ ]:
# Choose the coordinate and the tyep of path
coord = 1
#path_type = 'min_rmsd'
path_type = 'min_disp'
igeom = paths_dict[coord][path_type]["geom"]
ipath = paths_dict[coord][path_type]["proj"]
# Choose the proj_idxs for the path and the FES
# to be shown
proj_idxs = [0,1]
In [ ]:
plt.figure(figsize=(7,7))
h, (x,y) = np.histogramdd(np.vstack(Y)[:,proj_idxs], bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
linked_ngl_wdg, linked_ax_wdg = molpx.visualize.sample(ipath[:,proj_idxs],
igeom.superpose(igeom[0]),
plt.gca(),
clear_lines=True,
n_smooth = 5,
plot_path=True,
#radius=True,
)
linked_ngl_wdg
PyEMMA
molpx
is using many methods of the coordinates
submodule of PyEMMA
, and thus it also understands some of PyEMMA
's classes as input (like clustering objects or streaming transformers).
If the projected coordinates come from a TICA (or PCA) transformation, and the TICA object is available in memory
molpx.visualize.traj
can make use of correlation information to display not only the projected coordinates (i.e the TICs, in this case), but also the "original" input features behind it
In [ ]:
# Re-do the TICA computation to make sure we have a tica object in memory
feat = pyemma.coordinates.featurizer(top)
pairs = feat.pairs(range(feat.topology.n_atoms)[::2])
feat.add_distances(pairs)
src = pyemma.coordinates.source(MD_trajfiles, features=feat)
tica = pyemma.coordinates.tica(src, lag=10, dim=3)
Y = tica.get_output()
The method molpx.visualize.correlations
tries to provide a visual representation of the projected coordinates by relating them to the input features, which carry more meaning, since they are (usually) familiar parameters such as atom distances, angles, contacts etc.
In [ ]:
# Comment or uncomment the optinal parameters and see how the method reacts
# You can use a pre-instantiated the widget
#iwd = molpx.visualize._nglwidget_wrapper(MD_list[0][0])
# Or instantiate at the moment of calling visualize.correlations
iwd = None
corr, ngl_wdg = molpx.visualize.correlations(tica,
n_feats=3,
proj_idxs=[0,1,2],
geoms=MD_list[0][::100],
#verbose=True,
#proj_color_list=['red', 'blue', 'green'],
widget=iwd
)
ngl_wdg
Also, molpx.visualize.traj
can help in visualizing these correlations by parsing along the tica object itself as projection=tica
. In the next cell, can you spot the differences:
In [ ]:
# Reuse the visualize.traj method with the tica object as input
mpx_wdg_box = molpx.visualize.traj(MD_trajfiles,
top,
Y,
#Y_filenames,
plot_FES = True,
dt = dt*1e-6, tunits='ms',
#traj_selection = 0,
#sharey_traj=False,
#max_frames=100,
proj_idxs=[0,1],
panel_height=1,
projection=tica, ## this is what's new
n_feats=2
)
mpx_wdg_box
In [ ]:
# Do "some" clustering
clkmeans = pyemma.coordinates.cluster_kmeans([iY[:,:2] for iY in Y], 5)
In [ ]:
data_sample, geoms = molpx.generate.sample(MD_trajfiles, top, clkmeans,
n_geom_samples=50,
#keep_all_samples=True # read the doc for this argument
)
In [ ]:
# Plot clusters
plt.figure(figsize=(4,4))
plt.plot(clkmeans.clustercenters[:,0], clkmeans.clustercenters[:,1],' ok')
# FES as background is optional (change the bool to False)
if True:
plt.contourf(x[:-1], y[:-1], -np.log(h.T), alpha=.50)
# Link the clusters positions with the molecular structures
linked_ngl_wdg = molpx.visualize.sample(data_sample,
geoms.superpose(geoms[0]),
plt.gca(),
clear_lines=False,
#plot_path=True
)
linked_ngl_wdg
In [ ]:
MSM = pyemma.msm.estimate_markov_model(clkmeans.dtrajs, 20)
In [ ]:
plt.figure(figsize=(4,4))
ax, pos = pyemma.plots.plot_markov_model(MSM.P,
minflux=5e-4,
arrow_labels=None,
ax=plt.gca(),
arrow_curvature = 2, show_frame=True,
pos=clkmeans.clustercenters)
# Add a background if wanted
h, (x, y) = np.histogramdd(np.vstack(Y)[:,:2], weights=np.hstack(MSM.trajectory_weights()), bins=50)
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
plt.xlim(x[[0,-1]])
plt.xticks(np.unique(x.round()))
plt.yticks(np.unique(y.round()))
plt.ylim(y[[0,-1]])
linked_ngl_wd, linked_ax_wd = molpx.visualize.sample(pos, geoms, plt.gca(), dot_color='blue')
linked_ngl_wd
In [ ]:
# Do an MSM with a realistic number of clustercenters
cl_many = pyemma.coordinates.cluster_regspace([iY[:,:2] for iY in Y], dmin=.25)
M = pyemma.msm.estimate_markov_model(cl_many.dtrajs, 20)
cl_many.n_clusters
In [ ]:
# Use this object to sample geometries
pos, geom = molpx.generate.sample(MD_trajfiles, top, cl_many)
In [ ]:
# Find the most representative microstate of each
# and least populated macrostate
M.pcca(3)
dens_max_i = [distro.argmax() for distro in M.metastable_distributions]
A = np.argmax([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
B = np.argmin([M.stationary_distribution[iset].sum() for iset in M.metastable_sets])
print(cl_many.clustercenters[dens_max_i[A]],
cl_many.clustercenters[dens_max_i[B]])
In [ ]:
# Create a TPT object with most_pop, least_pop as source, sink respectively
tpt = pyemma.msm.tpt(M, [dens_max_i[A]], [dens_max_i[B]])
paths, flux = tpt.pathways(fraction=.5)
In [ ]:
# Get a path with a decent number of intermediates
sample_path = paths[np.argmax([len(ipath) for ipath in paths])]
In [ ]:
plt.figure()
plt.contourf(x[:-1], y[:-1], -np.log(h.T), cmap="jet", alpha=.5, zorder=0)
linked_ngl_wdg, linked_ax_wdg = molpx.visualize.sample(cl_many.clustercenters[sample_path],
geom[sample_path].superpose(geom[sample_path[0]]), plt.gca(),
plot_path=True,
)
plt.scatter(*cl_many.clustercenters.T, alpha=.25)
linked_ngl_wdg
In [ ]: