This example demonstrates one possible way to cluster data sets that are too large to fit into memory using MDTraj and scipy.cluster
. The idea for the algorithim is that we'll cluster every N-th frame directly, and then, considering the clusters fixed "assign" the remaining frames to the nearest cluster. It's not the most sophisticated algorithm, but it's a good demonstration of how MDTraj can be integrated with other data science tools.
In [ ]:
import random
from collections import defaultdict
import mdtraj as md
import numpy as np
import scipy.cluster.hierarchy
In [ ]:
stride = 5
subsampled = md.load('ala2.h5', stride=stride)
print subsampled
In [ ]:
# Compute the pairwise RMSD between all of the frames. This requires
# N^2 memory, which is why we might need to stride.
distances = np.empty((subsampled.n_frames, subsampled.n_frames))
for i in range(subsampled.n_frames):
distances[i] = md.rmsd(subsampled, subsampled, i)
In [ ]:
# Now that we have the distances, we can use out favorite clustering
# algorithm. I like ward.
n_clusters = 3
linkage = scipy.cluster.hierarchy.ward(distances)
labels = scipy.cluster.hierarchy.fcluster(linkage, t=n_clusters, criterion='maxclust')
print labels
In [ ]:
# Now, we need to extract n_leaders random samples from each of the clusters.
# One way to do this is by building a map from each of the cluster labels
# to the list of the indices of the subsampled confs which belong to it.
mapping = defaultdict(lambda : [])
for i, label in enumerate(labels):
mapping[label].append(i)
print mapping
In [ ]:
# Now we can iterate through the mapping and select n_leaders random
# samples from each list. As we select them, we'll extract the
# conformation and append it to a new trajectory called `leaders`, which
# will start empty.
n_leaders_per_cluster = 2
leaders = md.Trajectory(xyz=np.empty((0, subsampled.n_atoms, 3)),
topology=subsampled.topology)
leader_labels = []
for label, indices in mapping.items():
leaders = leaders.join(subsampled[np.random.choice(indices, n_leaders_per_cluster)])
leader_labels.extend([label] * n_leaders_per_cluster)
print leaders
print leader_labels
In [ ]:
# Now our `leaders` trajectory contains a set of representitive conformations
# for each cluster. Here comes the second pass of the two-pass clustering.
# Let's now consider these clusters as fixed objects and iterate through
# *every* frame in our data set, assigning each frame to the cluster
# it's closest to. We take the simple approach here of computing the distance
# from each frame to each leader and assigning the frame to the cluster whose
# leader is closest.
# Note that this whole algorithm never requires having the entire
# dataset in memory at once
labels = []
for frame in md.iterload('ala2.h5', chunk=1):
labels.append(leader_labels[np.argmin(md.rmsd(leaders, frame, 0))])
labels = np.array(labels)
print labels
print labels.shape