How can we manually inspect / interpret components of a tICA model?

Here we'll try a couple things:

  • First, we need a way to relate the index within the feature vector to a quantity we care about (e.g. the atom numbers or residue numbers it depends on)
    • For dihedrals, we'll retrofit msmbuilder.featurizer.DihedralFeaturizer to report the atoms involved in each
  • Then, to try to interpret each component, we can:
    • Directly inspect the eigenvector. Large elements of the eigenvector should correspond to features that are important, and near-zero elements should correspond to features that are unimportant.
      • This will probably need to be scaled somehow
    • Project the dataset onto the component of interest, then compute correlations some interpretable features and that projection.
  • Finally, we can perhaps visualize this by coloring all the atoms in a protein structure or MD movie.

In [1]:
# basic imports
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
%matplotlib inline

# load and process example data
from msmbuilder.example_datasets import FsPeptide
from msmbuilder.featurizer import DihedralFeaturizer
from msmbuilder.decomposition import tICA

fs = FsPeptide().get().trajectories
n_atoms = fs[0].n_atoms

dih_model = DihedralFeaturizer(['phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', 'chi4'])
fs_dih_feat = dih_model.transform(fs)

tica = tICA(lag_time=100)
tica.fit(fs_dih_feat)

# X is a list of tica-transformed arrays, one per trajectory
X = tica.transform(fs_dih_feat)
# X_ is a vertically concatenated array of these
X_ = np.vstack(X)

# plot the 2D histogram of the data projected onto tIC 1 and tIC 2
plt.hexbin(X_[:,0],X_[:,1],bins='log',gridsize=150,cmap='Blues')
plt.xlabel('tIC 1')
plt.ylabel('tIC 2')
plt.title('All trajectories pooled')


loading trajectory_1.xtc...
loading trajectory_10.xtc...
loading trajectory_11.xtc...
loading trajectory_12.xtc...
loading trajectory_13.xtc...
loading trajectory_14.xtc...
loading trajectory_15.xtc...
loading trajectory_16.xtc...
loading trajectory_17.xtc...
loading trajectory_18.xtc...
loading trajectory_19.xtc...
loading trajectory_2.xtc...
loading trajectory_20.xtc...
loading trajectory_21.xtc...
loading trajectory_22.xtc...
loading trajectory_23.xtc...
loading trajectory_24.xtc...
loading trajectory_25.xtc...
loading trajectory_26.xtc...
loading trajectory_27.xtc...
loading trajectory_28.xtc...
loading trajectory_3.xtc...
loading trajectory_4.xtc...
loading trajectory_5.xtc...
loading trajectory_6.xtc...
loading trajectory_7.xtc...
loading trajectory_8.xtc...
loading trajectory_9.xtc...
Out[1]:
<matplotlib.text.Text at 0x10c2bff10>

Let's look at tIC1


In [2]:
plt.title('Eigenvector corresponding to tIC 1')
plt.plot(tica.eigenvectors_[0])
plt.xlabel('Index within feature vector')
plt.ylabel('Coefficient')


Out[2]:
<matplotlib.text.Text at 0x102b47510>

We mostly care about the magnitudes here, so let's just look at the absolute value


In [3]:
plt.title('Abs. value of eigenvector corresponding to tIC 1')
plt.plot(np.abs(tica.eigenvectors_[0]))
plt.xlabel('Index within feature vector')
plt.ylabel('|Coefficient|')
print(np.argmax(np.abs(tica.eigenvectors_[0])))


47

What are the standard deviations of the input features?


In [4]:
plt.title('Variance of input features')
plt.plot(np.vstack(fs_dih_feat).std(0)**2)
plt.xlabel('Index within feature vector')
plt.ylabel('Variance')


Out[4]:
<matplotlib.text.Text at 0x1465d86d0>

If the variance of an important input feature is high, then the model can still get away with using a small coefficient. We need to somehow account for that:


In [5]:
plt.plot(np.abs(tica.eigenvectors_[0])/np.vstack(fs_dih_feat).std(0)**2)
plt.title('Variance-normalized eigenvector')
plt.xlabel('Index within feature vector')
plt.ylabel('Variance-normalized coefficient')
print(np.argmax(np.abs(tica.eigenvectors_[0])/np.vstack(fs_dih_feat).std(0)**2))


108

Hmm this doesn't look quite right...


In [6]:
np.correlate(X_[:,0],np.vstack(fs_dih_feat)[:,0])[0]


Out[6]:
-3074.3853715577807

In [7]:
corr_coefs = np.array([np.correlate(X_[:,0],np.vstack(fs_dih_feat)[:,i])[0] for i in range(X_.shape[1])])

In [8]:
plt.plot(corr_coefs)
plt.figure()
plt.plot(np.abs(corr_coefs))
print(np.argmax(np.abs(corr_coefs)))


77

In [9]:
import mdtraj as md

In [10]:
# retrofitting the dihedral featurizer to keep track of which atoms correspond
# to which elements of the feature vector
class LabeledDihedralFeaturizer(DihedralFeaturizer):
    def partial_transform(self,traj):
        x = []
        self.labels = []
        self.atom_list = []
        
        for a in self.types:
            func = getattr(md, 'compute_%s' % a)
            atoms,y = func(traj)
            self.labels.append(a)
            if self.sincos:
                self.atom_list.append(np.vstack([atoms]*2))
                
                x.extend([np.sin(y), np.cos(y)])
            else:
                self.atom_list.append(atoms)
                
                x.append(y)

        self.atom_list = np.vstack(self.atom_list)
        return np.hstack(x)

In [11]:
labeled_dih_model = LabeledDihedralFeaturizer(['phi', 'psi', 'omega', 'chi1', 'chi2', 'chi3', 'chi4'])
fs_dih_feat = labeled_dih_model.transform(fs)

In [12]:
labeled_dih_model.atom_list.shape


Out[12]:
(148, 4)

In [18]:
# now convert from a measure of feature importance to a measure of atom importance
atomwise_scores = np.zeros(n_atoms)
featurewise_scores = np.abs(corr_coefs)

for i,feature in enumerate(labeled_dih_model.atom_list):
    for atom in feature:
        atomwise_scores[atom] += featurewise_scores[i]
atomwise_scores /= np.max(atomwise_scores)

In [64]:
plt.plot(atomwise_scores)
plt.xlabel('Atom #')
plt.ylabel('Score')
plt.title('Atomwise scores')


Out[64]:
<matplotlib.text.Text at 0x15253d4d0>

In [60]:
# first few frames from trajectory 0, thinned by 2
traj = fs[0][::2][:200]
traj


Out[60]:
<mdtraj.Trajectory with 200 frames, 264 atoms, 23 residues, without unitcells at 0x130b42050>

In [61]:
# saved with scores as b factors, to make it easy to color in pymol
traj.save_pdb('fs_peptide.pdb',bfactors=atomwise_scores)

Problem: this is difficult to view since the entire peptide tumbles around a lot. Let's align everything to a common reference...


In [66]:
traj.superpose(traj,len(traj)/2) # align to whatever the structure was in the middle of the trajectory
traj.save_pdb('fs_peptide.pdb',bfactors=atomwise_scores)

In [ ]: