Tutorial Part 6: Going Deeper On Molecular Featurizations

One of the most important steps of doing machine learning on molecular data is transforming this data into a form amenable to the application of learning algorithms. This process is broadly called "featurization" and involves tutrning a molecule into a vector or tensor of some sort. There are a number of different ways of doing such transformations, and the choice of featurization is often dependent on the problem at hand.

In this tutorial, we explore the different featurization methods available for molecules. These featurization methods include:

  1. ConvMolFeaturizer,
  2. WeaveFeaturizer,
  3. CircularFingerprints
  4. RDKitDescriptors
  5. BPSymmetryFunction
  6. CoulombMatrix
  7. CoulombMatrixEig
  8. AdjacencyFingerprints

Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.


In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')


TensorFlow 1.x selected.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3477  100  3477    0     0  21867      0 --:--:-- --:--:-- --:--:-- 21867
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing deepchem
done
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

deepchem-2.3.0 installation finished!
CPU times: user 2.9 s, sys: 650 ms, total: 3.55 s
Wall time: 2min 9s

Let's start with some basic imports


In [0]:
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals

import numpy as np
from rdkit import Chem

from deepchem.feat import ConvMolFeaturizer, WeaveFeaturizer, CircularFingerprint
from deepchem.feat import AdjacencyFingerprint, RDKitDescriptors
from deepchem.feat import BPSymmetryFunctionInput, CoulombMatrix, CoulombMatrixEig
from deepchem.utils import conformers

We use propane( $CH_3 CH_2 CH_3 $ ) as a running example throughout this tutorial. Many of the featurization methods use conformers or the molecules. A conformer can be generated using the ConformerGenerator class in deepchem.utils.conformers.

RDKitDescriptors

RDKitDescriptors featurizes a molecule by computing descriptors values for specified descriptors. Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using RDKitDescriptors.allowedDescriptors.

The featurizer uses the descriptors in rdkit.Chem.Descriptors.descList, checks if they are in the list of allowed descriptors and computes the descriptor value for the molecule.


In [0]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

Let's check the allowed list of descriptors. As you will see shortly, there's a wide range of chemical properties that RDKit computes for us.


In [4]:
for descriptor in RDKitDescriptors.allowedDescriptors:
    print(descriptor)


EState_VSA5
NumRotatableBonds
Kappa1
MinPartialCharge
BalabanJ
VSA_EState1
SMR_VSA4
NumHAcceptors
PEOE_VSA6
Chi1n
SlogP_VSA7
Chi0
SMR_VSA3
NumSaturatedCarbocycles
EState_VSA1
SMR_VSA8
NumAromaticHeterocycles
TPSA
VSA_EState9
VSA_EState4
NumAromaticCarbocycles
MaxAbsEStateIndex
SMR_VSA7
PEOE_VSA14
Kappa3
Chi3n
SMR_VSA2
SlogP_VSA8
SMR_VSA1
PEOE_VSA12
NumAliphaticRings
NumAliphaticCarbocycles
MaxPartialCharge
SlogP_VSA6
EState_VSA4
HallKierAlpha
EState_VSA3
PEOE_VSA13
SlogP_VSA10
HeavyAtomCount
NumAliphaticHeterocycles
Chi0n
Kappa2
SlogP_VSA4
VSA_EState7
NHOHCount
PEOE_VSA3
VSA_EState3
RingCount
SlogP_VSA5
EState_VSA8
Chi4v
Chi4n
PEOE_VSA1
EState_VSA10
VSA_EState8
PEOE_VSA2
SMR_VSA5
PEOE_VSA10
Chi0v
MinEStateIndex
Chi1
NOCount
PEOE_VSA9
VSA_EState10
PEOE_VSA8
SMR_VSA10
ExactMolWt
SlogP_VSA11
PEOE_VSA11
SlogP_VSA12
NumSaturatedRings
VSA_EState5
EState_VSA7
MolMR
BertzCT
PEOE_VSA7
EState_VSA11
EState_VSA6
Chi2v
Chi3v
NumAromaticRings
MaxAbsPartialCharge
MolWt
MinAbsPartialCharge
PEOE_VSA4
VSA_EState6
SlogP_VSA9
NumValenceElectrons
MinAbsEStateIndex
Chi2n
HeavyAtomMolWt
MaxEStateIndex
MolLogP
FractionCSP3
NumHDonors
NumHeteroatoms
Chi1v
LabuteASA
Ipc
SMR_VSA6
SlogP_VSA1
SMR_VSA9
EState_VSA9
SlogP_VSA2
EState_VSA2
NumSaturatedHeterocycles
VSA_EState2
NumRadicalElectrons
SlogP_VSA3
PEOE_VSA5

In [5]:
rdkit_desc = RDKitDescriptors()
features = rdkit_desc._featurize(example_mol)

print('The number of descriptors present are: ', len(features))


The number of descriptors present are:  111

BPSymmetryFunction

Behler-Parinello Symmetry function or BPSymmetryFunction featurizes a molecule by computing the atomic number and coordinates for each atom in the molecule. The features can be used as input for symmetry functions, like RadialSymmetry, DistanceMatrix and DistanceCutoff . More details on these symmetry functions can be found in this paper. These functions can be found in deepchem.feat.coulomb_matrices

The featurizer takes in max_atoms as an argument. As input, it takes in a conformer of the molecule and computes:

  1. coordinates of every atom in the molecule (in Bohr units)
  2. the atomic numbers for all atoms.

These features are concantenated and padded with zeros to account for different number of atoms, across molecules.


In [0]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

Let's now take a look at the actual featurized matrix that comes out.


In [7]:
bp_sym = BPSymmetryFunctionInput(max_atoms=20)
features = bp_sym._featurize(mol=example_mol)
features


Out[7]:
array([[ 6.        ,  2.33166293, -0.52962788, -0.48097309],
       [ 6.        ,  0.0948792 ,  1.07597567, -1.33579553],
       [ 6.        , -2.40436371, -0.29483572, -0.90388318],
       [ 1.        ,  2.18166462, -0.95639011,  1.569049  ],
       [ 1.        ,  4.1178375 ,  0.51816193, -0.81949623],
       [ 1.        ,  2.39319787, -2.32844253, -1.56157176],
       [ 1.        ,  0.29919987,  1.51730566, -3.37889252],
       [ 1.        ,  0.08875543,  2.88229706, -0.26437996],
       [ 1.        , -3.99100651,  0.92016315, -1.54358853],
       [ 1.        , -2.66167993, -0.71627602,  1.136556  ],
       [ 1.        , -2.45014726, -2.08833123, -1.99406318],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]])

A simple check for the featurization would be to count the different atomic numbers present in the features.


In [8]:
atomic_numbers = features[:, 0]
from collections import Counter

unique_numbers = Counter(atomic_numbers)
print(unique_numbers)


Counter({0.0: 9, 1.0: 8, 6.0: 3})

For propane, we have $3$ C-atoms and $8$ H-atoms, and these numbers are in agreement with the results shown above. There's also the additional padding of 9 atoms, to equalize with max_atoms.

CoulombMatrix

CoulombMatrix, featurizes a molecule by computing the coulomb matrices for different conformers of the molecule, and returning it as a list.

A Coulomb matrix tries to encode the energy structure of a molecule. The matrix is symmetric, with the off-diagonal elements capturing the Coulombic repulsion between pairs of atoms and the diagonal elements capturing atomic energies using the atomic numbers. More information on the functional forms used can be found here.

The featurizer takes in max_atoms as an argument and also has options for removing hydrogens from the molecule (remove_hydrogens), generating additional random coulomb matrices(randomize), and getting only the upper triangular matrix (upper_tri).


In [9]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))


Number of available conformers for propane:  1

In [0]:
coulomb_mat = CoulombMatrix(max_atoms=20, randomize=False, remove_hydrogens=False, upper_tri=False)
features = coulomb_mat._featurize(mol=example_mol)

A simple check for the featurization is to see if the feature list has the same length as the number of conformers


In [11]:
print(len(example_mol.GetConformers()) == len(features))


True

CoulombMatrixEig

CoulombMatrix is invariant to molecular rotation and translation, since the interatomic distances or atomic numbers do not change. However the matrix is not invariant to random permutations of the atom's indices. To deal with this, the CoulumbMatrixEig featurizer was introduced, which uses the eigenvalue spectrum of the columb matrix, and is invariant to random permutations of the atom's indices.

CoulombMatrixEig inherits from CoulombMatrix and featurizes a molecule by first computing the coulomb matrices for different conformers of the molecule and then computing the eigenvalues for each coulomb matrix. These eigenvalues are then padded to account for variation in number of atoms across molecules.

The featurizer takes in max_atoms as an argument and also has options for removing hydrogens from the molecule (remove_hydrogens), generating additional random coulomb matrices(randomize).


In [12]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))


Number of available conformers for propane:  1

In [0]:
coulomb_mat_eig = CoulombMatrixEig(max_atoms=20, randomize=False, remove_hydrogens=False)
features = coulomb_mat_eig._featurize(mol=example_mol)

In [14]:
print(len(example_mol.GetConformers()) == len(features))


True

Adjacency Fingerprints

TODO(rbharath): This tutorial still needs to be expanded out with the additional fingerprints.

Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!