One of the most important steps of doing machine learning on molecular data is transforming this data into a form amenable to the application of learning algorithms. This process is broadly called "featurization" and involves tutrning a molecule into a vector or tensor of some sort. There are a number of different ways of doing such transformations, and the choice of featurization is often dependent on the problem at hand.
In this tutorial, we explore the different featurization methods available for molecules. These featurization methods include:
ConvMolFeaturizer
, WeaveFeaturizer
, CircularFingerprints
RDKitDescriptors
BPSymmetryFunction
CoulombMatrix
CoulombMatrixEig
AdjacencyFingerprints
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.
In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')
Let's start with some basic imports
In [0]:
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
import numpy as np
from rdkit import Chem
from deepchem.feat import ConvMolFeaturizer, WeaveFeaturizer, CircularFingerprint
from deepchem.feat import AdjacencyFingerprint, RDKitDescriptors
from deepchem.feat import BPSymmetryFunctionInput, CoulombMatrix, CoulombMatrixEig
from deepchem.utils import conformers
We use propane
( $CH_3 CH_2 CH_3 $ ) as a running example throughout this tutorial. Many of the featurization methods use conformers or the molecules. A conformer can be generated using the ConformerGenerator
class in deepchem.utils.conformers
.
RDKitDescriptors
featurizes a molecule by computing descriptors values for specified descriptors. Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using RDKitDescriptors.allowedDescriptors
.
The featurizer uses the descriptors in rdkit.Chem.Descriptors.descList
, checks if they are in the list of allowed descriptors and computes the descriptor value for the molecule.
In [0]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
Let's check the allowed list of descriptors. As you will see shortly, there's a wide range of chemical properties that RDKit computes for us.
In [4]:
for descriptor in RDKitDescriptors.allowedDescriptors:
print(descriptor)
In [5]:
rdkit_desc = RDKitDescriptors()
features = rdkit_desc._featurize(example_mol)
print('The number of descriptors present are: ', len(features))
Behler-Parinello Symmetry function
or BPSymmetryFunction
featurizes a molecule by computing the atomic number and coordinates for each atom in the molecule. The features can be used as input for symmetry functions, like RadialSymmetry
, DistanceMatrix
and DistanceCutoff
. More details on these symmetry functions can be found in this paper. These functions can be found in deepchem.feat.coulomb_matrices
The featurizer takes in max_atoms
as an argument. As input, it takes in a conformer of the molecule and computes:
These features are concantenated and padded with zeros to account for different number of atoms, across molecules.
In [0]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)
Let's now take a look at the actual featurized matrix that comes out.
In [7]:
bp_sym = BPSymmetryFunctionInput(max_atoms=20)
features = bp_sym._featurize(mol=example_mol)
features
Out[7]:
A simple check for the featurization would be to count the different atomic numbers present in the features.
In [8]:
atomic_numbers = features[:, 0]
from collections import Counter
unique_numbers = Counter(atomic_numbers)
print(unique_numbers)
For propane, we have $3$ C-atoms
and $8$ H-atoms
, and these numbers are in agreement with the results shown above. There's also the additional padding of 9 atoms, to equalize with max_atoms
.
CoulombMatrix
, featurizes a molecule by computing the coulomb matrices for different conformers of the molecule, and returning it as a list.
A Coulomb matrix tries to encode the energy structure of a molecule. The matrix is symmetric, with the off-diagonal elements capturing the Coulombic repulsion between pairs of atoms and the diagonal elements capturing atomic energies using the atomic numbers. More information on the functional forms used can be found here.
The featurizer takes in max_atoms
as an argument and also has options for removing hydrogens from the molecule (remove_hydrogens
), generating additional random coulomb matrices(randomize
), and getting only the upper triangular matrix (upper_tri
).
In [9]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)
print("Number of available conformers for propane: ", len(example_mol.GetConformers()))
In [0]:
coulomb_mat = CoulombMatrix(max_atoms=20, randomize=False, remove_hydrogens=False, upper_tri=False)
features = coulomb_mat._featurize(mol=example_mol)
A simple check for the featurization is to see if the feature list has the same length as the number of conformers
In [11]:
print(len(example_mol.GetConformers()) == len(features))
CoulombMatrix
is invariant to molecular rotation and translation, since the interatomic distances or atomic numbers do not change. However the matrix is not invariant to random permutations of the atom's indices. To deal with this, the CoulumbMatrixEig
featurizer was introduced, which uses the eigenvalue spectrum of the columb matrix, and is invariant to random permutations of the atom's indices.
CoulombMatrixEig
inherits from CoulombMatrix
and featurizes a molecule by first computing the coulomb matrices for different conformers of the molecule and then computing the eigenvalues for each coulomb matrix. These eigenvalues are then padded to account for variation in number of atoms across molecules.
The featurizer takes in max_atoms
as an argument and also has options for removing hydrogens from the molecule (remove_hydrogens
), generating additional random coulomb matrices(randomize
).
In [12]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)
print("Number of available conformers for propane: ", len(example_mol.GetConformers()))
In [0]:
coulomb_mat_eig = CoulombMatrixEig(max_atoms=20, randomize=False, remove_hydrogens=False)
features = coulomb_mat_eig._featurize(mol=example_mol)
In [14]:
print(len(example_mol.GetConformers()) == len(features))
TODO(rbharath): This tutorial still needs to be expanded out with the additional fingerprints.
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!