One of the most important steps of doing machine learning on molecular data is transforming this data into a form amenable to the application of learning algorithms. This process is broadly called "featurization" and involves tutrning a molecule into a vector or tensor of some sort. There are a number of different ways of doing such transformations, and the choice of featurization is often dependent on the problem at hand.

In this tutorial, we explore the different featurization methods available for molecules. These featurization methods include:

`ConvMolFeaturizer`

,`WeaveFeaturizer`

,`CircularFingerprints`

`RDKitDescriptors`

`BPSymmetryFunction`

`CoulombMatrix`

`CoulombMatrixEig`

`AdjacencyFingerprints`

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

```
In [1]:
```%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')

```
```

Let's start with some basic imports

```
In [0]:
```from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
import numpy as np
from rdkit import Chem
from deepchem.feat import ConvMolFeaturizer, WeaveFeaturizer, CircularFingerprint
from deepchem.feat import AdjacencyFingerprint, RDKitDescriptors
from deepchem.feat import BPSymmetryFunctionInput, CoulombMatrix, CoulombMatrixEig
from deepchem.utils import conformers

`propane`

( $CH_3 CH_2 CH_3 $ ) as a running example throughout this tutorial. Many of the featurization methods use conformers or the molecules. A conformer can be generated using the `ConformerGenerator`

class in `deepchem.utils.conformers`

.

`RDKitDescriptors`

featurizes a molecule by computing descriptors values for specified descriptors. Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using `RDKitDescriptors.allowedDescriptors`

.

The featurizer uses the descriptors in `rdkit.Chem.Descriptors.descList`

, checks if they are in the list of allowed descriptors and computes the descriptor value for the molecule.

```
In [0]:
```example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

```
In [4]:
```for descriptor in RDKitDescriptors.allowedDescriptors:
print(descriptor)

```
```

```
In [5]:
```rdkit_desc = RDKitDescriptors()
features = rdkit_desc._featurize(example_mol)
print('The number of descriptors present are: ', len(features))

```
```

`Behler-Parinello Symmetry function`

or `BPSymmetryFunction`

featurizes a molecule by computing the atomic number and coordinates for each atom in the molecule. The features can be used as input for symmetry functions, like `RadialSymmetry`

, `DistanceMatrix`

and `DistanceCutoff`

. More details on these symmetry functions can be found in this paper. These functions can be found in `deepchem.feat.coulomb_matrices`

The featurizer takes in `max_atoms`

as an argument. As input, it takes in a conformer of the molecule and computes:

- coordinates of every atom in the molecule (in Bohr units)
- the atomic numbers for all atoms.

These features are concantenated and padded with zeros to account for different number of atoms, across molecules.

```
In [0]:
```example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

Let's now take a look at the actual featurized matrix that comes out.

```
In [7]:
```bp_sym = BPSymmetryFunctionInput(max_atoms=20)
features = bp_sym._featurize(mol=example_mol)
features

```
Out[7]:
```

```
In [8]:
```atomic_numbers = features[:, 0]
from collections import Counter
unique_numbers = Counter(atomic_numbers)
print(unique_numbers)

```
```

`C-atoms`

and $8$ `H-atoms`

, and these numbers are in agreement with the results shown above. There's also the additional padding of 9 atoms, to equalize with `max_atoms`

.

`CoulombMatrix`

, featurizes a molecule by computing the coulomb matrices for different conformers of the molecule, and returning it as a list.

A Coulomb matrix tries to encode the energy structure of a molecule. The matrix is symmetric, with the off-diagonal elements capturing the Coulombic repulsion between pairs of atoms and the diagonal elements capturing atomic energies using the atomic numbers. More information on the functional forms used can be found here.

The featurizer takes in `max_atoms`

as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`

), generating additional random coulomb matrices(`randomize`

), and getting only the upper triangular matrix (`upper_tri`

).

```
In [9]:
```example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)
print("Number of available conformers for propane: ", len(example_mol.GetConformers()))

```
```

```
In [0]:
```coulomb_mat = CoulombMatrix(max_atoms=20, randomize=False, remove_hydrogens=False, upper_tri=False)
features = coulomb_mat._featurize(mol=example_mol)

```
In [11]:
```print(len(example_mol.GetConformers()) == len(features))

```
```

`CoulombMatrix`

is invariant to molecular rotation and translation, since the interatomic distances or atomic numbers do not change. However the matrix is not invariant to random permutations of the atom's indices. To deal with this, the `CoulumbMatrixEig`

featurizer was introduced, which uses the eigenvalue spectrum of the columb matrix, and is invariant to random permutations of the atom's indices.

`CoulombMatrixEig`

inherits from `CoulombMatrix`

and featurizes a molecule by first computing the coulomb matrices for different conformers of the molecule and then computing the eigenvalues for each coulomb matrix. These eigenvalues are then padded to account for variation in number of atoms across molecules.

The featurizer takes in `max_atoms`

as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`

), generating additional random coulomb matrices(`randomize`

).

```
In [12]:
```example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)
print("Number of available conformers for propane: ", len(example_mol.GetConformers()))

```
```

```
In [0]:
```coulomb_mat_eig = CoulombMatrixEig(max_atoms=20, randomize=False, remove_hydrogens=False)
features = coulomb_mat_eig._featurize(mol=example_mol)

```
In [14]:
```print(len(example_mol.GetConformers()) == len(features))

```
```

TODO(rbharath): This tutorial still needs to be expanded out with the additional fingerprints.

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!