In this tutorial, we'll walk you through the use of machine learning methods to predict the binding energy of a protein-ligand complex. Recall that a ligand is some small molecule which interacts (usually non-covalently) with a protein. As you work through the tutorial, you'll trace an arc from loading a raw dataset to fitting a random forest model to predict binding affinities. We'll take the following steps to get there:
To start the tutorial, we'll use a simple pre-processed dataset file that comes in the form of a gzipped file. Each row is a molecular system, and each column represents a different piece of information about that system. For instance, in this example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python list
of all lines in a PDB file for the protein alone; and a Python list
of all lines in a ligand file for the ligand alone.
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.
In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(additional_packages=['mdtraj'], version='2.3.0')
In [2]:
import deepchem as dc
from deepchem.utils import download_url
import os
data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")
if not os.path.exists(dataset_file):
print('File does not exist. Downloading file...')
download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
print('File downloaded...')
raw_dataset = dc.utils.save.load_from_disk(dataset_file)
Let's see what dataset
looks like:
In [3]:
print("Type of dataset is: %s" % str(type(raw_dataset)))
print(raw_dataset[:5])
print("Shape of dataset is: %s" % str(raw_dataset.shape))
Visualizing what these proteins and ligands look like will help us build intuition and understanding about these systems. Let's write a bit of code to help us view our molecules. We'll use the nglview
library to help us do this. You can install this library by calling pip install nglview
.
In [4]:
!pip install -q nglview
In [5]:
import nglview
import tempfile
import os
import mdtraj as md
import numpy as np
We'll use the mdtraj
library to help us manipulate both ligand and protein objects. We'll use the following convenience function to parse in the ligand and protein representations above into mdtraj.
In [0]:
def convert_lines_to_mdtraj(molecule_lines):
molecule_lines = molecule_lines.strip('[').strip(']').replace("'","").replace("\\n", "").split(", ")
tempdir = tempfile.mkdtemp()
molecule_file = os.path.join(tempdir, "molecule.pdb")
with open(molecule_file, "w") as f:
for line in molecule_lines:
f.write("%s\n" % line)
molecule_mdtraj = md.load(molecule_file)
return molecule_mdtraj
Let's take a look at the first protein ligand pair in our dataset:
In [0]:
first_protein, first_ligand = raw_dataset.iloc[0]["protein_pdb"], raw_dataset.iloc[0]["ligand_pdb"]
protein_mdtraj = convert_lines_to_mdtraj(first_protein)
ligand_mdtraj = convert_lines_to_mdtraj(first_ligand)
We'll use the convenience function nglview.show_mdtraj
in order to view our proteins and ligands.
In [8]:
v = nglview.show_mdtraj(ligand_mdtraj)
v
Now that we have an idea of what the ligand looks like, let's take a look at our protein:
In [9]:
view = nglview.show_mdtraj(protein_mdtraj)
view
Can we view the complex with both protein and ligand? Yes, but we'll need the following helper function to join the two mdtraj files for the protein and ligand.
In [0]:
def combine_mdtraj(protein, ligand):
chain = protein.topology.add_chain()
residue = protein.topology.add_residue("LIG", chain, resSeq=1)
for atom in ligand.topology.atoms:
protein.topology.add_atom(atom.name, atom.element, residue)
protein.xyz = np.hstack([protein.xyz, ligand.xyz])
protein.topology.create_standard_bonds()
return protein
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)
Let's now visualize our complex
In [11]:
v = nglview.show_mdtraj(complex_mdtraj)
v
We can see that the ligand slots into a groove on the outer edge of the protein. Ok, now that we've got our basic visualization tools up and running, let's see if we can use some machine learning to understand our dataset of protein-ligand systems better.
In order to do this, we'll need a way to transform our protein-ligand complexes into representations which can be used by learning algorithms. Ideally, we'd have neural protein-ligand complex fingerprints, but DeepChem doesn't yet have a good learned fingerprint of this sort. We do however have well tuned manual featurizers that can help us with our challenge here.
We'll make of two types of fingerprints in the rest of the tutorial, the circular fingerprints and the grid descriptors. The grid descriptors convert a 3D volume containing an arragment of atoms into a fingerprint. This is really useful for understanding protein-ligand complexes since it will allow us to transfer protein-ligand complexes into vectors that can be passed into a simple machine learning algorithms. Let's see how we can create such a fingerprint in DeepChem. We'll make use of the dc.feat.RdkitGridFeaturizer
class.
In [0]:
grid_featurizer = dc.feat.RdkitGridFeaturizer(
voxel_width=16.0, feature_types=["ecfp", "splif", "hbond", "pi_stack", "cation_pi", "salt_bridge"],
ecfp_power=5, splif_power=5, parallel=True, flatten=True, sanitize=True)
Next we'll create circular fingerprints. These convert small molecules into a vector of fragments. You can create these fingerprints with the dc.feat.CircularFingerprint
class.
In [0]:
compound_featurizer = dc.feat.CircularFingerprint(size=128)
The convenience loader dc.molnet.load_pdbbind_grid
will take care of performing featurizing the pdbbind dataset under the hood for us. We'll use this helper method to perform our featurization for us. We'll featurize the "refined" subset of the PDBBIND dataset (which consists of only a couple thousand protein-ligand complexes) to keep this task manageable.
In [14]:
pdbbind_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid(
featurizer="ECFP", subset="refined")
Now, we're ready to do some learning!
To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to interoperate with deepchem. To instantiate an SklearnModel
, you will need (a) task_types, (b) model_params, another dict
as illustrated below, and (c) a model_instance
defining the type of model you would like to fit, in this case a RandomForestRegressor
.
In [0]:
from sklearn.ensemble import RandomForestRegressor
seed=23 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)
In [16]:
from deepchem.utils.evaluate import Evaluator
import pandas as pd
metric = dc.metrics.Metric(dc.metrics.r2_score)
evaluator = Evaluator(model, train_dataset, transformers)
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))
evaluator = Evaluator(model, valid_dataset, transformers)
valid_r2score = evaluator.compute_model_performance([metric])
print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))
This is decent performance on a validation set! It's interesting to note that a trivial prediction from just the ligand can do reasonably on the task of predicting the binding energy.
In [17]:
predictions = model.predict(test_dataset)
print(predictions[:10])
In the previous section, we featurized only the ligand. The signal we observed in R^2 reflects the ability of grid fingerprints and random forests to learn general features that make ligands "drug-like." This time, let's see if we can do something sensible with our protein-ligand fingerprints that make use of our structural information. To start with, we need to re-featurize the dataset but using the "grid" fingerprints this time.
In [18]:
pdbbind_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid(
featurizer="grid", subset="refined")
Let's now train a simple random forest model on this dataset.
In [0]:
seed=23 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)
Let's see what our accuracies looks like!
In [20]:
metric = dc.metrics.Metric(dc.metrics.r2_score)
evaluator = Evaluator(model, train_dataset, transformers)
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))
evaluator = Evaluator(model, valid_dataset, transformers)
valid_r2score = evaluator.compute_model_performance([metric])
print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))
Ok, there's some predictive performance here, but it looks like we have lower accuracy than the ligand-only dataset. What gives? There might be a few things going on. It's possible that for this particular dataset the pure ligand only features are quite predictive. But nonetheless, it's probably still useful to have a protein-ligand model since it's likely to learn different features than the the pure ligand-only model.
Ok, now that we've built a few models, let's do some hyperparameter optimization to see if we can get our numbers to be a little better. We'll use the dc.hyper
module to do this for us.
In [21]:
def rf_model_builder(model_params, model_dir):
sklearn_model = RandomForestRegressor(**model_params)
sklearn_model.random_state = seed
return dc.models.SklearnModel(sklearn_model, model_dir)
params_dict = {
"n_estimators": [10, 50, 100],
"max_features": ["auto", "sqrt", "log2", None],
}
metric = dc.metrics.Metric(dc.metrics.r2_score)
optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
params_dict, train_dataset, valid_dataset, transformers,
metric=metric)
Ok, our best validation score is now 0.53
R^2. Let's make some predictions on the test set and see what they look like.
In [22]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
rf_predicted_test = best_rf.predict(test_dataset)
rf_true_test = test_dataset.y
plt.scatter(rf_predicted_test, rf_true_test)
plt.xlabel('Predicted pIC50s')
plt.ylabel('True IC50')
plt.title(r'RF predicted IC50 vs. True pIC50')
plt.xlim([2, 11])
plt.ylim([2, 11])
plt.plot([2, 11], [2, 11], color='k')
plt.show()
This model seems reasonably predictive! It's likely that this model is still a ways from being good enough to use in a production setting, but this isn't bad for a quick and dirty tutorial.
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!