Tutorial Part 13: Modeling Protein-Ligand Interactions

In this tutorial, we'll walk you through the use of machine learning methods to predict the binding energy of a protein-ligand complex. Recall that a ligand is some small molecule which interacts (usually non-covalently) with a protein. As you work through the tutorial, you'll trace an arc from loading a raw dataset to fitting a random forest model to predict binding affinities. We'll take the following steps to get there:

  1. Loading a chemical dataset, consisting of a series of protein-ligand complexes.
  2. Featurizing each protein-ligand complexes with various featurization schemes.
  3. Fitting a series of models with these featurized protein-ligand complexes.
  4. Visualizing the results.

To start the tutorial, we'll use a simple pre-processed dataset file that comes in the form of a gzipped file. Each row is a molecular system, and each column represents a different piece of information about that system. For instance, in this example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python list of all lines in a PDB file for the protein alone; and a Python list of all lines in a ligand file for the ligand alone.

Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.


In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(additional_packages=['mdtraj'], version='2.3.0')


TensorFlow 1.x selected.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3477  100  3477    0     0  22146      0 --:--:-- --:--:-- --:--:-- 22146
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing deepchem
done
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

deepchem-2.3.0 installation finished!
CPU times: user 2.89 s, sys: 634 ms, total: 3.52 s
Wall time: 2min 18s

In [2]:
import deepchem as dc
from deepchem.utils import download_url

import os

data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")

if not os.path.exists(dataset_file):
    print('File does not exist. Downloading file...')
    download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
    print('File downloaded...')

raw_dataset = dc.utils.save.load_from_disk(dataset_file)


File does not exist. Downloading file...
File downloaded...

Let's see what dataset looks like:


In [3]:
print("Type of dataset is: %s" % str(type(raw_dataset)))
print(raw_dataset[:5])
print("Shape of dataset is: %s" % str(raw_dataset.shape))


Type of dataset is: <class 'pandas.core.frame.DataFrame'>
  pdb_id  ... label
0   2d3u  ...  6.92
1   3cyx  ...  8.00
2   3uo4  ...  6.52
3   1p1q  ...  4.89
4   3ag9  ...  8.05

[5 rows x 7 columns]
Shape of dataset is: (193, 7)

Visualizing what these proteins and ligands look like will help us build intuition and understanding about these systems. Let's write a bit of code to help us view our molecules. We'll use the nglview library to help us do this. You can install this library by calling pip install nglview.


In [4]:
!pip install -q nglview


     |████████████████████████████████| 5.2MB 2.8MB/s 
  Building wheel for nglview (setup.py) ... done

In [5]:
import nglview
import tempfile
import os
import mdtraj as md
import numpy as np


We'll use the mdtraj library to help us manipulate both ligand and protein objects. We'll use the following convenience function to parse in the ligand and protein representations above into mdtraj.


In [0]:
def convert_lines_to_mdtraj(molecule_lines):
  molecule_lines = molecule_lines.strip('[').strip(']').replace("'","").replace("\\n", "").split(", ")
  tempdir = tempfile.mkdtemp()
  molecule_file = os.path.join(tempdir, "molecule.pdb")
  with open(molecule_file, "w") as f:
    for line in molecule_lines:
        f.write("%s\n" % line)
  molecule_mdtraj = md.load(molecule_file)
  return molecule_mdtraj

Let's take a look at the first protein ligand pair in our dataset:


In [0]:
first_protein, first_ligand = raw_dataset.iloc[0]["protein_pdb"], raw_dataset.iloc[0]["ligand_pdb"]
protein_mdtraj = convert_lines_to_mdtraj(first_protein)
ligand_mdtraj = convert_lines_to_mdtraj(first_ligand)

We'll use the convenience function nglview.show_mdtraj in order to view our proteins and ligands.


In [8]:
v = nglview.show_mdtraj(ligand_mdtraj)
v


Now that we have an idea of what the ligand looks like, let's take a look at our protein:


In [9]:
view = nglview.show_mdtraj(protein_mdtraj)
view


Can we view the complex with both protein and ligand? Yes, but we'll need the following helper function to join the two mdtraj files for the protein and ligand.


In [0]:
def combine_mdtraj(protein, ligand):
  chain = protein.topology.add_chain()
  residue = protein.topology.add_residue("LIG", chain, resSeq=1)
  for atom in ligand.topology.atoms:
      protein.topology.add_atom(atom.name, atom.element, residue)
  protein.xyz = np.hstack([protein.xyz, ligand.xyz])
  protein.topology.create_standard_bonds()
  return protein
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)

Let's now visualize our complex


In [11]:
v = nglview.show_mdtraj(complex_mdtraj)
v


We can see that the ligand slots into a groove on the outer edge of the protein. Ok, now that we've got our basic visualization tools up and running, let's see if we can use some machine learning to understand our dataset of protein-ligand systems better.

In order to do this, we'll need a way to transform our protein-ligand complexes into representations which can be used by learning algorithms. Ideally, we'd have neural protein-ligand complex fingerprints, but DeepChem doesn't yet have a good learned fingerprint of this sort. We do however have well tuned manual featurizers that can help us with our challenge here.

We'll make of two types of fingerprints in the rest of the tutorial, the circular fingerprints and the grid descriptors. The grid descriptors convert a 3D volume containing an arragment of atoms into a fingerprint. This is really useful for understanding protein-ligand complexes since it will allow us to transfer protein-ligand complexes into vectors that can be passed into a simple machine learning algorithms. Let's see how we can create such a fingerprint in DeepChem. We'll make use of the dc.feat.RdkitGridFeaturizer class.


In [0]:
grid_featurizer = dc.feat.RdkitGridFeaturizer(
    voxel_width=16.0, feature_types=["ecfp", "splif", "hbond", "pi_stack", "cation_pi", "salt_bridge"], 
    ecfp_power=5, splif_power=5, parallel=True, flatten=True, sanitize=True)

Next we'll create circular fingerprints. These convert small molecules into a vector of fragments. You can create these fingerprints with the dc.feat.CircularFingerprint class.


In [0]:
compound_featurizer = dc.feat.CircularFingerprint(size=128)

The convenience loader dc.molnet.load_pdbbind_grid will take care of performing featurizing the pdbbind dataset under the hood for us. We'll use this helper method to perform our featurization for us. We'll featurize the "refined" subset of the PDBBIND dataset (which consists of only a couple thousand protein-ligand complexes) to keep this task manageable.


In [14]:
pdbbind_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid(
    featurizer="ECFP", subset="refined")


Loading raw samples now.
shard_size: 8192
About to start loading CSV from /tmp/refined_smiles_labels.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
TIMING: featurizing shard 0 took 9.982 s
TIMING: dataset construction took 10.126 s
Loading dataset from disk.
TIMING: dataset construction took 0.166 s
Loading dataset from disk.
TIMING: dataset construction took 0.085 s
Loading dataset from disk.
TIMING: dataset construction took 0.081 s
Loading dataset from disk.
TIMING: dataset construction took 0.142 s
Loading dataset from disk.
TIMING: dataset construction took 0.022 s
Loading dataset from disk.
TIMING: dataset construction took 0.023 s
Loading dataset from disk.

Now, we're ready to do some learning!

To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to interoperate with deepchem. To instantiate an SklearnModel, you will need (a) task_types, (b) model_params, another dict as illustrated below, and (c) a model_instance defining the type of model you would like to fit, in this case a RandomForestRegressor.


In [0]:
from sklearn.ensemble import RandomForestRegressor

seed=23 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

In [16]:
from deepchem.utils.evaluate import Evaluator
import pandas as pd

metric = dc.metrics.Metric(dc.metrics.r2_score)

evaluator = Evaluator(model, train_dataset, transformers)
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

evaluator = Evaluator(model, valid_dataset, transformers)
valid_r2score = evaluator.compute_model_performance([metric])
print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))


computed_metrics: [0.8433487532863048]
RF Train set R^2 0.843349
computed_metrics: [0.45434010385273105]
RF Valid set R^2 0.454340

This is decent performance on a validation set! It's interesting to note that a trivial prediction from just the ligand can do reasonably on the task of predicting the binding energy.


In [17]:
predictions = model.predict(test_dataset)
print(predictions[:10])


[-0.64106832 -0.80219175 -1.19084758 -1.11424137 -1.21312906 -0.73018821
 -1.00686205 -0.17348379 -0.98073392 -0.10108712]

The protein-ligand complex view.

In the previous section, we featurized only the ligand. The signal we observed in R^2 reflects the ability of grid fingerprints and random forests to learn general features that make ligands "drug-like." This time, let's see if we can do something sensible with our protein-ligand fingerprints that make use of our structural information. To start with, we need to re-featurize the dataset but using the "grid" fingerprints this time.


In [18]:
pdbbind_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid(
    featurizer="grid", subset="refined")


Loading dataset from disk.
TIMING: dataset construction took 0.236 s
Loading dataset from disk.
TIMING: dataset construction took 0.073 s
Loading dataset from disk.
TIMING: dataset construction took 0.072 s
Loading dataset from disk.

Let's now train a simple random forest model on this dataset.


In [0]:
seed=23 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=10, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train_dataset)

Let's see what our accuracies looks like!


In [20]:
metric = dc.metrics.Metric(dc.metrics.r2_score)

evaluator = Evaluator(model, train_dataset, transformers)
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

evaluator = Evaluator(model, valid_dataset, transformers)
valid_r2score = evaluator.compute_model_performance([metric])
print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))


computed_metrics: [0.8954267076811548]
RF Train set R^2 0.895427
computed_metrics: [0.4608614366143733]
RF Valid set R^2 0.460861

Ok, there's some predictive performance here, but it looks like we have lower accuracy than the ligand-only dataset. What gives? There might be a few things going on. It's possible that for this particular dataset the pure ligand only features are quite predictive. But nonetheless, it's probably still useful to have a protein-ligand model since it's likely to learn different features than the the pure ligand-only model.

Doing Some Hyperparameter Optimization

Ok, now that we've built a few models, let's do some hyperparameter optimization to see if we can get our numbers to be a little better. We'll use the dc.hyper module to do this for us.


In [21]:
def rf_model_builder(model_params, model_dir):
  sklearn_model = RandomForestRegressor(**model_params)
  sklearn_model.random_state = seed
  return dc.models.SklearnModel(sklearn_model, model_dir)

params_dict = {
    "n_estimators": [10, 50, 100],
    "max_features": ["auto", "sqrt", "log2", None],
}

metric = dc.metrics.Metric(dc.metrics.r2_score)
optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
    params_dict, train_dataset, valid_dataset, transformers,
    metric=metric)


Fitting model 1/12
hyperparameters: {'n_estimators': 10, 'max_features': 'auto'}
computed_metrics: [0.4527347088180085]
Model 1/12, Metric r2_score, Validation set 0: 0.452735
	best_validation_score so far: 0.452735
Fitting model 2/12
hyperparameters: {'n_estimators': 10, 'max_features': 'sqrt'}
computed_metrics: [0.4608614366143733]
Model 2/12, Metric r2_score, Validation set 1: 0.460861
	best_validation_score so far: 0.460861
Fitting model 3/12
hyperparameters: {'n_estimators': 10, 'max_features': 'log2'}
computed_metrics: [0.40215050034606037]
Model 3/12, Metric r2_score, Validation set 2: 0.402151
	best_validation_score so far: 0.460861
Fitting model 4/12
hyperparameters: {'n_estimators': 10, 'max_features': None}
computed_metrics: [0.4527347088180085]
Model 4/12, Metric r2_score, Validation set 3: 0.452735
	best_validation_score so far: 0.460861
Fitting model 5/12
hyperparameters: {'n_estimators': 50, 'max_features': 'auto'}
computed_metrics: [0.49621726686995704]
Model 5/12, Metric r2_score, Validation set 4: 0.496217
	best_validation_score so far: 0.496217
Fitting model 6/12
hyperparameters: {'n_estimators': 50, 'max_features': 'sqrt'}
computed_metrics: [0.4931560486803085]
Model 6/12, Metric r2_score, Validation set 5: 0.493156
	best_validation_score so far: 0.496217
Fitting model 7/12
hyperparameters: {'n_estimators': 50, 'max_features': 'log2'}
computed_metrics: [0.4619425746467314]
Model 7/12, Metric r2_score, Validation set 6: 0.461943
	best_validation_score so far: 0.496217
Fitting model 8/12
hyperparameters: {'n_estimators': 50, 'max_features': None}
computed_metrics: [0.49621726686995704]
Model 8/12, Metric r2_score, Validation set 7: 0.496217
	best_validation_score so far: 0.496217
Fitting model 9/12
hyperparameters: {'n_estimators': 100, 'max_features': 'auto'}
computed_metrics: [0.5019612740243959]
Model 9/12, Metric r2_score, Validation set 8: 0.501961
	best_validation_score so far: 0.501961
Fitting model 10/12
hyperparameters: {'n_estimators': 100, 'max_features': 'sqrt'}
computed_metrics: [0.48994241350618273]
Model 10/12, Metric r2_score, Validation set 9: 0.489942
	best_validation_score so far: 0.501961
Fitting model 11/12
hyperparameters: {'n_estimators': 100, 'max_features': 'log2'}
computed_metrics: [0.47513889029551215]
Model 11/12, Metric r2_score, Validation set 10: 0.475139
	best_validation_score so far: 0.501961
Fitting model 12/12
hyperparameters: {'n_estimators': 100, 'max_features': None}
computed_metrics: [0.5019612740243959]
Model 12/12, Metric r2_score, Validation set 11: 0.501961
	best_validation_score so far: 0.501961
computed_metrics: [0.931461486646433]
Best hyperparameters: (100, None)
train_score: 0.931461
validation_score: 0.501961

Ok, our best validation score is now 0.53 R^2. Let's make some predictions on the test set and see what they look like.


In [22]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

rf_predicted_test = best_rf.predict(test_dataset)
rf_true_test = test_dataset.y
plt.scatter(rf_predicted_test, rf_true_test)
plt.xlabel('Predicted pIC50s')
plt.ylabel('True IC50')
plt.title(r'RF predicted IC50 vs. True pIC50')
plt.xlim([2, 11])
plt.ylim([2, 11])
plt.plot([2, 11], [2, 11], color='k')
plt.show()


This model seems reasonably predictive! It's likely that this model is still a ways from being good enough to use in a production setting, but this isn't bad for a quick and dirty tutorial.

Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!