deepchem: Machine Learning models for Drug Discovery

Tutorial 1: Basic Protein-Ligand Complex Featurized Models

Written by Evan Feinberg and Bharath Ramsundar

Copyright 2016, Stanford University

Welcome to the deepchem tutorial. In this iPython Notebook, one can follow along with the code below to learn how to fit machine learning models with rich predictive power on chemical datasets.

Overview:

In this tutorial, you will trace an arc from loading a raw dataset to fitting a cutting edge ML technique for predicting binding affinities. This will be accomplished by writing simple commands to access the deepchem Python API, encompassing the following broad steps:

  1. Loading a chemical dataset, consisting of a series of protein-ligand complexes.
  2. Featurizing each protein-ligand complexes with various featurization schemes.
  3. Fitting a series of models with these featurized protein-ligand complexes.
  4. Visualizing the results.

First, let's point to a "dataset" file. This can come in the format of a CSV file or Pandas DataFrame. Regardless of file format, it must be columnar data, where each row is a molecular system, and each column represents a different piece of information about that system. For instance, in this example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python list of all lines in a PDB file for the protein alone; and a Python list of all lines in a ligand file for the ligand alone.

This should become clearer with the example. (Make sure to set DISPLAY = True)


In [1]:
%load_ext autoreload
%autoreload 2
%pdb off
# set DISPLAY = True when running tutorial
DISPLAY = False
# set PARALLELIZE to true if you want to use ipyparallel
PARALLELIZE = False
import warnings
warnings.filterwarnings('ignore')


Automatic pdb calling has been turned OFF

In [2]:
dataset_file= "../datasets/pdbbind_core_df.pkl.gz"
from deepchem.utils.save import load_from_disk
dataset = load_from_disk(dataset_file)

Let's see what dataset looks like:


In [3]:
print("Type of dataset is: %s" % str(type(dataset)))
print(dataset[:5])
print("Shape of dataset is: %s" % str(dataset.shape))


Type of dataset is: <class 'pandas.core.frame.DataFrame'>
  pdb_id                                             smiles  \
0   2d3u        CC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O   
1   3cyx  CC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1CCCCC...   
2   3uo4        OC(O)C1CCC(NC2NCCC(NC3CCCCC3C3CCCCC3)N2)CC1   
3   1p1q                         CC1ONC(O)C1CC([NH3+])C(O)O   
4   3ag9  NC(O)C(CCC[NH2+]C([NH3+])[NH3+])NC(O)C(CCC[NH2...   

                                          complex_id  \
0    2d3uCC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O   
1  3cyxCC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1C...   
2    3uo4OC(O)C1CCC(NC2NCCC(NC3CCCCC3C3CCCCC3)N2)CC1   
3                     1p1qCC1ONC(O)C1CC([NH3+])C(O)O   
4  3ag9NC(O)C(CCC[NH2+]C([NH3+])[NH3+])NC(O)C(CCC...   

                                         protein_pdb  \
0  [HEADER    2D3U PROTEIN\n, COMPND    2D3U PROT...   
1  [HEADER    3CYX PROTEIN\n, COMPND    3CYX PROT...   
2  [HEADER    3UO4 PROTEIN\n, COMPND    3UO4 PROT...   
3  [HEADER    1P1Q PROTEIN\n, COMPND    1P1Q PROT...   
4  [HEADER    3AG9 PROTEIN\n, COMPND    3AG9 PROT...   

                                          ligand_pdb  \
0  [COMPND    2d3u ligand \n, AUTHOR    GENERATED...   
1  [COMPND    3cyx ligand \n, AUTHOR    GENERATED...   
2  [COMPND    3uo4 ligand \n, AUTHOR    GENERATED...   
3  [COMPND    1p1q ligand \n, AUTHOR    GENERATED...   
4  [COMPND    3ag9 ligand \n, AUTHOR    GENERATED...   

                                         ligand_mol2 label  
0  [### \n, ### Created by X-TOOL on Thu Aug 28 2...  6.92  
1  [### \n, ### Created by X-TOOL on Thu Aug 28 2...  8.00  
2  [### \n, ### Created by X-TOOL on Fri Aug 29 0...  6.52  
3  [### \n, ### Created by X-TOOL on Thu Aug 28 2...  4.89  
4  [### \n, ### Created by X-TOOL on Thu Aug 28 2...  8.05  
Shape of dataset is: (193, 7)

One of the missions of deepchem is to form a synapse between the chemical and the algorithmic worlds: to be able to leverage the powerful and diverse array of tools available in Python to analyze molecules. This ethos applies to visual as much as quantitative examination:


In [4]:
import nglview
import tempfile
import os
import mdtraj as md
import numpy as np
import deepchem.utils.visualization
from deepchem.utils.visualization import combine_mdtraj, visualize_complex, convert_lines_to_mdtraj

first_protein, first_ligand = dataset.iloc[0]["protein_pdb"], dataset.iloc[0]["ligand_pdb"]

protein_mdtraj = convert_lines_to_mdtraj(first_protein)
ligand_mdtraj = convert_lines_to_mdtraj(first_ligand)
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)

In [5]:
if DISPLAY:
    ngltraj = visualize_complex(complex_mdtraj)
    ngltraj

Now that we're oriented, let's use ML to do some chemistry.

So, step (2) will entail featurizing the dataset.

The available featurizations that come standard with deepchem are ECFP4 fingerprints, RDKit descriptors, NNScore-style bdescriptors, and hybrid binding pocket descriptors. Details can be found on deepchem.io.


In [6]:
from deepchem.featurizers.fingerprints import CircularFingerprint
from deepchem.featurizers.basic import RDKitDescriptors
from deepchem.featurizers.nnscore import NNScoreComplexFeaturizer
from deepchem.featurizers.grid_featurizer import GridFeaturizer
grid_featurizer = GridFeaturizer(voxel_width=16.0, feature_types="voxel_combined", voxel_feature_types=["ecfp",
                                 "splif", "hbond", "pi_stack", "cation_pi", "salt_bridge"], ecfp_power=5, splif_power=5,
                                 parallel=True, flatten=True)
compound_featurizers = [CircularFingerprint(size=128)]
# TODO(rbharath, enf): The grid featurizer breaks. Need to debug before code release
complex_featurizers = []
#complex_featurizers = [grid_featurizer]

Note how we separate our featurizers into those that featurize individual chemical compounds, compound_featurizers, and those that featurize molecular complexes, complex_featurizers.

Now, let's perform the actual featurization. Calling featurizer.featurize() will return an instance of class FeaturizedSamples. Internally, featurizer.featurize() (a) computes the user-specified features on the data, (b) transforms the inputs into X and y NumPy arrays suitable for ML algorithms, and (c) constructs a FeaturizedSamples() instance that has useful methods, such as an iterator, over the featurized data.


In [7]:
#Make a directory in which to store the featurized complexes.
import tempfile, shutil
base_dir = "./tutorial_output"
if not os.path.exists(base_dir):
    os.makedirs(base_dir)
data_dir = os.path.join(base_dir, "data")
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
featurized_samples_file = os.path.join(data_dir, "featurized_samples.joblib")

feature_dir = os.path.join(base_dir, "features")
if not os.path.exists(feature_dir):
    os.makedirs(feature_dir)

samples_dir = os.path.join(base_dir, "samples")
if not os.path.exists(samples_dir):
    os.makedirs(samples_dir)

train_dir = os.path.join(base_dir, "train")
if not os.path.exists(train_dir):
    os.makedirs(train_dir)

valid_dir = os.path.join(base_dir, "valid")
if not os.path.exists(valid_dir):
    os.makedirs(valid_dir)

test_dir = os.path.join(base_dir, "test")
if not os.path.exists(test_dir):
    os.makedirs(test_dir)

model_dir = os.path.join(base_dir, "model")
if not os.path.exists(model_dir):
    os.makedirs(model_dir)

In [8]:
import deepchem.featurizers.featurize
from deepchem.featurizers.featurize import DataFeaturizer

In [9]:
featurizers = compound_featurizers + complex_featurizers
featurizer = DataFeaturizer(tasks=["label"],
                            smiles_field="smiles",
                            protein_pdb_field="protein_pdb",
                            ligand_pdb_field="ligand_pdb",
                            compound_featurizers=compound_featurizers,
                            complex_featurizers=complex_featurizers,
                            id_field="complex_id",
                            verbose=False)
if PARALLELIZE:
    from ipyparallel import Client
    c = Client()
    dview = c[:]
else:
    dview = None
featurized_samples = featurizer.featurize(dataset_file, feature_dir, samples_dir,
                                          worker_pool=dview, shard_size=32)

from deepchem.utils.save import save_to_disk, load_from_disk

save_to_disk(featurized_samples, featurized_samples_file)

In [10]:
featurized_samples = load_from_disk(featurized_samples_file)

Now, we conduct a train-test split. If you'd like, you can choose splittype="scaffold" instead to perform a train-test split based on Bemis-Murcko scaffolds.


In [11]:
splittype = "random"

train_samples, test_samples = featurized_samples.train_test_split(
    splittype, train_dir, test_dir, seed=2016)

We generate separate instances of the Dataset() object to hermetically seal the train dataset from the test dataset. This style lends itself easily to validation-set type hyperparameter searches, which we will illustate in a separate section of this tutorial.


In [12]:
from deepchem.utils.dataset import Dataset

In [13]:
train_dataset = Dataset(data_dir=train_dir, samples=train_samples, 
                        featurizers=compound_featurizers, tasks=["label"])
test_dataset = Dataset(data_dir=test_dir, samples=test_samples, 
                       featurizers=compound_featurizers, tasks=["label"])

The performance of many ML algorithms hinges greatly on careful data preprocessing. Deepchem comes standard with a few options for such preprocessing.


In [14]:
from deepchem.transformers import NormalizationTransformer
from deepchem.transformers import ClippingTransformer

input_transformers = [NormalizationTransformer(transform_X=True, dataset=train_dataset),
                      ClippingTransformer(transform_X=True, dataset=train_dataset)]
output_transformers = [NormalizationTransformer(transform_y=True, dataset=train_dataset)]
transformers = input_transformers + output_transformers
for transformer in transformers:
    transformer.transform(train_dataset)
for transformer in transformers:
    transformer.transform(test_dataset)

Now, we're ready to do some learning! To set up a model, we will need: (a) a dictionary task_types that maps a task, in this case label, i.e. the Ki, to the type of the task, in this case regression. For the multitask use case, one will have a series of keys, each of which is a different task (Ki, solubility, renal half-life, etc.) that maps to a different task type (regression or classification).

To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to interoperate with deepchem. To instantiate an SklearnModel, you will need (a) task_types, (b) model_params, another dict as illustrated below, and (c) a model_instance defining the type of model you would like to fit, in this case a RandomForestRegressor.


In [15]:
from sklearn.ensemble import RandomForestRegressor
from deepchem.models.standard import SklearnModel

In [16]:
task_types = {"label": "regression"}
model_params = {"data_shape": train_dataset.get_data_shape()}

model = SklearnModel(task_types, model_params, model_instance=RandomForestRegressor())
model.fit(train_dataset)
model_dir = tempfile.mkdtemp()
model.save(model_dir)

In [17]:
from deepchem.utils.evaluate import Evaluator
import pandas as pd

In [18]:
evaluator = Evaluator(model, train_dataset, output_transformers, verbose=True)
with tempfile.NamedTemporaryFile() as train_csv_out:
  with tempfile.NamedTemporaryFile() as train_stats_out:
    _, train_r2score = evaluator.compute_model_performance(
        train_csv_out, train_stats_out)

evaluator = Evaluator(model, test_dataset, output_transformers, verbose=True)
test_csv_out = tempfile.NamedTemporaryFile()
with tempfile.NamedTemporaryFile() as test_stats_out:
    _, test_r2score = evaluator.compute_model_performance(
        test_csv_out, test_stats_out)

print test_csv_out.name
train_test_performance = pd.concat([train_r2score, test_r2score])
train_test_performance["split"] = ["train", "test"]
train_test_performance


Saving predictions to <open file '<fdopen>', mode 'w+b' at 0x7f389f4c3c00>
Saving model performance scores to <open file '<fdopen>', mode 'w+b' at 0x7f389f4c3db0>
Saving predictions to <open file '<fdopen>', mode 'w+b' at 0x7f389f4c3d20>
Saving model performance scores to <open file '<fdopen>', mode 'w+b' at 0x7f389f4c3e40>
/tmp/tmpK2BRxB
Out[18]:
task_name r2_score rms_error split
0 label 0.801475 0.959558 train
0 label 0.218699 2.273835 test

In this simple example, in few yet intuitive lines of code, we traced the machine learning arc from featurizing a raw dataset to fitting and evaluating a model.

Here, we featurized only the ligand. The signal we observed in R^2 reflects the ability of circular fingerprints and random forests to learn general features that make ligands "drug-like."


In [19]:
predictions = pd.read_csv(test_csv_out.name)
predictions = predictions.sort(['label'], ascending=[0])

In [20]:
from deepchem.utils.visualization import visualize_ligand

top_ligand = predictions.iloc[0]['ids']
ligand1 = convert_lines_to_mdtraj(dataset.loc[dataset['complex_id']==top_ligand]['ligand_pdb'].values[0])
if DISPLAY:
    ngltraj = visualize_ligand(ligand1)
    ngltraj

In [21]:
worst_ligand = predictions.iloc[predictions.shape[0]-2]['ids']
ligand1 = convert_lines_to_mdtraj(dataset.loc[dataset['complex_id']==worst_ligand]['ligand_pdb'].values[0])
if DISPLAY:
    ngltraj = visualize_ligand(ligand1)
    ngltraj

The protein-ligand complex view.

The preceding simple example, in few yet intuitive lines of code, traces the machine learning arc from featurizing a raw dataset to fitting and evaluating a model.

In this next section, we illustrate deepchem's modularity, and thereby the ease with which one can explore different featurization schemes, different models, and combinations thereof, to achieve the best performance on a given dataset. We will demonstrate this by examining protein-ligand interactions.

In the previous section, we featurized only the ligand. The signal we observed in R^2 reflects the ability of circular fingerprints and random forests to learn general features that make ligands "drug-like." However, the affinity of a drug for a target is determined not only by the drug itself, of course, but the way in which it interacts with a protein.


In [22]:
import deepchem.models.standard
from deepchem.models.standard import SklearnModel
from deepchem.utils.dataset import Dataset
from deepchem.utils.evaluate import Evaluator
from deepchem.hyperparameters import HyperparamOpt

train_dir, validation_dir, test_dir = tempfile.mkdtemp(), tempfile.mkdtemp(), tempfile.mkdtemp()
splittype="random"
train_samples, validation_samples, test_samples = featurized_samples.train_valid_test_split(
    splittype, train_dir, validation_dir, test_dir, seed=2016)

task_types = {"label": "regression"}
performance = pd.DataFrame()

def model_builder(task_types, params_dict, verbosity):
    n_estimators = params_dict["n_estimators"]
    return SklearnModel(
        task_types, params_dict,
        model_instance=RandomForestRegressor(n_estimators=n_estimators))

params_dict = {
    "n_estimators": [10, 20, 40, 80, 160],
    "data_shape": [train_dataset.get_data_shape()],
    }

optimizer = HyperparamOpt(model_builder, {"pIC50": "regression"})
for feature_type in (complex_featurizers + compound_featurizers):
    train_dataset = Dataset(data_dir=train_dir, samples=train_samples, 
                            featurizers=[feature_type], tasks=["label"])
    validation_dataset = Dataset(data_dir=validation_dir, samples=validation_samples, 
                                 featurizers=[feature_type], tasks=["label"])

    for transformer in transformers:
        transformer.transform(train_dataset)
    for transformer in transformers:
        transformer.transform(test_dataset)
        
    best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
    params_dict, train_dataset, test_dataset, output_transformers, metric="r2_score")


Model 0/5, Metric r2_score, Validation set 0: 0.686145
	best_validation_score so  far: -inf
Model 1/5, Metric r2_score, Validation set 1: 0.688894
	best_validation_score so  far: 0.686145
Model 2/5, Metric r2_score, Validation set 2: 0.733131
	best_validation_score so  far: 0.688894
Model 3/5, Metric r2_score, Validation set 3: 0.753555
	best_validation_score so  far: 0.733131
Model 4/5, Metric r2_score, Validation set 4: 0.747959
	best_validation_score so  far: 0.753555
Best hyperparameters: [('n_estimators', 80), ('data_shape', (128,))]
train_score: 0.870723
validation_score: 0.753555

In [23]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

# TODO(rbharath, enf): Need to fix this to work with new hyperparam-opt framework.

#df = pd.DataFrame(performance[['r2_score','split','featurizer']].values, index=performance['n_trees'].values, columns=['r2_score', 'split', 'featurizer'])
#df = df.loc[df['split']=="validation"]
#df = df.drop('split', 1)
#fingerprint_df = df[df['featurizer'].str.contains('fingerprint')].drop('featurizer', 1)
#print fingerprint_df
#fingerprint_df.columns = ['ligand fingerprints']
#grid_df = df[df['featurizer'].str.contains('grid')].drop('featurizer', 1)
#grid_df.columns = ['complex features']
#df = pd.concat([fingerprint_df, grid_df], axis=1)
#print(df)

#plt.clf()
#df.plot()
#plt.ylabel("$R^2$")
#plt.xlabel("Number of trees")

In [36]:
train_dir, validation_dir, test_dir = tempfile.mkdtemp(), tempfile.mkdtemp(), tempfile.mkdtemp()
splittype="random"
train_samples, validation_samples, test_samples = featurized_samples.train_valid_test_split(
    splittype, train_dir, validation_dir, test_dir, seed=2016)

feature_type = complex_featurizers
train_dataset = Dataset(data_dir=train_dir, samples=train_samples, 
                    featurizers=feature_type, tasks=["label"])
validation_dataset = Dataset(data_dir=validation_dir, samples=validation_samples, 
                   featurizers=feature_type, tasks=["label"])
test_dataset = Dataset(data_dir=test_dir, samples=test_samples, 
                   featurizers=feature_type, tasks=["label"])

for transformer in transformers:
    transformer.transform(train_dataset)
for transformer in transformers:
    transformer.transform(valid_dataset)
for transformer in transformers:
    transformer.transform(test_dataset)

model_params = {"data_shape": train_dataset.get_data_shape()}
rf_model = SklearnModel(task_types, model_params, model_instance=RandomForestRegressor(n_estimators=20))
rf_model.fit(train_dataset)
model_dir = tempfile.mkdtemp()
rf_model.save(model_dir)


evaluator = Evaluator(rf_model, train_dataset, output_transformers, verbose=True)
with tempfile.NamedTemporaryFile() as train_csv_out:
  with tempfile.NamedTemporaryFile() as train_stats_out:
    _, train_r2score = evaluator.compute_model_performance(
        train_csv_out, train_stats_out)

evaluator = Evaluator(rf_model, test_dataset, output_transformers, verbose=True)
test_csv_out = tempfile.NamedTemporaryFile()
with tempfile.NamedTemporaryFile() as test_stats_out:
    predictions, test_r2score = evaluator.compute_model_performance(
        test_csv_out, test_stats_out)

train_test_performance = pd.concat([train_r2score, test_r2score])
train_test_performance["split"] = ["train", "test"]
train_test_performance["featurizer"] = [str(feature_type.__class__), str(feature_type.__class__)]
train_test_performance["n_trees"] = [n_trees, n_trees]
print(train_test_performance)


feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
feature_types
[]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-e47c7d5b4d3d> in <module>()
      6 feature_type = complex_featurizers
      7 train_dataset = Dataset(data_dir=train_dir, samples=train_samples, 
----> 8                     featurizers=feature_type, tasks=["label"])
      9 validation_dataset = Dataset(data_dir=validation_dir, samples=validation_samples, 
     10                    featurizers=feature_type, tasks=["label"])

/home/rbharath/deepchem/deepchem/utils/dataset.py in __init__(self, data_dir, tasks, samples, featurizers, use_user_specified_features)
     50       # TODO(rbharath): Still a bit of information leakage.
     51       for df_file, df in zip(samples.dataset_files, samples.iterdataframes()):
---> 52         retval = write_dataset_single_partial((df_file, df))
     53         if retval is not None:
     54           metadata_rows.append(retval)

/home/rbharath/deepchem/deepchem/utils/dataset.py in write_dataset_single(val, data_dir, feature_types, tasks)
    180     return None
    181   task_names = sorted(tasks)
--> 182   ids, X, y, w = _df_to_numpy(df, feature_types, tasks)
    183   X_sums, X_sum_squares, X_n = compute_sums_and_nb_sample(X)
    184   y_sums, y_sum_squares, y_n = compute_sums_and_nb_sample(y, w)

/home/rbharath/deepchem/deepchem/utils/dataset.py in _df_to_numpy(df, feature_types, tasks)
    240       continue
    241     tensors.append(features)
--> 242   x = np.stack(tensors)
    243   sorted_ids = df["mol_id"]
    244 

/home/rbharath/anaconda/lib/python2.7/site-packages/numpy/core/shape_base.pyc in stack(arrays, axis)
    333     arrays = [asanyarray(arr) for arr in arrays]
    334     if not arrays:
--> 335         raise ValueError('need at least one array to stack')
    336 
    337     shapes = set(arr.shape for arr in arrays)

ValueError: need at least one array to stack

In [ ]:
import deepchem.models.deep
from deepchem.models.deep import SingleTaskDNN
import numpy.random
from operator import mul
import itertools

params_dict = {"activation": ["relu"],
                "momentum": [.9],
                "batch_size": [50],
                "init": ["glorot_uniform"],
                "data_shape": [train_dataset.get_data_shape()],
                "learning_rate": np.power(10., np.random.uniform(-5, -2, size=5)),
                "decay": np.power(10., np.random.uniform(-6, -4, size=5)),
                "nb_hidden": [1000],
                "nb_epoch": [40],
                "nesterov": [False],
                "dropout": [.5],
                "nb_layers": [1],
                "batchnorm": [False],
              }


optimizer = HyperparamOpt(SingleTaskDNN, task_types)
best_dnn, best_hyperparams, all_results = optimizer.hyperparam_search(
    params_dict, train_dataset, valid_dataset, output_transformers, metric="r2_score", verbosity=None)

In [ ]:
dnn_test_csv_out = tempfile.NamedTemporaryFile()
dnn_test_stats_out = tempfile.NamedTemporaryFile()
dnn_test_evaluator = Evaluator(best_dnn, test_dataset)
dnn_test_df, dnn_test_r2score = dnn_test_evaluator.compute_model_performance(
    dnn_test_csv_out, dnn_test_stats_out)
dnn_test_r2_score = dnn_test_r2score.iloc[0]["r2_score"]
print("DNN Test set R^2 %f" % (dnn_test_r2_score))

task = "label"
dnn_predicted_test = np.array(dnn_test_df[task + "_pred"])
dnn_true_test = np.array(dnn_test_df[task])

plt.clf()
plt.scatter(dnn_true_test, dnn_predicted_test)
plt.xlabel('Predicted Ki')
plt.ylabel('True Ki')
plt.title(r'DNN predicted vs. true Ki')
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.plot([-3, 3], [-3, 3], marker=".", color='k')

rf_test_csv_out = tempfile.NamedTemporaryFile()
rf_test_stats_out = tempfile.NamedTemporaryFile()
rf_test_evaluator = Evaluator(rf_model, test_dataset)
rf_test_df, rf_test_r2score = rf_test_evaluator.compute_model_performance(
    rf_test_csv_out, rf_test_stats_out)
rf_test_r2_score = rf_test_r2score.iloc[0]["r2_score"]
print("RF Test set R^2 %f" % (rf_test_r2_score))
plt.show()

task = "label"
rf_predicted_test = np.array(rf_test_df[task + "_pred"])
rf_true_test = np.array(rf_test_df[task])
plt.scatter(rf_true_test, rf_predicted_test)
plt.xlabel('Predicted Ki')
plt.ylabel('True Ki')
plt.title(r'RF predicted vs. true Ki')
plt.xlim([-2, 2])
plt.ylim([-2, 2])
plt.plot([-3, 3], [-3, 3], marker=".", color='k')
plt.show()

In [ ]:
predictions = dnn_test_df.sort(['label'], ascending=[0])

In [ ]:
top_complex = predictions.iloc[0]['ids']
best_complex = dataset.loc[dataset['complex_id']==top_complex]

protein_mdtraj = convert_lines_to_mdtraj(best_complex["protein_pdb"].values[0])
ligand_mdtraj = convert_lines_to_mdtraj(best_complex["ligand_pdb"].values[0])
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)
if DISPLAY:
    ngltraj = visualize_complex(complex_mdtraj)
    ngltraj

In [ ]:
top_complex = predictions.iloc[1]['ids']
best_complex = dataset.loc[dataset['complex_id']==top_complex]

protein_mdtraj = convert_lines_to_mdtraj(best_complex["protein_pdb"].values[0])
ligand_mdtraj = convert_lines_to_mdtraj(best_complex["ligand_pdb"].values[0])
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)
if DISPLAY:
    ngltraj = visualize_complex(complex_mdtraj)
    ngltraj

In [ ]:
top_complex = predictions.iloc[predictions.shape[0]-1]['ids']
best_complex = dataset.loc[dataset['complex_id']==top_complex]

protein_mdtraj = convert_lines_to_mdtraj(best_complex["protein_pdb"].values[0])
ligand_mdtraj = convert_lines_to_mdtraj(best_complex["ligand_pdb"].values[0])
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)
if DISPLAY:
    ngltraj = visualize_complex(complex_mdtraj)
    ngltraj