This notebook walks through the creation of multitask models on MUV [1]. The goal is to demonstrate that multitask methods outperform singletask methods on MUV.
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.
In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')
The MUV dataset is a challenging benchmark in molecular design that consists of 17 different "targets" where there are only a few "active" compounds per target. The goal of working with this dataset is to make a machine learnign model which achieves high accuracy on held-out compounds at predicting activity. To get started, let's download the MUV dataset for us to play with.
In [2]:
import os
import deepchem as dc
current_dir = os.path.dirname(os.path.realpath("__file__"))
dataset_file = "medium_muv.csv.gz"
full_dataset_file = "muv.csv.gz"
# We use a small version of MUV to make online rendering of notebooks easy. Replace with full_dataset_file
# In order to run the full version of this notebook
dc.utils.download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/%s" % dataset_file,
current_dir)
dataset = dc.utils.save.load_from_disk(dataset_file)
print("Columns of dataset: %s" % str(dataset.columns.values))
print("Number of examples in dataset: %s" % str(dataset.shape[0]))
Now, let's visualize some compounds from our dataset
In [3]:
from rdkit import Chem
from rdkit.Chem import Draw
from itertools import islice
from IPython.display import Image, display, HTML
def display_images(filenames):
"""Helper to pretty-print images."""
for filename in filenames:
display(Image(filename))
def mols_to_pngs(mols, basename="test"):
"""Helper to write RDKit mols to png files."""
filenames = []
for i, mol in enumerate(mols):
filename = "MUV_%s%d.png" % (basename, i)
Draw.MolToFile(mol, filename)
filenames.append(filename)
return filenames
num_to_display = 12
molecules = []
for _, data in islice(dataset.iterrows(), num_to_display):
molecules.append(Chem.MolFromSmiles(data["smiles"]))
display_images(mols_to_pngs(molecules))
There are 17 datasets total in MUV as we mentioned previously. We're going to train a multitask model that attempts to build a joint model to predict activity across all 17 datasets simultaneously. There's some evidence [2] that multitask training creates more robust models.
As fair warning, from my experience, this effect can be quite fragile. Nonetheless, it's a tool worth trying given how easy DeepChem makes it to build these models. To get started towards building our actual model, let's first featurize our data.
In [4]:
MUV_tasks = ['MUV-692', 'MUV-689', 'MUV-846', 'MUV-859', 'MUV-644',
'MUV-548', 'MUV-852', 'MUV-600', 'MUV-810', 'MUV-712',
'MUV-737', 'MUV-858', 'MUV-713', 'MUV-733', 'MUV-652',
'MUV-466', 'MUV-832']
featurizer = dc.feat.CircularFingerprint(size=1024)
loader = dc.data.CSVLoader(
tasks=MUV_tasks, smiles_field="smiles",
featurizer=featurizer)
dataset = loader.featurize(dataset_file)
We'll now want to split our dataset into training, validation, and test sets. We're going to do a simple random split using dc.splits.RandomSplitter
. It's worth noting that this will provide overestimates of real generalizability! For better real world estimates of prospective performance, you'll want to use a harder splitter.
In [5]:
splitter = dc.splits.RandomSplitter(dataset_file)
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(
dataset)
#NOTE THE RENAMING:
valid_dataset, test_dataset = test_dataset, valid_dataset
Let's now get started building some models! We'll do some simple hyperparameter searching to build a robust model.
In [6]:
import numpy as np
import numpy.random
params_dict = {"activation": ["relu"],
"momentum": [.9],
"batch_size": [50],
"init": ["glorot_uniform"],
"data_shape": [train_dataset.get_data_shape()],
"learning_rate": [1e-3],
"decay": [1e-6],
"nb_epoch": [1],
"nesterov": [False],
"dropouts": [(.5,)],
"nb_layers": [1],
"batchnorm": [False],
"layer_sizes": [(1000,)],
"weight_init_stddevs": [(.1,)],
"bias_init_consts": [(1.,)],
"penalty": [0.],
}
n_features = train_dataset.get_data_shape()[0]
def model_builder(model_params, model_dir):
model = dc.models.MultitaskClassifier(
len(MUV_tasks), n_features, **model_params)
return model
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
optimizer = dc.hyper.HyperparamOpt(model_builder)
best_dnn, best_hyperparams, all_results = optimizer.hyperparam_search(
params_dict, train_dataset, valid_dataset, [], metric)
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!