Tutorial Part 20: Converting DeepChem models to TensorFlow Estimators

So far, we've walked through a lot of the scientific details tied to molecular machine learning, but we haven't discussed as much how to use tools like DeepChem in production settings. This tutorial (and the last) focus more on the practical matters of how to use DeepChem in production settings.

When DeepChem was first created, Tensorflow had no standard interface for datasets or models. We created the Dataset and Model classes to fill this hole. More recently, Tensorflow has added the tf.data module as a standard interface for datasets, and the tf.estimator module as a standard interface for models. To enable easy interoperability with other tools, we have added features to Dataset and Model to support these new standards. Using the Estimator interface may make it easier to deply DeepChem models in production environments.

This example demonstrates how to use these features. Let's begin by loading a dataset and creating a model to analyze it. We'll use a simple MultitaskClassifier with one hidden layer.


This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.


To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')

TensorFlow 1.x selected.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3477  100  3477    0     0  10256      0 --:--:-- --:--:-- --:--:-- 10226
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
installing miniconda to /root/miniconda
installing deepchem
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

deepchem-2.3.0 installation finished!
CPU times: user 2.4 s, sys: 517 ms, total: 2.91 s
Wall time: 1min 56s

In [2]:
import deepchem as dc
import tensorflow as tf
import numpy as np

tasks, datasets, transformers = dc.molnet.load_tox21(reload=False)
train_dataset, valid_dataset, test_dataset = datasets
n_tasks = len(tasks)
n_features = train_dataset.X.shape[1]

model = dc.models.MultitaskClassifier(n_tasks, n_features, layer_sizes=[1000], dropouts=0.25)

Loading raw samples now.
shard_size: 8192
About to start loading CSV from /tmp/tox21.csv.gz
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
TIMING: featurizing shard 0 took 21.888 s
TIMING: dataset construction took 22.158 s
Loading dataset from disk.
TIMING: dataset construction took 0.351 s
Loading dataset from disk.
TIMING: dataset construction took 0.173 s
Loading dataset from disk.
TIMING: dataset construction took 0.176 s
Loading dataset from disk.
TIMING: dataset construction took 0.286 s
Loading dataset from disk.
TIMING: dataset construction took 0.044 s
Loading dataset from disk.
TIMING: dataset construction took 0.038 s
Loading dataset from disk.
WARNING:tensorflow:From /tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.

We want to train the model using the training set, then evaluate it on the test set. As our evaluation metric we will use the ROC AUC, averaged over the 12 tasks included in the dataset. First let's see how to do this with the DeepChem API.

In [3]:
model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
print(model.evaluate(test_dataset, [metric]))

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:169: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/optimizers.py:76: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:258: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:260: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/keras_model.py:237: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/losses.py:108: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From /root/miniconda/lib/python3.6/site-packages/deepchem/models/losses.py:109: The name tf.losses.Reduction is deprecated. Please use tf.compat.v1.losses.Reduction instead.

computed_metrics: [0.770005534034311, 0.8149272185691003, 0.843224224330952, 0.7941699811597237, 0.7050916141963877, 0.7847847847847849, 0.6692734193975505, 0.6598562026685901, 0.8362882956320903, 0.7056690837178643, 0.8348021283671433, 0.7099963045084996]
{'mean-roc_auc_score': 0.7606740659472496}

Simple enough. Now let's see how to do the same thing with the Tensorflow APIs. Fair warning: this is going to take a lot more code!

To begin with, Tensorflow doesn't allow a dataset to be passed directly to a model. Instead, you need to write an "input function" to construct a particular set of tensors and return them in a particular format. Fortunately, Dataset's make_iterator() method provides exactly the tensors we need in the form of a tf.data.Iterator. This allows our input function to be very simple.

In [0]:
def input_fn(dataset, epochs):
    x, y, weights = dataset.make_iterator(batch_size=100, epochs=epochs).get_next()
    return {'x': x, 'weights': weights}, y

Next, you have to use the functions in the tf.feature_column module to create an object representing each feature and weight column (but curiously, not the label column—don't ask me why!). These objects describe the data type and shape of each column, and give each one a name. The names must match the keys in the dict returned by the input function.

In [0]:
x_col = tf.feature_column.numeric_column('x', shape=(n_features,))
weight_col = tf.feature_column.numeric_column('weights', shape=(n_tasks,))

Unlike DeepChem models, which allow arbitrary metrics to be passed to evaluate(), estimators require all metrics to be defined up front when you create the estimator. Unfortunately, Tensorflow doesn't have very good support for multitask models. It provides an AUC metric, but no easy way to average this metric over tasks. We therefore must create a separate metric for every task, then define our own metric function to compute the average of them.

In [0]:
def mean_auc(labels, predictions, weights):
    metric_ops = []
    update_ops = []
    for i in range(n_tasks):
        metric, update = tf.metrics.auc(labels[:,i], predictions[:,i], weights[:,i])
    mean_metric = tf.reduce_mean(tf.stack(metric_ops))
    update_all = tf.group(*update_ops)
    return mean_metric, update_all

Now we create our Estimator by calling make_estimator() on the DeepChem model. We provide as arguments the objects created above to represent the feature and weight columns, as well as our metric function.

In [7]:
#estimator = model.make_estimator(feature_columns=[x_col],
#                                 weight_column=weight_col,
#                                 metrics={'mean_auc': mean_auc},
#                                 model_dir='estimator')
estimator = tf.keras.estimator.model_to_estimator(model)

INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpq86w8_0k
INFO:tensorflow:Using the Keras model provided.
AttributeError                            Traceback (most recent call last)
<ipython-input-7-aeaf11067fea> in <module>()
      3 #                                 metrics={'mean_auc': mean_auc},
      4 #                                 model_dir='estimator')
----> 5 estimator = tf.keras.estimator.model_to_estimator(model)

/tensorflow-1.15.2/python3.6/tensorflow_core/python/keras/estimator/__init__.py in model_to_estimator(keras_model, keras_model_path, custom_objects, model_dir, config, checkpoint_format)
    105       config=config,
    106       checkpoint_format=checkpoint_format,
--> 107       use_v2_estimator=False)

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/keras.py in model_to_estimator(keras_model, keras_model_path, custom_objects, model_dir, config, checkpoint_format, use_v2_estimator)
    558   keras_model_fn = _create_keras_model_fn(keras_model, custom_objects,
    559                                           save_object_ckpt)
--> 560   if _any_weight_initialized(keras_model):
    561     # Warn if config passed to estimator tries to update GPUOptions. If a
    562     # session has already been created, the GPUOptions passed to the first

/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/keras.py in _any_weight_initialized(keras_model)
     81   if ops.executing_eagerly_outside_functions():
     82     return True
---> 83   for layer in keras_model.layers:
     84     for weight in layer.weights:
     85       if hasattr(weight, '_keras_initialized'):

AttributeError: 'MultitaskClassifier' object has no attribute 'layers'

We are finally ready to train and evaluate it! Notice how the input function passed to each method is actually a lambda. This allows us to write a single function, then use it with different datasets and numbers of epochs.

In [0]:
estimator.train(input_fn=lambda: input_fn(train_dataset, 100))
print(estimator.evaluate(input_fn=lambda: input_fn(test_dataset, 1)))

That's a lot of code for something DeepChem can do in three lines. The Tensorflow API is verbose and somewhat confusing. It has seemingly arbitrary limitations, like assuming a model will only ever have one output, and therefore only allowing one label. But for better or worse, it's a standard.

Of course, if you just want to use a DeepChem model with a DeepChem dataset, there is no need for any of this. Just use the DeepChem API. But perhaps you want to use a DeepChem dataset with a model that has been implemented as an estimator. In that case, Dataset.make_iterator() allows you to easily do that. Or perhaps you have higher level workflow code that is written to work with estimators. In that case, make_estimator() allows DeepChem models to easily fit into that workflow.