Tutorial: Deep Life Sciences

Welcome to DeepChem's introductory tutorial for the deep life sciences. This series of notebooks is step-by-step guide for you to get to know the new tools and techniques needed to do deep learning for the life sciences.

Scope: This tutorial will encompass both the machine learning and data handling needed to build systems for the deep life sciences.

Outline

Part 1: The Basic Tools of the Deep Life Sciences
Part 2: Introduction to Molecular Data Handling
Part 3: Molecular Machine Learning
Part 4:

Why do the DeepChem Tutorial?

1) Career Advancement: Applying AI in the life sciences is a booming industry at present. There are a host of newly funded startups and initiatives at large pharmaceutical and biotech companies centered around AI. Learning and mastering DeepChem will bring you to the forefront of this field and will prepare you to enter a career in this field.

2) Humanitarian Considerations: Disease is the oldest cause of human suffering. From the dawn of human civilization, humans have suffered from pathogens, cancers, and neurological conditions. One of the greatest achievements of the last few centuries has been the development of effective treatments for many diseases. By mastering the skills in this tutorial, you will be able to stand on the shoulders of the giants of the past to help develop new medicine.

3) Lowering the Cost of Medicine: The art of developing new medicine is currently an elite skill that can only be practiced by a small core of expert practitioners. By enabling the growth of open source tools for drug discovery, you can help democratize these skills and open up drug discovery to more competition. Increased competition can help drive down the cost of medicine.

Getting Extra Credit

If you're excited about DeepChem and want to get more more involved, there's a couple of things that you can do right now:

Start DeepChem on GitHub! - https://github.com/deepchem/deepchem
Make a YouTube video teaching the contents of this notebook.

Part -1: Prerequisites

This tutorial will assume some basic familiarity with the Python data science ecosystem. We will assume that you have familiarity with libraries such as Numpy, Pandas, and TensorFlow.

Part 0: Setup

The first step is to get DeepChem up and running. We recommend using conda for now to do this install.

conda install -c deepchem -c rdkit -c conda-forge -c omnia deepchem=2.1.0



In [1]:

    
# Run this cell to see if things work
import deepchem as dc









    



/home/bharath/anaconda3/envs/deepchem/lib/python3.5/site-packages/sklearn/ensemble/weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d

The Basic Tools of the Deep Life Sciences

What does it take to do deep learning on the life sciences? Well, the first thing we'll need to do is actually handle some data. How can we start handling some basic data? For beginners, let's just take a look at some synthetic data.

To generate some basic synthetic data, we will use Numpy to create some basic arrays.



In [3]:

    
import numpy as np

data = np.random.random((4, 4))
labels = np.random.random((4,)) # labels of size 20x1

We've given these arrays some evocative names: "data" and "labels." For now, don't worry too much about the names, but just note that the arrays have different shapes. Let's take a quick look to get a feeling for these arrays



In [4]:

    
data, labels









    Out[4]:





(array([[0.17153735, 0.72653504, 0.75818459, 0.64997769],
        [0.64356789, 0.37895973, 0.46143683, 0.3251195 ],
        [0.51409105, 0.20522909, 0.29532684, 0.35239749],
        [0.49242761, 0.62127102, 0.77898693, 0.90960543]]),
 array([0.01939268, 0.43336842, 0.91222562, 0.23498551]))

In order to be able to work with this data in DeepChem, we need to wrap these arrays so DeepChem knows how to work with them. DeepChem has a Dataset API that it uses to facilitate its handling of datasets. For handling of Numpy datasets, we use DeepChem's NumpyDataset object.



In [6]:

    
from deepchem.data.datasets import NumpyDataset

dataset = NumpyDataset(data, labels)

Ok, now what? We have these arrays in a NumpyDataset object. What can we do with it? Let's try printing out the object.



In [7]:

    
dataset









    Out[7]:





<deepchem.data.datasets.NumpyDataset at 0x7ff02682c710>

Ok, that's not terribly informative. It's telling us that dataset is a Python object that lives somewhere in memory. Can we recover the two datasets that we used to construct this object? Luckily, the DeepChem API allows us to recover the two original datasets by calling the dataset.X and dataset.y attributes of the original object.



In [8]:

    
dataset.X, dataset.y









    Out[8]:





(array([[0.17153735, 0.72653504, 0.75818459, 0.64997769],
        [0.64356789, 0.37895973, 0.46143683, 0.3251195 ],
        [0.51409105, 0.20522909, 0.29532684, 0.35239749],
        [0.49242761, 0.62127102, 0.77898693, 0.90960543]]),
 array([0.01939268, 0.43336842, 0.91222562, 0.23498551]))

This set of transformations raises a few questions. First, what was the point of it all? Why would we want to wrap objects this way instead of working with the raw Numpy arrays? The simple answer is for have a unified API for working with larger datasets. Suppose that X and y are so large that they can't fit easily into memory. What would we do then? Being able to work with an abstract dataset object proves very convenient then. In fact, you'll have reason to use this feature of Dataset later in the tutorial series.

What else can we do with the dataset object? It turns out that it can be useful to be able to walk through the datapoints in the dataset one at a time. For that, we can use the dataset.itersamples() method.



In [9]:

    
for x, y, _, _ in dataset.itersamples():
    print(x, y)









    



[0.17153735 0.72653504 0.75818459 0.64997769] 0.019392679983928796
[0.64356789 0.37895973 0.46143683 0.3251195 ] 0.43336841680990135
[0.51409105 0.20522909 0.29532684 0.35239749] 0.9122256174354443
[0.49242761 0.62127102 0.77898693 0.90960543] 0.23498551323364447

There are a couple of other fields that the dataset object tracks. The first is dataset.ids. This is a listing of unique identifiers for the datapoitns in the dataset.



In [10]:

    
dataset.ids









    Out[10]:





array([0, 1, 2, 3], dtype=object)

In addition, the dataset object has a field dataset.w. This is the "example weight" associated with each datapoint. Since we haven't explicitly assigned the weights, this is simply going to be all ones.



In [12]:

    
dataset.w









    Out[12]:





array([1., 1., 1., 1.])

Alright, we've seen some basic features. What if you want to learn more about NumpyDataset? You should check out our more in-depth notebook that goes into much more depth on how to work with NumpyDataset objects.