Tutorial Part 9: Creating a high fidelity dataset from experimental data

Suppose you were given data collected by an experimental collaborator. You would like to use this data to construct a machine learning model.

How do you transform this data into a dataset capable of creating a useful model?

Building models from novel data can present several challenges. Perhaps the data was not recorded in a convenient manner. Additionally, perhaps the data contains noise. This is a common occurance with, for example, biological assays due to the large number of external variables and the difficulty and cost associated with collecting multiple samples. This is a problem because you do not want your model to fit to this noise.

Hence, there are two primary challenges:

  • Parsing data
  • De-noising data

In this tutorial, will walk through an example of curating a dataset from an excel spreadsheet of experimental drug measurements. Before we dive into this example though, let's do a brief review of DeepChem's input file handling and featurization capabilities.

Input Formats

DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by Loader class associated with that format. For example, with a csv input, we use the CSVLoader class underneath the hood. Here's an example of a sample .csv file that fits the requirements of CSVLoader.

  1. A column containing SMILES strings [1].
  2. A column containing an experimental measurement.
  3. (Optional) A column containing a unique compound identifier.

Here's an example of a potential input file.

Compound ID measured log solubility in mols per litre smiles
benzothiazole -1.5 c2ccc1scnc1c2

Here the "smiles" column contains the SMILES string, the "measured log solubility in mols per litre" contains the experimental measurement and "Compound ID" contains the unique compound identifier.

Data Featurization

Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to subclasses of DeepChem loader class dc.data.DataLoader such as dc.data.CSVLoader or dc.data.SDFLoader. Users can subclass dc.data.DataLoader to load arbitrary file formats. All loaders must be passed a dc.feat.Featurizer object. DeepChem provides a number of different subclasses of dc.feat.Featurizer for convenience.


This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.


To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

In [1]:
%tensorflow_version 1.x
!curl -Lo deepchem_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import deepchem_installer
%time deepchem_installer.install(version='2.3.0')

TensorFlow 1.x selected.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2814  100  2814    0     0  21813      0 --:--:-- --:--:-- --:--:-- 21813
add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
installing miniconda to /root/miniconda
installing deepchem
/usr/local/lib/python3.6/dist-packages/sklearn/externals/joblib/__init__.py:15: FutureWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=FutureWarning)
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

deepchem-2.3.0 installation finished!
CPU times: user 2.79 s, sys: 609 ms, total: 3.4 s
Wall time: 3min 40s

Parsing data

In order to read in the data, we will use the pandas data analysis library.

In order to convert the drug names into smiles strings, we will use pubchempy. This isn't a standard DeepChem dependency, but you can install this library with pip install pubchempy.

In [2]:
!pip install pubchempy

Collecting pubchempy
  Downloading https://files.pythonhosted.org/packages/aa/fb/8de3aa9804b614dbc8dc5c16ed061d819cc360e0ddecda3dcd01c1552339/PubChemPy-1.0.4.tar.gz
Building wheels for collected packages: pubchempy
  Building wheel for pubchempy (setup.py) ... done
  Created wheel for pubchempy: filename=PubChemPy-1.0.4-cp36-none-any.whl size=13825 sha256=bd54eb755f3e83b75a2579701aadc27284d998fe33b1fc1e22342e2c109939d8
  Stored in directory: /root/.cache/pip/wheels/10/4d/51/6b843681a9a5aef35f0d0fbce243de46f85080036e16118752
Successfully built pubchempy
Installing collected packages: pubchempy
Successfully installed pubchempy-1.0.4

In [0]:
import os
import pandas as pd
from pubchempy import get_cids, get_compounds

Pandas is magic but it doesn't automatically know where to find your data of interest. You likely will have to look at it first using a GUI.

We will now look at a screenshot of this dataset as rendered by LibreOffice.

To do this, we will import Image and os.

In [0]:
import os
from IPython.display import Image, display

In [0]:
current_dir = os.path.dirname(os.path.realpath('__file__'))

In [0]:
# data_screenshot = os.path.join(current_dir, 'assets/dataset_preparation_gui.png')
# display(Image(filename=data_screenshot))

We see the data of interest is on the second sheet, and contained in columns "TA ID", "N #1 (%)", and "N #2 (%)".

Additionally, it appears much of this spreadsheet was formatted for human readability (multicolumn headers, column labels with spaces and symbols, etc.). This makes the creation of a neat dataframe object harder. For this reason we will cut everything that is unnecesary or inconvenient.

In [7]:
!wget https://github.com/deepchem/deepchem/raw/master/datasets/Positive%20Modulators%20Summary_%20918.TUC%20_%20v1.xlsx

--2020-05-31 02:53:46--  https://github.com/deepchem/deepchem/raw/master/datasets/Positive%20Modulators%20Summary_%20918.TUC%20_%20v1.xlsx
Resolving github.com (github.com)...
Connecting to github.com (github.com)||:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/Positive%20Modulators%20Summary_%20918.TUC%20_%20v1.xlsx [following]
--2020-05-31 02:53:46--  https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/Positive%20Modulators%20Summary_%20918.TUC%20_%20v1.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)...,,, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42852 (42K) [application/octet-stream]
Saving to: ‘Positive Modulators Summary_ 918.TUC _ v1.xlsx’

Positive Modulators 100%[===================>]  41.85K  --.-KB/s    in 0.02s   

2020-05-31 02:53:47 (1.69 MB/s) - ‘Positive Modulators Summary_ 918.TUC _ v1.xlsx’ saved [42852/42852]

In [0]:
raw_data_file = os.path.join(current_dir, 'Positive Modulators Summary_ 918.TUC _ v1.xlsx')
raw_data_excel = pd.ExcelFile(raw_data_file)

# second sheet only
raw_data = raw_data_excel.parse(raw_data_excel.sheet_names[1])

In [9]:
# preview 5 rows of raw dataframe

Unnamed: 0 Unnamed: 1 Unnamed: 2 Metric #1 (-120 mV Peak) Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 NaN NaN NaN Vehicle NaN 4 Replications NaN
1 TA ## Position TA ID Mean SD Threshold (%) = Mean + 4xSD N #1 (%) N #2 (%)
2 1 1-A02 Penicillin V Potassium -12.8689 6.74705 14.1193 -10.404 -18.1929
3 2 1-A03 Mycophenolate Mofetil -12.8689 6.74705 14.1193 -12.4453 -11.7175
4 3 1-A04 Metaxalone -12.8689 6.74705 14.1193 -8.65572 -17.7753

Note that the actual row headers are stored in row 1 and not 0 above.

In [10]:
# remove column labels (rows 0 and 1), as we will replace them
# only take data given in columns "TA ID" "N #1 (%)" (3) and "N #2 (%)" (4)
raw_data = raw_data.iloc[2:, [2, 6, 7]]

## collapse multiindex so that drug names and number indexes are columns
#raw_data.reset_index(level=[1, 2], inplace=True)
# reset the index so we keep the label but number from 0 again

## rename columns
raw_data.columns = ['label', 'drug', 'n1', 'n2']

               Unnamed: 2 Unnamed: 6 Unnamed: 7
2  Penicillin V Potassium    -10.404   -18.1929
3   Mycophenolate Mofetil   -12.4453   -11.7175
4              Metaxalone   -8.65572   -17.7753
5           Terazosin·HCl   -11.5048    16.0825
6          Fluvastatin·Na   -11.1354    -14.553

In [11]:
# preview cleaner dataframe

label drug n1 n2
0 2 Penicillin V Potassium -10.404 -18.1929
1 3 Mycophenolate Mofetil -12.4453 -11.7175
2 4 Metaxalone -8.65572 -17.7753
3 5 Terazosin·HCl -11.5048 16.0825
4 6 Fluvastatin·Na -11.1354 -14.553

This formatting is closer to what we need.

Now, let's take the drug names and get smiles strings for them (format needed for DeepChem).

In [0]:
drugs = raw_data['drug'].values

For many of these, we can retreive the smiles string via the canonical_smiles attribute of the get_compounds object (using pubchempy)

In [13]:
get_compounds(drugs[1], 'name')


In [14]:
get_compounds(drugs[1], 'name')[0].canonical_smiles


However, some of these drug names have variables spaces and symbols (·, (±), etc.), and names that may not be readable by pubchempy.

For this task, we will do a bit of hacking via regular expressions. Also, we notice that all ions are written in a shortened form that will need to be expanded. For this reason we use a dictionary, mapping the shortened ion names to versions recognizable to pubchempy.

Unfortunately you may have several corner cases that will require more hacking.

In [0]:
ion_replacements = {
    'HBr': ' hydrobromide',
    '2Br': ' dibromide',
    'Br': ' bromide',
    'HCl': ' hydrochloride',
    '2H2O': ' dihydrate',
    'H20': ' hydrate',
    'Na': ' sodium'

ion_keys = ['H20', 'HBr', 'HCl', '2Br', '2H2O', 'Br', 'Na']

In [0]:
import re

In [0]:
def compound_to_smiles(cmpd):
    # remove spaces and irregular characters
    compound = re.sub(r'([^\s\w]|_)+', '', cmpd)
    # replace ion names if needed
    for ion in ion_keys:
        if ion in compound:
            compound = compound.replace(ion, ion_replacements[ion])

    # query for cid first in order to avoid timeouterror
    cid = get_cids(compound, 'name')[0]
    smiles = get_compounds(cid)[0].canonical_smiles

    return smiles

Now let's actually convert all these compounds to smiles. This conversion will take a few minutes so might not be a bad spot to go grab a coffee or tea and take a break while this is running! Note that this conversion will sometimes fail so we've added some error handling to catch these cases below.

In [18]:
smiles_map = {}
for i, compound in enumerate(drugs):
    # print("Converting %s to smiles" % i)
        smiles_map[compound] = compound_to_smiles(compound)
        print("Errored on %s" % i)

Errored on 162
Errored on 237
Errored on 303
Errored on 399

In [0]:
smiles_data = raw_data
# map drug name to smiles string
smiles_data['drug'] = smiles_data['drug'].apply(lambda x: smiles_map[x] if x in smiles_map else None)

In [20]:
# preview smiles data

label drug n1 n2
0 2 CC1(C(N2C(S1)C(C2=O)NC(=O)COC3=CC=CC=C3)C(=O)[... -10.404 -18.1929
1 3 CC1=C2COC(=O)C2=C(C(=C1OC)CC=C(C)CCC(=O)OCCN3C... -12.4453 -11.7175
2 4 CC1=CC(=CC(=C1)OCC2CNC(=O)O2)C -8.65572 -17.7753
3 5 COC1=C(C=C2C(=C1)C(=NC(=N2)N3CCN(CC3)C(=O)C4CC... -11.5048 16.0825
4 6 CC(C)N1C2=CC=CC=C2C(=C1C=CC(CC(CC(=O)[O-])O)O)... -11.1354 -14.553

Hooray, we have mapped each drug name to its corresponding smiles code.

Now, we need to look at the data and remove as much noise as possible.

De-noising data

In machine learning, we know that there is no free lunch. You will need to spend time analyzing and understanding your data in order to frame your problem and determine the appropriate model framework. Treatment of your data will depend on the conclusions you gather from this process.

Questions to ask yourself:

  • What are you trying to accomplish?
  • What is your assay?
  • What is the structure of the data?
  • Does the data make sense?
  • What has been tried previously?

For this project (respectively):

  • I would like to build a model capable of predicting the affinity of an arbitrary small molecule drug to a particular ion channel protein
  • For an input drug, data describing channel inhibition
  • A few hundred drugs, with n=2
  • Will need to look more closely at the dataset*
  • Nothing on this particular protein

*This will involve plotting, so we will import matplotlib and seaborn. We will also need to look at molecular structures, so we will import rdkit. We will also use the seaborn library which you can install with pip install seaborn.

In [21]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

In [0]:
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw, PyMol, rdFMCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase

In [0]:
# i will use numpy on occasion for manipulating arrays
import numpy as np

Our goal is to build a small molecule model, so let's make sure our molecules are all small. This can be approximated by the length of each smiles string.

In [0]:
smiles_data['len'] = [len(i) if i is not None else 0 for i in smiles_data['drug']]

In [25]:
smiles_lens = [len(i) if i is not None else 0 for i in smiles_data['drug']]

Text(0, 0.5, 'probability')

Some of these look rather large, len(smiles) > 150. Let's see what they look like.

In [0]:
# indices of large looking molecules
suspiciously_large = np.where(np.array(smiles_lens) > 150)[0]

# corresponding smiles string
long_smiles = smiles_data.loc[smiles_data.index[suspiciously_large]]['drug'].values

In [27]:
# look
Draw._MolsToGridImage([Chem.MolFromSmiles(i) for i in long_smiles], molsPerRow=6)


As suspected, these are not small molecules, so we will remove them from the dataset. The argument here is that these molecules could register as inhibitors simply because they are large. They are more likely to sterically blocks the channel, rather than diffuse inside and bind (which is what we are interested in).

The lesson here is to remove data that does not fit your use case.

In [0]:
# drop large molecules
smiles_data = smiles_data[~smiles_data['drug'].isin(long_smiles)]

Now, let's look at the numerical structure of the dataset.

First, check for NaNs.

In [29]:
nan_rows = smiles_data[smiles_data.isnull().T.any().T]
nan_rows[['n1', 'n2']]

n1 n2
62 NaN -7.8266
162 -12.8456 -11.4627
175 NaN -6.61225
187 NaN -8.23326
233 -8.21781 NaN
237 30.8369 6.16932
262 NaN -12.8788
288 NaN -2.34264
300 NaN -8.19936
301 NaN -10.4633
303 -5.61374 8.42267
311 NaN -8.78722
399 -1.45559 -6.47666

I don't trust n=1, so I will throw these out.

Then, let's examine the distribution of n1 and n2.

In [0]:
df = smiles_data.dropna(axis=0, how='any')

In [31]:
# seaborn jointplot will allow us to compare n1 and n2, and plot each marginal
sns.jointplot('n1', 'n2', data=smiles_data)

<seaborn.axisgrid.JointGrid at 0x7f460e332a58>

We see that most of the data is contained in the gaussian-ish blob centered a bit below zero. We see that there are a few clearly active datapoints located in the bottom left, and one on the top right. These are all distinguished from the majority of the data. How do we handle the data in the blob?

Because n1 and n2 represent the same measurement, ideally they would be of the same value. This plot should be tightly aligned to the diagonal, and the pearson correlation coefficient should be 1. We see this is not the case. This helps gives us an idea of the error of our assay.

Let's look at the error more closely, plotting in the distribution of (n1-n2).

In [32]:
diff_df = df['n1'] - df['n2']

plt.xlabel('difference in n')

Text(0, 0.5, 'probability')

This looks pretty gaussian, let's get the 95% confidence interval by fitting a gaussian via scipy, and taking 2*the standard deviation

In [0]:
from scipy import stats

In [0]:
mean, std = stats.norm.fit(np.asarray(diff_df, dtype=np.float32))

In [35]:
ci_95 = std*2


Now, I don't trust the data outside of the confidence interval, and will therefore drop these datapoints from df.

For example, in the plot above, at least one datapoint has n1-n2 > 60. This is disconcerting.

In [0]:
noisy = diff_df[abs(diff_df) > ci_95]
df = df.drop(noisy.index)

In [37]:
sns.jointplot('n1', 'n2', data=df)

<seaborn.axisgrid.JointGrid at 0x7f460e332390>

Now that data looks much better!

So, let's average n1 and n2, and take the error bar to be ci_95.

In [38]:
avg_df = df[['label', 'drug']]
n_avg = df[['n1', 'n2']].mean(axis=1)
avg_df['n'] = n_avg
avg_df.sort_values('n', inplace=True)

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.

Now, let's look at the sorted data with error bars.

In [39]:
plt.errorbar(np.arange(avg_df.shape[0]), avg_df['n'], yerr=ci_95, fmt='o')
plt.xlabel('drug, sorted')

Text(0, 0.5, 'activity')

Now, let's identify our active compounds.

In my case, this required domain knowledge. Having worked in this area, and having consulted with professors specializing on this channel, I am interested in compounds where the absolute value of the activity is greater than 25. This relates to the desired drug potency we would like to model.

If you are not certain how to draw the line between active and inactive, this cutoff could potentially be treated as a hyperparameter.

In [40]:
actives = avg_df[abs(avg_df['n'])-ci_95 > 25]['n']

plt.errorbar(np.arange(actives.shape[0]), actives, yerr=ci_95, fmt='o')

<ErrorbarContainer object of 3 artists>

In [41]:
# summary
print (raw_data.shape, avg_df.shape, len(actives.index))

(430, 5) (391, 3) 6

In summary, we have:

  • Removed data that did not address the question we hope to answer (small molecules only)
  • Dropped NaNs
  • Determined the noise of our measurements
  • Removed exceptionally noisy datapoints
  • Identified actives (using domain knowledge to determine a threshold)

Determine model type, final form of dataset, and sanity load

Now, what model framework should we use?

Given that we have 392 datapoints and 6 actives, this data will be used to build a low data one-shot classifier (10.1021/acscentsci.6b00367). If there were datasets of similar character, transfer learning could potentially be used, but this is not the case at the moment.

Let's apply logic to our dataframe in order to cast it into a binary format, suitable for classification.

In [42]:
# 1 if condition for active is met, 0 otherwise
avg_df['active'] = (abs(avg_df['n'])-ci_95 > 25).astype(int)

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Now, save this to file.

In [0]:
avg_df.to_csv('modulators.csv', index=False)

Now, we will convert this dataframe to a DeepChem dataset.

In [0]:
import deepchem as dc

In [45]:
dataset_file = 'modulators.csv'
task = ['active']
featurizer_func = dc.feat.ConvMolFeaturizer()

loader = dc.data.CSVLoader(tasks=task, smiles_field='drug', featurizer=featurizer_func)
dataset = loader.featurize(dataset_file)

Loading raw samples now.
shard_size: 8192
About to start loading CSV from modulators.csv
Loading shard 1 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 0 took 1.601 s
TIMING: dataset construction took 1.774 s
Loading dataset from disk.

Lastly, it is often advantageous to numerically transform the data in some way. For example, sometimes it is useful to normalize the data, or to zero the mean. This depends in the task at hand.

Built into DeepChem are many useful transformers, located in the deepchem.transformers.transformers base class.

Because this is a classification model, and the number of actives is low, I will apply a balancing transformer. I treated this transformer as a hyperparameter when I began training models. It proved to unambiguously improve model performance.

In [46]:
transformer = dc.trans.BalancingTransformer(dataset=dataset)
dataset = transformer.transform(dataset)

TIMING: dataset construction took 0.200 s
Loading dataset from disk.

Now let's save the balanced dataset object to disk, and then reload it as a sanity check.

In [0]:
dc.utils.save.save_to_disk(dataset, 'balanced_dataset.joblib')
balanced_dataset = dc.utils.save.load_from_disk('balanced_dataset.joblib')

Tutorial written by Keri McKiernan (github.com/kmckiern) on September 8, 2016

Congratulations! Time to join the Community!

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!


[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line notation and computerized interpreter for chemical structures." US Environmental Protection Agency, Environmental Research Laboratory, 1987.