Screening Zinc For HIV Inhibition

In this tutorial I will walk through how to efficiently screen a large compound library with DeepChem (ZINC). Screening a large compound library using machine learning is a CPU bound pleasingly parrellel problem. The actual code examples I will use assume the resources available are a single very big machine (like an AWS c5.18xlarge), but should be readily swappable for other systmes (like a super computing cluster). At a high level what we will do is...

  1. Create a Machine Learning Model Over Labeled Data
  2. Transform ZINC into "Work-Units"
  3. Create an inference script which runs predictions over a "Work-Unit"
  4. Load "Work-Unit" into a "Work Queue"
  5. Consume work units from "Work Queue"
  6. Gather Results

1. Train Model On Labelled Data

We are just going to knock out a simple model here. In a real world problem you will probably try several models and do a little hyper parameter searching.


In [5]:
from deepchem.molnet.load_function import hiv_datasets

In [10]:
from deepchem.models import GraphConvModel
from deepchem.data import NumpyDataset
from sklearn.metrics import average_precision_score
import numpy as np

tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv")
train, valid, test = [NumpyDataset.from_DiskDataset(x) for x in all_datasets]
model = GraphConvModel(1, mode="classification")
model.fit(train)


Loading dataset from disk.
Loading dataset from disk.
Loading dataset from disk.
/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Out[10]:
94.7494862112294

In [25]:
y_true = np.squeeze(valid.y)
y_pred = model.predict(valid)[:,0,1]
print("Average Precision Score:%s" % average_precision_score(y_true, y_pred))
sorted_results = sorted(zip(y_pred, y_true), reverse=True)
hit_rate_100 = sum(x[1] for x in sorted_results[:100]) / 100
print("Hit Rate Top 100: %s" % hit_rate_100)


Average Precision Score:0.22277598937482013
Hit Rate Top 100: 0.33

Retrain Model Over Full Dataset For The Screen


In [29]:
tasks, all_datasets, transformers = hiv_datasets.load_hiv(featurizer="GraphConv", split=None)

model = GraphConvModel(1, mode="classification", model_dir="/tmp/zinc/screen_model")
model.fit(all_datasets[0])
model.save()


Loading raw samples now.
shard_size: 8192
About to start loading CSV from /tmp/HIV.csv
Loading shard 1 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 0 took 15.701 s
Loading shard 2 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 1 took 15.869 s
Loading shard 3 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 2 took 19.106 s
Loading shard 4 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 3 took 16.267 s
Loading shard 5 of size 8192.
Featurizing sample 0
Featurizing sample 1000
Featurizing sample 2000
Featurizing sample 3000
Featurizing sample 4000
Featurizing sample 5000
Featurizing sample 6000
Featurizing sample 7000
Featurizing sample 8000
TIMING: featurizing shard 4 took 16.754 s
Loading shard 6 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 5 took 0.446 s
TIMING: dataset construction took 98.214 s
Loading dataset from disk.
TIMING: dataset construction took 21.127 s
Loading dataset from disk.
/home/leswing/miniconda3/envs/deepchem/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

2. Create Work-Units

  1. Download All of ZINC15.

Go to http://zinc15.docking.org/tranches/home and download all non-empty tranches in .smi format. I found it easiest to download the wget script and then run the wget script. For the rest of this tutorial I will assume zinc was downloaded to /tmp/zinc.

The way zinc downloads the data isn't great for inference. We want "Work-Units" which a single CPU can execute that takes a resonable amount of time (10 minutes to an hour). To accomplish this we are going to split the zinc data into files each with 500 thousand lines.

mkdir /tmp/zinc/screen
find /tmp/zinc -name '*.smi' -exec cat {} \; | grep -iv "smiles" \
     | split -l 500000 /tmp/zinc/screen/segment

This bash command

  1. Finds all smi files
  2. prints to stdout the contents of the file
  3. removes header lines
  4. splits into multiple files in /tmp/zinc/screen that are 500k molecules long

3. Creat Inference Script

Now that we have work unit we need to construct a program which ingests a work unit and logs the result. It is important that the logging mechanism is thread safe! For this example we will get the work unit via a file-path, and log the result to a file. An easy extensions to distribute over multiple computers would be to get the work unit via a url, and log the results to a distributed queue.

Here is what mine looks like

inference.py

import sys
import deepchem as dc
import numpy as np
from rdkit import Chem
import pickle
import os


def create_dataset(fname, batch_size=50000):
    featurizer = dc.feat.ConvMolFeaturizer()
    fin = open(fname)
    mols, orig_lines = [], []
    for line in fin:
        line = line.strip().split()
        try:
            mol = Chem.MolFromSmiles(line[0])
            if mol is None:
                continue
            mols.append(mol)
            orig_lines.append(line)
        except:
            pass
        if len(mols) > 0 and len(mols) % batch_size == 0:
            features = featurizer.featurize(mols)
            y = np.ones(shape=(len(mols), 1))
            ds = dc.data.NumpyDataset(features, y)
            yield ds, orig_lines
            mols, orig_lines = [], []
    if len(mols) > 0:
        features = featurizer.featurize(mols)
        y = np.ones(shape=(len(mols), 1))
        ds = dc.data.NumpyDataset(features, y)
        yield ds, orig_lines


def evaluate(fname):
    fout_name = "%s_out.smi" % fname
    model = dc.models.TensorGraph.load_from_dir('screen_model')
    for ds, lines in create_dataset(fname):
        y_pred = np.squeeze(model.predict(ds), axis=1)
        with open(fout_name, 'a') as fout:
            for index, line in enumerate(lines):
                line.append(y_pred[index][1])
                line = [str(x) for x in line]
                line = "\t".join(line)
                fout.write("%s\n" % line)


if __name__ == "__main__":
    evaluate(sys.argv[1])

4. Load "Work-Unit" into a "Work Queue"

We are going to use a flat file as our distribution mechanism. It will be a bash script calling our inference script for every work unit. If you are at an academic institution this would be queing your jobs in pbs/qsub/slurm. An option for cloud computing would be rabbitmq or kafka.


In [ ]:
import os
work_units = os.listdir('/tmp/zinc/screen')
with open('/tmp/zinc/work_queue.sh', 'w') as fout:
    fout.write("#!/bin/bash\n")
    for work_unit in work_units:
        full_path = os.path.join('/tmp/zinc', work_unit)
        fout.write("python inference.py %s" % full_path)

5. Consume work units from "distribution mechanism"

We will consume work units from our work queue using a very simple Process Pool. It takes lines from our "Work Queue" and runs them, running as many processes in parrallel as we have cpus. If you are using a supercomputing cluster system like pbs/qsub/slurm it will take care of this for you. The key is to use one CPU per work unit to get highest throughput. We accomplish that here using the linux utility "taskset".

Using an c5.18xlarge on aws this will finish overnight.

process_pool.py

import multiprocessing
import sys
from multiprocessing.pool import Pool

import delegator


def run_command(args):
  q, command = args
  cpu_id = q.get()
  try:
    command = "taskset -c %s %s" % (cpu_id, command)
    print("running %s" % command)
    c = delegator.run(command)
    print(c.err)
    print(c.out)
  except Exception as e:
    print(e)
  q.put(cpu_id)


def main(n_processors, command_file):
  commands = [x.strip() for x in open(command_file).readlines()]
  commands = list(filter(lambda x: not x.startswith("#"), commands))
  q = multiprocessing.Manager().Queue()
  for i in range(n_processors):
    q.put(i)
  argslist = [(q, x) for x in commands]
  pool = Pool(processes=n_processors)
  pool.map(run_command, argslist)


if __name__ == "__main__":
  processors = multiprocessing.cpu_count()
  main(processors, sys.argv[1])
>> python process_pool.py /tmp/zinc/work_queue.sh

6. Gather Results

Since we logged our results to *_out.smi we now need to gather all of them up and sort them by our predictions. The resulting file wile be > 40GB. To analyze the data further you can use dask, or put the data in a rdkit postgres cartridge.

Here I show how to join the and sort the data to get the "best" results.

find /tmp/zinc -name '*_out.smi' -exec cat {} \; > /tmp/zinc/screen/results.smi
sort -rg -k 3,3 /tmp/zinc/screen/results.smi > /tmp/zinc/screen/sorted_results.smi
# Put the top 100k scoring molecules in their own file
head -n 50000 /tmp/zinc/screen/sorted_results.smi > /tmp/zinc/screen/top_100k.smi

/tmp/zinc/screen/top_100k.smi is now a small enough file to investigate using standard tools like pandas.


In [9]:
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import SVG
from rdkit.Chem.Draw import rdMolDraw2D 
best_mols = [Chem.MolFromSmiles(x.strip().split()[0]) for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]
best_scores = [x.strip().split()[2] for x in open('/tmp/zinc/screen/top_100k.smi').readlines()[:100]]

In [10]:
print(best_scores[0])
best_mols[0]


0.98874843
Out[10]:

In [11]:
print(best_scores[0])
best_mols[1]


0.98874843
Out[11]:

In [12]:
print(best_scores[0])
best_mols[2]


0.98874843
Out[12]:

In [13]:
print(best_scores[0])
best_mols[3]


0.98874843
Out[13]:

The screen seems to favor molecules with one or multiple sulfur trioxides. The top scoring molecules also have low diversity. When creating a "buy list" we want to optimize for more things than just activity, for instance diversity and drug like MPO.


In [ ]: