Note: This tutorial is an IPython notebook which is a web-based interactive environment where code, text and plots can be combined and the code snippets can be executed (and modified) directly. IPython can be installed on Linux, Mac and Windows - please follow the instructions on http://ipython.org/install.html.
Alternatively, we provide the tutorial as an HTML and PDF file.
Sereina Riniker and Gregory Landrum
Novartis Institutes for BioMedical Research, Basel, Switzerland
In this tutorial, we will learn to:
The whole-cell phenotypic screen investigated here is against Plasmodium falciparum (Pf), a protozoan parasite that causes Malaria in humans. Such screening campaigns have been successful in the past in identifying new chemotypes (Chem. Biol. (2012), 19, 116).
The analysis of the HTS data will be done by a standard hit-list triaging workflow: filtering the primary hits for drug-like properties, removing compounds with potentially problematic substructures, and clustering by chemical series. Based on this, a subset of hits is proposed for validation in confirmation assays.
For the predictive model in step 2, we will investigate a set of machine-learning (ML) methods and combine the most promising models by heterogeneous classifier fusion in order to take advantage of the individual strengths of the different ML methodologies.
Finally in step 3, the fusion model is used to rank-order a set of untested commercially available molecules (list obtained from eMolecules) based on the predicted probability to be active. This list is further processed to identify an interesting subset for follow-up screening.
The workflow consists of this tutorial together with a series of Python scripts for the computationally more demanding tasks which can be run in a shell or on a cluster. The Python libraries used during this tutorial are all open-source and freely downloadable from the web. These are the cheminformatics toolkit RDKit, the machine-learning toolkit scikit-learn, pandas for working data tables, and libraries for scientific computing, numpy and scipy.
Tutorials for many of the tools used are available online:
Additionally, the book Python for Data Analysis is an excellent resource for getting started with many of the tools used here.
RDKit version 2013.09, scikit-learn version 0.13, numpy version 1.6.2 and scipy version 0.9.0 were used when developing this tutorial. Figures are plotted using matplotlib version 1.1.0.
The hit list of an HTS screen can contain artifacts that lead to false positives. Artifacts can be due to the chemical structure (i.e. reactive, toxic, interfering in assay signaling, etc.), the assay, screening technology and storage conditions (Curr. Opin. Chem. Biol. (2006), 10, 343). Testing these in a subsequent confirmatory assay would cost time and resources and may mislead the medicial chemists. Therefore, a triaging step is usually performed on the hit-list to remove as many potential false positives as possible. For this, we will use both property filters (e.g. molecular weight) as well as problematic-substructure filters (e.g. rhodanines).
We will use the following property filters for in silico post-processing which are a combination of the filters from the Rapid Elimination of Swill (REOS) described in Nature Rev. (2003), 2, 259 and the ones described in Curr. Opin. Chem. Biol. (2006), 10, 343. We choose softer filters than in REOS because we do not have an experienced medicinal chemist at hand who can look through the discarded molecules and "rescue" interesting compounds.
To remove compounds with potentially problematic substructures, we will use the PAINS (Pan Assay Interference Compounds) filters described in J. Med. Chem. (2010), 53, 2719.
After we have filtered the primary hits, we will have a look at the chemical series in the set. There may be a large number of compounds sharing the same scaffolds/chemotypes; we do not necessarily want to test all of these in a confirmatory assay. We want to strike a good balance between gaining structure-activity relationship (SAR) information from molecules with the same chemotype and testing as many diverse chemotypes as possible. For this, we will cluster the compounds and sample from these clusters.
The data from the primary phenotypic screen is provided by the TDT challenge in the file malariahts_trainingset.txt which we downloaded from the TDT website, gzipped and saved in the directory data.
The file contains six columns:
Normally, the first step would be to classify the compounds (i.e. select hits) based on their activity value using an analytic method such as Z-scores (for a discussion see J. Biomol. Screen. (2011), 16, 775). Here, an activity outcome (true, false, ambiguous) is already provided and we will use this classification in the following.
We start by separating the actives (activity outcome = true) and the inactives (activity outcome = false) in two different files. The molecules with an ambiguous outcome we will ignore.
In [1]:
%pylab inline
In [2]:
import gzip
actives = []
inactives = []
ambiguous = 0
for line in gzip.open('data/malariahts_trainingset.txt.gz', 'r'):
if line[0] == "#": continue
line = line.strip().split()
# contains: [sample_id, value_green, value_red, hit, pec50, smiles]
if line[3] == 'true': # = actives
# store: [sample_id, hit, pec50, smiles]
actives.append((line[0], line[3], line[4], line[5]))
elif line[3] == 'false': # = inactives
inactives.append((line[0], line[3], line[4], line[5]))
else: # ambiguous --> ignore
ambiguous += 1
num_actives = len(actives)
num_inactives = len(inactives)
print "number of actives =", num_actives
print "number of inactives =", num_inactives
print "number of ambiguous molecules =", ambiguous
# write the actives
outfile = gzip.open('data/training_actives.dat.gz', 'w')
outfile.writelines("%s\n" % ("\t".join(a[:4])) for a in actives)
outfile.close()
# and the inactives
outfile = gzip.open('data/training_inactives.dat.gz', 'w')
outfile.writelines("%s\n" % ("\t".join(a[:4])) for a in inactives)
outfile.close()
As we will use RDKit for the handling of molecules, we check if the given canonical SMILES can be converted to an RDKit molecule.
In [3]:
from rdkit import Chem
# check the SMILES of the actives
count = 0
outfile = gzip.open('data/training_actives_cleaned.dat.gz', 'w')
for line in gzip.open('data/training_actives.dat.gz', 'r'):
line = line.strip().split()
# contains: [sample_id, hit, pec50, smiles]
m = Chem.MolFromSmiles(line[3])
if m is not None:
outfile.write("%s\n" % ("\t".join(line[:4])))
else:
count += 1
outfile.close()
num_actives -= count # decrease the number of actives accordingly
print "number of actives with RDKit-invalid SMILES =", count
In [4]:
# check the SMILES of the inactives
count = 0
outfile = gzip.open('data/training_inactives_cleaned.dat.gz', 'w')
for line in gzip.open('data/training_inactives.dat.gz', 'r'):
line = line.strip().split()
# contains: [sample_id, hit, pec50, smiles]
m = Chem.MolFromSmiles(line[3])
if m is not None:
outfile.write("%s\n" % ("\t".join(line[:4])))
else:
count += 1
outfile.close()
num_inactives -= count # decrease the number of inactives accordingly
print "number of inactives with RDKit-invalid SMILES =", count
We start by reading the SMILES of the actives and generate molecules for them.
In [5]:
mols = []
for line in gzip.open('data/training_actives_cleaned.dat.gz', 'r'):
line = line.strip().split()
# contains: [sample_id, hit, pec50, smiles]
m = Chem.MolFromSmiles(line[3])
mols.append((m, line[0]))
print "number of actives =", len(mols)
Next, we calculate the molecular weight of each compound and take only those between 100 - 700 g/mol. Molecules with high molecular weight have often a low membrane permeability.
In [6]:
from rdkit.Chem import Descriptors
# loop over molecules
new_mols = []
bad_mols = []
for m,idx in mols:
mw = Descriptors.MolWt(m)
if mw > 700 or mw < 100:
bad_mols.append((m, idx))
else:
new_mols.append((m, idx))
mols = new_mols
print "number of actives with 100 < MW < 700 =", len(mols)
Let's have a look at the molecule that was discarded.
In [7]:
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
Draw.MolToImage(bad_mols[0][0])
Out[7]:
Next, we check that the number of heavy atoms is between 5 and 50, and that the number of rotatable bonds is less than 12 (this number is higher than the filter used in REOS because RDKit has the tendency to overcount rotatable bonds).
In [8]:
from rdkit.Chem import AllChem
# loop over molecules
new_mols = []
bad_mols = []
for m,idx in mols:
num_ha = m.GetNumHeavyAtoms()
num_rb = AllChem.CalcNumRotatableBonds(m)
if num_ha > 50 or num_ha < 5 or num_rb > 12:
bad_mols.append((m, idx))
else:
new_mols.append((m, idx))
mols = new_mols
print "number of actives with 5-50 heavy atoms and < 12 \
rotatable bonds = %i" % len(mols)
The molecules discarded at this step are below, most of them contain long aliphatic chains:
In [9]:
Draw.MolsToGridImage([m[0] for m in bad_mols],
legends=[m[1] for m in bad_mols],
molsPerRow=4)
Out[9]:
Now, we check the number of hydrogen-bond donors and acceptors. A large number of hydrogen bond donors and/or acceptors can lead to low membrane permeability.
In [10]:
# loop over molecules
new_mols = []
bad_mols = []
for m,idx in mols:
num_hba = AllChem.CalcNumHBA(m) # acceptors
num_hbd = AllChem.CalcNumHBD(m) # donors
if num_hba > 10 or num_hbd > 5:
bad_mols.append((m, idx))
else:
new_mols.append((m, idx))
mols = new_mols
print "number of actives with < 5 H-bond donors and < 10 \
H-bond acceptors = %i" % len(mols)
Here, all molecules pass our filter.
The last property filter is to check the partition coefficient logP of the compounds which is a measure of lipophilicity. For this, we calculate the logP value based on atom contributions as described in J. Chem. Inf. Comput. Sci (1999), 39, 868.
In [11]:
from rdkit.Chem import Crippen
new_mols = []
bad_mols = []
for m,idx in mols:
logp = Crippen.MolLogP(m)
if logp < -5 or logp > 7.5:
bad_mols.append((m, idx))
else:
new_mols.append((m, idx))
mols = new_mols
print "number of actives with -5 < logP < 7.5 = %i" % len(mols)
The molecules discarded at this step are:
In [12]:
Draw.MolsToGridImage([m[0] for m in bad_mols],
legends=[m[1] for m in bad_mols],
molsPerRow=4)
Out[12]:
The PAINS filters described in J. Med. Chem. (2010), 53, 2719 were converted to SMARTS patterns by Macs in Chemistry. There are four files in the directory pains_filters: one file for each of the three families described in the original publication plus one file containing all filters (PAINS.sieve). The files have five columns: FRAGMENT, name, SMARTS, minimum, maximum. Minimum and maximum define how often a specific fragment is allowed to occur in a molecule. However, for the patterns in PAINS.sieve the maximum value is always zero, therefore we can ignore this information.
In [13]:
pains = []
for line in open('pains_filter/PAINS.sieve', 'r'):
if line[0] == "#": continue # ignore header
line = line.strip().split()
if line:
m = Chem.MolFromSmarts(line[2])
pains.append(m)
print "number of PAINS SMARTS patterns =", len(pains)
Now, we loop over the property-filtered compounds and discard any that have a substructure match with one of the PAINS.
In [14]:
new_mols = []
bad_mols = []
for m,idx in mols:
match = False
for patt in pains:
if m.HasSubstructMatch(patt):
match = True
break # there is a match, stop searching
if not match:
new_mols.append((m, idx))
else:
bad_mols.append((m, idx))
mols = new_mols
print "number of actives without problematic \
substructures = %i" % len(mols)
Some of the molecules with problematic substructures are:
In [15]:
Draw.MolsToGridImage([m[0] for m in bad_mols[:20]],
legends=[m[1] for m in bad_mols[:20]],
molsPerRow=4)
Out[15]:
At this point, normally a medicial chemist would get involved, look at the discarded molecules and "rescue" interesting looking ones. This is of course not possible for the tutorial, therefore we will discard all of them.
Here, we are in the fortunate position to have already data from a confirmatory screen available. So, let's compare how many of the filtered molecules were actually tested in the confirmatory screen and how many of the tested ones were true positives.
First, we get the sample IDs of the molecules with a measured pEC50 value.
In [16]:
pEC50 = {}
for line in gzip.open('data/training_actives_cleaned.dat.gz', 'r'):
line = line.strip().split()
# contains: [sample_id, hit, pec50, smiles]
if line[2] != 'NA':
pEC50[line[0]] = float(line[2])
print "number of compounds with pEC50 value =", len(pEC50)
To decide what cutoff to use for active/inactive classification we have to look at the distribution of pEC50 values. For an easy target, a cutoff of EC50 < 10 uM might be too high and we have too many actives, whereas for a difficult target such a cutoff is too low and we find hardly any actives.
In [17]:
import numpy
pec_hist = numpy.histogram([pEC50.values()], bins=20)
num_pec = float(len(pEC50.keys()))
p_gt5 = len([p for p in pEC50.values() if p > 5]) / num_pec * 100
p_gt6 = len([p for p in pEC50.values() if p > 6]) / num_pec * 100
print "percent molecules < 10 uM: %.2f %%" % p_gt5
print "percent molecules < 1 uM: %.2f %%" % p_gt6
# plot it
fig = plt.figure(1, figsize=(10, 5))
plt1 = plt.subplot(111)
plt.axis([4.5, 9.5, 0, 300])
plt.xlabel('pEC50')
plt.ylabel('Number of molecules')
plt1.bar(pec_hist[1][:-1], pec_hist[0], lw=0, width=0.1)
plt.show()
The distribution indicates the target is rather easy with 75% actives at a cutoff of 10 uM. We therefore set the cutoff for (strong) actives to EC50 < 1 uM (i.e. pEC50 > 6.0), but treat the molecules with 1 uM < EC50 < 10 uM as weak actives.
In [18]:
not_measured = 0
positive = 0
medium = 0
negative = 0
for m,idx in mols:
if pEC50.has_key(idx):
if pEC50[idx] > 6.0: # 1uM
positive += 1
elif pEC50[idx] > 5.0: # 10 uM
medium += 1
else:
negative += 1
else:
not_measured += 1
print "number of not measured compounds = %i" % not_measured
print "EC50 < 1 uM:\t%i molecules" % positive
print "1 uM < EC50 < 10 uM:\t%i molecules" % medium
print "EC50 > 10 uM:\t%i molecules" % negative
How many of the molecules that did not pass our triaging filters were actually tested in the confirmatory screen?
In [19]:
measured_idx = set(pEC50.keys())
proposed_idx = set([idx for m,idx in mols])
notproposed_idx = measured_idx.difference(proposed_idx)
notproposed_pos = 0
notproposed_med = 0
notproposed_neg = 0
for k in notproposed_idx:
if pEC50[k] > 6.0: notproposed_pos += 1
elif pEC50[k] > 5.0: notproposed_med += 1
else: notproposed_neg += 1
print "molecules with EC50 < 1 uM:\t%i" % notproposed_pos
print "molecules with 1 uM < EC50 < 10 uM:\t%i" % notproposed_med
print "molecules with EC50 > 10 uM:\t%i" % notproposed_neg
Do these 267 compounds not in our filtered selection but tested in the confirmatory screen contain any PAINS patterns?
In [20]:
notproposed_mols = []
for line in gzip.open('data/training_actives_cleaned.dat.gz', 'r'):
line = line.strip().split()
# contains: [sample_id, hit, pec50, smiles]
if line[0] in notproposed_idx:
notproposed_mols.append((Chem.MolFromSmiles(line[3]), line[0]))
print "number of not proposed molecules = %i" % len(notproposed_mols)
In [21]:
bad_mols = []
for m,idx in notproposed_mols:
match = False
for patt in pains:
if m.HasSubstructMatch(patt):
match = True
break # there is a match, stop searching
if match:
bad_mols.append((m, idx))
print "number of molecules with problematic \
substructures = %i" % len(bad_mols)
So, 259 out of the 269 molecules contain problematic substructures and were therefore not in our selection. Even if many of them were tested active in the confirmatory screen (i.e. EC50 < 1 uM), they may be toxic or have other unwanted properties.
This was just a quick informational exercise. In a normal setting, we would not already have EC50 data as we are picking compounds for a confirmatory assay.
The property and substructure filters we have applied still leave us with many more measured actives than we could submit for a confirmatory screen; we need to pick a subset. In order to obtain as much information as possible from the confirmatory screen, we would like to pick a few members of each chemical series in our set. Since there is no standard computational approach to identify chemical series, we'll use a surrogate and cluster the compounds based on chemical similarity.
For this exercise, we will use the RDK5 fingerprint (a subgraph-based fingerprint similar to the well-known Daylight fingerprint) and the Butina clustering algorithm. The similarity between two fingerprints is calculated using the Tanimoto coefficient, although other similarity metrics could be used as well (J. Chem. Inf. Model. (2012), 52, 2884). We take a cutoff = 0.5 for the distance, i.e. all molecules in a cluster have a maximum distance of 0.5 from the cluster center.
In [25]:
from rdkit import DataStructs
from rdkit.ML.Cluster import Butina
# calculate fingerprints
fps = []
for m,idx in mols:
fps.append(Chem.RDKFingerprint(m, maxPath=5))
# generate distance matrix
dist_matrix = []
num_fps = len(fps)
for i in range(1, num_fps):
similarities = DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i])
dist_matrix.extend([1-x for x in similarities])
# cluster
clusters = Butina.ClusterData(dist_matrix, num_fps, 0.5, isDistData=True) # distance cutoff = 0.5
print "number of clusters =", len(clusters)
num_clust_g5 = len([c for c in clusters if len(c) > 5])
print "number of clusters with more than 5 compounds =", num_clust_g5
The size of the clusters is distributed as follows (for the first half of the clusters).
In [26]:
# plot the size of the first 150 clusters
fig = plt.figure(1, figsize=(18, 5))
plt1 = plt.subplot(111)
plt.axis([0, 151, 0, len(clusters[0])+1])
plt.xlabel('Cluster index')
plt.ylabel('Number of molecules')
plt1.bar(range(1, 151), [len(c) for c in clusters[:150]], lw=0)
plt.show()
Let's have a look at the first twelve molecules in the largest cluster.
In [27]:
Draw.MolsToGridImage([mols[i][0] for i in clusters[0][:12]],
legends=[mols[i][1] for i in clusters[0][:12]],
molsPerRow=4)
Out[27]:
The molecules in the cluster look indeed similar to each other and many share a common scaffold. The scaffold can be obtained by calculating the maximum common substructure (MCS) of all molecules in the cluster. We set the threshold to 0.8, i.e. the MCS has to be present in 80% of the molecules. This allows to retrieve the "real" scaffold even when one or two of the molecules have a slightly different one.
In [28]:
from rdkit.Chem import MCS
mcs = MCS.FindMCS([mols[i][0] for i in clusters[0]], threshold=0.8)
m = Chem.MolFromSmarts(mcs.smarts)
Draw.MolToImage(m)
Out[28]:
For the final list of 500 compounds which we propose for testing in a confirmatory assay, we take from each cluster the cluster centre (i.e. the first molecule) and then we take - starting with the largest cluster - for each cluster the 5 molecules (or 50% if less than 5 molecules are left in the cluster) most similar to the cluster center, until we have selected 500 compounds.
The idea behind this approach is to ensure diversity (because we have respresentatives of each cluster) and the chance of getting some SAR from the results of the confirmatory assay (because we have groups of quite similar molecules).
In [29]:
# take the cluster centre of each cluster
final_mols = [mols[c[0]] for c in clusters]
# sort the molecules within a cluster based on their similarity
# to the cluster centre and sort the clusters based on their size
clusters2 = []
for c in clusters:
if len(c) < 2: continue
fps_clust = [Chem.RDKFingerprint(mols[i][0], maxPath=5) for i in c]
simils = DataStructs.BulkTanimotoSimilarity(fps_clust[0],
fps_clust[1:])
simils = [(s,i) for s,i in zip(simils, c[1:])]
simils.sort(reverse=True)
clusters2.append((len(simils), [i for s,i in simils]))
clusters2.sort(reverse=True)
# take 5 molecules (or 50%) of each cluster starting with the
# largest one
idx = 0
diff = 500 - len(final_mols)
while diff > 0:
c = clusters2[idx][1]
if clusters2[idx][0] > 5:
num_cmps = 5
else:
num_cmps = int(0.5*len(c))+1
if num_cmps > diff: num_cmps = diff
final_mols += [mols[i] for i in c[:num_cmps]]
idx += 1
diff = 500 - len(final_mols)
print "number of selected molecules =", len(final_mols)
The molecules picked from the largest cluster are shown below (aligned by the scaffold), where the first one is the cluster center.
In [30]:
num_clust = len(clusters)
mols_cluster0 = [final_mols[0]] + final_mols[num_clust:num_clust+5]
mcs = MCS.FindMCS([m[0] for m in mols_cluster0]) # get scaffold
scaffold = Chem.MolFromSmarts(mcs.smarts)
AllChem.Compute2DCoords(scaffold) # generate 2D coordinates for scaffold
for m,idx in mols_cluster0: # align molecules to scaffold
AllChem.GenerateDepictionMatching2DStructure(m, scaffold)
Draw.MolsToGridImage([m[0] for m in mols_cluster0],
legends=[m[1] for m in mols_cluster0],
molsPerRow=4)
Out[30]:
From this six molecules, we will already be able to gain some SAR information. For example, the first and second molecule will show the effect between a methoxy and ethoxy group in para-position at the phenyl ring. Or the first and the last molecule will show the effect between a secondary amine (one ethyl substituent) or a tertiary amine (two ethyl substituents).
We write the compounds (identifier, SMILES) to a file in the directory results.
In [31]:
# write them to file
outfile = open('results/compounds_for_confirmatory_assay.txt', 'w')
outfile.write("#SAMPLE\tSMILES\n") # header
outfile.writelines("%s\t%s\r\n" %
(idx,Chem.MolToSmiles(m, isomericSmiles=True)) for m,i in final_mols)
outfile.close()
As a final point of curiosity: how many of the selected molecules were active in the confirmatory data provided as part of the TDT package?
In [32]:
not_measured = 0
positives = 0
medium = 0
negatives = 0
for m,idx in final_mols:
if pEC50.has_key(idx):
if pEC50[idx] > 6.0: # 1uM
positives += 1
elif pEC50[idx] > 5.0: # 10 uM
medium += 1
else:
negatives += 1
else:
not_measured += 1
print "number of not measured compounds =", not_measured
print "EC50 < 1 uM:\t%i molecules" % positives
print "1 uM < EC50 < 10 uM:\t%i molecules" % medium
print "EC50 > 10 uM:\t%i molecules" % negatives
Using the actives and inactives from the phenotypic screen, we can train a machine-learning (ML) model. For the current purpose, we are interested in supervised learning methods. The documentation of scikit-learn describes the different supervised techniques available from this library. In this tutorial, we focus on three methods: random forest (RF), Naive Bayes (NB) and logistic regression (LR), which we investigated in a recent study on ligand-based virtual screening (VS) (J. Chem. Inf. Model. (2013), 53, 2829), and we will use the best (and most complementary) models for heterogeneous classifier fusion. To train the models, we test three different molecular fingerprints: atom pairs (AP), the RDKit fingerprint with a maximum path length = 5 (RDK5), and the circular fingerprint with radius = 2 (Morgan2) (for a description see J. Cheminf. (2013), 5, 26).
Note: We will use all the information from the primary screen to train the predictive model, i.e. we do not apply any property or problematic-substructure filters. First of all, some of the filters may be too "hard" and we would discard valuable information, and secondly the ML methods used here are rather robust against noise (i.e. false positives and false negatives).
As the best combination of ML method and molecular fingerprint was found to be highly target-dependent, we need to evaluate the performance of different combinations in order to select the best (and most complementary) models for heterogeneous classifier fusion (J. Chem. Inf. Model. (2013), 53, 2829).
For the retrospective evaluation, we will randomly select 10% of the data and leave this out for testing. In addition, to reduce artifacts due to the random selection of the testing data, we will repeat this 50 times and calculate the average performance. We chose 50 as a compromise between getting good statistics and computational cost.
We start by randomly selecting 10% of the molecules for testing, we repeat this 50 times. For this tutorial, we only need the indices of the molecules.
In [33]:
import random
# seed for random-number generator
random.seed(23)
# helper function
def selectAndWriteMols(outfile, mollist, num):
random.shuffle(mollist)
outfile.writelines("%i " % i for i in mollist[:num])
outfile.write("\n")
percentage = 0.1 # = 10%
num_test_actives = int(percentage*num_actives)
num_test_inactives = int(percentage*num_inactives)
print "number of test actives = %i" % num_test_actives
print "number of test inactives = %i" % num_test_inactives
# number of repetitions/test lists
num_lists = 50
# loop over the lists
outfile_act = gzip.open('test_lists/test_lists_\
10percent_actives.dat.gz', 'w')
outfile_inact = gzip.open('test_lists/test_lists_\
10percent_inactives.dat.gz', 'w')
for i in range(num_lists):
# actives
selectAndWriteMols(outfile_act, range(num_actives),
num_test_actives)
# inactives
selectAndWriteMols(outfile_inact, range(num_inactives),
num_test_inactives)
outfile_act.close()
outfile_inact.close()
As we discussed in our JCIM paper, when ranking molecules based on an ML model's predicted probability to be active, it is possible for multiple molecules to have the same probability. To break the tie, we make use of the maximum chemical similarity between the molecules and the training actives. The molecule which is more similar to the training actives gets ranked higher. To this end, we calculate for each molecule in the test sets the similarity to the actives in the training set and store the maximum value. This procedure is called MAX group fusion (QSAR Comb. Sci. (2006), 25, 1143).
There is a Python script called calculate\_similarity.py in the directory evaluation to perform this task, which takes the fingerprint name as argument (i.e. rdk5, ap or morgan2). We will discuss here only the important steps of the script, whereas the complete script can be executed separately in a shell or be sent to a cluster for each fingerprint.
Firstly, the actives and inactives are read and the specified fingerprint is calculated for each molecule (e.g. RDK5).
In [33]:
# actives
fps_act = []
mols_act = []
for line in gzip.open('data/training_actives_cleaned.dat.gz', 'r'):
line = line.strip().split()
# contains: [sample_id, hit, pec50, smiles]
m = Chem.MolFromSmiles(line[3])
fp = Chem.RDKFingerprint(m, maxPath=5) # max. path length = 5
fps_act.append(fp)
mols_act.append([line[0], m])
print "number of active fingerprints =", len(fps_act)
In [34]:
# inactives
fps_inact = []
mols_inact = []
for line in gzip.open('data/training_inactives_cleaned.dat.gz', 'r'):
line = line.strip().split()
# contains: [sample_id, hit, pec50, smiles]
m = Chem.MolFromSmiles(line[3])
fp = Chem.RDKFingerprint(m, maxPath=5) # max. path length = 5
fps_inact.append(fp)
mols_inact.append([line[0], m])
print "number of inactive fingerprints =", len(fps_inact)
The first list of indices for the test molecules is read in and the corresponding molecules are selected. In addition, we collect for each molecule the information if it belongs to the actives (1) or inactives (0). This information is required for the evaluation later on.
In [35]:
# indices of actives
inlist = gzip.open('test_lists/test_lists_\
10percent_actives.dat.gz', 'r').readline()
test_indices_act = set(int(l) for l in inlist.strip().split())
# indices of inactives
inlist = gzip.open('test_lists/test_lists_\
10percent_inactives.dat.gz', 'r').readline()
test_indices_inact = set(int(l) for l in inlist.strip().split())
print "number of test actives = %i" % len(test_indices_act)
print "number of test inactives = %i" % len(test_indices_inact)
# test molecules: first the actives than the inactives
test_fps = [fps_act[i] for i in test_indices_act] \
+ [fps_inact[i] for i in test_indices_inact]
test_info = [1]*num_test_actives + [0]*num_test_inactives
test_mols = [mols_act[i] for i in test_indices_act] \
+ [mols_inact[i] for i in test_indices_inact]
print "number of total test molecules = %i" % len(test_fps)
The remaining actives are our training actives.
In [36]:
train_indices_act = set(range(num_actives)).difference(test_indices_act)
train_fps = [fps_act[i] for i in train_indices_act]
print "number of training actives = %i" % len(train_fps)
Now, we loop over the test molecules and calculate the similarities to the training actives. We use again the Tanimoto coefficient to calculate the similarities. These are then sorted and the maximum value is stored (i.e. MAX group fusion).
In [37]:
from rdkit import DataStructs
scores = []
for fp, info in zip(test_fps, test_info):
similarities = DataStructs.BulkTanimotoSimilarity(fp, train_fps)
# sort by descending similarity
similarities.sort(reverse=True)
scores.append([similarities[0], info])
We can also rank order the test molecules based on the maximum similarity to the training actives. This serves as a baseline for our evaluation of the ML models. Different methods exist for the evaluation of the virtual screening (VS) performance of a given method, i.e. how well the actives are separated from the inactives (for a discussion see J. Cheminf. (2013), 5, 26).
In [38]:
# sort by descending similarity
sorted_scores = sorted(scores, reverse=True)
The receiver operating characteristic (ROC) is defined as the true positives rate (TPR) over the false positives rate (FPR) and visualizes the ordering performance of a method. The ROC curve can be obtained using the corresponding RDKit function. The filled circle indicates the point at which 5% of the molecules have been picked.
In [39]:
from rdkit.ML.Scoring import Scoring
roc = Scoring.CalcROC(sorted_scores, -1)
# number of molecules in the first 5%
num_test_mols = int(0.05 * (num_test_actives + num_test_inactives))
# figure
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.axis([0.0, 1.0, 0.0, 1.0])
plt.plot(roc[0], roc[1], c='r', label='RDK5') # ROC curve
plt.scatter([roc[0][num_test_mols]], [roc[1][num_test_mols]],
c='r', lw=0, s=40) # 5% marker
plt.plot([0, 1], [0, 1], c='k', label='Random') # random distribution
plt.legend(loc='lower right')
plt.show()
The ordering performance over the whole set can be represented by the area under the ROC curve (AUC). The AUC gives values in the range [0, 1] where 0 corresponds to all actives at the bottom, 1 corresponds to all actives on top, and 0.5 indicates random distribution. However, in this context we are mainly interested in the ranking performance in the early part of the list, e.g. the top 5%, because only these molecules will be considered for subsequent confirmatory assays. This can be measured by a so-called "early-recognition" method such as the enrichment factor at 5% (EF5%). If the ratio of actives to inactives in the test set is below 5%, the maximum value of EF5% is 20, the minimum value is 0 and a value of 1 corresponds to random distribution (J. Chem. Inf. Model. (2007), 47, 488). This is the case here where the ratio is 0.5%. For a detailed discussion of the different evaluation methods and their differences see J. Cheminf. (2013), 5, 26.
Here, we calculate the AUC and EF5% using the RDKit implementations.
In [40]:
auc = Scoring.CalcAUC(sorted_scores, -1)
ef = Scoring.CalcEnrichment(sorted_scores, -1, [0.05])
print "AUC = %.2f, EF5%% = %.2f" % (auc, ef[0])
These results show that the actives and inactives can already be distiguished well by RDK5 similarity. By random distribution, we would have 7.6 of the 152 actives in the first 5% (i.e. 1476 molecules). Using RDK5 similarity, we have instead:
In [41]:
num_random_actives = 0.05*num_test_actives
actives_in_5percent = num_random_actives*ef[0]
print "number of actives in the first 5%% using RDK5 \
similarity = %.1f" % actives_in_5percent
Let's have a look at the similarity distribution between the training actives and the test actives.
In [42]:
sim_hist = numpy.histogram([s for s,i in scores if i == 1], bins=20)
# plot it
fig = plt.figure(1, figsize=(10, 5))
plt1 = plt.subplot(111)
plt.axis([0, 1, 0, 50])
plt.xlabel('RDK5 similarity')
plt.ylabel('Number of molecules')
plt1.bar(sim_hist[1][:-1], sim_hist[0], lw=0, width=0.02)
plt.show()
As the high enrichment factor indicated, many actives in the test set have a high similarity to at least one of the training actives.
One important property of a VS-method is its so-called scaffold-hopping ability (for a review see for example Future Med. Chem. (2011), 3, 405). This means how structurally diverse the top ranked actives are, because one is interested to know as many different active chemotypes as possible. Here, we can look at the top 20 actives and the bottom 20 actives together with the (maximum) similarities.
In [43]:
indices_act = [[s[0], i, s[1]] for i,s in enumerate(scores)]
indices_act.sort(reverse=True)
indices_act = [i for s,i,info in indices_act if info == 0]
# top 20 actives
Draw.MolsToGridImage([test_mols[i][1] for i in indices_act[:20]],
legends=[("s = %.3f" % scores[i][0]) for i in indices_act[:20]],
molsPerRow=4)
Out[43]:
In [44]:
# bottom 20 actives
Draw.MolsToGridImage([test_mols[i][1] for i in indices_act[-20:]],
legends=[("s = %.3f" % scores[i][0]) for i in indices_act[-20:]],
molsPerRow=4)
Out[44]: