In [1]:
cd ~/datasets
In [2]:
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['savefig.dpi'] = 100
pd.options.display.max_columns = 10
pd.options.display.max_rows = 10
We will be working on a mutagenicity dataset, released by Kazius et al.. 4337 compounds, provided as the file mols.sdf
, were subjected to the AMES test. The results are given in labels.csv
. We will clean the molecules, perform a brief chemical space analysis before finally assessing potential predictive models built on the data.
In [3]:
import skchem
import pandas as pd
We can use skchem.read_sdf
to import the sdf file:
In [4]:
ms_raw = skchem.read_sdf('mols.sdf'); ms_raw
Out[4]:
And pandas to import the labels.
In [5]:
y_raw = pd.read_csv('labels.csv').set_index('name').squeeze(); y_raw
Out[5]:
Quickly check the class balance:
In [9]:
y_raw.value_counts().plot.bar()
Out[9]:
And binarize them:
In [8]:
y_bin = (y_raw == 'mutagen').astype(np.uint8); y_bin
Out[8]:
The classes are (mercifully) quite balanced.
The first step is to apply a Transformer
to canonicalize the representations. Specifically, we will use the ChemAxon Standardizer wrapper. Some compounds are likely to fail this procedure, however they are likely to still be valid structures, so we will use the keep_failed
configuration option on the object to keep these, rather than returning a None, or raising an error.
In [8]:
std = skchem.standardizers.ChemAxonStandardizer(keep_failed=True)
In [9]:
ms = std.transform(ms_raw); ms
Out[9]:
In [10]:
of = skchem.filters.OrganicFilter()
ms, y = of.filter(ms, y)
In [11]:
mf = skchem.filters.MassFilter(above=100, below=900)
ms, y = mf.filter(ms, y)
In [12]:
nf = skchem.filters.AtomNumberFilter(above=5, below=100, include_hydrogens=True)
ms, y = nf.filter(ms, y)
We would like to calculate some features that require three dimensional coordinates, so we will next calculate three dimensional conformers using the Universal Force Field. Additionally, some compounds may be unfeasible - these should be dropped from the dataset. In order to do this, we will use the transform_filter
method:
In [13]:
uff = skchem.forcefields.UFF()
ms, y = uff.transform_filter(ms, y)
In [14]:
len(ms)
Out[14]:
As we can see, we get a warning that 3 molecules failed to embed, have been dropped. If we didn't care about the warnings, we could have set the warn_on_fail
property to False
(or set it using a keyword argument at initialization). Conversely, if we really cared about failures, we could have set error_on_fail
to True, which would raise an Error if any Mol
s failed to embed.
In [15]:
y_raw.str.get_dummies()
Out[15]:
We will use this function to binarize the labels:
In [16]:
y = y.str.get_dummies()['mutagen']
Amongst other options, it is provides access to chemical space plotting functionality. This will featurize the molecules using a passed featurizer (or a string shortcut), and a dimensionality reduction technique to reduce the feature space to two dimensions, which are then plotted. In this example, we use circular Morgan fingerprints, reduced by t-SNE to visualize structural diversity in the dataset.
In [17]:
ms.mol.visualize(fper='morgan',
dim_red='tsne', dim_red_kw={'method': 'exact'},
c=y,
cmap='copper')
The data appears to be reasonably separable in structural space, so we may suspect that Morgan fingerprints will be a good representation for modelling the data.
In [18]:
mf = skchem.descriptors.MorganFeaturizer()
X, y = mf.transform(ms, y); X
Out[18]:
In [35]:
pipeline = skchem.pipeline.Pipeline([
skchem.standardizers.ChemAxonStandardizer(keep_failed=True),
skchem.forcefields.UFF(),
skchem.filters.OrganicFilter(),
skchem.filters.MassFilter(above=100, below=1000),
skchem.filters.AtomNumberFilter(above=5, below=100),
skchem.descriptors.MorganFeaturizer()
])
X, y = pipeline.transform_filter(ms_raw, y_raw)
In this section, we will try building some basic scikit-learn models on the data.
To decide on the best model to use, we should perform some model selection. This will require comparing the relative performance of a selection of candidate molecules each trained on the same train set, and evaluated on a validation set.
In cheminformatics, partitioning datasets usually requires some thought, as chemical datasets usually vastly overrepresent certain scaffolds, and underrepresent others. In order to get as unbiased an estimate of performance as possible, one can either downsample compounds in a region of high density, or artifically favor splits that pool in the same split molecules that are too close in chemical space.
scikit-chem provides this functionality in the SimThresholdSplit
class, which applies single link heirachical clustering to produce a large number of clusters consisting of highly similar compounds. These clusters are then randomly assigned to the desired splits, such that no split contains compounds that are more similar to compounds in any other split than the clustering threshold.
In [37]:
cv = skchem.cross_validation.SimThresholdSplit(fper=None, n_jobs=4).fit(X)
train, valid, test = cv.split((60, 20, 20))
X_train, X_valid, X_test = X[train], X[valid], X[test]
y_train, y_valid, y_test = y[train], y[valid], y[test]
In [21]:
import sklearn.ensemble
import sklearn.linear_model
import sklearn.naive_bayes
In [38]:
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100)
nb = sklearn.naive_bayes.BernoulliNB()
lr = sklearn.linear_model.LogisticRegression()
In [39]:
X_train.shape, y_train.shape
Out[39]:
In [42]:
rf_score = rf.fit(X_train, y_train).score(X_valid, y_valid)
nb_score = nb.fit(X_train, y_train).score(X_valid, y_valid)
lr_score = lr.fit(X_train, y_train).score(X_valid, y_valid)
print(rf_score, nb_score, lr_score)
Random Forests appear to work best (although we should have chosen hyperparameters using Random or Grid search).
In [43]:
rf.fit(X_train.append(X_valid), y_train.append(y_valid)).score(X_test, y_test)
Out[43]: