It is possible to liken ageing to a disease. If this analogy is accepted, one can develop it further and suggest that there be remedies for it. Such medicines are called geroprotectors. Being more rigorous, geroprotectors are drugs (or drug-like compounds) that slow down undesirable processes associated with ageing.
Species with Latin name Caenorhabditis elegans (nematode or roundworm in English) is an object of several experiments aimed to measure effect of different drugs on mean, median, and maximum lifespan. Below these experimental data are used and further in this presentation the word "geroprotector" means a compound that has positive impact on roundworms' lifespan.
The goal of the current study is to generate a list of compounds that are not marked as known geroprotectors, but with high probability are geroprotectors. In other words, the task is to extend given list of geroprotectors by finding new ones.
The methodology is quite similar to PU learning (positive-vs.-unlabeled learning). Unlike in the case of binary classification, there are no initially given negative examples — there are only positive and unlabeled examples (in this case, verified geroprotecors and random compounds respectively). However, we treat unlabeled examples as if they were negative. This leads to irrelevance of accuracy score, because labeling a negative example with positive label can be a mistake and can be a right decision. It is better to use ROC-AUC, since it depends only on examples' ranking by probability of belonging to the positive class. Also it is sometimes possible to weight classes in order to penalize false negatives heavier than false positives. It is shown that "in principle, the PU classification problem can be solved by a standard cost-sensitive classifier such as the weighted support vector machine" [2]. Finally, when classificator is built and predictions are made, negative examples with high estimated probability of positive label are delivered as new positive examples.
The main limitation of the study is that only chemical features of compounds are involved. For example, some compounds are cures if concentration is low, but become poisonous if concentration is high. However, here concentration is not used. Features derived from interaction between a drug and an organism's cells are not used as well. Perhaps, this will be fixed in the future.
File geroprotectors.csv is based on Curated Database of Geroprotectors: http://geroprotectors.org/
File zinc_smiles.csv is based on ZINC Database: http://zinc.docking.org/browse/subsets/
For more details on geroprotectors' search or PU learning please look at:
Moskalev A, Zhavoronkov A, et al. In search for geroprotectors: in silico screening and in vitro validation of signalome-level mimetics of young healthy state. Aging (Albany NY). 2016
du Plessis M et al. Analysis of learning from positive and unlabeled data. NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014
The code is written in Python 2.7.
To successfully run the notebook, one needs libraries from Python scientific stack such as numpy
, pandas
, matplotlib
, seaborn
, sklearn
, and xgboost
. All of them can be installed via pip
. Recommended sklearn
version is 0.18.1, because older versions do not have model_selection
module. To upgrade sklearn
installation, it is enough to run:
pip install scikit-learn --upgrade
In addition, RDKit
library is necessary for working with chemical data. To install it on Ububntu, run:
sudo apt-get install python-rdkit librdkit1 rdkit-data
Also it can be installed via conda
. Please look at the official instructions for more details and information about installation on other operating systems.
In [1]:
import os
from collections import OrderedDict
from itertools import product
from functools import partial
import matplotlib.pyplot as plt
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.base import clone
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import roc_auc_score
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.externals import joblib
# Startup settings can not suppress a warning from `XGBClassifier` and so this is needed.
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
from xgboost import XGBClassifier
from rdkit.Chem import MolFromSmiles, rdMolDescriptors, Draw
from rdkit.DataStructs import ConvertToNumpyArray
In [2]:
np.random.seed(seed=361)
Of course, one can try any other random seed. Found peculiar seeds are as follows:
In [3]:
df = pd.read_csv('external_data/geroprotectors.csv', sep='\t')
df.head()
Out[3]:
In [4]:
len(df.index)
Out[4]:
In [5]:
len(df['Name'].unique())
Out[5]:
In [6]:
len(df['SMILES'].unique())
Out[6]:
In [7]:
len(df[pd.notnull(df['Mean']) & pd.notnull(df['Median']) & pd.notnull(df['Max'])])
Out[7]:
In [8]:
len(df[pd.notnull(df['Mean']) & pd.notnull(df['Median'])])
Out[8]:
In [9]:
len(df[pd.notnull(df['Median']) & pd.notnull(df['Max'])])
Out[9]:
In [10]:
len(df[pd.notnull(df['Mean']) & pd.notnull(df['Max'])])
Out[10]:
Hence, there is not enough data to predict values of one statistic from values of another statistic.
In [11]:
df.describe()
Out[11]:
In [12]:
def smiles_to_numpy(smiles, radius=2):
"""
Takes Morgan fingerprints for each
element of `smiles` and packs
the results to array.
Morgan fingerprints are chosen,
because they provide fixed length
of bit vectors.
@type smiles: pandas.Series
@type radius: int
@rtype: numpy.ndarray
"""
molecules = map(MolFromSmiles, smiles.tolist())
to_morgan = partial(rdMolDescriptors.GetMorganFingerprintAsBitVect,
radius=radius)
fingerprints = map(to_morgan, molecules)
# However, RDKit does not fit in functional paradigm.
np_fingerprints = []
for fingerprint in fingerprints:
arr = np.zeros((1,))
ConvertToNumpyArray(fingerprint, arr)
np_fingerprints.append(arr)
result = np.vstack(tuple(np_fingerprints))
return result
In [13]:
geroprotectors = smiles_to_numpy(df['SMILES'])
geroprotectors.shape
Out[13]:
In [14]:
path_to_zinc = 'external_data/zinc_smiles.csv'
if os.path.isfile(path_to_zinc):
zinc_smiles = pd.read_csv(path_to_zinc)
else:
zinc_smiles = pd.read_csv("https://docs.google.com/spreadsheets/d/" +
"1MXYjjwiAmyHHvHPv7I7kNUsWrYhmoD2vitJwRcwncfQ/" +
"export?format=csv")
zinc_smiles.head()
Out[14]:
In [15]:
random_indices = np.random.choice(np.arange(len(zinc_smiles.index)), size=5000, replace=False)
random_indices
Out[15]:
In [16]:
random_smiles = zinc_smiles.iloc[random_indices, :]
random_smiles.head()
Out[16]:
In [17]:
len(random_smiles['SMILES'].unique())
Out[17]:
In [18]:
random_compounds = smiles_to_numpy(random_smiles['SMILES'])
random_compounds.shape
Out[18]:
In [19]:
unique_gp = np.vstack({tuple(row) for row in geroprotectors}).shape[0]
try:
assert unique_gp == geroprotectors.shape[0]
except AssertionError:
print "# of duplicated fingerprints for geroprotectors: {}.".format(
geroprotectors.shape[0] - unique_gp)
else:
print "There are no duplicated fingerprints for geroprotectors."
In [20]:
unique_rc = np.vstack({tuple(row) for row in random_compounds}).shape[0]
try:
assert unique_rc == random_compounds.shape[0]
except AssertionError:
print "# of duplicated fingerprints for random compounds: {}.".format(
random_compounds.shape[0] - unique_rc)
else:
print "There are no duplicated fingerprints for random compounds."
In [21]:
unique_all = np.vstack({tuple(row)
for row in np.vstack((geroprotectors, random_compounds))}).shape[0]
try:
assert unique_all == unique_gp + unique_rc
except AssertionError:
print "Length of intersection between two samples: {}.".format(
unique_gp + unique_rc - unique_all)
else:
print "There is no intersection between geroprotectors and random compounds."
Although random_compounds
has duplicate fingerprints (e.g. due to hash collisions turning different SMILES into the same fingerprint), there is an empty intersection between randomly chosen compounds and known geroprotectors.
In [22]:
positives = np.hstack((geroprotectors, np.ones((geroprotectors.shape[0], 1))))
negatives = np.hstack((random_compounds, np.zeros((random_compounds.shape[0], 1))))
sample = np.vstack((positives, negatives))
sample.shape
Out[22]:
In [23]:
border = positives.shape[0]
Below hypothesis that some bits of fingerprints allows separating geroprotectors, is tested.
In [24]:
scores = {}
for i in range(sample.shape[1] - 1):
scores[i] = roc_auc_score(sample[:, -1], sample[:, i])
In [25]:
n_to_print = 10
for position in sorted(scores, key=scores.get, reverse=True)[:n_to_print]:
print "Bit position: {}, ROC-AUC: {}".format(position, scores[position])
In [26]:
for position in sorted(scores, key=scores.get)[:n_to_print]:
print "Bit position: {}, ROC-AUC: {}".format(position, scores[position])
After reverting predictions, ROC-AUC over 67% is reached solely by feature 1152 for all tested random seeds. They are as follows: 19, 26, 44, 123, 187, 361, 638, 1001, 3249, 128500.
In [27]:
np.mean(positives[:, 1152])
Out[27]:
In [28]:
np.mean(negatives[:, 1152])
Out[28]:
Thus, there is no silver bullet amongst the features, but there are relationships between the target and the features.
In [29]:
fig = plt.figure(figsize=(10, 7.5))
ax = fig.add_subplot(111)
ax.set_title("Fingerprint bits distribution by importance")
ax.set_xlabel("Univariate ROC-AUC")
ax.set_ylabel("Number of features")
_ = ax.hist([scores[key] for key in scores.keys()], 50)
More than half of the features seem like they are of little use. The next section surveys effect of their removal as well as some other issues.
In [30]:
def take_balanced_subsample(iterable, size):
"""
For each element of `iterable` draws
`size` objects (i.e. lines) from it and then
concatenates the results.
@type iterable: list-like(numpy.ndarray)
@type size: int
@rtype: numpy.ndarray
"""
subsamples = []
for arr in iterable:
subsample = arr[np.random.choice(np.arange(arr.shape[0]), size=size, replace=False), :]
subsamples.append(subsample)
return np.vstack(tuple(subsamples))
In [31]:
def tanimoto_similarity(first, second):
"""
Computes Tanimoto similarity of two arrays.
It is defined as a ratio of number of bits
active in both arrays to number of bits
active in at least one array.
@type first: numpy.ndarray
@type second: numpy.ndarray
@rtype: float
"""
bool_first = first.astype(bool)
bool_second = second.astype(bool)
intersection_length = np.sum(np.bitwise_and(bool_first, bool_second).astype(int))
union_length = np.sum(np.bitwise_or(bool_first, bool_second).astype(int))
return float(intersection_length) / union_length
In [32]:
def fill_tanimoto_matrix(arr):
"""
Assuming that each line of `arr` represents
an object, fills matrix of pairwise
Tanimoto similarities.
@type arr: numpy.ndarray
@rtype: numpy.ndarray
"""
size = arr.shape[0]
tanimoto_matrix = np.zeros((size, size))
for i, j in product(range(size), range(size)):
tanimoto_matrix[i, j] = tanimoto_similarity(arr[i, :], arr[j, :])
return tanimoto_matrix
In [33]:
halfsize = 10
subsample = take_balanced_subsample([positives, negatives], halfsize)[:, :-1]
tanimoto_matrix = fill_tanimoto_matrix(subsample)
In [34]:
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111)
ax.set_title("Tanimoto similarities with all features involved")
_ = sns.heatmap(tanimoto_matrix, ax=ax, annot=True, vmax=0.5)
In [35]:
relevant_features = [key for key in scores.keys() if abs(scores[key] - 0.5) > 0.01]
subspaced_subsample = subsample[:, relevant_features]
subspaced_tanimoto_matrix = fill_tanimoto_matrix(subspaced_subsample)
In [36]:
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111)
ax.set_title("Tanimoto similarities after weak features are excluded")
_ = sns.heatmap(subspaced_tanimoto_matrix, ax=ax, annot=True, vmax=0.5)
It can be seen that removal of weak features makes compounds more similar to each other.
Also it is clear from the heatmaps that known geroprotectors are dissimilar to each other approximately at the same extent as they are dissimilar to random compounds.
A trick applied here is that, instead of dropping irrelevant features, all features are weighted according to their relevance. If features are binary, PCA treats all of them as equally important, but the suggested trick solves this issue.
In [37]:
weights = np.array([abs(scores[key] - 0.5) for key in scores.keys()])
weighted_positives = positives[:, :-1] * weights
weighted_negatives = negatives[:, :-1] * weights
weighted_sample = sample[:, :-1] * weights
In [38]:
pca = PCA(n_components=10, random_state=361)
pca.fit(weighted_sample)
pca.explained_variance_ratio_
Out[38]:
In [39]:
pca = PCA(n_components=1, random_state=361)
roc_auc_score(sample[:, -1], pca.fit_transform(weighted_sample)[:, 0])
Out[39]:
In [40]:
pca = PCA(n_components=3, random_state=361)
pca.fit(weighted_sample)
pcaed_positives = pca.transform(weighted_positives)
pcaed_negatives = pca.transform(weighted_negatives)
In [41]:
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111)
ax.set_title("Projection on the first two principal components")
ax.set_xlabel("1st")
ax.set_ylabel("2nd")
ax.scatter(pcaed_negatives[:, 0], pcaed_negatives[:, 1])
_ = ax.scatter(pcaed_positives[:, 0], pcaed_positives[:, 1], c='Red', s=50)
In [42]:
fig = plt.figure(figsize=(15, 7.5))
ax_one = fig.add_subplot(121)
ax_one.set_title("KDE for geroprotectors")
ax_one.set_xlabel("1st")
ax_one.set_ylabel("2nd")
sns.kdeplot(pcaed_positives[:, 0], pcaed_positives[:, 1],
cmap='Reds', shade=True, shade_lowest=False, ax=ax_one)
ax_two = fig.add_subplot(122, sharex=ax_one, sharey=ax_one)
ax_two.set_title("KDE for random compounds")
ax_two.set_xlabel("1st")
ax_two.set_ylabel("2nd")
_ = sns.kdeplot(pcaed_negatives[:, 0], pcaed_negatives[:, 1],
cmap='Blues', shade=True, shade_lowest=False, ax=ax_two)
Above graph can be used for extending the list of geroprotectors. There is a hypothesis that solid ratio of random compounds from the bottom right cluster (i.e. the one with center at (0.13, -0.07)), are geroprotectors. This hypothesis seems very plausible.
In [43]:
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111)
ax.set_title("Projection on the first and the third principal components")
ax.set_xlabel("1st")
ax.set_ylabel("3rd")
ax.scatter(pcaed_negatives[:, 0], pcaed_negatives[:, 2])
_ = ax.scatter(pcaed_positives[:, 0], pcaed_positives[:, 2], c='Red', s=50)
In [44]:
fig = plt.figure(figsize=(15, 7.5))
ax_one = fig.add_subplot(121)
ax_one.set_title("KDE for geroprotectors")
ax_one.set_xlabel("1st")
ax_one.set_ylabel("3rd")
sns.kdeplot(pcaed_positives[:, 0], pcaed_positives[:, 2],
cmap='Reds', shade=True, shade_lowest=False, ax=ax_one)
ax_two = fig.add_subplot(122, sharex=ax_one, sharey=ax_one)
ax_two.set_title("KDE for random compounds")
ax_two.set_xlabel("1st")
ax_two.set_ylabel("3rd")
_ = sns.kdeplot(pcaed_negatives[:, 0], pcaed_negatives[:, 2],
cmap='Blues', shade=True, shade_lowest=False, ax=ax_two)
In [45]:
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111)
ax.set_title("Projection on the second and the third principal components")
ax.set_xlabel("2nd")
ax.set_ylabel("3rd")
ax.scatter(pcaed_negatives[:, 1], pcaed_negatives[:, 2])
_ = ax.scatter(pcaed_positives[:, 1], pcaed_positives[:, 2], c='Red', s=50)
In [46]:
fig = plt.figure(figsize=(15, 7.5))
ax_one = fig.add_subplot(121)
ax_one.set_title("KDE for geroprotectors")
ax_one.set_xlabel("2nd")
ax_one.set_ylabel("3rd")
sns.kdeplot(pcaed_positives[:, 1], pcaed_positives[:, 2],
cmap='Reds', shade=True, shade_lowest=False, ax=ax_one)
ax_two = fig.add_subplot(122, sharex=ax_one, sharey=ax_one)
ax_two.set_title("KDE for random compounds")
ax_two.set_xlabel("2nd")
ax_two.set_ylabel("3rd")
_ = sns.kdeplot(pcaed_negatives[:, 1], pcaed_negatives[:, 2],
cmap='Blues', shade=True, shade_lowest=False, ax=ax_two)
In [47]:
fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(111, projection='3d')
ax.set_title("Projection on the first three principal components")
ax.set_xlabel("1st")
ax.set_ylabel("2nd")
ax.set_zlabel("3rd")
ax.scatter(pcaed_negatives[:, 0], pcaed_negatives[:, 1], pcaed_negatives[:, 2])
_ = ax.scatter(pcaed_positives[:, 0], pcaed_positives[:, 1], pcaed_positives[:, 2],
c='Red', s=50)
The conclusion is that there are clusters with high concentration of known geroprotectors and low concentration of unlabeled compounds. Unlabeled compounds from such clusters are potential geroprotectors.
In [48]:
X = sample[:, :-1]
y = sample[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=361)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[48]:
In [49]:
np.sum(y_train) / y_train.shape[0], np.sum(y_test) / y_test.shape[0]
Out[49]:
To generate PCA-based features, y_test
can not be used and X_test
can be transformed, but it is better to avoid fitting PCA to it.
In [50]:
scores = {}
for i in range(X_train.shape[1]):
scores[i] = roc_auc_score(y_train, X_train[:, i])
weights = np.array([abs(scores[key] - 0.5) for key in scores.keys()])
weighted_X_train = X_train * weights
weighted_X_test = X_test * weights
In [51]:
pca = PCA(n_components=50, random_state=361)
pca.fit(weighted_X_train)
evr = pca.explained_variance_ratio_
In [52]:
fig = plt.figure(figsize=(10, 7.5))
ax = fig.add_subplot(111)
ax.set_title("Dependence of explained variance ratio on number of principal components")
ax.set_xlabel("Number of principal components")
ax.set_ylabel("Explained variance ratio")
_ = ax.plot(evr.cumsum())
Thus, the most informative components are the first three components. However, they explain not so much variance and because of this it is better to include the first 20 components.
In [53]:
pca = PCA(n_components=20, random_state=361)
pcaed_X_train = pca.fit_transform(weighted_X_train)
pcaed_X_test = pca.transform(weighted_X_test)
In [54]:
for i in range(3):
raw_train = roc_auc_score(y_train, pcaed_X_train[:, i])
raw_test = roc_auc_score(y_test, pcaed_X_test[:, i])
print "Univariate ROC-AUC for component {}; train: {}, test: {}".format(
i, max(raw_train, 1 - raw_train), max(raw_test, 1 - raw_test))
These univariate benchmarks look quite decent, but not perfect. Anyway, this is enough to include the corresponding features.
Now, filter the less relevant initial features based on Pearson's $\chi^2$-test. It is appropriate here, because features are Boolean.
How to choose a number of features to be selected? There is a heuristic that it is desirable to have at least 30 observations per each independent variable. Given 20 principal components and approximately 5100 training examples, an estimation of number of features is: $$(0.8 \times 0.75 \times 5100) \,/ \, 30 - 20 \approx 80,$$ where factors in numerator adjust for train-test splitting and cross-validation.
In [55]:
feature_selector = SelectKBest(chi2, k=80)
subspaced_X_train = feature_selector.fit_transform(X_train, y_train)
subspaced_X_test = feature_selector.transform(X_test)
In [56]:
fs_and_pca_X_train = np.hstack((subspaced_X_train, pcaed_X_train))
fs_and_pca_X_test = np.hstack((subspaced_X_test, pcaed_X_test))
fs_and_pca_X_train.shape, fs_and_pca_X_test.shape, y_train.shape, y_test.shape
Out[56]:
In [57]:
all_and_pca_X_train = np.hstack((X_train, pcaed_X_train))
all_and_pca_X_test = np.hstack((X_test, pcaed_X_test))
all_and_pca_X_train.shape, all_and_pca_X_test.shape, y_train.shape, y_test.shape
Out[57]:
Actually, all_and_pca_X_train
and all_and_pca_X_test
are not included in this presentation, because PCA-based features are derived from the initial ones by linear transformation and so there is multicollinearity that leads to overfitting.
In [58]:
class ModelsTuner(object):
"""
A convenience tool for repetitive
pieces of code.
Selects not only optimal hyperparameters,
but also optimal feature subspace, where
all candidate subspaces are values of
`X_train_dict`.
"""
def __init__(self, X_train_dict, X_test_dict, y_train, y_test):
"""
@type X_train_dict: dict(str -> numpy.ndarray)
@type X_test_dict: dict(str -> numpy.ndarray)
@type y_train: numpy.ndarray
@type y_test: numpy.ndarray
"""
try:
assert X_train_dict.keys() == X_test_dict.keys()
except AssertionError:
raise ValueError("Mismatching dictionaries.")
self.X_train_dict = X_train_dict
self.X_test_dict = X_test_dict
self.y_train = y_train
self.y_test = y_test
self.best_subspace_key_ = None
self.best_params_ = None
def tune(self, clf, grid_params, kf, scoring, verbose=True):
"""
Tunes `clf` on grid specified by `grid_params`.
For evaluation `kf` and `scoring` are used.
@type clf: any sklearn classifier
@type grid_params: dict
@type kf: any sklearn folds
@type scoring: str
@verbose: bool
"""
grid_search_cv = GridSearchCV(clf, grid_params, cv=kf, scoring=scoring, refit=False)
subspaces_scores = {}
subspaces_best_params = {}
for key, sample in self.X_train_dict.items():
grid_search_cv.fit(sample, y_train)
subspaces_scores[key] = grid_search_cv.best_score_
subspaces_best_params[key] = grid_search_cv.best_params_
if verbose:
print "-*- Results for {}".format(key)
print "Best CV mean score: {}".format(grid_search_cv.best_score_)
means = grid_search_cv.cv_results_['mean_test_score']
stds = grid_search_cv.cv_results_['std_test_score']
print "Detailed results:"
for mean, std, params in zip(means, stds,
grid_search_cv.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, 2 * std, params))
print ''
self.best_subspace_key_ = max(subspaces_scores, key=subspaces_scores.get)
self.best_params_ = subspaces_best_params[self.best_subspace_key_]
if isinstance(clf, SVC):
self.best_params_['probability'] = True
clf = clone(clf).set_params(**self.best_params_).fit(
self.X_train_dict[self.best_subspace_key_], y_train)
if verbose:
print "-*- Best feature subspace is {}.".format(self.best_subspace_key_)
y_hat = clf.predict_proba(self.X_test_dict[self.best_subspace_key_])[:, 1]
print "-*- Test set score: {}".format(roc_auc_score(self.y_test, y_hat))
return clf
What scores can be achieved by some popular ensemble method (say, XGBoost with 300 trees) trained on initial binary features?
In [59]:
clf = XGBClassifier(seed=361)
grid_params = {'n_estimators': [300]}
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=361)
In [60]:
X_train_dict = OrderedDict([('2048f', X_train)])
X_test_dict = OrderedDict([('2048f', X_test)])
In [61]:
tuner = ModelsTuner(X_train_dict, X_test_dict, y_train, y_test)
clf = tuner.tune(clf, grid_params, kf, 'roc_auc')
This is a relatively high score. Is it possible to beat it?
In [62]:
clf = KNeighborsClassifier()
grid_params = {'n_neighbors': [20, 30, 40, 50],
'weights': ['uniform', 'distance'],
'p': [1, 2]}
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=361)
Because K-Nearest Neighbors classifier is vulnerable to curse of dimensionality, only few features are used.
In [63]:
X_train_dict = OrderedDict([('{}pc'.format(i), pcaed_X_train[:, :i]) for i in range(1, 11)])
X_test_dict = OrderedDict([('{}pc'.format(i), pcaed_X_test[:, :i]) for i in range(1, 11)])
In [64]:
tuner = ModelsTuner(X_train_dict, X_test_dict, y_train, y_test)
clf = tuner.tune(clf, grid_params, kf, 'roc_auc')
In [65]:
clf = SVC(random_state=361)
grid_params = {'C': [0.5, 1, 2],
'kernel': ['rbf', 'poly', 'sigmoid'],
'class_weight': [None, 'balanced']}
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=361)
For SVM there are two options: small-to-moderate amount of principal components and all chosen features (both selected from the initial set and engineered). There is no option with absolutely all features, because it is time-consuming (by default, SVC does not predict probabilities and intensive computations are required to do so).
In [66]:
X_train_dict = OrderedDict([('7pc', pcaed_X_train[:, :7]),
('80f_20pc', fs_and_pca_X_train)])
X_test_dict = OrderedDict([('7pc', pcaed_X_test[:, :7]),
('80f_20pc', fs_and_pca_X_test)])
In [67]:
tuner = ModelsTuner(X_train_dict, X_test_dict, y_train, y_test)
clf = tuner.tune(clf, grid_params, kf, 'roc_auc')
Grid is rather coarse than fine, because both implementations from sklearn and xgboost work slowly on regular laptop.
In [68]:
clf = XGBClassifier(seed=361)
grid_params = {'n_estimators': [200, 300],
'max_depth': [3, 5],
'subsample': [0.8, 1],
'scale_pos_weight': [1, 5, 35]}
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=361)
In [69]:
X_train_dict = OrderedDict([('20pc', pcaed_X_train),
('80f', subspaced_X_train),
('80f_20pc', fs_and_pca_X_train),
('2048f', X_train)])
X_test_dict = OrderedDict([('20pc', pcaed_X_test),
('80f', subspaced_X_test),
('80f_20pc', fs_and_pca_X_test),
('2048f', X_test)])
In [70]:
tuner = ModelsTuner(X_train_dict, X_test_dict, y_train, y_test)
clf = tuner.tune(clf, grid_params, kf, 'roc_auc')
Model trained on all features has higher cross-validation scores than model trained on selected subspace, but it takes much more time to fit it. Also it is not as transparent as, say, K-Nearest Neighbors run on the three principal components.
In [71]:
y_train_gb = clf.predict_proba(X_train_dict[tuner.best_subspace_key_])[:, 1]
y_test_gb = clf.predict_proba(X_test_dict[tuner.best_subspace_key_])[:, 1]
In [72]:
sorted([(k, v) for k, v in clf.booster().get_score(importance_type='weight').items()],
key=lambda t: t[1], reverse=True)[:15]
Out[72]:
In [73]:
sorted([(k, round(v, 2)) for k, v in clf.booster().get_score(importance_type='gain').items()],
key=lambda t: t[1], reverse=True)[:15]
Out[73]:
The fact that two lists of the most important features are totally different can be used for explanation, why features with poor univariate ROC-AUC can contribute to an overall model. For example, they can provide splits that, in turn, allow having splits that substantially reduce impurity. More generally, they can play their roles only in interaction with other features.
Somehow or other, features with the highest univariate ROC-AUC such as 1152, 378, and 1057 are at the top based on number of splits, but are not at the top based on heterogeneity reduction. Feature 1380 is the most important by gain and it has high univariate ROC-AUC (although it is not in the list of top-10 features by it). At the same time, the majority of features with the highest univariate ROC-AUC have neither top-level weight importance nor top-level gain importance.
In [74]:
clf = ExtraTreesClassifier(random_state=361)
grid_params = {'n_estimators': [100, 200, 300],
'max_depth': [10, None],
'class_weight': [None, {0: 1, 1: 5}, 'balanced']}
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=361)
The same feature subspaces are used for all ensemble methods.
In [75]:
X_train_dict = OrderedDict([('20pc', pcaed_X_train),
('80f', subspaced_X_train),
('80f_20pc', fs_and_pca_X_train),
('2048f', X_train)])
X_test_dict = OrderedDict([('20pc', pcaed_X_test),
('80f', subspaced_X_test),
('80f_20pc', fs_and_pca_X_test),
('2048f', X_test)])
In [76]:
tuner = ModelsTuner(X_train_dict, X_test_dict, y_train, y_test)
clf = tuner.tune(clf, grid_params, kf, 'roc_auc')
In [77]:
y_train_et = clf.predict_proba(X_train_dict[tuner.best_subspace_key_])[:, 1]
y_test_et = clf.predict_proba(X_test_dict[tuner.best_subspace_key_])[:, 1]
In [78]:
sorted(zip(['f{}'.format(i) for i in range(1, 2049)],
clf.feature_importances_.tolist()),
key=lambda t: t[1], reverse=True)[:10]
Out[78]:
Feature importances look different from that of Gradient Boosting. Therefore, it might be a good idea to use stacking running on top of predictions made by these two methods.
In [79]:
clf = RandomForestClassifier(random_state=361)
grid_params = {'n_estimators': [100, 200, 300],
'max_depth': [10, None],
'class_weight': [None, {0: 1, 1: 5}, 'balanced']}
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=361)
In [80]:
X_train_dict = OrderedDict([('20pc', pcaed_X_train),
('80f', subspaced_X_train),
('80f_20pc', fs_and_pca_X_train),
('2048f', X_train)])
X_test_dict = OrderedDict([('20pc', pcaed_X_test),
('80f', subspaced_X_test),
('80f_20pc', fs_and_pca_X_test),
('2048f', X_test)])
In [81]:
tuner = ModelsTuner(X_train_dict, X_test_dict, y_train, y_test)
clf = tuner.tune(clf, grid_params, kf, 'roc_auc')
In [82]:
y_train_rf = clf.predict_proba(X_train_dict[tuner.best_subspace_key_])[:, 1]
y_test_rf = clf.predict_proba(X_test_dict[tuner.best_subspace_key_])[:, 1]
In [83]:
sorted(zip(['f{}'.format(i) for i in range(1, 2049)],
clf.feature_importances_.tolist()),
key=lambda t: t[1], reverse=True)[:10]
Out[83]:
There is a tie in cross-validation scores of Extra Trees and Random Forest, because the difference is not significant (0.946 vs 0.94 respectively). Both of the methods can be a winner. However, let us decide that Random Forest wins.
In [84]:
_ = joblib.dump(clf, 'geroprotectors_classifier.pkl')
Having a method that wins during cross-validation and demonstrates solid result on the test set, we can fit it to all available data and find unlabeled objects with the highest estimated probability of belonging to the positive class. Refitting to the whole dataset can improve performance, because more data to train is better than less data (of course, if additional data are appropriate).
In [85]:
X_total = sample[:, :-1]
X_total.shape, y.shape
Out[85]:
In [86]:
selected_params = tuner.best_params_
selected_params['random_state'] = 361
clf = RandomForestClassifier(**selected_params)
clf
Out[86]:
In [87]:
clf.fit(X_total, y)
y_hat = clf.predict_proba(X_total)[:, 1]
roc_auc_score(y, y_hat)
Out[87]:
Above value means that Random Forest with passed parameters is capable enough to memorize training data. Anyway, we already know that it has ability to generalize well.
In [88]:
min(y_hat[:border])
Out[88]:
In [89]:
max(y_hat[border:])
Out[89]:
Naive interpretation is that no one unlabeled compound has probability of being a geroprotecor higher than 0.3. Of course, this is wrong. Random Forest estimates just a share of trees that classify an object as a positive example. Figures reported as probabilities are rather confidences in belonging to positive class than quantifications of real-world uncertainty. Also take into consideration that treatment of all unlabeled examples as negative ones can decrease estimated probabilities too. One way or another, the only thing that is important is that the higher estimated probability is, the higher the estimated chances to be a geroprotector are.
To see that low estimated probabilities do not evidence that compound probably is not a geroprotector, let us look at the test set. It contains known geroprotectors, but the highest estimated probabilities are as follows.
In [90]:
np.sort(y_test_rf)[-int(np.sum(y_test)):][::-1]
Out[90]:
There are int(np.sum(y_test))
known geroprotectors, but some of the highest estimated probabilities are about 20%. Moreover, since ROC-AUC on the test set is less than than 1, there are geroprotectors with even lower estimated probability.
So compounds that often goes to leaves with high concentration of known geroprotectors, are similar to known geroprotectors and may be a geroprotectors too.
In [91]:
fig = plt.figure(figsize=(10, 7.5))
ax = fig.add_subplot(111)
ax.set_title("Distribution of compounds by estimated probability of being a geroprotector")
ax.set_xlabel("Estimated probability")
ax.set_ylabel("Number of compounds")
_ = ax.hist(y_hat, 100)
Let us find the three unlabeled examples with the highest estimated probabilities of being a geroprotector.
In [92]:
n_examples = 3
indices = y_hat[border:].argsort()[-n_examples:][::-1].tolist()
indices = [x + border for x in indices]
indices
Out[92]:
In [93]:
found_candidates = random_smiles.iloc[indices, 0].values.tolist()
found_candidates
Out[93]:
You can search by this SMILES using sources such as ChemSpider: http://www.chemspider.com/. Do not forget to remove Python single quotes from SMILES.
In [94]:
for candidate in found_candidates:
molecule = MolFromSmiles(candidate)
Draw.MolToMPL(molecule, size=(250, 250))
Data analysis is a perspective way to create long-lists of compounds that possess some desired properties. Then candidates from such lists can be tested experimentally. If lists contain predominantly high-quality candidates, costs of searching for new compounds are seriously reduced.
In this presentation, machine learning techniques are applied in order to find potential geroprotectors and, finally, a set of three candidates is demonstrated. It is worth to be mentioned that quality scores of algorithms are high and this is an evidence for existence of general chemical or biological rule that describes geroprotectors. Although ensemble methods are black boxes, domain specialists can start from clusters' visualization provided during exploratory analysis. Dimensionality reduction makes data more transparent at the cost of a slight decrease in quality scores.
Further improvements can be achieved, in particular, by introducing new features. For example, it is possible to enrich existing dataset using other sources (e.g. sources on gene expression).
Another way to improve the study lies in revising its methodology. The current methodology can detect unknown geroprotectors that are close enough to known geroprotectors, but it can not detect groups of unknown geroprotectors that are dissimilar to any known geroprotectors. In other words, the current methodology is able to find regularities or rules in geroprotectors distribution, but it can not extend the set of found regularities and rules in order to cover rules for totally unknown groups of geroprotectors.
Finally, some techniques like stacking or blending can be applied in order to incorporate predictions made by several algorithms.