Classification

Consider a binary classification problem. The data and target files are available online. The domain of the problem is chemoinformatics. Data is about toxicity of 4K small molecules. The creation of a predictive system happens in 3 steps:

  1. data conversion: transform instances into a suitable graph format. This is done using specialized programs for each (domain, format) pair. In the example we have molecular graphs encoded using the gSpan format and we will therefore use the 'gspan' tool.

  2. data vectorization: transform graphs into sparse vectors. This is done using the EDeN tool. The vectorizer accepts as parameters the (maximal) size of the fragments to be used as features, this is expressed as the pair 'radius' and the 'distance'. See for details: F. Costa, K. De Grave,''Fast Neighborhood Subgraph Pairwise Distance Kernel'', 27th International Conference on Machine Learning (ICML), 2010.

  3. modelling: fit a predicitve system and evaluate its performance. This is done using the tools offered by the scikit library. In the example we will use a Stochastic Gradient Descent linear classifier.

In the following cells there is the code for each step.

Install the library

1 Conversion

load a target file


In [4]:
from eden.util import load_target
y = load_target( 'http://www.bioinf.uni-freiburg.de/~costa/bursi.target' )

load data and convert it to graphs


In [9]:
from eden.converter.graph.gspan import gspan_to_eden
graphs = gspan_to_eden( 'http://www.bioinf.uni-freiburg.de/~costa/bursi.gspan' )

2 Vectorization

setup the vectorizer


In [10]:
from eden.graph import Vectorizer
vectorizer = Vectorizer( r=2,d=0 )

extract features and build data matrix


In [11]:
%%time
X = vectorizer.transform( graphs )
print 'Instances: %d Features: %d with an avg of %d features per instance' % (X.shape[0], X.shape[1],  X.getnnz()/X.shape[0])


Instances: 4337 Features: 1048577 with an avg of 21 features per instance
CPU times: user 21.3 s, sys: 86.5 ms, total: 21.3 s
Wall time: 23.3 s

3 Modelling

Induce a predictor and evaluate its performance


In [12]:
%%time
#induce a predictive model
from sklearn.linear_model import SGDClassifier
predictor = SGDClassifier(average=True, class_weight='auto', shuffle=True, n_jobs=-1)

from sklearn import cross_validation
scores = cross_validation.cross_val_score(predictor, X, y, cv=10, scoring='roc_auc')

import numpy as np
print('AUC ROC: %.4f +- %.4f' % (np.mean(scores),np.std(scores)))


AUC ROC: 0.8864 +- 0.0139
CPU times: user 440 ms, sys: 95.6 ms, total: 536 ms
Wall time: 535 ms

In [ ]: