Consider a binary classification problem. The data and target files are available online. The domain of the problem is chemoinformatics. Data is about toxicity of 4K small molecules. The creation of a predictive system happens in 3 steps:
data conversion: transform instances into a suitable graph format. This is done using specialized programs for each (domain, format) pair. In the example we have molecular graphs encoded using the gSpan format and we will therefore use the 'gspan' tool.
data vectorization: transform graphs into sparse vectors. This is done using the EDeN tool. The vectorizer accepts as parameters the (maximal) size of the fragments to be used as features, this is expressed as the pair 'radius' and the 'distance'. See for details: F. Costa, K. De Grave,''Fast Neighborhood Subgraph Pairwise Distance Kernel'', 27th International Conference on Machine Learning (ICML), 2010.
modelling: fit a predicitve system and evaluate its performance. This is done using the tools offered by the scikit library. In the example we will use a Stochastic Gradient Descent linear classifier.
In the following cells there is the code for each step.
Install the library
pip install git+https://github.com/fabriziocosta/EDeN.git --user
load a target file
In [1]:
from eden.util import load_target
y = load_target( 'http://www.bioinf.uni-freiburg.de/~costa/bursi.target' )
load data and convert it to graphs
In [2]:
from eden.converter.graph.gspan import gspan_to_eden
graphs = gspan_to_eden( 'http://www.bioinf.uni-freiburg.de/~costa/bursi.gspan' )
setup the vectorizer
In [3]:
from eden.graph import Vectorizer
vectorizer = Vectorizer( r=2,d=5 )
extract features and build data matrix
In [4]:
%%time
X = vectorizer.transform( graphs )
print 'Instances: %d Features: %d with an avg of %d features per instance' % (X.shape[0], X.shape[1], X.getnnz()/X.shape[0])
Induce a predictor and evaluate its performance
In [5]:
%%time
#induce a predictive model
from sklearn.linear_model import SGDClassifier
predictor = SGDClassifier(average=True, class_weight='auto', shuffle=True, n_jobs=-1)
from sklearn import cross_validation
scores = cross_validation.cross_val_score(predictor, X, y, cv=10, scoring='roc_auc')
import numpy as np
print('AUC ROC: %.4f +- %.4f' % (np.mean(scores),np.std(scores)))