Annotation

Consider a binary classification problem. We will fit a predictor and use it to assign a weight score to each node in each instance; this operation is referred to as "annotation". For illustration purposes we will display a few annotated graphs. We will see that building a predictor on the annotated instances can increase the predictive performance.

load data and convert it to graphs



In [1]:

    
pos = 'bursi.pos.gspan'
neg = 'bursi.neg.gspan'

from eden.converter.graph.gspan import gspan_to_eden
iterable_pos = gspan_to_eden( pos )
iterable_neg = gspan_to_eden( neg )

#split train/test
train_test_split=0.9
from eden.util import random_bipartition_iter
iterable_pos_train, iterable_pos_test = random_bipartition_iter(iterable_pos, relative_size=train_test_split)
iterable_neg_train, iterable_neg_test = random_bipartition_iter(iterable_neg, relative_size=train_test_split)

setup the vectorizer



In [12]:

    
from eden.graph import Vectorizer
vectorizer = Vectorizer( complexity=2 )



In [13]:

    
%%time
from itertools import tee
iterable_pos_train,iterable_pos_train_=tee(iterable_pos_train)
iterable_neg_train,iterable_neg_train_=tee(iterable_neg_train)
iterable_pos_test,iterable_pos_test_=tee(iterable_pos_test)
iterable_neg_test,iterable_neg_test_=tee(iterable_neg_test)

from eden.util import fit,estimate
estimator = fit(iterable_pos_train_, iterable_neg_train_, vectorizer, n_iter_search=5)
estimate(iterable_pos_test_, iterable_neg_test_, estimator, vectorizer)









    



Test set
Instances: 405 ; Features: 1048577 with an avg of 81 features per instance
--------------------------------------------------------------------------------
Test Estimate
             precision    recall  f1-score   support

         -1       0.67      0.82      0.74       165
          1       0.86      0.72      0.78       240

avg / total       0.78      0.76      0.76       405

APR: 0.846
ROC: 0.808
CPU times: user 9.43 s, sys: 2.04 s, total: 11.5 s
Wall time: 17.4 s

annotate instances and list all resulting graphs

display one graph as an example. Color the vertices using the annotated 'importance' attribute.



In [14]:

    
help(vectorizer.annotate)









    



Help on method annotate in module eden.graph:

annotate(self, graphs, estimator=None, reweight=1.0, relabel=False) method of eden.graph.Vectorizer instance
    Given a list of networkx graphs, and a fitted estimator, it returns a list of networkx 
    graphs where each vertex has an additional attribute with key 'importance'.
    The importance value of a vertex corresponds to the part of the score that is imputable 
    to the neighborhood of radius r+d of the vertex. 
    It can overwrite the label attribute with the sparse vector corresponding to the vertex induced features.
    
    Parameters
    ----------
    estimator : scikit-learn predictor trained on data sampled from the same distribution. 
      If None the vertex weigths are by default 1.
    
    reweight : float
      Update the 'weight' information of each vertex as a linear combination of the current weight and 
      the absolute value of the score computed by the estimator. 
      If reweight = 0 then do not update.
      If reweight = 1 then discard the current weight information and use only abs( score )
      If reweight = 0.5 then update with the aritmetic mean of the current weight information 
      and the abs( score )
    
    relabel : bool
      If True replace the label attribute of each vertex with the 
      sparse vector encoding of all features that have that vertex as root. Create a new attribute 
      'original_label' to store the previous label. If the 'original_label' attribute is already present
      then it is left untouched: this allows an iterative application of the relabeling procedure while 
      preserving the original information.



In [15]:

    
%matplotlib inline
from itertools import tee
iterable_pos_train,iterable_pos_train_=tee(iterable_pos_train)

graphs = vectorizer.annotate( iterable_pos_train_, estimator=estimator )

import itertools 
graphs = itertools.islice( graphs, 3 )

from eden.util.display import draw_graph
for graph in graphs: draw_graph( graph, vertex_color='importance', size=10 )



In [16]:

    
%matplotlib inline
from itertools import tee
iterable_pos_train,iterable_pos_train_=tee(iterable_pos_train)

graphs = vectorizer.annotate( iterable_pos_train_, estimator=estimator )

from eden.modifier.graph.vertex_attributes import colorize_binary
graphs = colorize_binary(graph_list = graphs, output_attribute = 'color_value', input_attribute='importance', level=0)

import itertools 
graphs = itertools.islice( graphs, 3 )

from eden.util.display import draw_graph
for graph in graphs: draw_graph( graph, vertex_color='color_value', size=10 )

Create a data matrix this time using the annotated graphs. Note that now graphs are weighted.

Evaluate the predictive performance on the weighted graphs.



In [17]:

    
%%time
a_estimator=estimator
num_iterations = 3
reweight = 0.6
for i in range(num_iterations):
    print 'Iteration %d'%i
    
    from itertools import tee
    iterable_pos_train_=vectorizer.annotate( iterable_pos_train, estimator=a_estimator, reweight=reweight )
    iterable_neg_train_=vectorizer.annotate( iterable_neg_train, estimator=a_estimator, reweight=reweight )
    iterable_pos_test_=vectorizer.annotate( iterable_pos_test, estimator=a_estimator, reweight=reweight )
    iterable_neg_test_=vectorizer.annotate( iterable_neg_test, estimator=a_estimator, reweight=reweight )
    
    iterable_pos_train,iterable_pos_train_=tee(iterable_pos_train_)
    iterable_neg_train,iterable_neg_train_=tee(iterable_neg_train_)
    iterable_pos_test,iterable_pos_test_=tee(iterable_pos_test_)
    iterable_neg_test,iterable_neg_test_=tee(iterable_neg_test_)

    from eden.util import fit,estimate
    a_estimator = fit(iterable_pos_train_, iterable_neg_train_, vectorizer)
    estimate(iterable_pos_test_, iterable_neg_test_, a_estimator, vectorizer)









    



Iteration 0
Test set
Instances: 405 ; Features: 1048577 with an avg of 81 features per instance
--------------------------------------------------------------------------------
Test Estimate
             precision    recall  f1-score   support

         -1       0.67      0.79      0.73       165
          1       0.84      0.73      0.78       240

avg / total       0.77      0.76      0.76       405

APR: 0.872
ROC: 0.837
Iteration 1
Test set
Instances: 405 ; Features: 1048577 with an avg of 81 features per instance
--------------------------------------------------------------------------------
Test Estimate
             precision    recall  f1-score   support

         -1       0.66      0.79      0.72       165
          1       0.84      0.72      0.78       240

avg / total       0.77      0.75      0.76       405

APR: 0.852
ROC: 0.825
Iteration 2
Test set
Instances: 405 ; Features: 1048577 with an avg of 81 features per instance
--------------------------------------------------------------------------------
Test Estimate
             precision    recall  f1-score   support

         -1       0.68      0.79      0.73       165
          1       0.84      0.74      0.79       240

avg / total       0.77      0.76      0.76       405

APR: 0.867
ROC: 0.837
CPU times: user 1min 12s, sys: 15.1 s, total: 1min 27s
Wall time: 2min 6s