pycobra and scikit-learn

This notebook demonstrates pycobras integration with the scikit-learn ecosystem. We will also give an example of pycobra's performance on some real world data-sets.


In [1]:
from pycobra.cobra import Cobra
from pycobra.ewa import Ewa
from pycobra.diagnostics import Diagnostics
from pycobra.visualisation import Visualisation
import numpy as np
%matplotlib inline

Let's set up a synthetic data-set just to show that the COBRA estimator is scikit-learn compatible.


In [2]:
# setting up our random data-set
rng = np.random.RandomState(1)

# D1 = train machines; D2 = create COBRA; D3 = calibrate epsilon, alpha; D4 = testing
n_features = 20
D1, D2, D3, D4 = 200, 200, 200, 200
D = D1 + D2 + D3 + D4
X = rng.uniform(-1, 1, D * n_features).reshape(D, n_features)
Y = np.power(X[:,1], 2) + np.power(X[:,3], 3) + np.exp(X[:,10]) 
# Y = np.power(X[:,0], 2) + np.power(X[:,1], 3)

# training data-set
X_train = X[:D1 + D2]
X_test = X[D1 + D2 + D3:D1 + D2 + D3 + D4]
X_eps = X[D1 + D2:D1 + D2 + D3]
# for testing
Y_train = Y[:D1 + D2]
Y_test = Y[D1 + D2 + D3:D1 + D2 + D3 + D4]
Y_eps = Y[D1 + D2:D1 + D2 + D3]

Similar to other scikit-learn estimators, we set up our machine by creating an object and then fitting it. Since we are not passing an Epsilon value, we pass data to find an optimal epsilon value while instantiating our object. The optimal epsilon is found through the scikit-learn CVGridSearch. The grid_points parameter decides how many possible epsilon values must be traversed.


In [3]:
cobra = Cobra()

In [4]:
cobra.set_epsilon(X_epsilon=X_eps, y_epsilon=Y_eps, grid_points=5)

In [5]:
cobra.epsilon


Out[5]:
0.83454040481024494

In [6]:
cobra.fit(X_train, Y_train)


Out[6]:
Cobra(epsilon=0.83454040481024494, machines=None, random_state=None)

We now see if our object can fit into the scikit-learn pipeline and GridSearch - and it can!


In [7]:
from sklearn.utils.estimator_checks import check_estimator
check_estimator(Cobra) #passes

Exponentially Weighted Average Aggregate

Let us also demonstrate the EWA predictor. You can read more about it over here in the paper by A. Dalalyan and A. B. Tsybakov.


In [8]:
ewa = Ewa()

In [9]:
ewa.set_beta(X_beta=X_eps, y_beta=Y_eps)

If we fit EWA without passing beta, we perform a CV to find the optimal beta.


In [10]:
ewa.fit(X_train, Y_train)


Out[10]:
Ewa(beta=0.01, random_state=None)

In [11]:
check_estimator(Ewa) #passes

EWA assigns weights to each machine based on it's MSE. We can check the weights of each machine with the plot_machine_weights method.


In [12]:
ewa.plot_machine_weights()



In [13]:
ewa.machine_weight_


Out[13]:
{'lasso': 0.25001661060829494,
 'random_forest': 0.24990096862037336,
 'ridge': 0.25004298116939,
 'tree': 0.25003943960194158}

Like the Cobra estimator, Ewa is also a scikit-learn compatible estimator. It also fits into the Visualisation class, like demonstrated in the notebook.

Predicting?

Like the other scikit-learn predictors, we estimate on data by simply using the predict() method.


In [14]:
query = X_test[0].reshape(1, -1)

In [15]:
cobra.predict(query)


Out[15]:
array([ 0.75129517])

In [16]:
ewa.predict(query)


Out[16]:
array([ 0.62147997])

Why pycobra?

There are scikit-learn estimators which already perform well in basic regression tasks - why use pycobra? The Cobra estimator has the advantage of a theoretical bound on its performance - this means it is supposed to perform at least as well as the estimators used to create it, up to a remainder term which decays to zero. The Ewa estimator also benefits from similar bounds.

pycobra also lets you compare the scikit-learn estimators used in the aggregation - unlike the ensemble methods for regression which scikit-learn has, pycobra's algorithms is actually built on other scikit-learn like estimators.

pycobra for classification

pycobra also implements the classification algorithm as introduced by Mojirsheibani [1999] Combining Classifiers via Discretization, Journal of the American Statistical Association.

ClassifierCobra operates exactly as COBRA in the sense that data points are selected with respect to their closeness to the prediction of the new query point. Then, instead of forming a weighted average as COBRA, ClassifierCobra performs a majority vote to assign a label to the new point.


In [17]:
from sklearn import datasets
from sklearn.metrics import accuracy_score
bc = datasets.load_breast_cancer()
X = bc.data[:-20]
y = bc.target[:-20]
X_test = bc.data[-20:]
y_test = bc.target[-20:]

In [18]:
from pycobra.classifiercobra import ClassifierCobra
check_estimator(ClassifierCobra)

In [19]:
cc = ClassifierCobra()

In [20]:
cc.fit(X, y)


Out[20]:
ClassifierCobra(random_state=None)

In [21]:
cc.predict(X_test)


Out[21]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.])

Let's see how it works in a practical case.


In [22]:
cc_diag = Diagnostics(cc, X_test, y_test)

In [23]:
cc_diag.load_errors()

In [24]:
cc_diag.machine_error


Out[24]:
{'ClassifierCobra': 0.050000000000000044,
 'knn': 0.050000000000000044,
 'sgd': 0.050000000000000044,
 'svm': 0.30000000000000004,
 'tree': 0.050000000000000044}

Quite well!

Real-world datasets

We have demonstrated in the regression notebook how pycobra works on synthetic data-sets. Let's see pycobra in action on some scikit-learn regression datasets.


In [25]:
diabetes = datasets.load_diabetes()

diabetes_X_train = diabetes.data[:-40]
diabetes_X_test = diabetes.data[-20:]
# part of the data to find an appropriate epsilon
diabetes_X_eps = diabetes.data[-40:-20]

diabetes_y_train = diabetes.target[:-40]
diabetes_y_test = diabetes.target[-20:]
diabetes_y_eps = diabetes.target[-40:-20]

We're unaware of what epsilon value to choose for our data-sets so by passing X_eps and y_eps we can get an idea of what might be a good epsilon value.


In [26]:
COBRA_diabetes = Cobra()
COBRA_diabetes.set_epsilon(X_epsilon=diabetes_X_eps, y_epsilon=diabetes_y_eps, grid_points=50)
COBRA_diabetes.fit(diabetes_X_train, diabetes_y_train)


Out[26]:
Cobra(epsilon=67.106571256106747, machines=None, random_state=None)

Predicting using the COBRA predictor is again similar to using a scikit-learn estimator.


In [27]:
COBRA_diabetes.predict(diabetes_X_test)


Out[27]:
array([ 169.55      ,  118.45      ,  163.96153846,  123.36842105,
        205.3       ,  114.1038961 ,  248.73684211,  110.4516129 ,
        121.72631579,  118.11956522,  228.30769231,  101.765625  ,
        120.76470588,  109.89705882,   94.08064516,  176.43589744,
        107.48529412,  116.12      ,  149.12      ,  124.        ])

Let's compare our MSEs using the diagnostics class now.


In [28]:
cobra_diagnostics = Diagnostics(COBRA_diabetes, diabetes_X_test, diabetes_y_test, load_MSE=True)

In [29]:
cobra_diagnostics.machine_MSE


Out[29]:
{'COBRA': 2574.9261032202858,
 'lasso': 1890.4566824060098,
 'random_forest': 3519.2169999999996,
 'ridge': 2046.7959296145473,
 'tree': 5294.0500000000002}

Let us similarily use COBRA on the Boston housing data set.


In [30]:
boston = datasets.load_boston()

boston_X_train = boston.data[:-40]
boston_X_test = boston.data[-20:]
boston_X_eps = boston.data[-40:-20]

boston_y_train = boston.target[:-40]
boston_y_test = boston.target[-20:]
boston_y_eps = boston.target[-40:-20]

In [31]:
COBRA_boston = Cobra()
COBRA_boston.set_epsilon(X_epsilon=boston_X_eps, y_epsilon=boston_y_eps, grid_points=50)
COBRA_boston.fit(boston_X_train, boston_y_train)


/Users/bhargavvader/open_source/pycobra/venv/lib/python2.7/site-packages/sklearn/linear_model/coordinate_descent.py:491: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
Out[31]:
Cobra(epsilon=7.1981878724376198, machines=None, random_state=None)

In [32]:
cobra_diagnostics = Diagnostics(COBRA_boston, boston_X_test, boston_y_test, load_MSE=True)

In [33]:
cobra_diagnostics.machine_MSE


Out[33]:
{'COBRA': 34.990952015164297,
 'lasso': 17.055321704329554,
 'random_forest': 21.819779999999998,
 'ridge': 19.601633683218243,
 'tree': 19.169499999999996}