Interoperability with sklearn

In this notebook, we demonstrate the interoperability of arboretum with sklearn.model_selection for cross-validation and parameter search. We will also use an example involving feature selection and a pipeline. We will be working with the ALS dataset. This is a wide noisy dataset that tree models struggle with.


In [1]:
from arboretum.datasets import load_als
from arboretum import RFRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error as mse

xtr, ytr, xte, yte = load_als()
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=5)
rf.fit(xtr, ytr)
myrf = RFRegressor(n_trees=100, min_leaf=5)
myrf.fit(xtr, ytr)


Out[1]:
RFRegressor(min_leaf=5, n_trees=100, max_features=None, max_depth=None)

In [2]:
pred = rf.predict(xte)
mypred = myrf.predict(xte)
mse(yte, pred), mse(yte, mypred)


Out[2]:
(0.25643977501041193, 0.26140719623082576)

Grid Search CV

Next, we run a one-parameter grid search for these models on the minimium leaf size. In order to speed things up in the notebook, we'll limit the maximum number of features tried to 30.


In [5]:
rf.max_features = 30
params = {'min_samples_leaf':[1, 5, 10, 20]}
gcv = GridSearchCV(rf, params, 'neg_mean_squared_error')
gcv.fit(xtr, ytr)
pred = gcv.predict(xte)
mse(yte, pred), gcv.best_score_, gcv.best_params_


Out[5]:
(0.27238839781844454, -0.2686713399674493, {'min_samples_leaf': 1})

In [7]:
myrf.max_features = 30
myparams = {'min_leaf':[1, 5, 10, 20]}
mygcv = GridSearchCV(myrf, myparams, 'neg_mean_squared_error')
mygcv.fit(xtr, ytr)
mypred = mygcv.predict(xte)
mse(yte, mypred), mygcv.best_score_, mygcv.best_params_


Out[7]:
(0.27147188893224822, -0.26986941019167443, {'min_leaf': 10})

Pipeline/Feature Selection

Next we'll set up a pipeline with a simple univariate feature selection method, and our model. We'll set the models back to using all features now that feature selection is being used.


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
rf.max_features = None
skb = SelectKBest(f_regression, k=30)
pipe = Pipeline([('select', skb), ('model', rf)])
pipe.fit(xtr, ytr)
pred = pipe.predict(xte)
mse(yte, pred)


Out[10]:
0.26722016810894289

In [11]:
myrf.max_features = None
mypipe = Pipeline([('select', skb), ('model', myrf)])
mypipe.fit(xtr, ytr)
mypred = mypipe.predict(xte)
mse(yte, mypred)


Out[11]:
0.26832628758585897

Conclusion

A lot of the value of scikit-learn is in the 'plumbing' code for repetitive tasks like cross-validation, evaluation, and feature selection. In this notebook, we showed how to use arboretum with these parts of sklearn.