Comparing Lolo and Scikit-Learn

The purpose of this notebook is to compare the use and output of models in


In [1]:
%matplotlib inline
from matplotlib import pyplot as plt
from lolopy.learners import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor as SKRFRegressor
from sklearn.datasets import load_boston
import numpy as np

Create the Dataset

We'll use the famous Boston Housing Prices dataset


In [2]:
X, y = load_boston(True)
print('Training set size:', X.shape)


Training set size: (506, 13)

Train a Scikit-Learn Random Forest

Just train the model on the entire Boston dataset and predict the housing price on every entry in the dataset


In [3]:
model = SKRFRegressor(n_estimators=len(X))

In [4]:
%%time 
model.fit(X, y)


CPU times: user 1.75 s, sys: 78.1 ms, total: 1.83 s
Wall time: 1.85 s
Out[4]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=506, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [5]:
%%time
sk_pred = model.predict(X)


CPU times: user 78.1 ms, sys: 31.2 ms, total: 109 ms
Wall time: 72.7 ms

Train Model Using Lolo

Train the model and get the predictions with uncertainties


In [6]:
model = RandomForestRegressor(num_trees=len(X))

In [7]:
%%time
model.fit(X, y)


CPU times: user 1.09 s, sys: 672 ms, total: 1.77 s
Wall time: 7.81 s
Out[7]:
RandomForestRegressor(num_trees=506, subsetStrategy=4, useJackknife=True)

In [8]:
%%time
lolo_pred, lolo_std = model.predict(X, return_std=True)


CPU times: user 734 ms, sys: 609 ms, total: 1.34 s
Wall time: 5.32 s

Note that it follows the same API as the scikit-learn model

Plot the Results

Just show that Lolo gives a reasonably similar model to sklearn


In [9]:
fig, axs = plt.subplots(1, 2, sharey=True)

axs[0].errorbar(y, lolo_pred, lolo_std, fmt='o', ms=2.5, ecolor='gray')
axs[1].scatter(y, sk_pred, s=5)

lim = [0, 55]

for ax, n in zip(axs, ['Lolo', 'sklearn']):
    ax.set_xlim(lim)
    ax.set_ylim(lim)
    ax.set_xlabel('House Price, True (k$)')
    ax.plot(lim, lim, 'k--')
    ax.text(5, 50, n, fontsize=16)
    
axs[0].set_ylabel('House Price, Predicted (k$)')
fig.set_size_inches(6, 3)
fig.tight_layout()


Lolo produces a Random Forest model very close to what scikit-learn does and can do error bars


In [ ]: