Plotting Cross-Validated Predictions On The Boston Dataset

This notebook illustrates finding feature importance in the Boston dataset. It is a version of the Scikit-Learn example Plotting Cross-Validated Predictions.

The main point it shows is using pandas structures throughout the code, and integrate nicely with seaborn.


In [14]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn import model_selection
import seaborn as sns
sns.set_style('whitegrid')
sns.despine()

from ibex import trans
from ibex.sklearn import linear_model as pd_linear_model
from ibex.sklearn import decomposition as pd_decomposition
from ibex.sklearn import preprocessing as pd_preprocessing
from ibex.sklearn import ensemble as pd_ensemble
from ibex import xgboost as pd_xgboost
from ibex.sklearn import model_selection as pd_model_selection

%pylab inline


Populating the interactive namespace from numpy and matplotlib

Loading The Data

First we load the dataset into a pandas.DataFrame.


In [9]:
dataset = datasets.load_boston()
boston = pd.DataFrame(dataset.data, columns=dataset.feature_names)
features = dataset.feature_names
boston['price'] = dataset.target
boston.head()


Out[9]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT price
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Building The Cross Validated Predictions

We will use a linear predictor, and a random forest predictor.


In [10]:
linear_y_hat = pd_model_selection.cross_val_predict(
    pd_linear_model.LinearRegression(),
    boston[features],
    boston.price)
linear_y_hat.head()


Out[10]:
0    30.598898
1    24.489412
2    29.747524
3    28.648380
4    27.552867
dtype: float64

In [11]:
linear_cv= pd.concat([linear_y_hat, boston.price], axis=1)
linear_cv['type'] = 'linear'
linear_cv.columns = ['y_hat', 'y', 'regressor']
linear_cv.head()


Out[11]:
y_hat y regressor
0 30.598898 24.0 linear
1 24.489412 21.6 linear
2 29.747524 34.7 linear
3 28.648380 33.4 linear
4 27.552867 36.2 linear

In [32]:
rf_y_hat = pd_model_selection.cross_val_predict(
    pd_ensemble.RandomForestRegressor(),
    boston[features],
    boston.price)
rf_cv= pd.concat([rf_y_hat, boston.price], axis=1)
rf_cv['type'] = 'rf'
rf_cv.columns = ['y_hat', 'y', 'regressor']

In [33]:
xgb_rf_y_hat = pd_model_selection.cross_val_predict(
    pd_xgboost.XGBRegressor(),
    boston[features],
    boston.price)
xgb_rf_cv= pd.concat([xgb_rf_y_hat, boston.price], axis=1)
xgb_rf_cv['type'] = 'xgb_rf'
xgb_rf_cv.columns = ['y_hat', 'y', 'regressor']

In [34]:
cvs = pd.concat([linear_cv, rf_cv, xgb_rf_cv])

In [35]:
cvs.regressor.unique()


Out[35]:
array(['linear', 'rf', 'xgb_rf'], dtype=object)

Plotting The Cross-Validated Predictions

Finally, we can plot the results:


In [37]:
min_, max_ = cvs[['y_hat', 'y']].min().min(), cvs[['y_hat', 'y']].max().max()
sns.lmplot(
    x='y', 
    y='y_hat', 
    hue='regressor', 
    data=cvs,
    palette={'linear': 'grey', 'rf': 'brown', 'xgb_rf': 'green'});
plot(np.linspace(min_, max_, 100), np.linspace(min_, max_, 100), '--', color='darkgrey');
tick_params(colors='0.6')
xlim((min_, max_))
ylim((min_, max_))
figtext(
    0, 
    -0.1, 
    'Cross-validated predictions for linear and random-forest regressor on the price in the Boston dataset;\n'
    'the linear regressor has inferior performance here, in particular for lower prices');



In [8]: