In [1]:
%matplotlib inline
In [2]:
from matplotlib import pyplot as plt
from matplotlib.cm import PuOr
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
In [3]:
from pycebox.ice import ice, ice_plot
In [4]:
# from random.org, for reproducibility
np.random.seed(400845)
This tutorial recreates the first example from Peeking Inside the Black Box: Visualizing
Statistical Learning with Plots of Individual Conditional Expectation using pycebox
. For details of pycebox
's API, consult the documentation.
First we generate 1,000 data points from the model $X_1, X_2, X_3 \sim U(-1, 1)$, $\varepsilon \sim N(0, 1)$,
$$y = 0.2 X_1 - 5 X_2 + 10 X_2 \cdot \mathbb{I}(X_3 \geq 0) + \varepsilon,$$where $\mathbb{I}(\cdot)$ is the indicator function.
In [5]:
N = 1000
In [6]:
df = pd.DataFrame(sp.stats.uniform.rvs(-1, 2, size=(N, 3)),
columns=['x1', 'x2', 'x3'])
noise = sp.stats.norm.rvs(size=N)
In [7]:
y = 0.2 * df.x1 - 5 * df.x2 + 10 * df.x2 * (df.x3 >= 0) + noise
We will study the relationship between $y$ and $X_2$, which is shown below.
In [8]:
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df.x2, y, c='k', alpha=0.5);
ax.set_xlim(-1.05, 1.05);
ax.set_xlabel('$X_2$');
ax.set_ylabel('$y$');
ax.set_title('Data');
We fit a scikit-learn
GradientBoostringRegressor
to the data.
In [9]:
gbm = GradientBoostingRegressor()
gbm.fit(df.values, y)
Out[9]:
We can now use pycebox
's ice
function, to generate individual conditional expectation curves with respect to the fitted model and the predictor $X_2$.
In [10]:
ice_df = ice(df, 'x2', gbm.predict, num_grid_points=100)
Each column of ice_df
corresponds to a data point, with the $X_2$ value removed. Each row corresponds to an $X_2$ value.
In [11]:
ice_df.head()
Out[11]:
The individual conditional expectation curves in ice_df
can now be plotting using your visualization package of choice. pycebox
includes a convenience function, ice_plot
for plotting the individual conditional expectation curves using matplotlib
.
In [12]:
fig, (data_ax, ice_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
data_ax.scatter(df.x2, y, c='k', alpha=0.5);
data_ax.set_xlim(-1.05, 1.05);
data_ax.set_xlabel('$X_2$');
data_ax.set_ylabel('$y$');
data_ax.set_title('Data');
ice_plot(ice_df, frac_to_plot=0.1,
c='k', alpha=0.25,
ax=ice_ax);
ice_ax.set_xlabel('$X_2$');
ice_ax.set_ylabel('$y$');
ice_ax.set_title('ICE curves');
Inspecting the ICE curves, it seems quite likely that there is an important interaction. Since we know the data generating process, we know that the key interaction is between $X_2$ and $X_3$. The function ice_plot
accepts color_by
keywords, which accepts either a string, which represents the column in the initial DataFrame
or a function to apply to the column index in the ICE DataFrame
to use for coloring the ICE plots.
In [13]:
fig, (data_ax, ice_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
data_ax.scatter(df.x2, y, c='k', alpha=0.5);
data_ax.set_xlim(-1.05, 1.05);
data_ax.set_xlabel('$X_2$');
data_ax.set_ylabel('$y$');
data_ax.set_title('Data');
ice_plot(ice_df, frac_to_plot=0.1,
color_by='x3', cmap=PuOr,
ax=ice_ax);
ice_ax.set_xlabel('$X_2$');
ice_ax.set_ylabel('$y$');
ice_ax.set_title('ICE Curves');
This plot makes the interaction between $X_2$ and $X_3$ quite transparent.
Additionally, the ice_plot
accepts the plot_points
keyword. When plot_points=True
, a the predicted value for each data point is plotted on its ICE curve.
In [14]:
fig, (data_ax, ice_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
data_ax.scatter(df.x2, y, c='k', alpha=0.5);
data_ax.set_xlim(-1.05, 1.05);
data_ax.set_xlabel('$X_2$');
data_ax.set_ylabel('$y$');
data_ax.set_title('Data');
ice_plot(ice_df, frac_to_plot=0.1,
plot_points=True, point_kwargs={'color': 'k', 'alpha': 0.75},
color_by='x3', cmap=PuOr,
ax=ice_ax);
ice_ax.set_xlabel('$X_2$');
ice_ax.set_ylabel('$y$');
ice_ax.set_title('ICE Curves');