Some people complain that machine learning models are black boxes.
These people will argue we cannot see how these models are working on any given dataset, so we can neither extract insight nor identify problems with the model.
By and large, people making this claim are unfamiliar with partial dependence plots.
Partial dependence plots show how each variable or predictor affects the model's predictions.
This is useful for questions like:
If you are familiar with linear or logistic regression models, partial dependence plots can be interepreted similarly to the coefficients in those models.
But partial dependence plots can capture more complex patterns from your data, and they can be used with any model.
If you aren't familiar with linear or logistic regressions, don't get caught up on that comparison.
We will show a couple examples below, explain what they mean, and then talk about the code.
We'll begin with 2 partial dependence plots showing the relationship (according to the model) between Price and other variables from the Melbourne Housing dataset.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
from sklearn.preprocessing import Imputer
cols_to_use = ['Distance', 'Landsize', 'BuildingArea']
def get_some_data():
data = pd.read_csv('input/melbourne_data.csv')
y = data.Price
X = data[cols_to_use]
my_imputer = Imputer()
imputed_X = my_imputer.fit_transform(X)
return imputed_X, y
X, y = get_some_data()
my_model = GradientBoostingRegressor()
my_model.fit(X, y)
my_plots = plot_partial_dependence(my_model, features=[0,2], X=X, feature_names=cols_to_use, grid_resolution=10)
plt.show()
The left plot shows the partial dependence between our target, Sales Price, and the distance variable.
Distance in this dataset measures the distance to Melbourne's central business district.
The partial dependence plot is calculated only after the model has been fit.
The model is fit on real data.
In that real data, houses in different parts of town may differ in myriad ways (different ages, sizes), but after the model is fit, we could start by taking all the characteristics of a single house.
Say, a house with 2 bedrooms, 2 bathrooms, a large lot, an age of 10 years, and the like.
We then use the model to predict the price of that house, but we change the distance variable before making a prediction.
We first predict the price for that house when setting the distance to 4.
We then move the price setting distance to 5, then 6, and so on.
We trace out how the predicted price changes (on the vertical axis) as we move from small values of distance to large values (on the horizontal axis).
In this description, we used only a single house.
But because of the peculiar mix of interactions, the partial dependence plot for a single house may overfit.
So, instead we repeat that experiment with multiple houses, and we plot the average predicted price on the vertical axis.
You'll see some negative numbers, but that doesn't mean the price would sell for a negative price.
Instead it means the prices would have been less than the actual average price for that distance.
In the left graph, we see house prices fall as we get further from the central business district.
Though there seems to be a nice suburb about 16 kilometers out, where home prices are higher than in other suburbs.
The right graph shows the impact of building area, which is interpreted similarly.
A larger building area means higher prices.
These plots are useful both to extract insights, as well as to sanity check that your model is learning something you think is sensible.
In [2]:
from sklearn.ensemble.partial_dependence import partial_dependence, plot_partial_dependence
X, y = get_some_data()
# Partial dependence plots were orginally implemented only for Gradient Boosting models.
my_model = GradientBoostingRegressor()
my_model.fit(X, y)
my_plots = plot_partial_dependence(my_model,
features=[0,2], # Column numbers of the plots we want to show.
X=X, # Raw predictors data.
feature_names=['Distance', 'Landsize', 'BuildingArea'], # Labels.
grid_resolution=10) # Number of values to plot on the x-axis.
plt.show()
Some tips related to plot_partial_dependence:
grid_resolution
which we use to determine how many different points are plotted. These plots tend to look jagged as that value increases, because you will pick up lots of randomness or noise in your model. It's best not to take the small or jagged fluctuations too literally. Smaller values of grid_resolution smooth this out. It's also much less of an issue for datasets with many rows. partial_dependence
that is used to get the raw data making up this plot, rather than making the visual plot itself. This is useful if you want to control how it is visualized using a plotting package like Seaborn. With moderate effort, you could make much nicer looking plots.Partial dependence plots are a great way (though not the only way) to extract insights from complex models.
These can be incredibly powerful for communicating those insights to colleagues or non-technical users.
There are a variety of opinions on how to interpret these plots when they come from non-experimental data.
Some claim you can conclude nothing about cause-and-effect relationships from data unless it comes from experiments. Others are more positive about what can be learned from non-experimental data (also called observational data).
It's a divisive topic in the data science world, beyond the scope of this tutorial.
However most agree that these are useful to understand your model.
Also, given the messiness of most real-world data sources, it's also a good sanity check that your model is capturing realistic patterns.
The partial_dependence_plot
function is an easy way to get these plots, though the results aren't visually beautiful.
The partial_dependence
function gives you the raw data, in case you want to make presentation-quality graphs.