Copyright 2017 J. Patrick Hall, jphall@gwu.edu
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Based on: Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Why should i trust you?: Explaining the predictions of any classifier." In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135-1144. ACM, 2016.
http://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf
Instead of perturbing a sample of interest to create a local region in which to fit a linear model, some of these examples use a practical sample, say all one story homes, from the data to create an approximately local region in which to fit a linear model. That model can be validated and the region examined to explain local prediction behavior.
In [1]:
# imports
import h2o
import operator
import numpy as np
import pandas as pd
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
In [2]:
# start h2o
h2o.init()
h2o.remove_all()
In [3]:
# load data
path = '../../03_regression/data/train.csv'
frame = h2o.import_file(path=path)
In [4]:
# assign target and inputs
y = 'SalePrice'
X = [name for name in frame.columns if name not in [y, 'Id']]
In [5]:
# determine column types
# impute
reals, enums = [], []
for key, val in frame.types.items():
if key in X:
if val == 'enum':
enums.append(key)
else:
reals.append(key)
_ = frame[reals].impute(method='median')
_ = frame[enums].impute(method='mode')
In [6]:
# split into training and validation
train, valid = frame.split_frame([0.7])
In [7]:
# print out correlated pairs
corr = train[reals].cor().as_data_frame()
for i in range(0, corr.shape[0]):
for j in range(0, corr.shape[1]):
if i != j:
if np.abs(corr.iat[i, j]) > 0.7:
print(corr.columns[i], corr.columns[j])
In [8]:
X_reals_decorr = [i for i in reals if i not in ['GarageYrBlt', 'TotRmsAbvGrd', 'TotalBsmtSF', 'GarageCars']]
In [9]:
# train GBM model
model = H2OGradientBoostingEstimator(ntrees=100,
max_depth=10,
distribution='huber',
learn_rate=0.1,
stopping_rounds=5,
seed=12345)
model.train(y=y, x=X_reals_decorr, training_frame=train, validation_frame=valid)
preds = valid['Id'].cbind(model.predict(valid))
In [10]:
local_frame = preds.cbind(valid.drop(['Id']))
local_frame = local_frame[local_frame['HouseStyle'] == '1Story']
local_frame['predict'] = local_frame['predict'].log()
local_frame.describe()
In [11]:
%matplotlib inline
In [12]:
# initialize
local_glm = H2OGeneralizedLinearEstimator(lambda_search=True)
# train
local_glm.train(x=X_reals_decorr, y='predict', training_frame=local_frame)
# coefs
print('\nLocal GLM Coefficients:')
for c_name, c_val in sorted(local_glm.coef().items(), key=operator.itemgetter(1)):
if c_val != 0.0:
print('%s %s' % (str(c_name + ':').ljust(25), c_val))
# r2
print('\nLocal GLM R-square:\n%.2f' % local_glm.r2())
In [13]:
# ranked predictions plot
pred_frame = local_frame.cbind(local_glm.predict(local_frame))\
.as_data_frame()[['predict', 'predict0']]
pred_frame.columns = ['ML Preds.', 'Surrogate Preds.']
pred_frame.sort_values(by='ML Preds.', inplace=True)
pred_frame.reset_index(inplace=True, drop=True)
_ = pred_frame.plot(title='Ranked Predictions Plot')
A ranked predictions plot is a way to visually check whether the surrogate model is a good fit for the complex model. The y-axis is the numeric prediction of both models for a given point. The x-axis is the rank of a point when the predictions are sorted by their GBM prediction, from lowest on the left to highest on the right. When both sets of predictions are aligned, as they are above, this a good indication that the linear model fits the complex, nonlinear GBM well in the approximately local region.
Both the R2 and ranked predictions plot show the linear model is a good fit in the practical, approximately local sample. This means the regression coefficients are likely a very accurate representation of the behavior of the nonlinear model in this region.
The local glm coefficient multiplied by the value in a specific row are estimates of how much each variable contributed to each prediction decision. These values can tell you how a variable and it's values were weighted in any given decision by the model. These values are crucially important for machine learning interpretability and are often to referred to "local feature importance", "reason codes", or "turn-down codes." The latter phrases are borrowed from credit scoring. Credit lenders must provide reasons for turning down a credit application, even for automated decisions. Reason codes can be easily extracted from LIME local feature importance values, by simply ranking the variables that played the largest role in any given decision.
In [14]:
row = 20 # select a row to describe
local_contrib_frame = pd.DataFrame(columns=['Name', 'Local Contribution', 'Sign'])
# multiply values in row by local glm coefficients
for name in local_frame[row, :].columns:
contrib = 0.0
try:
contrib = local_frame[row, name]*local_glm.coef()[name]
except:
pass
if contrib != 0.0:
local_contrib_frame = local_contrib_frame.append({'Name':name,
'Local Contribution': contrib,
'Sign': contrib > 0},
ignore_index=True)
# plot
_ = local_contrib_frame.plot(x = 'Name',
y = 'Local Contribution',
kind='bar',
title='Local Contributions for Row ' + str(row) + '\n',
color=local_contrib_frame.Sign.map({True: 'r', False: 'b'}),
legend=False)
In [15]:
local_frame = preds.cbind(valid.drop(['Id'])).as_data_frame()
local_frame.sort_values('predict', axis=0, inplace=True)
local_frame = local_frame.iloc[0: local_frame.shape[0]//10, :]
local_frame = h2o.H2OFrame(local_frame)
local_frame['predict'] = local_frame['predict'].log()
local_frame.describe()
In [16]:
# initialize
local_glm = H2OGeneralizedLinearEstimator(lambda_search=True)
# train
local_glm.train(x=X_reals_decorr, y='predict', training_frame=local_frame)
# ranked predictions plot
pred_frame = local_frame.cbind(local_glm.predict(local_frame))\
.as_data_frame()[['predict', 'predict0']]
pred_frame.columns = ['ML Preds.', 'Surrogate Preds.']
pred_frame.sort_values(by='ML Preds.', inplace=True)
pred_frame.reset_index(inplace=True, drop=True)
_ = pred_frame.plot(title='Ranked Predictions Plot')
# r2
print('\nLocal GLM R-square:\n%.2f' % local_glm.r2())
# coefs
print('\nLocal GLM Coefficients:')
for c_name, c_val in sorted(local_glm.coef().items(), key=operator.itemgetter(1)):
if c_val != 0.0:
print('%s %s' % (str(c_name + ':').ljust(25), c_val))
Here the R2 and ranked predictions plot show a slightly less accurate fit in the local sample. So the regression coefficients and reason codes may be a bit more approximate than those in the first example.
In [17]:
row = 30 # select a row to describe
local_contrib_frame = pd.DataFrame(columns=['Name', 'Local Contribution', 'Sign'])
# multiply values in row by local glm coefficients
for name in local_frame[row, :].columns:
contrib = 0.0
try:
contrib = local_frame[row, name]*local_glm.coef()[name]
except:
pass
if contrib != 0.0:
local_contrib_frame = local_contrib_frame.append({'Name':name,
'Local Contribution': contrib,
'Sign': contrib > 0},
ignore_index=True)
# plot
_ = local_contrib_frame.plot(x = 'Name',
y = 'Local Contribution',
kind='bar',
title='Local Contributions for Row ' + str(row) + '\n',
color=local_contrib_frame.Sign.map({True: 'r', False: 'b'}),
legend=False)
In [18]:
h2o.cluster().shutdown(prompt=True)