License


Copyright 2017 J. Patrick Hall, jphall@gwu.edu

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Local Interpretable Model Agnostic Explanations (LIME)


Based on: Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Why should i trust you?: Explaining the predictions of any classifier." In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135-1144. ACM, 2016.

http://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf

Instead of perturbing a sample of interest to create a local region in which to fit a linear model, some of these examples use a practical sample, say all one story homes, from the data to create an approximately local region in which to fit a linear model. That model can be validated and the region examined to explain local prediction behavior.

Preliminaries: imports, start h2o, load and clean data


In [1]:
# imports
import h2o 
import operator
import numpy as np
import pandas as pd
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [2]:
# start h2o
h2o.init()
h2o.remove_all()


Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_112"; Java(TM) SE Runtime Environment (build 1.8.0_112-b16); Java HotSpot(TM) 64-Bit Server VM (build 25.112-b16, mixed mode)
  Starting server from /Users/phall/anaconda/lib/python3.5/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/tc/0ss1l73113j3wdyjsxmy1j2r0000gn/T/tmpk5t7btqn
  JVM stdout: /var/folders/tc/0ss1l73113j3wdyjsxmy1j2r0000gn/T/tmpk5t7btqn/h2o_phall_started_from_python.out
  JVM stderr: /var/folders/tc/0ss1l73113j3wdyjsxmy1j2r0000gn/T/tmpk5t7btqn/h2o_phall_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: 01 secs
H2O cluster version: 3.12.0.1
H2O cluster version age: 2 months and 9 days
H2O cluster name: H2O_from_python_phall_og2isp
H2O cluster total nodes: 1
H2O cluster free memory: 3.556 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
Python version: 3.5.2 final

Load and prepare data for modeling


In [3]:
# load data
path = '../../03_regression/data/train.csv'
frame = h2o.import_file(path=path)


Parse progress: |█████████████████████████████████████████████████████████| 100%

In [4]:
# assign target and inputs
y = 'SalePrice'
X = [name for name in frame.columns if name not in [y, 'Id']]

LIME is simpler to use with data containing no missing values


In [5]:
# determine column types
# impute
reals, enums = [], []
for key, val in frame.types.items():
    if key in X:
        if val == 'enum':
            enums.append(key)
        else: 
            reals.append(key)
            
_ = frame[reals].impute(method='median')
_ = frame[enums].impute(method='mode')

In [6]:
# split into training and validation
train, valid = frame.split_frame([0.7])

LIME can be unstable with data in which strong correlations exist between input variables


In [7]:
# print out correlated pairs
corr = train[reals].cor().as_data_frame()
for i in range(0, corr.shape[0]):
    for j in range(0, corr.shape[1]):
        if i != j:
            if np.abs(corr.iat[i, j]) > 0.7:
                print(corr.columns[i], corr.columns[j])


YearBuilt GarageYrBlt
GrLivArea TotRmsAbvGrd
1stFlrSF TotalBsmtSF
TotalBsmtSF 1stFlrSF
TotRmsAbvGrd GrLivArea
GarageCars GarageArea
GarageArea GarageCars
GarageYrBlt YearBuilt

Remove one var from each correlated pair


In [8]:
X_reals_decorr = [i for i in reals if i not in  ['GarageYrBlt', 'TotRmsAbvGrd', 'TotalBsmtSF', 'GarageCars']]

Train a predictive model


In [9]:
# train GBM model
model = H2OGradientBoostingEstimator(ntrees=100,
                                     max_depth=10,
                                     distribution='huber',
                                     learn_rate=0.1,
                                     stopping_rounds=5,
                                     seed=12345)

model.train(y=y, x=X_reals_decorr, training_frame=train, validation_frame=valid)

preds = valid['Id'].cbind(model.predict(valid))


gbm Model Build progress: |███████████████████████████████████████████████| 100%
gbm prediction progress: |████████████████████████████████████████████████| 100%

Build local linear surrogate models to help interpret the model

Create a local region based on HouseStyle


In [10]:
local_frame = preds.cbind(valid.drop(['Id']))
local_frame = local_frame[local_frame['HouseStyle'] == '1Story']
local_frame['predict'] = local_frame['predict'].log()
local_frame.describe()


Rows:209
Cols:82


Id predict MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
type int real int enum real int enum enum enum enum enum enum enum enum enum enum enum enum int int int int enum enum enum enum enum int enum enum enum enum enum enum enum int enum int int int enum enum enum enum int int int int int int int int int int enum int enum int enum enum real enum int int enum enum enum int int int int int int enum enum enum int int int enum enum int
mins 14.0 11.03654544159326920.0 32.0 3010.0 2.0 3.0 1913.0 1950.0 0.0 0.0 0.0 0.0 0.0 480.0 0.0 0.0 480.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 1920.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2006.0 35311.0
mean 687.1291866028708 12.00554754120865342.057416267942585 70.779179232617159969.435406698565 5.947368421052632 5.4832535885167471974.60765550239241984.909090909091 89.69856459330144 574.1052631578947 53.61244019138756602.04306220095691229.7607655502393 1337.66028708133972.23444976076555 0.0 1339.89473684210520.57894736842105270.0430622009569378 1.47846889952153120.119617224880382772.650717703349282 1.0526315789473684 5.980861244019139 0.5358851674641149 1979.9238578680201 1.7272727272727273479.52153110047846 111.377990430622 38.79904306220096 17.2679425837320582.569377990430622 9.2296650717703343.1004784688995217 19.1387559808612446.555023923444976 2007.8086124401914 178322.78468899522
maxs 1452.0 12.985784787643702190.0 313.0 50271.0 10.0 9.0 2009.0 2010.0 922.0 1810.0 791.0 2336.0 3094.0 2898.0 467.0 0.0 2898.0 3.0 1.0 3.0 1.0 6.0 2.0 10.0 3.0 2009.0 4.0 1356.0 857.0 304.0 286.0 216.0 224.0 648.0 1200.0 12.0 2010.0 555000.0
sigma 422.432001309779140.391649246313495441.068382877079394 24.435571112959155321.718702537052 1.50706836630882310.98593812110746924.48997319854322520.1376599844756 150.47734958562475 488.1206585752494 155.9077267731607489.8865086996808457.11586457436033 398.8805542268284532.303065462545690.0 399.1265805830717 0.55011960422018090.203484550908893420.52875538753323140.325292541252000970.77677338154163290.22383300599978273 1.2746105079593406 0.612297302582734 21.14726636557282 0.7827913117569955229.0783598834146 135.5653606687560658.12203506948400653.86418976031431 21.58494445244393539.5973105193781244.823097258521635 114.005840270368032.84166901871504331.2978245436848066 77114.87392119177
zeros 0 0 0 0 0 0 0 0 0 114 39 182 22 7 0 208 209 0 93 200 2 184 2 0 0 109 0 14 14 95 104 184 206 198 208 202 0 0 0
missing0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 14.0 12.26860912318311820.0 RL 91.0 10652.0 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 1Story 7.0 5.0 2006.0 2007.0 Gable CompShg VinylSd VinylSd Stone 306.0 Gd TA PConc Gd TA Av Unf 0.0 Unf 0.0 1494.0 1494.0 GasA Ex Y SBrkr 1494.0 0.0 0.0 1494.0 0.0 0.0 2.0 0.0 3.0 1.0 Gd 7.0 Typ 1.0 Gd Attchd 2006.0 RFn 3.0 840.0 TA TA Y 160.0 33.0 0.0 0.0 0.0 0.0 NA NA NA 0.0 8.0 2007.0 New Partial 279500.0
1 15.0 11.89861004166866 20.0 RL 70.0499583680266510920.0 Pave NA IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6.0 5.0 1960.0 1960.0 Hip CompShg MetalSd MetalSd BrkFace 212.0 TA TA CBlock TA TA No BLQ 733.0 Unf 0.0 520.0 1253.0 GasA TA Y SBrkr 1253.0 0.0 0.0 1253.0 1.0 0.0 1.0 1.0 2.0 1.0 TA 5.0 Typ 1.0 Fa Attchd 1960.0 RFn 1.0 352.0 TA TA Y 0.0 213.0 176.0 0.0 0.0 0.0 NA GdWo NA 0.0 5.0 2008.0 WD Normal 157000.0
2 27.0 11.77105609163824920.0 RL 60.0 7200.0 Pave NA Reg Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 5.0 7.0 1951.0 2000.0 Gable CompShg Wd Sdng Wd Sdng None 0.0 TA TA CBlock TA TA Mn BLQ 234.0 Rec 486.0 180.0 900.0 GasA TA Y SBrkr 900.0 0.0 0.0 900.0 0.0 1.0 1.0 0.0 3.0 1.0 Gd 5.0 Typ 0.0 NA Detchd 2005.0 Unf 2.0 576.0 TA TA Y 222.0 32.0 0.0 0.0 0.0 0.0 NA NA NA 0.0 5.0 2010.0 WD Normal 134800.0
3 28.0 12.65471385490652320.0 RL 98.0 11478.0 Pave NA Reg Lvl AllPub Inside Gtl NridgHt Norm Norm 1Fam 1Story 8.0 5.0 2007.0 2008.0 Gable CompShg VinylSd VinylSd Stone 200.0 Gd TA PConc Ex TA No GLQ 1218.0 Unf 0.0 486.0 1704.0 GasA Ex Y SBrkr 1704.0 0.0 0.0 1704.0 1.0 0.0 2.0 0.0 3.0 1.0 Gd 7.0 Typ 1.0 Gd Attchd 2008.0 RFn 3.0 772.0 TA TA Y 0.0 50.0 0.0 0.0 0.0 0.0 NA NA NA 0.0 5.0 2010.0 WD Normal 306000.0
4 29.0 12.05939361069260820.0 RL 47.0 16321.0 Pave NA IR1 Lvl AllPub CulDSac Gtl NAmes Norm Norm 1Fam 1Story 5.0 6.0 1957.0 1997.0 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock TA TA Gd BLQ 1277.0 Unf 0.0 207.0 1484.0 GasA TA Y SBrkr 1600.0 0.0 0.0 1600.0 1.0 0.0 1.0 0.0 2.0 1.0 TA 6.0 Typ 2.0 Gd Attchd 1957.0 RFn 1.0 319.0 TA TA Y 288.0 258.0 0.0 0.0 0.0 0.0 NA NA NA 0.0 12.0 2006.0 WD Normal 207500.0
5 30.0 11.17125449658085 30.0 RM 60.0 6324.0 Pave NA IR1 Lvl AllPub Inside Gtl BrkSide Feedr RRNn 1Fam 1Story 4.0 6.0 1927.0 1950.0 Gable CompShg MetalSd MetalSd None 0.0 TA TA BrkTil TA TA No Unf 0.0 Unf 0.0 520.0 520.0 GasA Fa N SBrkr 520.0 0.0 0.0 520.0 0.0 0.0 1.0 0.0 1.0 1.0 Fa 4.0 Typ 0.0 NA Detchd 1920.0 Unf 1.0 240.0 Fa TA Y 49.0 0.0 87.0 0.0 0.0 0.0 NA NA NA 0.0 5.0 2008.0 WD Normal 68500.0
6 32.0 11.81039945287562720.0 RL 70.049958368026658544.0 Pave NA IR1 Lvl AllPub CulDSac Gtl Sawyer Norm Norm 1Fam 1Story 5.0 6.0 1966.0 2006.0 Gable CompShg HdBoard HdBoard None 0.0 TA TA CBlock TA TA No Unf 0.0 Unf 0.0 1228.0 1228.0 GasA Gd Y SBrkr 1228.0 0.0 0.0 1228.0 0.0 0.0 1.0 1.0 3.0 1.0 Gd 6.0 Typ 0.0 NA Attchd 1966.0 Unf 1.0 271.0 TA TA Y 0.0 65.0 0.0 0.0 0.0 0.0 NA MnPrv NA 0.0 6.0 2008.0 WD Normal 149350.0
7 38.0 11.86714853764263120.0 RL 74.0 8532.0 Pave NA Reg Lvl AllPub Inside Gtl NAmes Norm Norm 1Fam 1Story 5.0 6.0 1954.0 1990.0 Hip CompShg Wd Sdng Wd Sdng BrkFace 650.0 TA TA CBlock TA TA No Rec 1213.0 Unf 0.0 84.0 1297.0 GasA Gd Y SBrkr 1297.0 0.0 0.0 1297.0 0.0 1.0 1.0 0.0 3.0 1.0 TA 5.0 Typ 1.0 TA Attchd 1954.0 Fin 2.0 498.0 TA TA Y 0.0 0.0 0.0 0.0 0.0 0.0 NA NA NA 0.0 10.0 2009.0 WD Normal 153000.0
8 39.0 11.8126303757069 20.0 RL 68.0 7922.0 Pave NA Reg Lvl AllPub Inside Gtl NAmes Norm Norm 1Fam 1Story 5.0 7.0 1953.0 2007.0 Gable CompShg VinylSd VinylSd None 0.0 TA Gd CBlock TA TA No GLQ 731.0 Unf 0.0 326.0 1057.0 GasA TA Y SBrkr 1057.0 0.0 0.0 1057.0 1.0 0.0 1.0 0.0 3.0 1.0 Gd 5.0 Typ 0.0 NA Detchd 1953.0 Unf 1.0 246.0 TA TA Y 0.0 52.0 0.0 0.0 0.0 0.0 NA NA NA 0.0 1.0 2010.0 WD Abnorml 109000.0
9 44.0 11.80154741421603520.0 RL 70.049958368026659200.0 Pave NA IR1 Lvl AllPub CulDSac Gtl CollgCr Norm Norm 1Fam 1Story 5.0 6.0 1975.0 1980.0 Hip CompShg VinylSd VinylSd None 0.0 TA TA CBlock Gd TA Av LwQ 280.0 BLQ 491.0 167.0 938.0 GasA TA Y SBrkr 938.0 0.0 0.0 938.0 1.0 0.0 1.0 0.0 3.0 1.0 TA 5.0 Typ 0.0 NA Detchd 1977.0 Unf 1.0 308.0 TA TA Y 145.0 0.0 0.0 0.0 0.0 0.0 NA MnPrv NA 0.0 7.0 2008.0 WD Normal 130250.0

Train penalized linear model in local region

  • Check R2 to ensure surrogate model is a good fit for predictions
  • Use ranked predictions plot to ensure surrogate model is a good fit for predictions
  • Use trained GLM and coefficients to understand local region of response function

In [11]:
%matplotlib inline

In [12]:
# initialize
local_glm = H2OGeneralizedLinearEstimator(lambda_search=True)

# train 
local_glm.train(x=X_reals_decorr, y='predict', training_frame=local_frame)

# coefs
print('\nLocal GLM Coefficients:')
for c_name, c_val in sorted(local_glm.coef().items(), key=operator.itemgetter(1)):
    if c_val != 0.0:
        print('%s %s' % (str(c_name + ':').ljust(25), c_val))
        
# r2
print('\nLocal GLM R-square:\n%.2f' % local_glm.r2())


glm Model Build progress: |███████████████████████████████████████████████| 100%

Local GLM Coefficients:
KitchenAbvGr:             -0.1508258670073023
YrSold:                   -0.0038514394943816705
MSSubClass:               -0.00011210487426281287
EnclosedPorch:            -9.130093138694107e-05
2ndFlrSF:                 -3.9194722085630606e-05
MoSold:                   -1.5547455415244997e-05
LotArea:                  3.329666938941041e-06
MiscVal:                  6.695561375680934e-06
OpenPorchSF:              7.334408876566789e-05
GrLivArea:                7.801707438135216e-05
WoodDeckSF:               8.338161186914433e-05
ScreenPorch:              9.387310676953871e-05
BsmtFinSF1:               0.00010125805853647395
GarageArea:               0.00016468298708688275
1stFlrSF:                 0.000240835733035765
HalfBath:                 0.0004316804511827515
LotFrontage:              0.0005389665055779621
YearRemodAdd:             0.0010353227962118725
BsmtFullBath:             0.0019456452443123145
YearBuilt:                0.003081698342094866
BedroomAbvGr:             0.010692862700010637
Fireplaces:               0.014672329995422593
OverallCond:              0.016722109721306774
OverallQual:              0.09232478225180206
Intercept:                10.437151063047025

Local GLM R-square:
0.97

In [13]:
# ranked predictions plot
pred_frame = local_frame.cbind(local_glm.predict(local_frame))\
                        .as_data_frame()[['predict', 'predict0']]
pred_frame.columns = ['ML Preds.', 'Surrogate Preds.']
pred_frame.sort_values(by='ML Preds.', inplace=True)
pred_frame.reset_index(inplace=True, drop=True)
_ = pred_frame.plot(title='Ranked Predictions Plot')


glm prediction progress: |████████████████████████████████████████████████| 100%

A ranked predictions plot is a way to visually check whether the surrogate model is a good fit for the complex model. The y-axis is the numeric prediction of both models for a given point. The x-axis is the rank of a point when the predictions are sorted by their GBM prediction, from lowest on the left to highest on the right. When both sets of predictions are aligned, as they are above, this a good indication that the linear model fits the complex, nonlinear GBM well in the approximately local region.

Both the R2 and ranked predictions plot show the linear model is a good fit in the practical, approximately local sample. This means the regression coefficients are likely a very accurate representation of the behavior of the nonlinear model in this region.

Create explanations (or 'reason codes') for a row in the local set

The local glm coefficient multiplied by the value in a specific row are estimates of how much each variable contributed to each prediction decision. These values can tell you how a variable and it's values were weighted in any given decision by the model. These values are crucially important for machine learning interpretability and are often to referred to "local feature importance", "reason codes", or "turn-down codes." The latter phrases are borrowed from credit scoring. Credit lenders must provide reasons for turning down a credit application, even for automated decisions. Reason codes can be easily extracted from LIME local feature importance values, by simply ranking the variables that played the largest role in any given decision.


In [14]:
row = 20 # select a row to describe
local_contrib_frame = pd.DataFrame(columns=['Name', 'Local Contribution', 'Sign'])

# multiply values in row by local glm coefficients
for name in local_frame[row, :].columns:
    contrib = 0.0
    try:
        contrib = local_frame[row, name]*local_glm.coef()[name]
    except:
        pass
    if contrib != 0.0:
        local_contrib_frame = local_contrib_frame.append({'Name':name,
                                                          'Local Contribution': contrib,
                                                          'Sign': contrib > 0}, 
                                                         ignore_index=True)

# plot
_ = local_contrib_frame.plot(x = 'Name',
                             y = 'Local Contribution',
                             kind='bar', 
                             title='Local Contributions for Row ' + str(row) + '\n', 
                             color=local_contrib_frame.Sign.map({True: 'r', False: 'b'}), 
                             legend=False)


Create a local region based on predicted SalePrice quantiles


In [15]:
local_frame = preds.cbind(valid.drop(['Id'])).as_data_frame()
local_frame.sort_values('predict', axis=0, inplace=True)
local_frame = local_frame.iloc[0: local_frame.shape[0]//10, :]
local_frame = h2o.H2OFrame(local_frame)
local_frame['predict'] = local_frame['predict'].log()
local_frame.describe()


Parse progress: |█████████████████████████████████████████████████████████| 100%
Rows:44
Cols:82


Id predict MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
type int real int enum real int enum enum enum enum enum enum enum enum enum enum enum enum int int int int enum enum enum enum enum int enum enum enum enum enum enum enum int enum int int int enum enum enum enum int int int int int int int int int int enum int enum int enum enum real enum int int enum enum enum int int int int int int enum enum enum int int int enum enum int
mins 30.0 10.968339120935102 20.0 21.0 1596.0 2.0 3.0 1892.0 1950.0 0.0 0.0 0.0 0.0 0.0 372.0 0.0 0.0 480.0 0.0 0.0 0.0 0.0 1.0 1.0 3.0 0.0 1920.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2006.0 35311.0
mean 656.5454545454544 11.412471417415036 69.88636363636363 54.7749981076375866930.522727272727 4.340909090909091 5.454545454545455 1938.90909090909081964.4545454545455 14.068181818181818 183.11363636363637 25.886363636363637427.568181818182 636.5681818181822 784.2954545454546 209.500000000000035.318181818181818999.11363636363630.250000000000000060.0227272727272727281.0909090909090913 0.113636363636363632.340909090909091 1.1136363636363633 5.295454545454544 0.11363636363636363 1966.070423231591 0.8181818181818182223.54545454545453 42.8863636363636417.1590909090909137.659090909090910.0 0.0 0.0 90.9090909090909 6.25 2007.8863636363633 95765.61363636363
maxs 1405.0 11.592682530778966 190.0 140.0 21750.0 6.0 9.0 1977.0 2006.0 381.0 1440.0 499.0 994.0 1440.0 1440.0 994.0 234.0 2372.0 2.0 1.0 2.0 1.0 4.0 2.0 11.0 2.0 1998.0 3.0 936.0 321.0 287.0 286.0 0.0 0.0 0.0 3500.0 12.0 2010.0 150000.0
sigma 427.577407899786240.1691849754202543560.95641443595537 21.1543469370321 3959.5263370035746 0.93865767141246971.516644789161419223.89396944625402618.047037437033808 62.471722990177895 294.3891757291177 91.37482628980807 332.9897266190454306.01674963485647 214.22815261820085283.3505101952474 35.27682731559835315.365805752038 0.4882336459352794 0.15075567228888181 0.362047267772427630.321038220640550430.71343198754027450.32103822064055043 1.4560147177659126 0.3867520743564635 19.044164062233055 0.755529300763972 213.9337492043408 78.7079428955009354.0197824151263166.176980182923710.0 0.0 0.0 531.26360198977913.41082408374172231.3845645624108944 23115.542274867905
zeros 0 0 0 0 0 0 0 0 0 41 22 40 10 5 0 27 43 0 34 43 1 39 0 0 0 40 0 16 16 30 38 30 44 44 44 42 0 0 0
missing0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 706.0 10.968339120935102 190.0 RM 70.0 5600.0 Pave nan Reg Lvl AllPub Inside Gtl IDOTRR Norm Norm 2fmCon 2Story 4.0 5.0 1930.0 1950.0 Hip CompShg VinylSd Wd Shng None 0.0 Fa Fa Slab nan nan nan nan 0.0 nan 0.0 0.0 0.0 GasA Fa N SBrkr 372.0 720.0 0.0 1092.0 0.0 0.0 2.0 0.0 3.0 2.0 Fa 7.0 Mod 0.0 nan nan 1978.5061638868744nan 0.0 0.0 nan nan N 0.0 0.0 0.0 0.0 0.0 0.0 nan nan Othr 3500.0 7.0 2010.0 WD Normal 55000.0
1 621.0 11.036545441593269 30.0 RL 45.0 8248.0 Pave Grvl Reg Lvl AllPub Inside Gtl Edwards Norm Norm 1Fam 1Story 3.0 3.0 1914.0 1950.0 Gable CompShg Stucco Stucco None 0.0 TA TA BrkTil TA TA No BLQ 41.0 Unf 0.0 823.0 864.0 GasA TA N FuseF 864.0 0.0 0.0 864.0 1.0 0.0 1.0 0.0 2.0 1.0 TA 5.0 Typ 0.0 nan nan 1978.5061638868744nan 0.0 0.0 nan nan N 0.0 0.0 100.0 0.0 0.0 0.0 nan nan nan 0.0 9.0 2008.0 WD Normal 67000.0
2 1326.0 11.081865084108477 30.0 RM 40.0 3636.0 Pave nan Reg Lvl AllPub Inside Gtl IDOTRR Norm Norm 1Fam 1Story 4.0 4.0 1922.0 1950.0 Gable CompShg AsbShng AsbShng None 0.0 TA TA BrkTil TA Fa No Unf 0.0 Unf 0.0 796.0 796.0 GasA Fa N SBrkr 796.0 0.0 0.0 796.0 0.0 0.0 1.0 0.0 2.0 1.0 TA 5.0 Typ 0.0 nan nan 1978.5061638868744nan 0.0 0.0 nan nan N 0.0 0.0 100.0 0.0 0.0 0.0 nan MnPrv nan 0.0 1.0 2008.0 WD Normal 55000.0
3 1381.0 11.12330045348205 30.0 RL 45.0 8212.0 Pave Grvl Reg Lvl AllPub Inside Gtl Edwards Norm Norm 1Fam 1Story 3.0 3.0 1914.0 1950.0 Gable CompShg Stucco Stucco None 0.0 TA Fa BrkTil TA Fa No Rec 203.0 Unf 0.0 661.0 864.0 GasA TA N FuseF 864.0 0.0 0.0 864.0 1.0 0.0 1.0 0.0 2.0 1.0 TA 5.0 Typ 0.0 nan Detchd 1938.0 Unf 1.0 200.0 TA Fa Y 0.0 0.0 96.0 0.0 0.0 0.0 nan nan nan 0.0 6.0 2010.0 WD Normal 58500.0
4 30.0 11.17125449658085 30.0 RM 60.0 6324.0 Pave nan IR1 Lvl AllPub Inside Gtl BrkSide Feedr RRNn 1Fam 1Story 4.0 6.0 1927.0 1950.0 Gable CompShg MetalSd MetalSd None 0.0 TA TA BrkTil TA TA No Unf 0.0 Unf 0.0 520.0 520.0 GasA Fa N SBrkr 520.0 0.0 0.0 520.0 0.0 0.0 1.0 0.0 1.0 1.0 Fa 4.0 Typ 0.0 nan Detchd 1920.0 Unf 1.0 240.0 Fa TA Y 49.0 0.0 87.0 0.0 0.0 0.0 nan nan nan 0.0 5.0 2008.0 WD Normal 68500.0
5 917.0 11.177260209023242 20.0 C (all) 50.0 9000.0 Pave nan Reg Lvl AllPub Inside Gtl IDOTRR Norm Norm 1Fam 1Story 2.0 3.0 1949.0 1950.0 Gable CompShg AsbShng AsbShng None 0.0 TA TA CBlock TA TA Av BLQ 50.0 Unf 0.0 430.0 480.0 GasA TA N FuseA 480.0 0.0 0.0 480.0 1.0 0.0 0.0 0.0 1.0 1.0 TA 4.0 Typ 0.0 nan Detchd 1958.0 Unf 1.0 308.0 TA TA Y 0.0 0.0 0.0 0.0 0.0 0.0 nan nan nan 0.0 10.0 2006.0 WD Abnorml 35311.0
6 977.0 11.193205027446796 30.0 RL 51.0 5900.0 Pave nan IR1 Bnk AllPub Inside Gtl BrkSide Norm Norm 1Fam 1Story 4.0 7.0 1923.0 1958.0 Gable CompShg Wd Sdng Wd Sdng None 0.0 TA TA PConc Gd TA No Unf 0.0 Unf 0.0 440.0 440.0 GasA TA Y FuseA 869.0 0.0 0.0 869.0 0.0 0.0 1.0 0.0 2.0 1.0 Fa 4.0 Typ 0.0 nan nan 1978.5061638868744nan 0.0 0.0 nan nan Y 0.0 0.0 0.0 0.0 0.0 0.0 nan nan nan 0.0 8.0 2006.0 WD Normal 85500.0
7 1327.0 11.201453663593002 30.0 RH 70.0 4270.0 Pave nan Reg Bnk AllPub Inside Mod Edwards Norm Norm 1Fam 1Story 3.0 6.0 1931.0 2006.0 Gable CompShg MetalSd MetalSd None 0.0 TA TA BrkTil TA TA No Rec 544.0 Unf 0.0 0.0 544.0 GasA Ex Y SBrkr 774.0 0.0 0.0 774.0 0.0 0.0 1.0 0.0 3.0 1.0 Gd 6.0 Typ 0.0 nan nan 1978.5061638868744nan 0.0 0.0 nan nan Y 0.0 0.0 286.0 0.0 0.0 0.0 nan nan nan 0.0 5.0 2007.0 WD Normal 79000.0
8 1324.0 11.213968403112121 30.0 RL 50.0 5330.0 Pave nan Reg HLS AllPub Inside Gtl BrkSide Norm Norm 1Fam 1Story 4.0 7.0 1940.0 1950.0 Hip CompShg VinylSd VinylSd None 0.0 Fa TA CBlock TA TA No LwQ 280.0 Unf 0.0 140.0 420.0 GasA Gd Y SBrkr 708.0 0.0 0.0 708.0 0.0 0.0 1.0 0.0 2.0 1.0 Fa 5.0 Typ 0.0 nan nan 1978.5061638868744nan 0.0 0.0 nan nan Y 164.0 0.0 0.0 0.0 0.0 0.0 nan nan nan 0.0 12.0 2009.0 WD Normal 82500.0
9 1001.0 11.253681686666216 20.0 RL 74.0 10206.0 Pave nan Reg Lvl AllPub Corner Gtl Edwards Norm Norm 1Fam 1Story 3.0 3.0 1952.0 1952.0 Flat Tar&Grv BrkComm Brk Cmn None 0.0 TA TA Slab nan nan nan nan 0.0 nan 0.0 0.0 0.0 GasW Fa N FuseF 944.0 0.0 0.0 944.0 0.0 0.0 1.0 0.0 2.0 1.0 Fa 4.0 Min1 0.0 nan Detchd 1956.0 Unf 2.0 528.0 TA Fa Y 0.0 0.0 0.0 0.0 0.0 0.0 nan nan nan 0.0 7.0 2009.0 WD Normal 82000.0

Train penalized linear model in local region


In [16]:
# initialize
local_glm = H2OGeneralizedLinearEstimator(lambda_search=True)

# train 
local_glm.train(x=X_reals_decorr, y='predict', training_frame=local_frame)

# ranked predictions plot
pred_frame = local_frame.cbind(local_glm.predict(local_frame))\
                        .as_data_frame()[['predict', 'predict0']]
pred_frame.columns = ['ML Preds.', 'Surrogate Preds.']
pred_frame.sort_values(by='ML Preds.', inplace=True)
pred_frame.reset_index(inplace=True, drop=True)
_ = pred_frame.plot(title='Ranked Predictions Plot')

# r2
print('\nLocal GLM R-square:\n%.2f' % local_glm.r2())

# coefs
print('\nLocal GLM Coefficients:')
for c_name, c_val in sorted(local_glm.coef().items(), key=operator.itemgetter(1)):
    if c_val != 0.0:
        print('%s %s' % (str(c_name + ':').ljust(25), c_val))


glm Model Build progress: |███████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%

Local GLM R-square:
0.87

Local GLM Coefficients:
FullBath:                 -0.10560466842690448
HalfBath:                 -0.1011673167208127
MoSold:                   -0.008566732387828832
YrSold:                   -0.002307356382236561
EnclosedPorch:            -0.00047646418589632216
MiscVal:                  -7.659111768804664e-05
BsmtFinSF2:               -4.009348055311646e-06
2ndFlrSF:                 2.760992025667719e-07
BsmtFinSF1:               1.0698104588444621e-05
LowQualFinSF:             4.571947416837975e-05
WoodDeckSF:               5.7321738441605914e-05
GarageArea:               0.00017678332091345634
GrLivArea:                0.0003117252960306305
LotFrontage:              0.00095008001255757
YearRemodAdd:             0.001492563117669517
YearBuilt:                0.002536554911375407
BedroomAbvGr:             0.023509735677891618
OverallQual:              0.03550114547540719
BsmtFullBath:             0.038584610439521966
OverallCond:              0.04580958970433358
Fireplaces:               0.11432285083629967
BsmtHalfBath:             0.1795995633777401
Intercept:                7.506954243761411

Here the R2 and ranked predictions plot show a slightly less accurate fit in the local sample. So the regression coefficients and reason codes may be a bit more approximate than those in the first example.

Create explanations (or 'reason codes') for a row in the local set


In [17]:
row = 30 # select a row to describe
local_contrib_frame = pd.DataFrame(columns=['Name', 'Local Contribution', 'Sign'])

# multiply values in row by local glm coefficients
for name in local_frame[row, :].columns:
    contrib = 0.0
    try:
        contrib = local_frame[row, name]*local_glm.coef()[name]
    except:
        pass
    if contrib != 0.0:
        local_contrib_frame = local_contrib_frame.append({'Name':name,
                                                          'Local Contribution': contrib,
                                                          'Sign': contrib > 0}, 
                                                         ignore_index=True)
# plot            
_ = local_contrib_frame.plot(x = 'Name',
                             y = 'Local Contribution',
                             kind='bar', 
                             title='Local Contributions for Row ' + str(row) + '\n', 
                             color=local_contrib_frame.Sign.map({True: 'r', False: 'b'}),
                             legend=False)


Shutdown H2O


In [18]:
h2o.cluster().shutdown(prompt=True)


Are you sure you want to shutdown the H2O instance running at http://127.0.0.1:54321 (Y/N)? y
H2O session _sid_bca1 closed.