Calculating SHAP Values for GBM models in H2O

Note: this example is adapted from an example published in the shap package https://github.com/slundberg/shap/blob/master/notebooks/tree_explainer/Front%20page%20example%20(XGBoost).ipynb



In [1]:

    
import h2o
import shap
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o import H2OFrame

# initialize H2O
h2o.init()

# load JS visualization code to notebook
shap.initjs()









    



versionFromGradle='3.23.0',projectVersion='3.23.0.99999',branch='michalk_treeshap-expose',lastCommitHash='e945ade4bf70cb04112390d86e1af63d14c53832',gitDescribe='jenkins-master-4610-17-ge945ade-dirty',compiledOn='2019-03-21 11:02:19',compiledBy='mkurka'
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
versionFromGradle='3.23.0',projectVersion='3.23.0.99999',branch='michalk_treeshap-expose',lastCommitHash='e945ade4bf70cb04112390d86e1af63d14c53832',gitDescribe='jenkins-master-4610-17-ge945ade-dirty',compiledOn='2019-03-21 11:02:19',compiledBy='mkurka'






    




H2O cluster uptime:
2 mins 01 secs
H2O cluster timezone:
America/Los_Angeles
H2O data parsing timezone:
UTC
H2O cluster version:
3.23.0.99999
H2O cluster version age:
1 hour and 25 minutes 
H2O cluster name:
mkurka
H2O cluster total nodes:
1
H2O cluster free memory:
3.277 Gb
H2O cluster total cores:
8
H2O cluster allowed cores:
8
H2O cluster status:
locked, healthy
H2O connection url:
http://localhost:54321
H2O connection proxy:
None
H2O internal security:
False
H2O API Extensions:
XGBoost, Algos, AutoML, Core V3, Core V4
Python version:
2.7.14 final



In [2]:

    
# train a GBM model in H2O
X, y = shap.datasets.boston()
boston_housing = H2OFrame(X).cbind(H2OFrame(y, column_names=["medv"]))

model = H2OGradientBoostingEstimator(learn_rate=0.01, ntrees=100)
model.train(training_frame=boston_housing, y="medv")









    



Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
gbm Model Build progress: |███████████████████████████████████████████████| 100%



In [3]:

    
# calculate SHAP values using function predict_contributions
contributions = model.predict_contributions(boston_housing)



In [4]:

    
# convert the H2O Frame to use with shap's visualization functions
contributions_matrix = contributions.as_data_frame().as_matrix()
# shap values are calculated for all features
shap_values = contributions_matrix[:,0:13]
# expected values is the last returned column
expected_value = contributions_matrix[:,13].min()



In [5]:

    
# visualize the first prediction's explanation
shap.force_plot(expected_value, shap_values[0,:], X.iloc[0,:])









    Out[5]:







  Visualization omitted, Javascript library not loaded!

  Have you run `initjs()` in this notebook? If this notebook was from another
  user you must also trust this notebook (File -> Trust notebook). If you are viewing
  this notebook on github the Javascript has been stripped for security. If you are using
  JupyterLab this error is because a JupyterLab extension has not yet been written.



In [6]:

    
# visualize the training set predictions
shap.force_plot(expected_value, shap_values, X)









    Out[6]:







  Visualization omitted, Javascript library not loaded!

  Have you run `initjs()` in this notebook? If this notebook was from another
  user you must also trust this notebook (File -> Trust notebook). If you are viewing
  this notebook on github the Javascript has been stripped for security. If you are using
  JupyterLab this error is because a JupyterLab extension has not yet been written.



In [7]:

    
# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("RM", shap_values, X)



In [8]:

    
# summarize the effects of all the features
shap.summary_plot(shap_values, X)



In [9]:

    
shap.summary_plot(shap_values, X, plot_type="bar")



In [ ]:

H2O cluster uptime:	2 mins 01 secs
H2O cluster timezone:	America/Los_Angeles
H2O data parsing timezone:	UTC
H2O cluster version:	3.23.0.99999
H2O cluster version age:	1 hour and 25 minutes
H2O cluster name:	mkurka
H2O cluster total nodes:	1
H2O cluster free memory:	3.277 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	locked, healthy
H2O connection url:	http://localhost:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	XGBoost, Algos, AutoML, Core V3, Core V4
Python version:	2.7.14 final