Calculating SHAP Values for DRF models in H2O

Note: this example is adapted from an example published in the shap package https://github.com/slundberg/shap/blob/master/notebooks/tree_explainer/Front%20page%20example%20(XGBoost).ipynb



In [4]:

    
import h2o
import shap
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o import H2OFrame

# initialize H2O
h2o.init()

# load JS visualization code to notebook
shap.initjs()









    



Checking whether there is an H2O instance running at http://localhost:54321 . connected.






    




H2O cluster uptime:
5 mins 20 secs
H2O cluster timezone:
America/Los_Angeles
H2O data parsing timezone:
UTC
H2O cluster version:
3.25.0.99999
H2O cluster version age:
9 minutes 
H2O cluster name:
navdeepgill
H2O cluster total nodes:
1
H2O cluster free memory:
3.250 Gb
H2O cluster total cores:
8
H2O cluster allowed cores:
8
H2O cluster status:
locked, healthy
H2O connection url:
http://localhost:54321
H2O connection proxy:
None
H2O internal security:
False
H2O API Extensions:
XGBoost, Algos, AutoML, Core V3, Core V4
Python version:
3.6.0 final



In [6]:

    
# train a GBM model in H2O
X, y = shap.datasets.boston()
boston_housing = H2OFrame(X).cbind(H2OFrame(y, column_names=["medv"]))

model = H2ORandomForestEstimator(ntrees=100)
model.train(training_frame=boston_housing, y="medv")









    



Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
drf Model Build progress: |███████████████████████████████████████████████| 100%



In [7]:

    
# calculate SHAP values using function predict_contributions
contributions = model.predict_contributions(boston_housing)



In [8]:

    
# convert the H2O Frame to use with shap's visualization functions
contributions_matrix = contributions.as_data_frame().as_matrix()
# shap values are calculated for all features
shap_values = contributions_matrix[:,0:13]
# expected values is the last returned column
expected_value = contributions_matrix[:,13].min()









    



/Users/navdeepgill/Desktop/h2o-3/h2o-py/h2o3-py-env/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.



In [9]:

    
# visualize the first prediction's explanation
shap.force_plot(expected_value, shap_values[0,:], X.iloc[0,:])









    Out[9]:



In [10]:

    
# visualize the training set predictions
shap.force_plot(expected_value, shap_values, X)









    Out[10]:



In [11]:

    
# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("RM", shap_values, X)



In [12]:

    
# summarize the effects of all the features
shap.summary_plot(shap_values, X)



In [13]:

    
shap.summary_plot(shap_values, X, plot_type="bar")

H2O cluster uptime:	5 mins 20 secs
H2O cluster timezone:	America/Los_Angeles
H2O data parsing timezone:	UTC
H2O cluster version:	3.25.0.99999
H2O cluster version age:	9 minutes
H2O cluster name:	navdeepgill
H2O cluster total nodes:	1
H2O cluster free memory:	3.250 Gb
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster status:	locked, healthy
H2O connection url:	http://localhost:54321
H2O connection proxy:	None
H2O internal security:	False
H2O API Extensions:	XGBoost, Algos, AutoML, Core V3, Core V4
Python version:	3.6.0 final