Simple Boston Demo

The PartitionExplainer is still in an Alpha state, but this notebook demonstrates how to use it right now. Note that I am releasing this to get feedback and show how I am working to address concerns about the speed of our model agnostic approaches and the impact of feature correlations. This is all as-yet unpublished work, so treat it accordingly.

When given a balanced partition tree PartitionExplainer has $O(M^2)$ runtime, where $M$ is the number of input features. This is much better than the $O(2^M)$ runtime of KernelExplainer.



In [3]:

    
import numpy as np
import scipy as sp
import scipy.cluster
import matplotlib.pyplot as pl
import xgboost
import shap
import pandas as pd

Train the model



In [4]:

    
X,y = shap.datasets.boston()

model = xgboost.XGBRegressor(n_estimators=100, subsample=0.3)
model.fit(X, y)

x = X.values[0:1,:]
refs = X.values[1:100] # use 100 samples for our background references (using the whole dataset would be slower)









    



[11:36:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.

Compute a hierarchal clustering of the input features



In [5]:

    
D = sp.spatial.distance.pdist(X.fillna(X.mean()).T, metric="correlation")
cluster_matrix = sp.cluster.hierarchy.complete(D)



In [6]:

    
# plot the clustering
pl.figure(figsize=(15, 6))
pl.title('Hierarchical Clustering Dendrogram')
pl.xlabel('sample index')
pl.ylabel('distance')
sp.cluster.hierarchy.dendrogram(
    cluster_matrix,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=10.,  # font size for the x axis labels
    labels=X.columns
)
pl.show()

Explain the first sample with PartitionExplainer



In [12]:

    
# define the model as a python function 
f = lambda x: model.predict(x, output_margin=True, validate_features=False)

# explain the model
e = shap.PartitionExplainer(f, refs, cluster_matrix)
shap_values = e.shap_values(x, tol=-1)
# ...or use something like e.shap_values(x, tol=0.001) to prune the partition tree and so run faster

Compare with TreeExplainer



In [13]:

    
explainer = shap.TreeExplainer(model, refs, feature_dependence="independent")
shap_values2 = explainer.shap_values(x)



In [14]:

    
pl.plot(shap_values2[0], label="TreeExplainer")
pl.plot(shap_values[0], label="PartitionExplainer")
pl.legend()









    Out[14]:





<matplotlib.legend.Legend at 0x1c21312f98>