Shap XGBoost


In [2]:
import xgboost as xgb
import shap
from sklearn.model_selection import train_test_split
import pandas as pd

In [3]:
X,y = shap.datasets.boston()
X.head()


Out[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

In [4]:
print(y.shape)  # predict house price
y[4:10]


(506,)
Out[4]:
array([36.2, 28.7, 22.9, 27.1, 16.5, 18.9])

In [5]:
y = pd.DataFrame(y)
y.head()


Out[5]:
0
0 24.0
1 21.6
2 34.7
3 33.4
4 36.2

In [6]:
# for regression method, I can not use stratify split with this method
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=410, train_size=0.75, test_size=0.25)

In [7]:
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

In [8]:
param_dist = {'learning_rate':0.01}
model = xgb.XGBRegressor(**param_dist)
model.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        verbose=False)


Out[8]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [9]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

In [10]:
print(shap_values.shape)
shap_values


(506, 13)
Out[10]:
array([[ 1.10400066e-01,  0.00000000e+00,  0.00000000e+00, ...,
         2.28839107e-02, -4.38051298e-03,  3.22986555e+00],
       [ 1.69313133e-01,  0.00000000e+00,  0.00000000e+00, ...,
         2.28839107e-02, -4.38051298e-03,  1.39101851e+00],
       [ 2.04872221e-01,  0.00000000e+00,  0.00000000e+00, ...,
         1.93794537e-02,  2.42282338e-02,  4.48239899e+00],
       ...,
       [ 1.86996222e-01,  0.00000000e+00,  0.00000000e+00, ...,
        -1.97763480e-02, -4.38051298e-03,  2.35479021e+00],
       [ 1.70009464e-01,  0.00000000e+00,  0.00000000e+00, ...,
        -2.07108743e-02,  5.59995696e-03,  2.37453294e+00],
       [ 1.67378768e-01,  0.00000000e+00,  0.00000000e+00, ...,
        -2.07108743e-02, -4.38051298e-03,  1.41540968e+00]], dtype=float32)

In [11]:
# If you JS load successfully, this will generate interactive visualization
shap.initjs()

check_row = 7  # check each individual case
shap.force_plot(explainer.expected_value, shap_values[check_row,:], X.iloc[check_row,:])


Out[11]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Summarize Feature Importance

Summary Plot

  • Higher feature value (pink) with longer length means higher value has larger impact to the prediction. Towards left means impact more on lower value/negative class, towards right means impoact more on higher value/positive class.
  • From upper to lower, features are ranked by importance decreasing order

In [12]:
shap.summary_plot(shap_values, X)



In [37]:
X.iloc[7:9, :]


Out[37]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
7 0.14455 12.5 7.87 0.0 0.524 6.172 96.1 5.9505 5.0 311.0 15.2 396.90 19.15
8 0.21124 12.5 7.87 0.0 0.524 5.631 100.0 6.0821 5.0 311.0 15.2 386.63 29.93

In [40]:
shap_values[7:9, :]


Out[40]:
array([[ 5.77595115e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.21741794e-01, -1.16346812e+00,
         5.98867657e-04,  2.94446021e-01,  0.00000000e+00,
         0.00000000e+00,  4.10547182e-02,  4.53771232e-03,
        -2.85411382e+00],
       [ 5.87462187e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  2.35860541e-01, -1.19827282e+00,
         1.06076524e-03,  3.01349074e-01,  0.00000000e+00,
         0.00000000e+00,  1.39980584e-01,  5.59995696e-03,
        -3.39539766e+00]], dtype=float32)

Check Individual Cases

  • It can only show 2 records here, the red one means higher feature value, the blue one means lower feature value. But none of them means it's high or low in the whole population.
  • When a record is at the top left, it means it tends to contribute to the negative class/lower value prediction; top right means it tends to contribute to the positive class/higher value prediction.
    • But just today, in the real work, we saw opposite cases, where real high probability positive class has feature values all indicate to negative class..

In [36]:
shap.summary_plot(shap_values[7:9, :], X.iloc[7:9, :])



In [55]:
# comparing with xgboost feature importance (by default it's using gain to rank fearure importance)
print(X.columns)
model.feature_importances_


Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')
Out[55]:
array([0.16103059, 0.        , 0.        , 0.        , 0.09178744,
       0.27858293, 0.00161031, 0.13204509, 0.        , 0.        ,
       0.01771337, 0.01127214, 0.30595812], dtype=float32)

In [57]:
# using absolute mean value of SHAP values to rank the features
shap.summary_plot(shap_values, X, plot_type="bar")