Shap XGBoost

Feature importance with SHAP
While XgBoost uses Gain to rank feature importance by default, and some more important features may rank lower, this algorithm is trying to deal with this drawback
Learning resources: https://github.com/slundberg/shap/tree/master/notebooks



In [2]:

    
import xgboost as xgb
import shap
from sklearn.model_selection import train_test_split
import pandas as pd



In [3]:

    
X,y = shap.datasets.boston()
X.head()



In [4]:

    
print(y.shape)  # predict house price
y[4:10]









    



(506,)






    Out[4]:





array([36.2, 28.7, 22.9, 27.1, 16.5, 18.9])



In [5]:

    
y = pd.DataFrame(y)
y.head()



In [6]:

    
# for regression method, I can not use stratify split with this method
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=410, train_size=0.75, test_size=0.25)



In [7]:

    
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)



In [8]:

    
param_dist = {'learning_rate':0.01}
model = xgb.XGBRegressor(**param_dist)
model.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        verbose=False)









    Out[8]:





XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.01, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)



In [9]:

    
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)



In [10]:

    
print(shap_values.shape)
shap_values









    



(506, 13)






    Out[10]:





array([[ 1.10400066e-01,  0.00000000e+00,  0.00000000e+00, ...,
         2.28839107e-02, -4.38051298e-03,  3.22986555e+00],
       [ 1.69313133e-01,  0.00000000e+00,  0.00000000e+00, ...,
         2.28839107e-02, -4.38051298e-03,  1.39101851e+00],
       [ 2.04872221e-01,  0.00000000e+00,  0.00000000e+00, ...,
         1.93794537e-02,  2.42282338e-02,  4.48239899e+00],
       ...,
       [ 1.86996222e-01,  0.00000000e+00,  0.00000000e+00, ...,
        -1.97763480e-02, -4.38051298e-03,  2.35479021e+00],
       [ 1.70009464e-01,  0.00000000e+00,  0.00000000e+00, ...,
        -2.07108743e-02,  5.59995696e-03,  2.37453294e+00],
       [ 1.67378768e-01,  0.00000000e+00,  0.00000000e+00, ...,
        -2.07108743e-02, -4.38051298e-03,  1.41540968e+00]], dtype=float32)



In [11]:

    
# If you JS load successfully, this will generate interactive visualization
shap.initjs()

check_row = 7  # check each individual case
shap.force_plot(explainer.expected_value, shap_values[check_row,:], X.iloc[check_row,:])









    











    Out[11]:







  Visualization omitted, Javascript library not loaded!

  Have you run `initjs()` in this notebook? If this notebook was from another
  user you must also trust this notebook (File -> Trust notebook). If you are viewing
  this notebook on github the Javascript has been stripped for security. If you are using
  JupyterLab this error is because a JupyterLab extension has not yet been written.

Summarize Feature Importance

Summary Plot

Higher feature value (pink) with longer length means higher value has larger impact to the prediction. Towards left means impact more on lower value/negative class, towards right means impoact more on higher value/positive class.
From upper to lower, features are ranked by importance decreasing order



In [12]:

    
shap.summary_plot(shap_values, X)



In [37]:

    
X.iloc[7:9, :]



In [40]:

    
shap_values[7:9, :]









    Out[40]:





array([[ 5.77595115e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  1.21741794e-01, -1.16346812e+00,
         5.98867657e-04,  2.94446021e-01,  0.00000000e+00,
         0.00000000e+00,  4.10547182e-02,  4.53771232e-03,
        -2.85411382e+00],
       [ 5.87462187e-01,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00,  2.35860541e-01, -1.19827282e+00,
         1.06076524e-03,  3.01349074e-01,  0.00000000e+00,
         0.00000000e+00,  1.39980584e-01,  5.59995696e-03,
        -3.39539766e+00]], dtype=float32)

Check Individual Cases

It can only show 2 records here, the red one means higher feature value, the blue one means lower feature value. But none of them means it's high or low in the whole population.
When a record is at the top left, it means it tends to contribute to the negative class/lower value prediction; top right means it tends to contribute to the positive class/higher value prediction.
- But just today, in the real work, we saw opposite cases, where real high probability positive class has feature values all indicate to negative class..



In [36]:

    
shap.summary_plot(shap_values[7:9, :], X.iloc[7:9, :])



In [55]:

    
# comparing with xgboost feature importance (by default it's using gain to rank fearure importance)
print(X.columns)
model.feature_importances_









    



Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')






    Out[55]:





array([0.16103059, 0.        , 0.        , 0.        , 0.09178744,
       0.27858293, 0.00161031, 0.13204509, 0.        , 0.        ,
       0.01771337, 0.01127214, 0.30595812], dtype=float32)



In [57]:

    
# using absolute mean value of SHAP values to rank the features
shap.summary_plot(shap_values, X, plot_type="bar")

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

	0
0	24.0
1	21.6
2	34.7
3	33.4
4	36.2

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
7	0.14455	12.5	7.87	0.0	0.524	6.172	96.1	5.9505	5.0	311.0	15.2	396.90	19.15
8	0.21124	12.5	7.87	0.0	0.524	5.631	100.0	6.0821	5.0	311.0	15.2	386.63	29.93