Workshop: Explaining and Interpreting XGBoost Models

This workshop will focus on two primary objectives:

Understanding the overall dynamics of your data and your model:

  • Using more sophisticated modeling packages (like XGBoost) to understand more complicated dynamics in the data

  • How to approach data exploration to understand more complicated relationships between the variables in your data

  • Why the "coherence" of a model is important - arguably, on the same level as its predictive performance

  • How to assess the "coherence" of a model using ICE plots

Understanding and explaining individual predictions from the model

  • How to ascribe "reasons" to individual predictions

  • How to "consolidate" features to make the reasons more coherent and understandable

  • Using visualizations independently and from the SHAP package


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('classic')  # I prefer the appearance of the "old-school" color scheme

%matplotlib inline
from sklearn.metrics import roc_auc_score, confusion_matrix, log_loss, accuracy_score, r2_score, roc_curve, precision_recall_curve
import xgboost as xgb


pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
np.set_printoptions(edgeitems=30, linewidth=100000)

In [2]:
# must have ml_insights installed ('pip install ml_insights')
import ml_insights as mli

In [3]:
mli.__version__


Out[3]:
'0.0.21'

Load data

You can get the data at:

https://drive.google.com/file/d/1HCz_68wCH_3Qp0pGz3R4Q0fmd4FA2Pah/view?usp=sharing

or

https://bit.ly/2JA3Xcm

Download the file and place in the same directory as the notebook.


In [4]:
df_ins = pd.read_csv('ins_data_ODSC_West_2019.csv')
df_ins.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10670 entries, 0 to 10669
Data columns (total 20 columns):
Unnamed: 0                   10670 non-null int64
has_umbrella                 10670 non-null int64
homeown_premium              10670 non-null float64
auto_premium                 10670 non-null float64
num_home_pol                 10670 non-null float64
home_dwell_cov               8781 non-null float64
home_pers_prop_cov           10015 non-null float64
num_vehicles                 10514 non-null float64
max_vehicle_year             10514 non-null float64
min_vehicle_year             10514 non-null float64
num_drivers_by_dob           10670 non-null float64
yob_policyholder             8706 non-null float64
max_driver_yob               9334 non-null float64
min_driver_yob               9334 non-null float64
state_abbrev                 10632 non-null object
avg_homeown_rate_in_state    10632 non-null float64
median_household_income      10518 non-null float64
median_house_value           10518 non-null float64
fold_num_1                   10670 non-null int64
fold_num_2                   10670 non-null int64
dtypes: float64(15), int64(4), object(1)
memory usage: 1.6+ MB

In [ ]:

Data Exploration

This data set contains information about different insurance customers who have both home insurance and auto insurance with an agency. The information includes their premiums, coverages, vehicles, drivers, state of residence, and zip-code demographics.

In this exercise we will be building a model to predict who is likely to have an umbrella policy. Imagine, that our goal is to be able to recommend to new agencies (which currently don't sell umbrella policies) which customers are the best candidates. Our binary outcome variable is has_umbrella

We will focus on techniques for exploring, understanding, and interpreting data and models in a binary classification framework.


In [ ]:

Look at the "baseline" (marginal) probability of the outcome variable

It is always a good idea to know ahead of time what the baseline probability of your outcome variable is. Without this information, it is difficult to properly interpret the effectiveness of your model.


In [5]:
df_ins.has_umbrella.value_counts(), np.mean(df_ins.has_umbrella)


Out[5]:
(0    9457
 1    1213
 Name: has_umbrella, dtype: int64, 0.11368322399250234)

Initial Predictors

To start, let's look only at the variables homeown_premium and auto_premium and see how they relate to the outcome variable has_umbrella


In [6]:
plt.hist(df_ins.homeown_premium)


Out[6]:
(array([9.894e+03, 6.010e+02, 1.070e+02, 3.800e+01, 1.300e+01, 9.000e+00, 1.000e+00, 3.000e+00, 0.000e+00, 4.000e+00]),
 array([    0.,  2270.,  4540.,  6810.,  9080., 11350., 13620., 15890., 18160., 20430., 22700.]),
 <a list of 10 Patch objects>)

In [7]:
# Generally a good idea to put in custom bin widths

plt.hist(df_ins.homeown_premium, bins=np.linspace(0,10000,101));



In [8]:
bins_1 = np.linspace(0,2000,9)
bins_2 = np.array([2000,3000, 6000, 10000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.homeown_premium, df_ins.has_umbrella, bins=bins_final);



In [9]:
plt.hist(df_ins.auto_premium);



In [10]:
plt.hist(df_ins.auto_premium,np.linspace(0,15000,30+1));



In [ ]:


In [11]:
bins_1 = np.linspace(0,4000,8+1)
bins_2 = np.array([4000,6000, 8000, 15000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.auto_premium, df_ins.has_umbrella, bins=bins_final);



In [12]:
## Could do this on all the variables to understand...

In [ ]:

Simple model: 2 variables

It is often a good idea to build a model on a small number of variables first, just to get a sense of the behavior.


In [13]:
chosen_fold_variant = 'fold_num_1'
test_fold_num = 0

In [14]:
features_1 = ['auto_premium','homeown_premium']

Question

  • ### How high do you expect your precision to get? (at, say, recall=.1)

In [ ]:


In [ ]:


In [15]:
# Define train and test sets

X_train_1 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1]
X_test_1 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1]

y_train = df_ins.has_umbrella[df_ins[chosen_fold_variant]!=test_fold_num]
y_test = df_ins.has_umbrella[df_ins[chosen_fold_variant]==test_fold_num]

In [ ]:

First model - default XGBoost


In [16]:
xgb_def1 = xgb.XGBClassifier()

In [17]:
xgb_def1.fit(X_train_1,y_train)
pred_probs_def1 = xgb_def1.predict_proba(X_test_1)[:,1]

Question: How should we evaluate this model?


In [18]:
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)


Out[18]:
(0.5888486568695068, 0.35346403439752083)

In [19]:
mli.plot_pr_curves([y_test],[pred_probs_def1])


To Do: Perform your own evaluation of the model here


In [ ]:


In [ ]:


In [ ]:

Other less utilized ways to "examine" your model

  • ### Calibration (Reliability Diagram)
  • ### Variation to individual Parameter changes (Individual Conditional Expectation (ICE) plots)
  • ### Coherence - does the "story" of the model hold together?

Calibration

Idea: When your model predicts probability of 20%, does it really happen 20% of the time?

Method:

  • #### Look at test set when your model predicted outcome 1 with a probability of (approximately) 20%.
  • #### Look at fraction that actually occured.
  • #### Determine if it is within the expected range of frequency if the probability were accurate
  • #### Doing this requires binning the data

In [20]:
plt.figure(figsize=(12,4))
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))

mli.plot_reliability_diagram(y_test,pred_probs_def1,size_points=False, bins=bins_final);


Question: What do you think of the above reliability diagram?


In [ ]:

Individual Conditional Expectation Plot

Idea:

  • #### Hold all features constant except for one.
  • #### Vary the value of the feature over a range of values
  • #### Examine how the model prediction changes
  • #### Examine how the response curve may take different shapes (because of interactions)

In [21]:
### Define an example set
X_explore_1 =  X_test_1.sample(100)

### "Mutate" the data points and make predictions on them
mxr_def_1 = mli.ModelXRay(xgb_def1,X_explore_1)

In [22]:
### Visualize the responses
indices = mxr_def_1.feature_dependence_plots()


Question: What do you think about these ICE plots?


In [ ]:


In [ ]:

Second model: Let's try adding a bit more regularization


In [23]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9, 
                                 reg_lambda=3, reg_alpha=2)

In [24]:
xgb_complex1.fit(X_train_1,y_train, eval_metric='logloss')#, eval_set=[(X_test_1, y_test)])
pred_probs_complex_1 = xgb_complex1.predict_proba(X_test_1)[:,1]

In [25]:
## Performance of this model
roc_auc_score(y_test, pred_probs_complex_1), log_loss(y_test, pred_probs_complex_1)


Out[25]:
(0.5867353705766885, 0.3530065483201448)

In [26]:
## Performance of previous model
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)


Out[26]:
(0.5888486568695068, 0.35346403439752083)

In [27]:
mli.plot_pr_curves([y_test,y_test],[pred_probs_def1,pred_probs_complex_1])



In [ ]:


In [28]:
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize=(10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex_1,size_points=False, 
                             bins=bins_final);



In [29]:
### Examine Model More Deeply

mxr_complex_1 = mli.ModelXRay(xgb_complex1,X_explore_1)

In [30]:
## Result for more regularized model

mxr_complex_1.feature_dependence_plots(pts_selected=indices)


Out[30]:
array([89, 31, 47, 99, 55])

In [31]:
## Results for less regualrized model

mxr_def_1.feature_dependence_plots(pts_selected=indices)


Out[31]:
array([89, 31, 47, 99, 55])

Questions

  • ### Which model looks more "coherent" ?
  • ### Which model would you feel more comfortable justifying to your boss / clients?
  • ### Which model do you expect to generalize better to new cases?

One more question

  • ### What do you think accounts for the non-monotonicity of the response wrt the premiums?

In [ ]:


In [ ]:

Next model: Add in one additional feature


In [32]:
features_1a = ['auto_premium','homeown_premium'
                 ,'avg_homeown_rate_in_state']

In [33]:
X_train_1a = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1a]
X_test_1a = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1a]

In [34]:
xgb_complex1a= xgb.XGBClassifier(n_estimators=1000, learning_rate=.02, gamma=6, max_depth=3, subsample=.8, 
                                 reg_lambda=3, reg_alpha=2)

In [35]:
xgb_complex1a.fit(X_train_1a,y_train, eval_metric='logloss')#, eval_set=[(X_test_1a, y_test)])


Out[35]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=6,
              learning_rate=0.02, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.8, verbosity=1)

In [36]:
pred_probs_complex_1a = xgb_complex1a.predict_proba(X_test_1a)[:,1]

In [37]:
roc_auc_score(y_test, pred_probs_complex_1a), log_loss(y_test, pred_probs_complex_1a)


Out[37]:
(0.8154444532866465, 0.2903881074906859)

In [38]:
mli.plot_pr_curves([y_test,y_test, y_test],[pred_probs_def1,pred_probs_complex_1, pred_probs_complex_1a])



In [ ]:


In [39]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex_1a,size_points=False, 
                             bins=bins_final);



In [40]:
X_explore_1a = X_test_1a.sample(100)
mxr1a = mli.ModelXRay(xgb_complex1a,X_explore_1a)

In [41]:
mxr1a.feature_dependence_plots()


Out[41]:
array([26, 20, 25,  7, 90])

Question

  • ### What conclusions do you draw from the shapes of the curves above?

In [ ]:


In [ ]:


In [ ]:

Next model: Add in a bunch more variables


In [42]:
features_2 = ['auto_premium','homeown_premium', 'home_dwell_cov', 'home_pers_prop_cov',
                'num_home_pol',
                'yob_policyholder','min_vehicle_year', 'max_vehicle_year', 'num_vehicles',
                  'max_driver_yob', 'min_driver_yob',
                'median_household_income','median_house_value'
                  ,'avg_homeown_rate_in_state'
                ]

In [43]:
X_train_2 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_2]
X_test_2 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_2]

In [44]:
xgb_complex2 = xgb.XGBClassifier(n_estimators = 1500, learning_rate=.03, max_depth=4, subsample=.8, gamma=6, 
                                 reg_alpha=2, reg_lambda=5, colsample_bytree=.7, colsample_bylevel=.7)
xgb_complex2.fit(X_train_2,y_train, eval_metric='logloss')#, eval_set=[(X_test_2, y_test)])
pred_probs_complex2 = xgb_complex2.predict_proba(X_test_2)[:,1]

In [45]:
roc_auc_score(y_test, pred_probs_complex2), log_loss(y_test, pred_probs_complex2)


Out[45]:
(0.8627402868410349, 0.25905745915111483)

In [46]:
mli.plot_pr_curves([y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2])



In [47]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex2,size_points=False, 
                             bins=bins_final);



In [48]:
X_explore_2 =  X_test_2[X_test_2.median_house_value<1000000].sample(100)
mxr_complex2 = mli.ModelXRay(xgb_complex2,X_explore_2)

In [49]:
mxr_complex2.feature_dependence_plots(pts_selected=indices)


Out[49]:
array([89, 31, 47, 99, 55])

In [50]:
mxr_complex2.feature_dependence_plots(pts_selected=indices, y_scaling='logit')


Out[50]:
array([89, 31, 47, 99, 55])

In [ ]:

For comparison sake, let's look at the default XGB settings


In [51]:
xgb_def2 = xgb.XGBClassifier()
xgb_def2.fit(X_train_2,y_train)
pred_probs_def2 = xgb_def2.predict_proba(X_test_2)[:,1]

In [52]:
roc_auc_score(y_test, pred_probs_def2), log_loss(y_test, pred_probs_def2)


Out[52]:
(0.8668707004792474, 0.25708338170573364)

In [53]:
mli.plot_pr_curves([y_test, y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2, pred_probs_def2])



In [54]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_def2,size_points=False, 
                             bins=bins_final);



In [55]:
mxr_def2 = mli.ModelXRay(xgb_def2,X_explore_2)

In [56]:
indices = mxr_def2.feature_dependence_plots()



In [57]:
indices = mxr_def2.feature_dependence_plots(y_scaling='logit')