Workshop: Explaining and Interpreting XGBoost Models

This workshop will focus on two primary objectives:

Understanding the overall dynamics of your data and your model:

  • Using more sophisticated modeling packages (like XGBoost) to understand more complicated dynamics in the data

  • How to approach data exploration to understand more complicated relationships between the variables in your data

  • Why the "coherence" of a model is important - arguably, on the same level as its predictive performance

  • How to assess the "coherence" of a model using ICE plots

Understanding and explaining individual predictions from the model

  • How to ascribe "reasons" to individual predictions

  • How to "consolidate" features to make the reasons more coherent and understandable

  • Using visualizations independently and from the SHAP package


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('classic')  # I prefer the appearance of the "old-school" color scheme

%matplotlib inline
from sklearn.metrics import roc_auc_score, confusion_matrix, log_loss, accuracy_score, r2_score, roc_curve, precision_recall_curve
import xgboost as xgb


pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
np.set_printoptions(edgeitems=30, linewidth=100000)

In [2]:
# must have ml_insights installed ('pip install ml_insights')
import ml_insights as mli

In [3]:
mli.__version__


Out[3]:
'0.0.21'

Load data

You can get the data at:

https://drive.google.com/file/d/1HCz_68wCH_3Qp0pGz3R4Q0fmd4FA2Pah/view?usp=sharing

or

https://bit.ly/2JA3Xcm

Download the file and place in the same directory as the notebook.


In [4]:
df_ins = pd.read_csv('ins_data_ODSC_West_2019.csv')
df_ins.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10670 entries, 0 to 10669
Data columns (total 20 columns):
Unnamed: 0                   10670 non-null int64
has_umbrella                 10670 non-null int64
homeown_premium              10670 non-null float64
auto_premium                 10670 non-null float64
num_home_pol                 10670 non-null float64
home_dwell_cov               8781 non-null float64
home_pers_prop_cov           10015 non-null float64
num_vehicles                 10514 non-null float64
max_vehicle_year             10514 non-null float64
min_vehicle_year             10514 non-null float64
num_drivers_by_dob           10670 non-null float64
yob_policyholder             8706 non-null float64
max_driver_yob               9334 non-null float64
min_driver_yob               9334 non-null float64
state_abbrev                 10632 non-null object
avg_homeown_rate_in_state    10632 non-null float64
median_household_income      10518 non-null float64
median_house_value           10518 non-null float64
fold_num_1                   10670 non-null int64
fold_num_2                   10670 non-null int64
dtypes: float64(15), int64(4), object(1)
memory usage: 1.6+ MB

In [ ]:

Data Exploration

This data set contains information about different insurance customers who have both home insurance and auto insurance with an agency. The information includes their premiums, coverages, vehicles, drivers, state of residence, and zip-code demographics.

In this exercise we will be building a model to predict who is likely to have an umbrella policy. Imagine, that our goal is to be able to recommend to new agencies (which currently don't sell umbrella policies) which customers are the best candidates. Our binary outcome variable is has_umbrella

We will focus on techniques for exploring, understanding, and interpreting data and models in a binary classification framework.


In [ ]:

Look at the "baseline" (marginal) probability of the outcome variable

It is always a good idea to know ahead of time what the baseline probability of your outcome variable is. Without this information, it is difficult to properly interpret the effectiveness of your model.


In [5]:
df_ins.has_umbrella.value_counts(), np.mean(df_ins.has_umbrella)


Out[5]:
(0    9457
 1    1213
 Name: has_umbrella, dtype: int64, 0.11368322399250234)

Initial Predictors

To start, let's look only at the variables homeown_premium and auto_premium and see how they relate to the outcome variable has_umbrella


In [6]:
plt.hist(df_ins.homeown_premium)


Out[6]:
(array([9.894e+03, 6.010e+02, 1.070e+02, 3.800e+01, 1.300e+01, 9.000e+00, 1.000e+00, 3.000e+00, 0.000e+00, 4.000e+00]),
 array([    0.,  2270.,  4540.,  6810.,  9080., 11350., 13620., 15890., 18160., 20430., 22700.]),
 <a list of 10 Patch objects>)

In [7]:
# Generally a good idea to put in custom bin widths

plt.hist(df_ins.homeown_premium, bins=np.linspace(0,10000,101));



In [8]:
bins_1 = np.linspace(0,2000,9)
bins_2 = np.array([2000,3000, 6000, 10000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.homeown_premium, df_ins.has_umbrella, bins=bins_final);



In [9]:
plt.hist(df_ins.auto_premium);



In [10]:
plt.hist(df_ins.auto_premium,np.linspace(0,15000,30+1));



In [ ]:


In [11]:
bins_1 = np.linspace(0,4000,8+1)
bins_2 = np.array([4000,6000, 8000, 15000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.auto_premium, df_ins.has_umbrella, bins=bins_final);



In [12]:
## Could do this on all the variables to understand...

In [ ]:

Simple model: 2 variables

It is often a good idea to build a model on a small number of variables first, just to get a sense of the behavior.


In [13]:
chosen_fold_variant = 'fold_num_1'
test_fold_num = 0

In [14]:
features_1 = ['auto_premium','homeown_premium']

Question

  • ### How high do you expect your precision to get? (at, say, recall=.1)

In [ ]:


In [ ]:


In [15]:
# Define train and test sets

X_train_1 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1]
X_test_1 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1]

y_train = df_ins.has_umbrella[df_ins[chosen_fold_variant]!=test_fold_num]
y_test = df_ins.has_umbrella[df_ins[chosen_fold_variant]==test_fold_num]

In [ ]:

First model - default XGBoost


In [16]:
xgb_def1 = xgb.XGBClassifier()

In [17]:
xgb_def1.fit(X_train_1,y_train)
pred_probs_def1 = xgb_def1.predict_proba(X_test_1)[:,1]

Question: How should we evaluate this model?


In [18]:
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)


Out[18]:
(0.5888486568695068, 0.35346403439752083)

In [19]:
mli.plot_pr_curves([y_test],[pred_probs_def1])


To Do: Perform your own evaluation of the model here


In [ ]:


In [ ]:


In [ ]:

Other less utilized ways to "examine" your model

  • ### Calibration (Reliability Diagram)
  • ### Variation to individual Parameter changes (Individual Conditional Expectation (ICE) plots)
  • ### Coherence - does the "story" of the model hold together?

Calibration

Idea: When your model predicts probability of 20%, does it really happen 20% of the time?

Method:

  • #### Look at test set when your model predicted outcome 1 with a probability of (approximately) 20%.
  • #### Look at fraction that actually occured.
  • #### Determine if it is within the expected range of frequency if the probability were accurate
  • #### Doing this requires binning the data

In [20]:
plt.figure(figsize=(12,4))
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))

mli.plot_reliability_diagram(y_test,pred_probs_def1,size_points=False, bins=bins_final);


Question: What do you think of the above reliability diagram?


In [ ]:

Individual Conditional Expectation Plot

Idea:

  • #### Hold all features constant except for one.
  • #### Vary the value of the feature over a range of values
  • #### Examine how the model prediction changes
  • #### Examine how the response curve may take different shapes (because of interactions)

In [21]:
### Define an example set
X_explore_1 =  X_test_1.sample(100)

### "Mutate" the data points and make predictions on them
mxr_def_1 = mli.ModelXRay(xgb_def1,X_explore_1)

In [22]:
### Visualize the responses
indices = mxr_def_1.feature_dependence_plots()


Question: What do you think about these ICE plots?


In [ ]:


In [ ]:

Second model: Let's try adding a bit more regularization


In [23]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9, 
                                 reg_lambda=3, reg_alpha=2)

In [24]:
xgb_complex1.fit(X_train_1,y_train, eval_metric='logloss')#, eval_set=[(X_test_1, y_test)])
pred_probs_complex_1 = xgb_complex1.predict_proba(X_test_1)[:,1]

In [25]:
## Performance of this model
roc_auc_score(y_test, pred_probs_complex_1), log_loss(y_test, pred_probs_complex_1)


Out[25]:
(0.5867353705766885, 0.3530065483201448)

In [26]:
## Performance of previous model
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)


Out[26]:
(0.5888486568695068, 0.35346403439752083)

In [27]:
mli.plot_pr_curves([y_test,y_test],[pred_probs_def1,pred_probs_complex_1])



In [ ]:


In [28]:
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize=(10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex_1,size_points=False, 
                             bins=bins_final);



In [29]:
### Examine Model More Deeply

mxr_complex_1 = mli.ModelXRay(xgb_complex1,X_explore_1)

In [30]:
## Result for more regularized model

mxr_complex_1.feature_dependence_plots(pts_selected=indices)


Out[30]:
array([89, 31, 47, 99, 55])

In [31]:
## Results for less regualrized model

mxr_def_1.feature_dependence_plots(pts_selected=indices)


Out[31]:
array([89, 31, 47, 99, 55])

Questions

  • ### Which model looks more "coherent" ?
  • ### Which model would you feel more comfortable justifying to your boss / clients?
  • ### Which model do you expect to generalize better to new cases?

One more question

  • ### What do you think accounts for the non-monotonicity of the response wrt the premiums?

In [ ]:


In [ ]:

Next model: Add in one additional feature


In [32]:
features_1a = ['auto_premium','homeown_premium'
                 ,'avg_homeown_rate_in_state']

In [33]:
X_train_1a = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1a]
X_test_1a = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1a]

In [34]:
xgb_complex1a= xgb.XGBClassifier(n_estimators=1000, learning_rate=.02, gamma=6, max_depth=3, subsample=.8, 
                                 reg_lambda=3, reg_alpha=2)

In [35]:
xgb_complex1a.fit(X_train_1a,y_train, eval_metric='logloss')#, eval_set=[(X_test_1a, y_test)])


Out[35]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=6,
              learning_rate=0.02, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.8, verbosity=1)

In [36]:
pred_probs_complex_1a = xgb_complex1a.predict_proba(X_test_1a)[:,1]

In [37]:
roc_auc_score(y_test, pred_probs_complex_1a), log_loss(y_test, pred_probs_complex_1a)


Out[37]:
(0.8154444532866465, 0.2903881074906859)

In [38]:
mli.plot_pr_curves([y_test,y_test, y_test],[pred_probs_def1,pred_probs_complex_1, pred_probs_complex_1a])



In [ ]:


In [39]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex_1a,size_points=False, 
                             bins=bins_final);



In [40]:
X_explore_1a = X_test_1a.sample(100)
mxr1a = mli.ModelXRay(xgb_complex1a,X_explore_1a)

In [41]:
mxr1a.feature_dependence_plots()


Out[41]:
array([26, 20, 25,  7, 90])

Question

  • ### What conclusions do you draw from the shapes of the curves above?

In [ ]:


In [ ]:


In [ ]:

Next model: Add in a bunch more variables


In [42]:
features_2 = ['auto_premium','homeown_premium', 'home_dwell_cov', 'home_pers_prop_cov',
                'num_home_pol',
                'yob_policyholder','min_vehicle_year', 'max_vehicle_year', 'num_vehicles',
                  'max_driver_yob', 'min_driver_yob',
                'median_household_income','median_house_value'
                  ,'avg_homeown_rate_in_state'
                ]

In [43]:
X_train_2 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_2]
X_test_2 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_2]

In [44]:
xgb_complex2 = xgb.XGBClassifier(n_estimators = 1500, learning_rate=.03, max_depth=4, subsample=.8, gamma=6, 
                                 reg_alpha=2, reg_lambda=5, colsample_bytree=.7, colsample_bylevel=.7)
xgb_complex2.fit(X_train_2,y_train, eval_metric='logloss')#, eval_set=[(X_test_2, y_test)])
pred_probs_complex2 = xgb_complex2.predict_proba(X_test_2)[:,1]

In [45]:
roc_auc_score(y_test, pred_probs_complex2), log_loss(y_test, pred_probs_complex2)


Out[45]:
(0.8627402868410349, 0.25905745915111483)

In [46]:
mli.plot_pr_curves([y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2])



In [47]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex2,size_points=False, 
                             bins=bins_final);



In [48]:
X_explore_2 =  X_test_2[X_test_2.median_house_value<1000000].sample(100)
mxr_complex2 = mli.ModelXRay(xgb_complex2,X_explore_2)

In [49]:
mxr_complex2.feature_dependence_plots(pts_selected=indices)


Out[49]:
array([89, 31, 47, 99, 55])

In [50]:
mxr_complex2.feature_dependence_plots(pts_selected=indices, y_scaling='logit')


Out[50]:
array([89, 31, 47, 99, 55])

In [ ]:

For comparison sake, let's look at the default XGB settings


In [51]:
xgb_def2 = xgb.XGBClassifier()
xgb_def2.fit(X_train_2,y_train)
pred_probs_def2 = xgb_def2.predict_proba(X_test_2)[:,1]

In [52]:
roc_auc_score(y_test, pred_probs_def2), log_loss(y_test, pred_probs_def2)


Out[52]:
(0.8668707004792474, 0.25708338170573364)

In [53]:
mli.plot_pr_curves([y_test, y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2, pred_probs_def2])



In [54]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_def2,size_points=False, 
                             bins=bins_final);



In [55]:
mxr_def2 = mli.ModelXRay(xgb_def2,X_explore_2)

In [56]:
indices = mxr_def2.feature_dependence_plots()



In [57]:
indices = mxr_def2.feature_dependence_plots(y_scaling='logit')



In [ ]:

Summary of Part I

  • ### Model understanding is still possible even when models are more complex
  • ### Need to broaden your thinking beyond the assumptions of linear / logistic regression
  • ### Effects of a variable are not constant across its range of values (non-linearity)
  • ### Shape of the response curve may vary depending on values of other variables (interactions)
  • ### But can still explore and understand them!

In [ ]:


In [ ]:


In [ ]:


In [ ]:

Part II: Giving reasons to model predictions

  • ### In the previous, we focused our attention to understanding the overall dynamics of the model.

  • ### However, in many cases, we want to understand the reasons behind a specific prediction.

Goal:

  • ### Explain why a particular instance is "different" from average
  • ### Which features / concepts contributed most to its "distinctiveness"?

SHAP

Idea:

  • #### arbitrarily order the features $X_1, X_2, \ldots, X_k$.

  • #### Compare $P(Y=1|X_1 = x_1, X_2 = x_2, \ldots, X_{j-1}=x_{j-1})$ to $P(Y=1|X_1 = x_1, X_2 = x_2, \ldots, X_j=x_j)$ (actually, compare the log odds) and attribute the difference to feature $X_j$ being equal to $x_j$.

  • #### Average the "attributions" to a feature over all possible orderings of the features

This is motivated by the "Shapley Value" in Game Theory

Same approach attributes ~98% of the "power" in the Security Council to the 5 permanent members

The "magic" of SHAP is that step (3) (averaging over all possible orderings) is typically computationally prohibitive, but for tree-based models they found a clever way to compute it by exploiting the structure.

References

A Unified Approach to Interpreting Model Predictions Scott Lundberg, Su-In Lee https://arxiv.org/abs/1705.07874

Consistent Individualized Feature Attribution for Tree Ensembles Scott M. Lundberg, Gabriel G. Erion, Su-In Lee https://arxiv.org/abs/1802.03888


In [ ]:


In [58]:
reas_df = mli.get_reason_score_matrix(xgb_complex2, X_test_2)
reas_df.round(decimals=2).head()


Out[58]:
auto_premium homeown_premium home_dwell_cov home_pers_prop_cov num_home_pol yob_policyholder min_vehicle_year max_vehicle_year num_vehicles max_driver_yob min_driver_yob median_household_income median_house_value avg_homeown_rate_in_state Intercept
0 0.10 -0.05 -0.23 -0.19 -0.04 -0.28 -0.03 -0.12 0.02 -0.16 -0.08 -0.17 -0.48 -0.07 -2.17
1 -0.00 -0.06 0.07 0.15 -0.04 0.10 -0.06 -0.08 0.01 0.13 -0.03 -0.10 0.24 -0.65 -2.17
2 0.01 0.01 -0.17 -0.41 -0.04 0.24 -0.07 -0.13 0.02 -0.01 0.04 -0.10 -0.00 -0.48 -2.17
3 -0.16 0.02 -0.13 -0.24 -0.04 -0.70 -0.05 0.04 0.01 0.22 0.03 -0.04 -0.11 -0.31 -2.17
4 -0.16 -0.05 -0.48 -0.25 -0.04 -0.54 -0.02 -0.04 0.02 0.15 0.01 -0.12 -0.27 -0.60 -2.17

In [59]:
## Look at correlations between the reasons
pd.DataFrame(np.round(np.corrcoef(reas_df.values[:,:-1].T), decimals=2), columns = reas_df.columns[:-1], index = reas_df.columns[:-1])


Out[59]:
auto_premium homeown_premium home_dwell_cov home_pers_prop_cov num_home_pol yob_policyholder min_vehicle_year max_vehicle_year num_vehicles max_driver_yob min_driver_yob median_household_income median_house_value avg_homeown_rate_in_state
auto_premium 1.00 -0.04 0.00 -0.00 0.01 0.04 0.02 0.13 0.16 -0.08 -0.08 0.04 0.10 0.09
homeown_premium -0.04 1.00 0.11 0.10 0.08 0.05 0.05 0.02 0.03 0.06 0.06 0.08 0.09 0.10
home_dwell_cov 0.00 0.11 1.00 0.76 0.15 0.05 0.06 0.17 0.25 0.01 0.08 0.37 0.43 0.06
home_pers_prop_cov -0.00 0.10 0.76 1.00 0.12 0.11 0.02 0.14 0.29 0.06 0.15 0.29 0.31 0.09
num_home_pol 0.01 0.08 0.15 0.12 1.00 0.01 0.01 0.03 0.06 0.09 0.08 0.08 0.07 -0.07
yob_policyholder 0.04 0.05 0.05 0.11 0.01 1.00 -0.03 0.01 0.02 0.07 0.47 0.03 -0.01 0.22
min_vehicle_year 0.02 0.05 0.06 0.02 0.01 -0.03 1.00 0.40 -0.28 -0.01 -0.05 0.15 0.05 -0.06
max_vehicle_year 0.13 0.02 0.17 0.14 0.03 0.01 0.40 1.00 0.18 -0.07 -0.02 0.16 0.07 -0.08
num_vehicles 0.16 0.03 0.25 0.29 0.06 0.02 -0.28 0.18 1.00 -0.08 -0.00 0.05 0.06 0.00
max_driver_yob -0.08 0.06 0.01 0.06 0.09 0.07 -0.01 -0.07 -0.08 1.00 0.63 -0.06 0.01 -0.04
min_driver_yob -0.08 0.06 0.08 0.15 0.08 0.47 -0.05 -0.02 -0.00 0.63 1.00 -0.03 0.01 0.04
median_household_income 0.04 0.08 0.37 0.29 0.08 0.03 0.15 0.16 0.05 -0.06 -0.03 1.00 0.67 0.12
median_house_value 0.10 0.09 0.43 0.31 0.07 -0.01 0.05 0.07 0.06 0.01 0.01 0.67 1.00 0.28
avg_homeown_rate_in_state 0.09 0.10 0.06 0.09 -0.07 0.22 -0.06 -0.08 0.00 -0.04 0.04 0.12 0.28 1.00

In [60]:
## get_reason_matrix is essentially the following 3 lines
# X_test_dmat = xgb.DMatrix(X_test_2)
# reas_mat = xgb_complex2.get_booster().predict(X_test_dmat, pred_contribs=True)
# reas_df_2 = pd.DataFrame(reas_mat, columns = list(X_test_2.columns)+['Intercept'])
# reas_df_2.head()

In [61]:
## Demonstrate that these numbers sum up to the overall log_odds

log_odds_vec = np.sum(reas_df, axis=1)
pv = 1/(1+np.exp(-log_odds_vec))
np.sum(np.abs(pv- pred_probs_complex2)>.0001)


Out[61]:
0

In [62]:
# Get the average "impact" of each variable
reas_df.abs().mean().sort_values(ascending=False)


Out[62]:
Intercept                    2.171552
avg_homeown_rate_in_state    0.502415
yob_policyholder             0.336279
home_pers_prop_cov           0.310728
home_dwell_cov               0.300321
median_house_value           0.249097
median_household_income      0.121956
auto_premium                 0.104172
max_driver_yob               0.103864
max_vehicle_year             0.067970
num_home_pol                 0.064425
min_driver_yob               0.060194
min_vehicle_year             0.058359
num_vehicles                 0.051028
homeown_premium              0.044403
dtype: float32

In [63]:
def analyze_effect(feature_name, reas_df, test_df, fillna_val=0):
    plt.subplot(2,2,1)
    plt.scatter(test_df[feature_name].fillna(fillna_val), reas_df[feature_name], alpha=.1)
    plt.subplot(2,2,2)
    plt.scatter(reas_df[feature_name], test_df[feature_name].fillna(fillna_val), alpha=.1)
    plt.subplot(2,2,3)
    plt.hist(test_df[feature_name].fillna(fillna_val))
    plt.subplot(2,2,4)
    plt.hist(reas_df[feature_name])

In [64]:
plt.figure(figsize=(10,8))
analyze_effect('yob_policyholder',reas_df, X_test_2, fillna_val=1900)



In [65]:
plt.figure(figsize=(10,8))
analyze_effect('num_home_pol',reas_df, X_test_2, fillna_val=0)



In [66]:
plt.figure(figsize=(10,8))
analyze_effect('avg_homeown_rate_in_state',reas_df, X_test_2, fillna_val=5000)



In [67]:
plt.figure(figsize=(10,8))
analyze_effect('auto_premium',reas_df, X_test_2, fillna_val=-1000)



In [68]:
plt.figure(figsize=(10,8))
analyze_effect('median_house_value',reas_df, X_test_2, fillna_val=-1000)



In [69]:
plt.figure(figsize=(10,8))
analyze_effect('median_household_income',reas_df, X_test_2, fillna_val=-10000)


Consolidating Reasons for better Interpretability

Often, ascribing value to individual features may yield confusing results. For example, many individual features may represent the same concept or otherwise be highly correlated. In these cases, it may be somewhat arbitary which of the correlated features gets "credit" for the impact. In other cases, there may be many similar variables, each of which has a tiny impact, but collectively have a much greater impact. For this reason, we may want to "consolidate" the impact of features.


In [70]:
# Create a dictionary mapping the "group name" to the list of features included in that group
reason_mapping_umb = {
              'Value_of_Real_Estate':['homeown_premium', 'home_dwell_cov','home_pers_prop_cov','num_home_pol'],
              'State_Specific_Factors':['avg_homeown_rate_in_state'],
              'Value_of_Automobiles':['min_vehicle_year', 'max_vehicle_year', 'num_vehicles','auto_premium'],
              'Age_of_Policyholder_and_Family':['yob_policyholder', 'min_driver_yob', 'max_driver_yob'],
              'Zipcode_Wealth':['median_household_income','median_house_value']}

In [71]:
cons_df = mli.consolidate_reason_scores(reas_df, reason_mapping_umb)

In [72]:
cons_df.head()


Out[72]:
Value_of_Real_Estate State_Specific_Factors Value_of_Automobiles Age_of_Policyholder_and_Family Zipcode_Wealth
0 -0.512408 -0.069297 -0.027797 -0.517467 -0.649828
1 0.128312 -0.652693 -0.132531 0.199833 0.144933
2 -0.614153 -0.482755 -0.175748 0.268490 -0.103152
3 -0.390248 -0.308211 -0.149260 -0.452115 -0.144190
4 -0.821800 -0.595606 -0.203028 -0.379293 -0.390431

In [73]:
cons_df.abs().mean().sort_values(ascending=False)


Out[73]:
Value_of_Real_Estate              0.610747
State_Specific_Factors            0.502415
Age_of_Policyholder_and_Family    0.414252
Zipcode_Wealth                    0.350292
Value_of_Automobiles              0.158169
dtype: float32

In [74]:
cons_df.columns


Out[74]:
Index(['Value_of_Real_Estate', 'State_Specific_Factors',
       'Value_of_Automobiles', 'Age_of_Policyholder_and_Family',
       'Zipcode_Wealth'],
      dtype='object')

In [75]:
## Look at correlations between the reasons
pd.DataFrame(np.round(np.corrcoef(cons_df.values.T), decimals=2), columns = cons_df.columns, index = cons_df.columns)


Out[75]:
Value_of_Real_Estate State_Specific_Factors Value_of_Automobiles Age_of_Policyholder_and_Family Zipcode_Wealth
Value_of_Real_Estate 1.00 0.08 0.16 0.10 0.41
State_Specific_Factors 0.08 1.00 0.00 0.18 0.24
Value_of_Automobiles 0.16 0.00 1.00 -0.01 0.15
Age_of_Policyholder_and_Family 0.10 0.18 -0.01 1.00 0.00
Zipcode_Wealth 0.41 0.24 0.15 0.00 1.00

In [76]:
reason_string_vector = mli.predict_reason_strings(xgb_complex2, X_test_2, reason_mapping_umb, .3)

In [77]:
pd.Series(reason_string_vector).value_counts()


Out[77]:
                                                                                             1027
Value_of_Real_Estate                                                                          219
Age_of_Policyholder_and_Family                                                                200
Zipcode_Wealth                                                                                112
State_Specific_Factors                                                                         89
Value_of_Real_Estate;Zipcode_Wealth                                                            46
Value_of_Real_Estate;Age_of_Policyholder_and_Family                                            44
State_Specific_Factors;Zipcode_Wealth                                                          42
Age_of_Policyholder_and_Family;Value_of_Real_Estate                                            30
Value_of_Real_Estate;Zipcode_Wealth;Age_of_Policyholder_and_Family                             29
State_Specific_Factors;Value_of_Real_Estate                                                    29
Age_of_Policyholder_and_Family;Zipcode_Wealth                                                  28
State_Specific_Factors;Age_of_Policyholder_and_Family                                          24
Zipcode_Wealth;Age_of_Policyholder_and_Family                                                  22
Value_of_Real_Estate;Value_of_Automobiles                                                      16
State_Specific_Factors;Value_of_Real_Estate;Zipcode_Wealth                                     13
Value_of_Real_Estate;Age_of_Policyholder_and_Family;Zipcode_Wealth                             13
State_Specific_Factors;Value_of_Real_Estate;Age_of_Policyholder_and_Family                     13
State_Specific_Factors;Age_of_Policyholder_and_Family;Value_of_Real_Estate                     11
State_Specific_Factors;Zipcode_Wealth;Age_of_Policyholder_and_Family                           11
State_Specific_Factors;Zipcode_Wealth;Value_of_Real_Estate                                      9
Zipcode_Wealth;Value_of_Real_Estate                                                             8
Value_of_Automobiles                                                                            6
Value_of_Real_Estate;Age_of_Policyholder_and_Family;Value_of_Automobiles                        6
Age_of_Policyholder_and_Family;Zipcode_Wealth;Value_of_Real_Estate                              5
State_Specific_Factors;Age_of_Policyholder_and_Family;Zipcode_Wealth                            5
Value_of_Real_Estate;Zipcode_Wealth;Value_of_Automobiles                                        4
State_Specific_Factors;Value_of_Real_Estate;Age_of_Policyholder_and_Family;Zipcode_Wealth       4
State_Specific_Factors;Age_of_Policyholder_and_Family;Value_of_Real_Estate;Zipcode_Wealth       4
State_Specific_Factors;Zipcode_Wealth;Value_of_Real_Estate;Age_of_Policyholder_and_Family       3
Age_of_Policyholder_and_Family;Value_of_Real_Estate;Zipcode_Wealth                              3
State_Specific_Factors;Value_of_Real_Estate;Zipcode_Wealth;Age_of_Policyholder_and_Family       3
Value_of_Real_Estate;State_Specific_Factors;Zipcode_Wealth                                      3
Value_of_Real_Estate;Age_of_Policyholder_and_Family;Zipcode_Wealth;Value_of_Automobiles         3
State_Specific_Factors;Age_of_Policyholder_and_Family;Zipcode_Wealth;Value_of_Real_Estate       3
Age_of_Policyholder_and_Family;Value_of_Real_Estate;Value_of_Automobiles                        1
Value_of_Real_Estate;State_Specific_Factors;Zipcode_Wealth;Age_of_Policyholder_and_Family       1
Value_of_Automobiles;Age_of_Policyholder_and_Family                                             1
Age_of_Policyholder_and_Family;Value_of_Automobiles                                             1
Zipcode_Wealth;Value_of_Real_Estate;Age_of_Policyholder_and_Family                              1
Value_of_Real_Estate;Value_of_Automobiles;Zipcode_Wealth                                        1
State_Specific_Factors;Value_of_Real_Estate;Value_of_Automobiles                                1
Zipcode_Wealth;Value_of_Automobiles                                                             1
Age_of_Policyholder_and_Family;Zipcode_Wealth;Value_of_Automobiles                              1
Age_of_Policyholder_and_Family;State_Specific_Factors;Value_of_Real_Estate                      1
Value_of_Real_Estate;State_Specific_Factors                                                     1
dtype: int64

In [78]:
pd.Series(reason_string_vector[pred_probs_complex2>.5]).value_counts()


Out[78]:
Value_of_Real_Estate;Zipcode_Wealth;Age_of_Policyholder_and_Family                           13
State_Specific_Factors;Value_of_Real_Estate;Zipcode_Wealth                                   12
State_Specific_Factors;Value_of_Real_Estate;Age_of_Policyholder_and_Family                   12
State_Specific_Factors;Value_of_Real_Estate                                                   9
State_Specific_Factors;Zipcode_Wealth;Value_of_Real_Estate                                    9
State_Specific_Factors;Age_of_Policyholder_and_Family;Value_of_Real_Estate                    7
Value_of_Real_Estate;Zipcode_Wealth                                                           6
Value_of_Real_Estate;Age_of_Policyholder_and_Family;Zipcode_Wealth                            4
State_Specific_Factors;Zipcode_Wealth                                                         4
State_Specific_Factors;Value_of_Real_Estate;Age_of_Policyholder_and_Family;Zipcode_Wealth     4
State_Specific_Factors;Age_of_Policyholder_and_Family;Zipcode_Wealth;Value_of_Real_Estate     3
Value_of_Real_Estate;Age_of_Policyholder_and_Family;Zipcode_Wealth;Value_of_Automobiles       3
State_Specific_Factors;Zipcode_Wealth;Age_of_Policyholder_and_Family                          3
State_Specific_Factors;Age_of_Policyholder_and_Family;Value_of_Real_Estate;Zipcode_Wealth     3
State_Specific_Factors;Value_of_Real_Estate;Zipcode_Wealth;Age_of_Policyholder_and_Family     3
State_Specific_Factors;Age_of_Policyholder_and_Family;Zipcode_Wealth                          2
Value_of_Real_Estate;Age_of_Policyholder_and_Family                                           2
State_Specific_Factors;Zipcode_Wealth;Value_of_Real_Estate;Age_of_Policyholder_and_Family     2
State_Specific_Factors;Age_of_Policyholder_and_Family                                         2
Value_of_Real_Estate;Zipcode_Wealth;Value_of_Automobiles                                      2
Value_of_Real_Estate;Age_of_Policyholder_and_Family;Value_of_Automobiles                      2
Value_of_Real_Estate;State_Specific_Factors;Zipcode_Wealth                                    2
Age_of_Policyholder_and_Family;Value_of_Real_Estate;Zipcode_Wealth                            1
Value_of_Real_Estate;State_Specific_Factors;Zipcode_Wealth;Age_of_Policyholder_and_Family     1
Value_of_Real_Estate;State_Specific_Factors                                                   1
State_Specific_Factors;Value_of_Real_Estate;Value_of_Automobiles                              1
dtype: int64

In [ ]:


In [ ]:

SHAP package

Since the initial version of this workshop, the SHAP package has expanded some of its visualization capabilities. So I have added a section to demonstrate some of them


In [79]:
# if needed
# !pip install shap

In [80]:
import shap
shap.initjs()
X_test_2_sm = X_test_2.iloc[:500,:]



In [81]:
explainer = shap.TreeExplainer(xgb_complex2)

In [82]:
shap_values = explainer.shap_values(X_test_2_sm)

In [83]:
row_num = 11
shap.force_plot(explainer.expected_value, shap_values[row_num,:], X_test_2_sm.iloc[row_num,:], matplotlib=True)



In [84]:
cons_shap_df = mli.consolidate_reason_scores(pd.DataFrame(shap_values, columns=X_test_2_sm.columns), reason_mapping_umb)

In [85]:
shap.force_plot(explainer.expected_value, cons_shap_df.iloc[row_num,:].values, feature_names=cons_shap_df.columns,matplotlib=True)



In [ ]:

Interesting interaction between age and neighborhood income


In [86]:
shap_interaction_values = shap.TreeExplainer(xgb_complex2).shap_interaction_values(X_test_2_sm)

In [87]:
shap.summary_plot(shap_interaction_values, X_test_2_sm)



In [88]:
shap.dependence_plot(
    ("median_household_income", "median_household_income"),
    shap_interaction_values, X_test_2_sm,
    display_features=X_test_2_sm
)



In [89]:
shap.dependence_plot(
    ("median_household_income", "yob_policyholder"),
    shap_interaction_values, X_test_2_sm,
    display_features=X_test_2_sm
)


/Users/brianlucena/.pyenv/versions/3.7.3/envs/public_env/lib/python3.7/site-packages/matplotlib/colors.py:527: RuntimeWarning: invalid value encountered in less
  xa[xa < 0] = -1

In [90]:
shap.dependence_plot(
    ("yob_policyholder", "yob_policyholder"),
    shap_interaction_values, X_test_2_sm,
    display_features=X_test_2_sm
)



In [91]:
shap.dependence_plot(
    ("yob_policyholder", "median_household_income"),
    shap_interaction_values, X_test_2_sm,
    display_features=X_test_2_sm
)



In [ ]:

Wrap-up

  • Please rate this talk on the ODSC app!
  • Also, if you find the ml_insights package useful, please give it a star on Github!

In [ ]:


In [ ]:


In [ ]:

If we end up having extra time

Idea

  • ### Create a CVModel object out of your desired model
  • ### Create a column for the fold identifier you wish to use. Let $k$ be the number of different folds
  • ### Calling fit on this model (with arguments X, y, foldnum) will automatically fit $k$ models (each one leaving out a different fold in training). If desired, will also train an additional model on all data ('overall_model')
  • ### Calling predict_proba (with arguments X, foldnum) will automatically "route" each row to the model that was not trained on it. (i.e. an element from fold 3 will get the model that was trained on all the other folds except fold 3)
  • ### Therefore, you get a set of predictions, each from a slightly different model, but all of which are "valid" in the sense that they did not have the answer in their training set.
  • ### Allows this process to be fairly simple and elegant
  • ### CVModel object can also be used on future predictions, if desired.

In [92]:
X_full_1 = df_ins.loc[:,features_2]
y_full_1 = df_ins.has_umbrella
foldnum_vec_1 = df_ins.fold_num_1
foldnum_vec_2 = df_ins.fold_num_2  ## all of a particular agent goes to the same fold

In [93]:
pd.Series(foldnum_vec_2).value_counts()


Out[93]:
0    2252
1    2200
3    2133
2    2053
4    2032
Name: fold_num_2, dtype: int64

In [94]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9, 
                                 reg_lambda=3, reg_alpha=2)
xgbcv_complex1 = mli.CVModel(xgb_complex1)

In [95]:
xgbcv_complex1.fit(X_full_1, y_full_1, foldnum_vec_1)


Leave out fold 0 and train on the rest
Leave out fold 1 and train on the rest
Leave out fold 2 and train on the rest
Leave out fold 3 and train on the rest
Leave out fold 4 and train on the rest
Train the overall model
Out[95]:
CVModel(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=5,
                                     learning_rate=0.01, max_delta_step=0,
                                     max_depth=2, min_child_weight=1,
                                     missing=None, n_estimators=1000, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=2, reg_lambda=3,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=0.9, verbosity=1))

In [96]:
pred_probs_cv_1 = xgbcv_complex1.predict_proba(X_full_1, foldnum_vec_1)[:,1]

In [97]:
roc_auc_score(y_full_1, pred_probs_cv_1), log_loss(y_full_1, pred_probs_cv_1)


Out[97]:
(0.851853676043629, 0.2615836962261281)

In [ ]:


In [98]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9, 
                                 reg_lambda=3, reg_alpha=2)
xgbcv_complex1 = mli.CVModel(xgb_complex1)

In [99]:
xgbcv_complex1.fit(X_full_1, y_full_1, foldnum_vec_2)


Leave out fold 0 and train on the rest
Leave out fold 1 and train on the rest
Leave out fold 2 and train on the rest
Leave out fold 3 and train on the rest
Leave out fold 4 and train on the rest
Train the overall model
Out[99]:
CVModel(base_estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=5,
                                     learning_rate=0.01, max_delta_step=0,
                                     max_depth=2, min_child_weight=1,
                                     missing=None, n_estimators=1000, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=2, reg_lambda=3,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=0.9, verbosity=1))

In [100]:
pred_probs_cv_1 = xgbcv_complex1.predict_proba(X_full_1, foldnum_vec_2)[:,1]

In [101]:
roc_auc_score(y_full_1, pred_probs_cv_1), log_loss(y_full_1, pred_probs_cv_1)


Out[101]:
(0.8511410305037571, 0.261510705296177)

In [ ]:


In [102]:
xgbcv_complex1.model_dict


Out[102]:
{0: XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=5,
               learning_rate=0.01, max_delta_step=0, max_depth=2,
               min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
               nthread=None, objective='binary:logistic', random_state=0,
               reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
               silent=None, subsample=0.9, verbosity=1),
 1: XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=5,
               learning_rate=0.01, max_delta_step=0, max_depth=2,
               min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
               nthread=None, objective='binary:logistic', random_state=0,
               reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
               silent=None, subsample=0.9, verbosity=1),
 2: XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=5,
               learning_rate=0.01, max_delta_step=0, max_depth=2,
               min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
               nthread=None, objective='binary:logistic', random_state=0,
               reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
               silent=None, subsample=0.9, verbosity=1),
 3: XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=5,
               learning_rate=0.01, max_delta_step=0, max_depth=2,
               min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
               nthread=None, objective='binary:logistic', random_state=0,
               reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
               silent=None, subsample=0.9, verbosity=1),
 4: XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=5,
               learning_rate=0.01, max_delta_step=0, max_depth=2,
               min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
               nthread=None, objective='binary:logistic', random_state=0,
               reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
               silent=None, subsample=0.9, verbosity=1),
 'overall_model': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=1, gamma=5,
               learning_rate=0.01, max_delta_step=0, max_depth=2,
               min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
               nthread=None, objective='binary:logistic', random_state=0,
               reg_alpha=2, reg_lambda=3, scale_pos_weight=1, seed=None,
               silent=None, subsample=0.9, verbosity=1)}

In [ ]:

If we have even more time....


In [103]:
def auc_sim(prob_vec, num_trials=1000):
    out_vec = np.zeros(num_trials)
    for i in range(num_trials):
        sample_results = np.random.binomial(n=1, p=prob_vec)
        out_vec[i] = roc_auc_score(sample_results, prob_vec)
    return(out_vec)

In [104]:
sample_auc_scores = auc_sim(pred_probs_complex2)


plt.hist(sample_auc_scores, bins = np.linspace(.7,.9,100+1));
plt.axvline(roc_auc_score(y_test, pred_probs_complex2))


Out[104]:
<matplotlib.lines.Line2D at 0x1379753c8>

In [105]:
def log_loss_sim(prob_vec, num_trials=1000):
    out_vec = np.zeros(num_trials)
    for i in range(num_trials):
        sample_results = np.random.binomial(n=1, p=prob_vec)
        out_vec[i] = log_loss(sample_results, prob_vec)
    return(out_vec)

In [106]:
sample_log_loss_scores = log_loss_sim(pred_probs_complex2, num_trials=5000)

In [107]:
plt.hist(sample_log_loss_scores, bins = np.linspace(.2,.4,101));
plt.axvline(log_loss(y_test, pred_probs_complex2))


Out[107]:
<matplotlib.lines.Line2D at 0x13a504fd0>

In [ ]:


In [ ]: