Using more sophisticated modeling packages (like XGBoost) to understand more complicated dynamics in the data
How to approach data exploration to understand more complicated relationships between the variables in your data
Why the "coherence" of a model is important - arguably, on the same level as its predictive performance
How to assess the "coherence" of a model using ICE plots
How to ascribe "reasons" to individual predictions
How to "consolidate" features to make the reasons more coherent and understandable
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('classic') # I prefer the appearance of the "old-school" color scheme
%matplotlib inline
from sklearn.metrics import roc_auc_score, confusion_matrix, log_loss, accuracy_score, r2_score, roc_curve, precision_recall_curve
import xgboost as xgb
pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
np.set_printoptions(edgeitems=30, linewidth=100000)
In [2]:
# must have ml_insights installed ('pip install ml_insights')
import ml_insights as mli
In [ ]:
You can get the data at:
https://drive.google.com/file/d/1UDIemsSUseF5yyv5BEABlLOKSeCnfZeo/view?usp=sharing
or
Download the file and place in the same directory as the notebook.
In [3]:
df_ins = pd.read_csv('ins_data_ODSC_East_2019.csv')
df_ins.info()
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
This data set contains information about different insurance customers who have both home insurance and auto insurance with an agency. The information includes their premiums, coverages, vehicles, drivers, state of residence, and zip-code demographics.
In this exercise we will be building a model to predict who is likely to have an umbrella policy. Imagine, that our goal is to be able to recommend to new agencies (which currently don't sell umbrella policies) which customers are the best candidates. Our binary outcome variable is has_umbrella
We will focus on techniques for exploring, understanding, and interpreting data and models in a binary classification framework.
In [ ]:
In [4]:
df_ins.has_umbrella.value_counts(), np.mean(df_ins.has_umbrella)
Out[4]:
In [5]:
plt.hist(df_ins.homeown_premium)
Out[5]:
In [6]:
plt.hist(df_ins.homeown_premium, bins=np.linspace(0,10000,101));
In [7]:
bins_1 = np.linspace(0,2000,9)
bins_2 = np.array([2000,3000, 6000, 10000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.homeown_premium, df_ins.has_umbrella, bins=bins_final);
In [8]:
plt.hist(df_ins.auto_premium);
In [9]:
plt.hist(df_ins.auto_premium,np.linspace(0,15000,30+1));
In [ ]:
In [10]:
bins_1 = np.linspace(0,4000,8+1)
bins_2 = np.array([4000,6000, 8000, 15000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.auto_premium, df_ins.has_umbrella, bins=bins_final);
In [11]:
## Could do this on all the variables to understand...
In [ ]:
In [12]:
chosen_fold_variant = 'fold_num_1'
test_fold_num = 0
In [13]:
features_1 = ['auto_premium','homeown_premium']
In [ ]:
In [ ]:
In [14]:
# Define train and test sets
X_train_1 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1]
X_test_1 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1]
y_train = df_ins.has_umbrella[df_ins[chosen_fold_variant]!=test_fold_num]
y_test = df_ins.has_umbrella[df_ins[chosen_fold_variant]==test_fold_num]
In [ ]:
In [15]:
xgb_def1 = xgb.XGBClassifier()
In [16]:
xgb_def1.fit(X_train_1,y_train)
pred_probs_def1 = xgb_def1.predict_proba(X_test_1)[:,1]
In [17]:
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)
Out[17]:
In [18]:
mli.plot_pr_curves([y_test],[pred_probs_def1])
In [ ]:
In [ ]:
In [ ]:
In [19]:
plt.figure(figsize=(12,4))
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))
mli.plot_reliability_diagram(y_test,pred_probs_def1,size_points=False, bins=bins_final);
In [ ]:
In [20]:
### Define an example set
X_explore_1 = X_test_1.sample(100)
### "Mutate" the data points and make predictions on them
mxr_def_1 = mli.ModelXRay(xgb_def1,X_explore_1)
In [21]:
### Visualize the responses
indices = mxr_def_1.feature_dependence_plots()
In [ ]:
In [ ]:
In [22]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9,
reg_lambda=3, reg_alpha=2)
In [23]:
xgb_complex1.fit(X_train_1,y_train, eval_metric='logloss')#, eval_set=[(X_test_1, y_test)])
pred_probs_complex_1 = xgb_complex1.predict_proba(X_test_1)[:,1]
In [24]:
## Performance of this model
roc_auc_score(y_test, pred_probs_complex_1), log_loss(y_test, pred_probs_complex_1)
Out[24]:
In [25]:
## Performance of previous model
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)
Out[25]:
In [26]:
mli.plot_pr_curves([y_test,y_test],[pred_probs_def1,pred_probs_complex_1])
In [ ]:
In [27]:
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize=(10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex_1,size_points=False,
bins=bins_final);
In [28]:
### Examine Model More Deeply
mxr_complex_1 = mli.ModelXRay(xgb_complex1,X_explore_1)
In [29]:
mxr_complex_1.feature_dependence_plots(pts_selected=indices)
Out[29]:
In [30]:
mxr_def_1.feature_dependence_plots(pts_selected=indices)
Out[30]:
In [ ]:
In [ ]:
In [31]:
features_1a = ['auto_premium','homeown_premium'
,'avg_homeown_rate_in_state']
In [32]:
X_train_1a = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1a]
X_test_1a = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1a]
In [33]:
xgb_complex1a= xgb.XGBClassifier(n_estimators=1000, learning_rate=.02, gamma=6, max_depth=3, subsample=.8,
reg_lambda=3, reg_alpha=2)
In [34]:
xgb_complex1a.fit(X_train_1a,y_train, eval_metric='logloss')#, eval_set=[(X_test_1a, y_test)])
Out[34]:
In [35]:
pred_probs_complex_1a = xgb_complex1a.predict_proba(X_test_1a)[:,1]
In [36]:
roc_auc_score(y_test, pred_probs_complex_1a), log_loss(y_test, pred_probs_complex_1a)
Out[36]:
In [37]:
mli.plot_pr_curves([y_test,y_test, y_test],[pred_probs_def1,pred_probs_complex_1, pred_probs_complex_1a])
In [ ]:
In [ ]:
In [38]:
X_explore_1a = X_test_1a.sample(100)
mxr1a = mli.ModelXRay(xgb_complex1a,X_explore_1a)
In [39]:
mxr1a.feature_dependence_plots()
Out[39]:
In [ ]:
In [ ]:
In [ ]:
In [40]:
features_2 = ['auto_premium','homeown_premium', 'home_dwell_cov', 'home_pers_prop_cov',
'num_home_pol',
'yob_policyholder','min_vehicle_year', 'max_vehicle_year', 'num_vehicles',
'max_driver_yob', 'min_driver_yob',
'median_household_income','median_house_value'
,'avg_homeown_rate_in_state'
]
In [41]:
X_train_2 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_2]
X_test_2 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_2]
In [42]:
xgb_complex2 = xgb.XGBClassifier(n_estimators = 1500, learning_rate=.03, max_depth=4, subsample=.8, gamma=6,
reg_alpha=2, reg_lambda=5, colsample_bytree=.7, colsample_bylevel=.7)
xgb_complex2.fit(X_train_2,y_train, eval_metric='logloss')#, eval_set=[(X_test_2, y_test)])
pred_probs_complex2 = xgb_complex2.predict_proba(X_test_2)[:,1]
In [43]:
roc_auc_score(y_test, pred_probs_complex2), log_loss(y_test, pred_probs_complex2)
Out[43]:
In [44]:
mli.plot_pr_curves([y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2])
In [45]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex2,size_points=False,
bins=bins_final);
In [46]:
X_explore_2 = X_test_2[X_test_2.median_house_value<1000000].sample(100)
mxr_complex2 = mli.ModelXRay(xgb_complex2,X_explore_2)
In [47]:
mxr_complex2.feature_dependence_plots(pts_selected=indices)
Out[47]:
In [48]:
mxr_complex2.feature_dependence_plots(pts_selected=indices, y_scaling='logit')
Out[48]:
In [ ]:
In [49]:
xgb_def2 = xgb.XGBClassifier()
xgb_def2.fit(X_train_2,y_train)
pred_probs_def2 = xgb_def2.predict_proba(X_test_2)[:,1]
In [50]:
roc_auc_score(y_test, pred_probs_def2), log_loss(y_test, pred_probs_def2)
Out[50]:
In [51]:
mli.plot_pr_curves([y_test, y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2, pred_probs_def2])
In [52]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_def2,size_points=False,
bins=bins_final);
In [53]:
mxr_def2 = mli.ModelXRay(xgb_def2,X_explore_2)
In [54]:
indices = mxr_def2.feature_dependence_plots()
In [55]:
indices = mxr_def2.feature_dependence_plots(y_scaling='logit')
In [ ]:
In [ ]:
In [ ]:
In [56]:
def auc_sim(prob_vec, num_trials=1000):
out_vec = np.zeros(num_trials)
for i in range(num_trials):
sample_results = np.random.binomial(n=1, p=prob_vec)
out_vec[i] = roc_auc_score(sample_results, prob_vec)
return(out_vec)
In [ ]:
In [57]:
sample_auc_scores = auc_sim(pred_probs_complex2)
In [58]:
plt.hist(sample_auc_scores, bins = np.linspace(.7,.9,100+1));
plt.axvline(roc_auc_score(y_test, pred_probs_complex2))
Out[58]:
In [59]:
def log_loss_sim(prob_vec, num_trials=1000):
out_vec = np.zeros(num_trials)
for i in range(num_trials):
sample_results = np.random.binomial(n=1, p=prob_vec)
out_vec[i] = log_loss(sample_results, prob_vec)
return(out_vec)
In [60]:
sample_log_loss_scores = log_loss_sim(pred_probs_complex2, num_trials=5000)
In [61]:
plt.hist(sample_log_loss_scores, bins = np.linspace(.2,.4,101));
plt.axvline(log_loss(y_test, pred_probs_complex2))
Out[61]:
In [ ]:
In [ ]:
In [ ]:
### In the previous, we focused our attention to understanding the overall dynamics of the model.
### However, in many cases, we want to understand the reasons behind a specific prediction.
#### arbitrarily order the features $X_1, X_2, \ldots, X_k$.
#### Compare $P(Y=1|X_1 = x_1, X_2 = x_2, \ldots, X_{j-1}=x_{j-1})$ to $P(Y=1|X_1 = x_1, X_2 = x_2, \ldots, X_j=x_j)$ (actually, compare the log odds) and attribute the difference to feature $X_j$ being equal to $x_j$.
#### Average the "attributions" to a feature over all possible orderings of the features
The "magic" of SHAP is that step (3) (averaging over all possible orderings) is typically computationally prohibitive, but for tree-based models they found a clever way to compute it by exploiting the structure.
A Unified Approach to Interpreting Model Predictions Scott Lundberg, Su-In Lee https://arxiv.org/abs/1705.07874
Consistent Individualized Feature Attribution for Tree Ensembles Scott M. Lundberg, Gabriel G. Erion, Su-In Lee https://arxiv.org/abs/1802.03888
In [ ]:
In [62]:
reas_df = mli.get_reason_score_matrix(xgb_complex2, X_test_2)
reas_df.round(decimals=2).head()
Out[62]:
In [63]:
## Look at correlations between the reasons
pd.DataFrame(np.round(np.corrcoef(reas_df.values[:,:-1].T), decimals=2), columns = reas_df.columns[:-1], index = reas_df.columns[:-1])
Out[63]:
In [64]:
## get_reason_matrix is essentially the following 3 lines
# X_test_dmat = xgb.DMatrix(X_test_2)
# reas_mat = xgb_complex2.get_booster().predict(X_test_dmat, pred_contribs=True)
# reas_df_2 = pd.DataFrame(reas_mat, columns = list(X_test_2.columns)+['Intercept'])
# reas_df_2.head()
In [65]:
## Demonstrate that these numbers sum up to the overall log_odds
log_odds_vec = np.sum(reas_df, axis=1)
pv = 1/(1+np.exp(-log_odds_vec))
np.sum(np.abs(pv- pred_probs_complex2)>.0001)
Out[65]:
In [66]:
# Get the average "impact" of each variable
reas_df.abs().mean().sort_values(ascending=False)
Out[66]:
In [67]:
def analyze_effect(feature_name, reas_df, test_df, fillna_val=0):
plt.subplot(2,2,1)
plt.scatter(test_df[feature_name].fillna(fillna_val), reas_df[feature_name], alpha=.1)
plt.subplot(2,2,2)
plt.scatter(reas_df[feature_name], test_df[feature_name].fillna(fillna_val), alpha=.1)
plt.subplot(2,2,3)
plt.hist(test_df[feature_name].fillna(fillna_val))
plt.subplot(2,2,4)
plt.hist(reas_df[feature_name])
In [68]:
plt.figure(figsize=(10,8))
analyze_effect('yob_policyholder',reas_df, X_test_2, fillna_val=1900)
In [69]:
plt.figure(figsize=(10,8))
analyze_effect('num_home_pol',reas_df, X_test_2, fillna_val=0)
In [70]:
plt.figure(figsize=(10,8))
analyze_effect('avg_homeown_rate_in_state',reas_df, X_test_2, fillna_val=5000)
In [71]:
plt.figure(figsize=(10,8))
analyze_effect('auto_premium',reas_df, X_test_2, fillna_val=-1000)
In [72]:
plt.figure(figsize=(10,8))
analyze_effect('median_house_value',reas_df, X_test_2, fillna_val=-1000)
In [73]:
plt.figure(figsize=(10,8))
analyze_effect('median_household_income',reas_df, X_test_2, fillna_val=-10000)
Often, ascribing value to individual features may yield confusing results. For example, many individual features may represent the same concept or otherwise be highly correlated. In these cases, it may be somewhat arbitary which of the correlated features gets "credit" for the impact. In other cases, there may be many similar variables, each of which has a tiny impact, but collectively have a much greater impact. For this reason, we may want to "consolidate" the impact of features.
In [74]:
# Create a dictionary mapping the "group name" to the list of features included in that group
reason_mapping_umb = {
'Value_of_Real_Estate':['homeown_premium', 'home_dwell_cov','home_pers_prop_cov','num_home_pol'],
'State_Specific_Factors':['avg_homeown_rate_in_state'],
'Value_of_Automobiles':['min_vehicle_year', 'max_vehicle_year', 'num_vehicles','auto_premium'],
'Age_of_Policyholder_and_Family':['yob_policyholder', 'min_driver_yob', 'max_driver_yob'],
'Zipcode_Wealth':['median_household_income','median_house_value']}
In [75]:
cons_df = mli.consolidate_reason_scores(reas_df, reason_mapping_umb)
In [76]:
cons_df.head()
Out[76]:
In [77]:
cons_df.abs().mean().sort_values(ascending=False)
Out[77]:
In [78]:
cons_df.columns
Out[78]:
In [79]:
## Look at correlations between the reasons
pd.DataFrame(np.round(np.corrcoef(cons_df.values.T), decimals=2), columns = cons_df.columns, index = cons_df.columns)
Out[79]:
In [80]:
reason_string_vector = mli.predict_reason_strings(xgb_complex2, X_test_2, reason_mapping_umb, .3)
In [81]:
pd.Series(reason_string_vector).value_counts()
Out[81]:
In [82]:
pd.Series(reason_string_vector[pred_probs_complex2>.5]).value_counts()
Out[82]:
In [ ]:
fit
on this model (with arguments X, y, foldnum
) will automatically fit $k$ models (each one leaving out a different fold in training). If desired, will also train an additional model on all data ('overall_model')predict_proba
(with arguments X, foldnum
) will automatically "route" each row to the model that was not trained on it. (i.e. an element from fold 3 will get the model that was trained on all the other folds except fold 3)
In [83]:
X_full_1 = df_ins.loc[:,features_2]
y_full_1 = df_ins.has_umbrella
foldnum_vec_1 = df_ins.fold_num_1
foldnum_vec_2 = df_ins.fold_num_2 ## all of a particular agent goes to the same fold
In [84]:
pd.Series(foldnum_vec_2).value_counts()
Out[84]:
In [85]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9,
reg_lambda=3, reg_alpha=2)
xgbcv_complex1 = mli.CVModel(xgb_complex1)
In [86]:
xgbcv_complex1.fit(X_full_1, y_full_1, foldnum_vec_1)
Out[86]:
In [87]:
pred_probs_cv_1 = xgbcv_complex1.predict_proba(X_full_1, foldnum_vec_1)[:,1]
In [88]:
roc_auc_score(y_full_1, pred_probs_cv_1), log_loss(y_full_1, pred_probs_cv_1)
Out[88]:
In [ ]:
In [89]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9,
reg_lambda=3, reg_alpha=2)
xgbcv_complex1 = mli.CVModel(xgb_complex1)
In [90]:
xgbcv_complex1.fit(X_full_1, y_full_1, foldnum_vec_2)
Out[90]:
In [91]:
pred_probs_cv_1 = xgbcv_complex1.predict_proba(X_full_1, foldnum_vec_2)[:,1]
In [92]:
roc_auc_score(y_full_1, pred_probs_cv_1), log_loss(y_full_1, pred_probs_cv_1)
Out[92]:
In [ ]:
In [93]:
xgbcv_complex1.model_dict
Out[93]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: