Using more sophisticated modeling packages (like XGBoost) to understand more complicated dynamics in the data
How to approach data exploration to understand more complicated relationships between the variables in your data
Why the "coherence" of a model is important - arguably, on the same level as its predictive performance
How to assess the "coherence" of a model using ICE plots
How to ascribe "reasons" to individual predictions
How to "consolidate" features to make the reasons more coherent and understandable
Using visualizations independently and from the SHAP package
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('classic') # I prefer the appearance of the "old-school" color scheme
%matplotlib inline
from sklearn.metrics import roc_auc_score, confusion_matrix, log_loss, accuracy_score, r2_score, roc_curve, precision_recall_curve
import xgboost as xgb
pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
np.set_printoptions(edgeitems=30, linewidth=100000)
In [2]:
# must have ml_insights installed ('pip install ml_insights')
import ml_insights as mli
In [3]:
mli.__version__
Out[3]:
You can get the data at:
https://drive.google.com/file/d/1HCz_68wCH_3Qp0pGz3R4Q0fmd4FA2Pah/view?usp=sharing
or
Download the file and place in the same directory as the notebook.
In [4]:
df_ins = pd.read_csv('ins_data_ODSC_West_2019.csv')
df_ins.info()
In [ ]:
This data set contains information about different insurance customers who have both home insurance and auto insurance with an agency. The information includes their premiums, coverages, vehicles, drivers, state of residence, and zip-code demographics.
In this exercise we will be building a model to predict who is likely to have an umbrella policy. Imagine, that our goal is to be able to recommend to new agencies (which currently don't sell umbrella policies) which customers are the best candidates. Our binary outcome variable is has_umbrella
We will focus on techniques for exploring, understanding, and interpreting data and models in a binary classification framework.
In [ ]:
In [5]:
df_ins.has_umbrella.value_counts(), np.mean(df_ins.has_umbrella)
Out[5]:
In [6]:
plt.hist(df_ins.homeown_premium)
Out[6]:
In [7]:
# Generally a good idea to put in custom bin widths
plt.hist(df_ins.homeown_premium, bins=np.linspace(0,10000,101));
In [8]:
bins_1 = np.linspace(0,2000,9)
bins_2 = np.array([2000,3000, 6000, 10000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.homeown_premium, df_ins.has_umbrella, bins=bins_final);
In [9]:
plt.hist(df_ins.auto_premium);
In [10]:
plt.hist(df_ins.auto_premium,np.linspace(0,15000,30+1));
In [ ]:
In [11]:
bins_1 = np.linspace(0,4000,8+1)
bins_2 = np.array([4000,6000, 8000, 15000])
bins_final = np.unique(np.concatenate((bins_1, bins_2)))
mli.histogram_pair(df_ins.auto_premium, df_ins.has_umbrella, bins=bins_final);
In [12]:
## Could do this on all the variables to understand...
In [ ]:
In [13]:
chosen_fold_variant = 'fold_num_1'
test_fold_num = 0
In [14]:
features_1 = ['auto_premium','homeown_premium']
In [ ]:
In [ ]:
In [15]:
# Define train and test sets
X_train_1 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1]
X_test_1 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1]
y_train = df_ins.has_umbrella[df_ins[chosen_fold_variant]!=test_fold_num]
y_test = df_ins.has_umbrella[df_ins[chosen_fold_variant]==test_fold_num]
In [ ]:
In [16]:
xgb_def1 = xgb.XGBClassifier()
In [17]:
xgb_def1.fit(X_train_1,y_train)
pred_probs_def1 = xgb_def1.predict_proba(X_test_1)[:,1]
In [18]:
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)
Out[18]:
In [19]:
mli.plot_pr_curves([y_test],[pred_probs_def1])
In [ ]:
In [ ]:
In [ ]:
In [20]:
plt.figure(figsize=(12,4))
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))
mli.plot_reliability_diagram(y_test,pred_probs_def1,size_points=False, bins=bins_final);
In [ ]:
In [21]:
### Define an example set
X_explore_1 = X_test_1.sample(100)
### "Mutate" the data points and make predictions on them
mxr_def_1 = mli.ModelXRay(xgb_def1,X_explore_1)
In [22]:
### Visualize the responses
indices = mxr_def_1.feature_dependence_plots()
In [ ]:
In [ ]:
In [23]:
xgb_complex1 = xgb.XGBClassifier(n_estimators=1000, learning_rate=.01, gamma=5, max_depth=2, subsample=.9,
reg_lambda=3, reg_alpha=2)
In [24]:
xgb_complex1.fit(X_train_1,y_train, eval_metric='logloss')#, eval_set=[(X_test_1, y_test)])
pred_probs_complex_1 = xgb_complex1.predict_proba(X_test_1)[:,1]
In [25]:
## Performance of this model
roc_auc_score(y_test, pred_probs_complex_1), log_loss(y_test, pred_probs_complex_1)
Out[25]:
In [26]:
## Performance of previous model
roc_auc_score(y_test, pred_probs_def1), log_loss(y_test, pred_probs_def1)
Out[26]:
In [27]:
mli.plot_pr_curves([y_test,y_test],[pred_probs_def1,pred_probs_complex_1])
In [ ]:
In [28]:
bins1 = np.linspace(0,.2,8+1)
bins2 = np.linspace(.2,1,8+1)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize=(10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex_1,size_points=False,
bins=bins_final);
In [29]:
### Examine Model More Deeply
mxr_complex_1 = mli.ModelXRay(xgb_complex1,X_explore_1)
In [30]:
## Result for more regularized model
mxr_complex_1.feature_dependence_plots(pts_selected=indices)
Out[30]:
In [31]:
## Results for less regualrized model
mxr_def_1.feature_dependence_plots(pts_selected=indices)
Out[31]:
In [ ]:
In [ ]:
In [32]:
features_1a = ['auto_premium','homeown_premium'
,'avg_homeown_rate_in_state']
In [33]:
X_train_1a = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_1a]
X_test_1a = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_1a]
In [34]:
xgb_complex1a= xgb.XGBClassifier(n_estimators=1000, learning_rate=.02, gamma=6, max_depth=3, subsample=.8,
reg_lambda=3, reg_alpha=2)
In [35]:
xgb_complex1a.fit(X_train_1a,y_train, eval_metric='logloss')#, eval_set=[(X_test_1a, y_test)])
Out[35]:
In [36]:
pred_probs_complex_1a = xgb_complex1a.predict_proba(X_test_1a)[:,1]
In [37]:
roc_auc_score(y_test, pred_probs_complex_1a), log_loss(y_test, pred_probs_complex_1a)
Out[37]:
In [38]:
mli.plot_pr_curves([y_test,y_test, y_test],[pred_probs_def1,pred_probs_complex_1, pred_probs_complex_1a])
In [ ]:
In [39]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex_1a,size_points=False,
bins=bins_final);
In [40]:
X_explore_1a = X_test_1a.sample(100)
mxr1a = mli.ModelXRay(xgb_complex1a,X_explore_1a)
In [41]:
mxr1a.feature_dependence_plots()
Out[41]:
In [ ]:
In [ ]:
In [ ]:
In [42]:
features_2 = ['auto_premium','homeown_premium', 'home_dwell_cov', 'home_pers_prop_cov',
'num_home_pol',
'yob_policyholder','min_vehicle_year', 'max_vehicle_year', 'num_vehicles',
'max_driver_yob', 'min_driver_yob',
'median_household_income','median_house_value'
,'avg_homeown_rate_in_state'
]
In [43]:
X_train_2 = df_ins.loc[df_ins[chosen_fold_variant]!=test_fold_num,features_2]
X_test_2 = df_ins.loc[df_ins[chosen_fold_variant]==test_fold_num,features_2]
In [44]:
xgb_complex2 = xgb.XGBClassifier(n_estimators = 1500, learning_rate=.03, max_depth=4, subsample=.8, gamma=6,
reg_alpha=2, reg_lambda=5, colsample_bytree=.7, colsample_bylevel=.7)
xgb_complex2.fit(X_train_2,y_train, eval_metric='logloss')#, eval_set=[(X_test_2, y_test)])
pred_probs_complex2 = xgb_complex2.predict_proba(X_test_2)[:,1]
In [45]:
roc_auc_score(y_test, pred_probs_complex2), log_loss(y_test, pred_probs_complex2)
Out[45]:
In [46]:
mli.plot_pr_curves([y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2])
In [47]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_complex2,size_points=False,
bins=bins_final);
In [48]:
X_explore_2 = X_test_2[X_test_2.median_house_value<1000000].sample(100)
mxr_complex2 = mli.ModelXRay(xgb_complex2,X_explore_2)
In [49]:
mxr_complex2.feature_dependence_plots(pts_selected=indices)
Out[49]:
In [50]:
mxr_complex2.feature_dependence_plots(pts_selected=indices, y_scaling='logit')
Out[50]:
In [ ]:
In [51]:
xgb_def2 = xgb.XGBClassifier()
xgb_def2.fit(X_train_2,y_train)
pred_probs_def2 = xgb_def2.predict_proba(X_test_2)[:,1]
In [52]:
roc_auc_score(y_test, pred_probs_def2), log_loss(y_test, pred_probs_def2)
Out[52]:
In [53]:
mli.plot_pr_curves([y_test, y_test, y_test],[pred_probs_complex_1a,pred_probs_complex2, pred_probs_def2])
In [54]:
bins1 = np.linspace(0,.1,5)
bins2 = np.linspace(.2,1,9)
bins_final = np.unique(np.concatenate((bins1, bins2)))
plt.figure(figsize = (10,4))
mli.plot_reliability_diagram(y_test,pred_probs_def2,size_points=False,
bins=bins_final);
In [55]:
mxr_def2 = mli.ModelXRay(xgb_def2,X_explore_2)
In [56]:
indices = mxr_def2.feature_dependence_plots()
In [57]:
indices = mxr_def2.feature_dependence_plots(y_scaling='logit')