This workbook is intended to demonstrate why probability calibration may be useful, and how to do it using the ML-Insights package. This is an abridged version of a longer workbook (which goes into more detail).
We build a random forest classifier to predict mortality. We show that the uncalibrated model performs much worse than the calibrated version on log-loss (and brier score).
We demonstrate the "isotonic" and "sigmoid" calibration functions of sklearn and show the they are not as effective.
Finally, we show that even methods like boosting, which we might expect to be well-calibrated, benefit significantly from the calibration capabilities provided by ML-Insights.
*MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). https://mimic.physionet.org
In [1]:
# "pip install ml_insights" in terminal if needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ml_insights as mli
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, log_loss, brier_score_loss
from sklearn import clone
from sklearn.calibration import CalibratedClassifierCV
In the next few cells, we load in some data, inspect it, select columns for our features and outcome (mortality) and fill in missing values with the median of that column.
In [2]:
# Load dataset derived from the MMIC database
lab_aug_df = pd.read_csv("data/lab_vital_icu_table.csv")
In [3]:
lab_aug_df.head(10)
Out[3]:
In [4]:
X = lab_aug_df.loc[:,['aniongap_min', 'aniongap_max',
'albumin_min', 'albumin_max', 'bicarbonate_min', 'bicarbonate_max',
'bilirubin_min', 'bilirubin_max', 'creatinine_min', 'creatinine_max',
'chloride_min', 'chloride_max',
'hematocrit_min', 'hematocrit_max', 'hemoglobin_min', 'hemoglobin_max',
'lactate_min', 'lactate_max', 'platelet_min', 'platelet_max',
'potassium_min', 'potassium_max', 'ptt_min', 'ptt_max', 'inr_min',
'inr_max', 'pt_min', 'pt_max', 'sodium_min', 'sodium_max', 'bun_min',
'bun_max', 'wbc_min', 'wbc_max','sysbp_max', 'sysbp_mean', 'diasbp_min', 'diasbp_max', 'diasbp_mean',
'meanbp_min', 'meanbp_max', 'meanbp_mean', 'resprate_min',
'resprate_max', 'resprate_mean', 'tempc_min', 'tempc_max', 'tempc_mean',
'spo2_min', 'spo2_max', 'spo2_mean']]
y = lab_aug_df['hospital_expire_flag']
In [5]:
# Impute the median for in each column to replace NA's
median_vec = [X.iloc[:,i].median() for i in range(len(X.columns))]
for i in range(len(X.columns)):
X.iloc[:,i].fillna(median_vec[i],inplace=True)
Now we divide the data into training, and test sets via a 70/30 split.
In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=942)
Next, we fit a Random Forest model to our training data using the SplineCalibratedClassifierCV function in ML-Insights. The resulting object is a model containing the usual "predict" and "predict_proba" methods. It also contains the uncalibrated classifier and the calibration function that "corrects" the output to give more accurate probabilities. This enables us to see exactly why and how the calibration helps us perform better on metrics such as log-loss and brier score.
Note that the percentage of trees that voted "yes" in a random forest are better understood as mere scores. A higher value should generally indicate a higher probability of mortality. However, there is no reason to expect these to be well-calibrated probabilities. The fact that, say, 60% of the trees voted "yes" on a particular case does not mean that that case has a 60% probability of mortality.
We will demonstrate this empirically later.
In [7]:
rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942, n_jobs=-1 )
rfm_calib = mli.SplineCalibratedClassifierCV(rfm)
rfm_calib.fit(X_train,y_train)
Out[7]:
In [8]:
test_res_uncalib = rfm_calib.uncalibrated_classifier.predict_proba(X_test)[:,1]
test_res_calib = rfm_calib.predict_proba(X_test)[:,1]
As you can see below, there is a significant improvement after calibration. (Lower loss is better)
In [9]:
log_loss(y_test,test_res_uncalib)
Out[9]:
In [10]:
log_loss(y_test,test_res_calib)
Out[10]:
The following shows that the predict_proba method of rfm_calib is the same as using the uncalibrated classifier and then applying the calibration function.
In [11]:
test_res_calib_2 = rfm_calib.calib_func(test_res_uncalib)
log_loss(y_test,test_res_calib_2)
Out[11]:
The default logistic option is intended to reduce the log-loss (aka deviance). But we see here that it improves the Brier Score as well.
In [12]:
brier_score_loss(y_test,test_res_uncalib)
Out[12]:
In [13]:
brier_score_loss(y_test,test_res_calib)
Out[13]:
This next plot shows the calibration function that was estimated. If the original model had been perfectly calibrated, we would expect the calibration function to be y=x. Instead, the plot shows that the Random Forest scores overestimate the probability between 0 and 0.1 and underestimate it after that.
In [14]:
plt.plot(np.linspace(0,1,101),rfm_calib.calib_func(np.linspace(0,1,101)))
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
Out[14]:
Using histograms and bin counts, we can look at estimated empirical probabilities on the test set to see how well our calibration is working.
In [15]:
# Side by side histograms showing scores of positive vs negative cases
fig, axis = plt.subplots(1,2, figsize = (8,4))
ax=axis.flatten()
countvec0_test = ax[0].hist(test_res_uncalib[np.where(y_test==0)],bins=20,range=[0,1]);
countvec1_test = ax[1].hist(test_res_uncalib[np.where(y_test==1)],bins=20,range=[0,1]);
In [16]:
emp_prob_vec_test = countvec1_test[0]/(countvec0_test[0]+countvec1_test[0])
The below plot shows how well the calibrated probabilities fit the empirical (binned) probabilities on an independent test set.
In [17]:
plt.plot(np.linspace(0,1,101),rfm_calib.calib_func(np.linspace(0,1,101)))
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(.025,.975,20), emp_prob_vec_test, 'g+')
Out[17]:
In [ ]:
In [ ]:
In [ ]:
Note, sklearn has a CalibratedClassifierCV
function, but it does not seem to work as well.
In [18]:
clf_isotonic_xval = CalibratedClassifierCV(rfm, method='isotonic', cv=5)
clf_isotonic_xval.fit(X_train,y_train)
prob_pos_isotonic_xval = clf_isotonic_xval.predict_proba(X_test)[:, 1]
In [19]:
log_loss(y_test,prob_pos_isotonic_xval), log_loss(y_test,test_res_calib)
Out[19]:
In [20]:
brier_score_loss(y_test,prob_pos_isotonic_xval), brier_score_loss(y_test,test_res_calib)
Out[20]:
In [21]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib,prob_pos_isotonic_xval,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
plt.plot(np.linspace(0,1,101),rfm_calib.calib_func(np.linspace(0,1,101)),'b')
Out[21]:
The CalibratedClassifierCV in sklearn averages the results of mutiple calibrated models, resulting in significant variance which hampers its performance. By contrast, ML-Insights refits the full model, and computes one calibration on the entire cross-validated answer set.
For this reason, the log-loss is significantly worse (and brier score is worse as well) using the isotonic variant (though both are improved from the uncalibrated version)
In [22]:
clf_sigmoid_xval = CalibratedClassifierCV(rfm, method='sigmoid', cv=5)
clf_sigmoid_xval.fit(X_train,y_train)
prob_pos_sigmoid_xval = clf_sigmoid_xval.predict_proba(X_test)[:, 1]
In [23]:
log_loss(y_test,prob_pos_sigmoid_xval), log_loss(y_test,test_res_calib)
Out[23]:
In [24]:
brier_score_loss(y_test,prob_pos_sigmoid_xval), brier_score_loss(y_test,test_res_calib)
Out[24]:
In [25]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib,prob_pos_sigmoid_xval,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
Out[25]:
The sigmoid variant assumes a strict parametric form for the calibration, which is not accurate in many cases, therefore, its performance is not very good.
In [ ]:
It is perhaps unsurprising that random forest vote percentages would need calibration since they are not intended to estimate the true probabilities. However, as we see below, even methods like boosting perform better if calibration is applied. (Warning: next cell may take 30+ minutes to run). For a quicker demonstration, decrease the max_depth and/or the n_estimators)
In [26]:
gbm = GradientBoostingClassifier(n_estimators = 1000, max_depth=7, learning_rate=.02, random_state=942)
gbm_calib = mli.SplineCalibratedClassifierCV(gbm)
gbm_calib.fit(X_train,y_train)
Out[26]:
In [30]:
test_res_uncalib_gbm = gbm_calib.uncalibrated_classifier.predict_proba(X_test)[:,1]
test_res_calib_gbm = gbm_calib.predict_proba(X_test)[:,1]
In [31]:
log_loss(y_test,test_res_uncalib_gbm)
Out[31]:
In [32]:
log_loss(y_test,test_res_calib_gbm)
Out[32]:
In [ ]:
In [ ]:
In [33]:
# Side by side histograms showing scores of positive vs negative cases
fig, axis = plt.subplots(1,2, figsize = (8,4))
ax=axis.flatten()
countvec0_test_gbm = ax[0].hist(test_res_uncalib_gbm[np.where(y_test==0)],bins=20,range=[0,1]);
countvec1_test_gbm = ax[1].hist(test_res_uncalib_gbm[np.where(y_test==1)],bins=20,range=[0,1]);
In [34]:
emp_prob_vec_test_gbm = countvec1_test_gbm[0]/(countvec0_test_gbm[0]+countvec1_test_gbm[0])
It is interesting to see that the boosting model has a very different calibration function.
In [35]:
plt.plot(np.linspace(0,1,101),gbm_calib.calib_func(np.linspace(0,1,101)))
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(.025,.975,20), emp_prob_vec_test_gbm, 'g+')
Out[35]:
In [ ]: