This workbook is intended to demonstrate why probability calibration may be useful, and how to do it using the prob_calibration_function
in the ML-Insights package.
We build a random forest model, demonstrate that using the vote percentage as a probability is not well-calibrated, and then show how to use an independent validation set and the prob_calibration_function
to properly calibrate it so that accurate probabilities can be obtained.
*MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). https://mimic.physionet.org
In [1]:
# "pip install ml_insights" in terminal if needed
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ml_insights as mli
%matplotlib inline
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, log_loss, brier_score_loss
from sklearn import clone
In the next few cells, we load in some data, inspect it, select columns for our features and outcome (mortality) and fill in missing values with the median of that column.
In [2]:
# Load dataset derived from the MMIC database
lab_aug_df = pd.read_csv("data/lab_vital_icu_table.csv")
In [3]:
lab_aug_df.head(10)
Out[3]:
In [4]:
X = lab_aug_df.loc[:,['aniongap_min', 'aniongap_max',
'albumin_min', 'albumin_max', 'bicarbonate_min', 'bicarbonate_max',
'bilirubin_min', 'bilirubin_max', 'creatinine_min', 'creatinine_max',
'chloride_min', 'chloride_max',
'hematocrit_min', 'hematocrit_max', 'hemoglobin_min', 'hemoglobin_max',
'lactate_min', 'lactate_max', 'platelet_min', 'platelet_max',
'potassium_min', 'potassium_max', 'ptt_min', 'ptt_max', 'inr_min',
'inr_max', 'pt_min', 'pt_max', 'sodium_min', 'sodium_max', 'bun_min',
'bun_max', 'wbc_min', 'wbc_max','sysbp_max', 'sysbp_mean', 'diasbp_min', 'diasbp_max', 'diasbp_mean',
'meanbp_min', 'meanbp_max', 'meanbp_mean', 'resprate_min',
'resprate_max', 'resprate_mean', 'tempc_min', 'tempc_max', 'tempc_mean',
'spo2_min', 'spo2_max', 'spo2_mean']]
y = lab_aug_df['hospital_expire_flag']
In [5]:
# Impute the median for in each column to replace NA's
median_vec = [X.iloc[:,i].median() for i in range(len(X.columns))]
for i in range(len(X.columns)):
X.iloc[:,i].fillna(median_vec[i],inplace=True)
Now we divide the data into training, validation, and test sets. The training set will be used to fit the model, the validation set will be used to calibrate the probabilities, and the test set will be used to evaluate the performance. We use a 60-20-20 split (achived by first doing 80/20 and then splitting the 80 by 75/25)
In [6]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=942)
In [7]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=942)
Next, we fit a Random Forest model to our training data. Then we use that model to predict "probabilities" on our validation and test sets.
I use quotes on "probabilities" because these numbers, which are the percentage of trees that voted "yes" are better understood as mere scores. A higher value should generally indicate a higher probability of mortality. However, there is no reason to expect these to be well-calibrated probabilities. The fact that, say, 60% of the trees voted "yes" on a particular case does not mean that that case has a 60% probability of mortality.
We will demonstrate this empirically later.
In [8]:
rfmodel1 = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942, n_jobs=-1 )
rfmodel1.fit(X_train,y_train)
Out[8]:
In [9]:
val_res = rfmodel1.predict_proba(X_val)[:,1]
test_res_uncalib = rfmodel1.predict_proba(X_test)[:,1]
What follows is histograms showing on the validation set (top row) and the test set (bottom row) the histogram of the random forest vote percentage for the negative cases (survivors, left side) and the positive cases (mortalities, right side). As we would expect the distribution of mortalities is shifted to the right. We store the "count vectors" so that we can calculate some empirical probabilites. We use 20 equal size bins.
In [10]:
# Side by side histograms showing scores of positive vs negative cases
fig, axis = plt.subplots(2,2, figsize = (8,8))
ax=axis.flatten()
countvec0_val = ax[0].hist(val_res[np.where(y_val==0)],bins=20,range=[0,1]);
countvec1_val = ax[1].hist(val_res[np.where(y_val==1)],bins=20,range=[0,1]);
countvec0_test = ax[2].hist(test_res_uncalib[np.where(y_test==0)],bins=20,range=[0,1]);
countvec1_test = ax[3].hist(test_res_uncalib[np.where(y_test==1)],bins=20,range=[0,1]);
Next, we see the numerical counts in each bin for the validation set.
For example, in the first entry, we see that for the entries that had a vote percentage between 0 and .05, there were 4968 survivors, and 62 mortalities. For those with a vote percentage between .05 and .10, there were 2450 survivors and 138 mortalities, and so on. This allows us to calculate some empirical probabilities and visually see how well (or poorly) calibrated the vote percentage is.
In [11]:
list(zip(countvec0_val[0],countvec1_val[0]))
Out[11]:
Using the counts above, we can calculate empirical probabilities for each bin. Below, we can see the endpoints of each bin and the empirical probability of mortality for entries in that bin. So for example, for entries with a vote percentage between .45 and .5, we can see that the empirical probability of mortality was .756 (59/78), which suggests that the vote percentage is not a well-calibrated probability.
In [12]:
emp_prob_vec_val = countvec1_val[0]/(countvec0_val[0]+countvec1_val[0])
list(zip([(i/20, (i+1)/20) for i in range(20)],emp_prob_vec_val))
Out[12]:
To demonstrate this more visually, a well-calibrated probability should have the points falling (roughly) on the line x=y. However, as we see in the plot below, most of the points lie above that line. This suggest that (in particular for entries with vote percentage >.4) the true probability is considerably higher than the vote percentage.
In [13]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(0.05,1,20)-.025,emp_prob_vec_val,'ro')
Out[13]:
So how do we fix this?! We can use the validation set to create a calibration function. This function would map our uncalibrated score into a proper probability. The ml-insights package has a function called prob_calibration_function
which does just that. Give it the scores and the true 0/1 values, and it will return a calibration function. This calibration function can then be applied to the model output to yield a well-calibrated probability.
In [14]:
calibrate_rf = mli.prob_calibration_function(y_val,val_res)
The function takes a little bit of time to run, primarily because it is doing cross-validation on the regularization parameter C (to balance fitting the current data with generalizing well). There are additional settings on the function that can be applied to speed up this process, if desired.
As you can see below, the resulting function is a nice cubic function that fits the data well without over-fitting
In [15]:
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(0.05,1,20)-.025,emp_prob_vec_val,'ro')
Out[15]:
Now let's demonstrate that we actually have more accurate probabilities after calibration than we did before. We will use the log_loss
metric to determine which performs better on the test set, with or without calibration.
In [16]:
log_loss(y_test,test_res_uncalib)
Out[16]:
In [17]:
test_res_calib=calibrate_rf(test_res_uncalib)
In [18]:
log_loss(y_test,test_res_calib)
Out[18]:
In [19]:
brier_score_loss(y_test,test_res_uncalib)
Out[19]:
In [20]:
brier_score_loss(y_test,test_res_calib)
Out[20]:
As you can see, there is a significant improvement after calibration. (Lower loss is better)
This next plot shows the empirical probabilities of the test set data (that was not seen by the calibration process).
You can see, that the curve fits the data pretty well, though not as well as it fits the validation set (since it was trained on the validation set).
Note, the warning below is just eliminating the points where the counts in both bins was 0)
In [ ]:
In [21]:
emp_prob_vec_test = countvec1_test[0]/(countvec0_test[0]+countvec1_test[0])
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
#plt.plot(np.linspace(.025,.975,20),emp_prob_vec_val,'ro')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'y+')
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
Out[21]:
In [22]:
rfm_calib = mli.SplineCalibratedClassifierCV(rfmodel1, cv='prefit')
rfm_calib.fit(X_val,y_val)
Out[22]:
In [23]:
test_res_calib_alt = rfm_calib.predict_proba(X_test)[:,1]
In [24]:
plt.hist(test_res_calib_alt - test_res_calib)
Out[24]:
In [ ]:
In [ ]:
The default calibration uses a logistic regression fitting procedure, which is best if your metric of interest is log-loss. If brier score is the metric of interest, it is better to use method='ridge'
since that will use an L2 (mean-squared error) loss function as the calibration metric to optimize.
In [25]:
calibrate_rf_L2 = mli.prob_calibration_function(y_val,val_res,method='ridge')
In [26]:
test_res_calib_L2=calibrate_rf_L2(test_res_uncalib)
In [27]:
log_loss(y_test,test_res_calib_L2)
Out[27]:
In [28]:
brier_score_loss(y_test,test_res_calib_L2)
Out[28]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
As you can see, using the L2 calibration improves (slightly) on the brier score loss, while doing (significantly) worse on the log-loss. In general, using the default logistic method is advised for calibration of probabilities
In [29]:
rfm_calib = mli.SplineCalibratedClassifierCV(rfmodel1, cv='prefit')
rfm_calib.fit(X_val,y_val)
Out[29]:
In [30]:
test_res_calib_direct = rfm_calib.predict_proba(X_test)[:,1]
In [31]:
log_loss(y_test,test_res_calib_direct)
Out[31]:
In [32]:
brier_score_loss(y_test,test_res_calib_direct)
Out[32]:
At the heart of the spline-fitting component of the calibration is a Penalized Logistic Regression, driven by the LogisticRegressionCV
function in sklearn. This function searches over a range of "C" values, where C is a parameter that controls the regularization (shrinkage) of the coefficients.
By default, prob_calibration_function
searches over 43 values, logarithmically spaced between 10^-4 and 10^10. (In numpy language this is reg_param_vec = 10**np.linspace(-4,10,43)
. However, you have the option to specify your own reg_param_vec
vector of C values to try in the Cross-Validation.
If you recall previously, when we ran
calibrate_rf = mli.prob_calibration_function(y_val, val_res)
We got a message:
Trying 43 values of C between 0.0001 and 10000000000.0
Best value found C = [ 1000.]
If we wish to optimize more fully, we might want to try a range of points between 100 and 10000 for C. Below we show how to do this.
In [33]:
new_C_vec = np.linspace(100,10000,101)
new_C_vec
Out[33]:
In [34]:
calibrate_rf_new = mli.prob_calibration_function(y_val, val_res, reg_param_vec=new_C_vec)
Note that the run-time of the calibration depends on the number of C values that are tried. In general, if you want to save some time, you can specify a range with fewer points in the reg_param_vec
. For a more thorough search (which takes more time) you can specify more points.
In [35]:
test_res_calib_new=calibrate_rf_new(test_res_uncalib)
In [36]:
log_loss(y_test,test_res_calib_new)
Out[36]:
We see that, in this case, the log_loss was slightly improved from our previous calibration.
In [ ]:
In [ ]:
In [ ]:
Note, sklearn has a CalibratedClassifierCV
function, but it does not seem to work as well. Here we will use it in the same fashion as we did above. That is, we assume we have a model that is fit already and then use the additional data to calibrate (this is why we use the option cv='prefit'
). Unlike ml-insights, the sklearn functionality creates a new model which integrates the calibration rather than keeping the model and calibration separate.
In [37]:
from sklearn.calibration import CalibratedClassifierCV
In [38]:
clf_isotonic = CalibratedClassifierCV(rfmodel1, method='isotonic',cv="prefit")
clf_isotonic.fit(X_val, y_val)
prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1]
In [39]:
log_loss(y_test,prob_pos_isotonic)
Out[39]:
In [40]:
brier_score_loss(y_test,prob_pos_isotonic)
Out[40]:
The log loss and brier score loss is actually worse (higher) than the uncalibrated version, and considerably worse than the calibrated method using ml-insights
In [41]:
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_val,'ro')
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib,prob_pos_isotonic,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
Out[41]:
From the plot above, it seems to give more of a "step function" calibration rather than a smooth, spline-like curve. It also seems to be off quite a bit when it predicts a 0.2 probability and the actual (empirical) probability is around 0.05.
In [42]:
clf_sigmoid = CalibratedClassifierCV(rfmodel1, method='sigmoid',cv="prefit")
clf_sigmoid.fit(X_val, y_val)
prob_pos_sigmoid = clf_sigmoid.predict_proba(X_test)[:, 1]
In [43]:
log_loss(y_test,prob_pos_sigmoid)
Out[43]:
In [44]:
brier_score_loss(y_test,prob_pos_sigmoid)
Out[44]:
Again, the results are not very reassuring.
In [45]:
#plt.plot(np.linspace(.025,.975,20),emp_prob_vec_val,'ro')
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib,prob_pos_sigmoid,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
Out[45]:
Above, we see that the sigmoid
option assumes a rather strict parametric form, and thus fits badly if the reality is not of that shape.
In [ ]:
One disadvantage to the above approach is that the model uses one set of data for training and another set for calibration. As a result, neither step benefits from the full size of available data. Below we outline the ML-Insights capabilities to do a cross-validated training and calibration, and compare it to the existing sklearn functionality.
In [ ]:
In [ ]:
In [46]:
rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942, n_jobs=-1 )
rfm_cv = mli.SplineCalibratedClassifierCV(rfm)
rfm_cv.fit(X_train_val,y_train_val)
Out[46]:
In [47]:
test_res_uncalib_cv = rfm_cv.uncalibrated_classifier.predict_proba(X_test)[:,1]
In [48]:
#test_res_calib_cv = calibration_function(test_res_uncalib_cv)
test_res_calib_cv = rfm_cv.predict_proba(X_test)[:,1]
In [49]:
#log_loss(y_test,test_res_uncalib_cv)
In [50]:
#brier_score_loss(y_test,test_res_uncalib_cv)
In [51]:
log_loss(y_test,test_res_calib_cv)
Out[51]:
In [52]:
brier_score_loss(y_test,test_res_calib_cv)
Out[52]:
In [53]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
plt.plot(np.linspace(0,1,101),rfm_cv.calib_func(np.linspace(0,1,101)))
Out[53]:
In [ ]:
In [54]:
rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942)
rfm_cv_L2 = mli.SplineCalibratedClassifierCV(rfm, method='ridge')
rfm_cv_L2.fit(X_train_val,y_train_val)
Out[54]:
In [55]:
test_res_calib_cv_L2 = rfm_cv_L2.predict_proba(X_test)[:,1]
In [56]:
log_loss(y_test,test_res_calib_cv_L2)
Out[56]:
In [57]:
brier_score_loss(y_test,test_res_calib_cv_L2)
Out[57]:
In [58]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
plt.plot(np.linspace(0,1,101),rfm_cv.calib_func(np.linspace(0,1,101)),'g')
plt.plot(np.linspace(0,1,101),rfm_cv_L2.calib_func(np.linspace(0,1,101)),'r')
Out[58]:
In [59]:
clf_isotonic_xval = CalibratedClassifierCV(rfm, method='isotonic', cv=5)
clf_isotonic_xval.fit(X_train_val,y_train_val)
prob_pos_isotonic_xval = clf_isotonic_xval.predict_proba(X_test)[:, 1]
In [60]:
log_loss(y_test,prob_pos_isotonic_xval)
Out[60]:
In [61]:
brier_score_loss(y_test,prob_pos_isotonic_xval)
Out[61]:
In [62]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib_cv,prob_pos_isotonic_xval,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
#plt.plot(np.linspace(0,1,101),calibration_function(np.linspace(0,1,101)),'g')
Out[62]:
In [63]:
clf_sigmoid_xval = CalibratedClassifierCV(rfm, method='sigmoid', cv=5)
clf_sigmoid_xval.fit(X_train_val,y_train_val)
prob_pos_sigmoid_xval = clf_sigmoid_xval.predict_proba(X_test)[:, 1]
In [64]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib_cv,prob_pos_sigmoid,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
#plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
#plt.plot(np.linspace(0,1,101),calibration_function(np.linspace(0,1,101)),'g')
Out[64]:
In [65]:
log_loss(y_test,prob_pos_sigmoid_xval)
Out[65]:
In [66]:
brier_score_loss(y_test,prob_pos_sigmoid_xval)
Out[66]:
In [ ]:
In [ ]: