Probability Calibration with SplineCalib

This workbook demonstrates the SplineCalib algorithm detailed in the paper "Spline-Based Probability Calibration" https://arxiv.org/abs/1809.07751

We build a random forest model and demonstrate that using the vote percentage as a probability is not well-calibrated. We then show different approaches on how to use SplineCalib to appropriately calibrate the model.

We also show how to serialize the calibration object to be able to save it on disk and re-use it.

MIMIC ICU Data*

We illustrate this process using a mortality model on the MIMIC ICU data

*MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). https://mimic.physionet.org

``````

In [1]:

# "pip install ml_insights" in terminal if needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ml_insights as mli
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss, brier_score_loss, roc_auc_score

mli.__version__

``````
``````

Out[1]:

'0.1.3'

``````

In the next few cells, we load in some data, inspect it, select columns for our features and outcome (mortality) and fill in missing values with the median of that column.

``````

In [2]:

# Load dataset derived from the MMIC database

``````
``````

Out[2]:

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}

subject_id
icustay_id
aniongap_min
aniongap_max
albumin_min
albumin_max
bicarbonate_min
bicarbonate_max
bilirubin_min
...
meanbp_mean
resprate_min
resprate_max
resprate_mean
tempc_min
tempc_max
tempc_mean
spo2_min
spo2_max
spo2_mean

0
9
150750
220597
13.0
13.0
NaN
NaN
26.0
30.0
0.4
...
98.850000
14.0
19.0
14.369565
35.500001
37.888887
37.049383
95.0
100.0
97.650000

1
13
143045
263738
10.0
14.0
3.9
3.9
23.0
24.0
0.4
...
93.772727
11.0
25.0
15.320000
35.944443
37.400002
36.653534
94.0
100.0
97.700000

2
20
157681
264490
12.0
12.0
NaN
NaN
21.0
21.0
NaN
...
75.058333
10.0
27.0
15.404762
35.900002
37.299999
36.545714
95.0
100.0
98.435897

3
28
162569
225559
13.0
13.0
NaN
NaN
23.0
23.0
NaN
...
69.133333
9.0
32.0
16.677419
35.900002
37.700001
37.033333
92.0
100.0
96.419355

4
37
188670
213503
9.0
10.0
NaN
NaN
33.0
35.0
NaN
...
73.297610
15.0
30.0
22.241379
36.833335
38.055556
37.333334
89.0
99.0
96.533333

5
71
111944
211832
13.0
30.0
3.6
4.7
17.0
26.0
0.4
...
79.222208
13.0
25.0
17.130435
35.722224
37.833332
37.351852
99.0
100.0
99.862069

6
72
156857
239612
18.0
18.0
NaN
NaN
20.0
20.0
3.5
...
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN

7
78
100536
233150
9.0
9.0
2.7
3.1
26.0
26.0
0.8
...
121.129705
11.0
24.0
16.764706
36.333334
36.833335
36.577778
96.0
100.0
98.470588

8
88
123010
297289
13.0
18.0
NaN
NaN
19.0
26.0
NaN
...
91.884615
9.0
45.0
20.352941
35.722224
39.111112
37.810185
99.0
100.0
99.962963

9
95
160891
216431
13.0
17.0
NaN
NaN
23.0
26.0
NaN
...
93.952386
14.0
20.0
16.363636
35.722224
36.666667
36.305556
96.0
100.0
98.071429

10 rows × 79 columns

``````
``````

In [3]:

# Choose a subset of variables

X = lab_aug_df.loc[:,['aniongap_min', 'aniongap_max',
'albumin_min', 'albumin_max', 'bicarbonate_min', 'bicarbonate_max',
'bilirubin_min', 'bilirubin_max', 'creatinine_min', 'creatinine_max',
'chloride_min', 'chloride_max',
'hematocrit_min', 'hematocrit_max', 'hemoglobin_min', 'hemoglobin_max',
'lactate_min', 'lactate_max', 'platelet_min', 'platelet_max',
'potassium_min', 'potassium_max', 'ptt_min', 'ptt_max', 'inr_min',
'inr_max', 'pt_min', 'pt_max', 'sodium_min', 'sodium_max', 'bun_min',
'bun_max', 'wbc_min', 'wbc_max','sysbp_max', 'sysbp_mean', 'diasbp_min', 'diasbp_max', 'diasbp_mean',
'meanbp_min', 'meanbp_max', 'meanbp_mean', 'resprate_min',
'resprate_max', 'resprate_mean', 'tempc_min', 'tempc_max', 'tempc_mean',
'spo2_min', 'spo2_max', 'spo2_mean']]
y = lab_aug_df['hospital_expire_flag']

``````
``````

In [4]:

# Impute the median for in each column to replace NA's

median_vec = [X.iloc[:,i].median() for i in range(len(X.columns))]

for i in range(len(X.columns)):
X.iloc[:,i].fillna(median_vec[i],inplace=True)

``````

Now we divide the data into training, calibration, and test sets. The training set will be used to fit the model, the calibration set will be used to calibrate the probabilities, and the test set will be used to evaluate the performance. We use a 60-20-20 split (achived by first doing 80/20 and then splitting the 80 by 75/25)

``````

In [5]:

X_train_calib, X_test, y_train_calib, y_test = train_test_split(X, y, test_size=0.2, random_state=942)

``````
``````

In [6]:

X_train, X_calib, y_train, y_calib = train_test_split(X_train_calib, y_train_calib, test_size=0.25, random_state=942)

``````
``````

In [7]:

X_train.shape, X_calib.shape, X_test.shape

``````
``````

Out[7]:

((35835, 51), (11945, 51), (11946, 51))

``````

Next, we fit a Random Forest model to our training data. Then we use that model to predict "probabilities" on our validation and test sets.

I use quotes on "probabilities" because these numbers, which are the percentage of trees that voted "yes" are better understood as mere scores. A higher value should generally indicate a higher probability of mortality. However, there is no reason to expect these to be well-calibrated probabilities. The fact that, say, 60% of the trees voted "yes" on a particular case does not mean that that case has a 60% probability of mortality.

We will demonstrate this empirically later.

``````

In [8]:

rfmodel1 = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942, n_jobs=-1 )
rfmodel1.fit(X_train,y_train)

``````
``````

Out[8]:

RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=-1, oob_score=False,
random_state=942, verbose=0, warm_start=False)

``````
``````

In [9]:

preds_test_uncalib = rfmodel1.predict_proba(X_test)[:,1]

``````
``````

In [10]:

preds_test_uncalib[:10]

``````
``````

Out[10]:

array([0.03 , 0.264, 0.002, 0.138, 0.076, 0.02 , 0.098, 0.026, 0.088,
0.03 ])

``````
``````

In [ ]:

``````
``````

In [11]:

roc_auc_score(y_test, preds_test_uncalib), roc_auc_score(y_test, .1*preds_test_uncalib)

``````
``````

Out[11]:

(0.8594278459278996, 0.8594278459278996)

``````

Model Evaluation

You are pretty happy with this model since it has an AUROC of .859. But someone asks you if the probability is well-calibrated. In other words, if we looked at all the time your model predicted a mortality probability of 40%, did around 40% of them actually die? Or was it 20%, or 80%? It turns our that AUROC just measures the ranking of cases and does not evaluate if the probabilities are meaningful. In fact, if you multiply all of your predicted probabilities by .1, you would still get the same AUROC.

Checking Calibration

How do we know if a model is well-calibrated? One way to check is to create a "Reliability Diagram". The idea behind the reliability diagram is the following:

• Bin the interval [0,1] into smaller subsets (e.g. [0, 0.05], [0.05, .1], ... [.95,1])
• Find the empirical probabilities when the probabilities fell into each bin (if there were 20 times, and 9 of them were "yes", the empirical probability is .45)
• Plot the predicted probability (average of predicted probabilities in each bin) (x-axis) vs the empirical probabilities(y-axis)
• put error bars based on the size of the bin
• When the dots are (significantly) above the line y=x, the model is under-predicting the true probability, if below the line, model is over-predicting the true probability.
``````

In [12]:

mli.plot_reliability_diagram(y_test, preds_test_uncalib, marker='.')

``````
``````

Out[12]:

(array([0.01861123, 0.07227181, 0.12277373, 0.17188113, 0.22222058,
0.27283966, 0.32543118, 0.37303394, 0.42274797, 0.47114629,
0.52441667, 0.57909123, 0.62548387, 0.67436364, 0.71878232,
0.78066667, 0.81781818, 0.86266667, 0.92428571]),
array([0.01173479, 0.05521706, 0.11971412, 0.20971302, 0.26256983,
0.29813665, 0.42523364, 0.49350649, 0.51219512, 0.68571429,
0.625     , 0.76666667, 0.83870968, 0.81818182, 0.94444444,
1.        , 0.90909091, 0.88888889, 1.        ]),
array([5113., 2626., 1679.,  906.,  537.,  322.,  214.,  154.,  123.,
70.,   48.,   30.,   31.,   33.,   18.,   15.,   11.,    9.,
7.]))

``````

Above, we see that the model is largely under-predicting the probability of mortality in the range .35 to .85. For example, when the model predicts a probability of between .6 and .65, more than 80% of those patients died. And the error bars indicate that this is not likely due to random error. In other words, our model is poorly calibrated.

Calibrating a Model

Since our current model is not well-calibrated, we would like to fix this. We want that when our model says 60% chance of mortality, it means 60% and not 40% or 80%. We will discuss two ways to fix this:

• Use an independent calibration set
• Using Cross-validation to generate scores from the training set.

The first method is simpler, but requires a separate data set, meaning that you will have less data to train your model with. It is good to use if you have plenty of data. It is also a useful approach if you think your distribution has "shifted" but the underlying signal in the model is fundamentally unchanged. In some cases it may make sense to "re-calibrate" a model on the "current" population without doing a full re-training.

The second approach takes more time, but is generally more data-efficient. We generate a set of cross-validated predictions on the training data. These predictions come from models that are close to, but not exactly identical to, your original model. However, this small disrepancy is usually minor and the calibration approach works well. For details, see the "Spline-Based Probability Calibration" paper referenced above.

Approach 1: Independent validation set

First let us demostrate how we would fix this using the independent validation set.

SplineCalib object

The SplineCalib object is similar in spirit to preprocessors / data transformations in scikit-learn. The two main operations are `fit` and `calibrate` (akin to `fit` and `transform` in sklearn).

To fit a calibration object, we give it a set of uncalibrated predictions from a model, and the corresponding truth set. The `fit` routine will learn the spline curve that best maps the uncalibrated scores to actual probabilities.

``````

In [13]:

# Define SplineCalib object

calib1 = mli.SplineCalib()

``````
``````

In [14]:

# Use the model to make predictions on the calibration set
preds_cset = rfmodel1.predict_proba(X_calib)[:,1]

``````
``````

In [15]:

# Fit the calibration object on the calibration set
calib1.fit(preds_cset, y_calib)

``````
``````

In [16]:

# Visually inspect the quality of the calibration on the calibration set

mli.plot_reliability_diagram(y_calib, preds_cset);
calib1.show_calibration_curve()

``````
``````

``````
``````

In [17]:

# Visually inspect the quality of the calibration on the test set

calib1.show_calibration_curve()
mli.plot_reliability_diagram(y_test, preds_test_uncalib);

``````
``````

``````
``````

In [18]:

calib1.show_spline_reg_plot()

``````
``````

``````
``````

In [19]:

# Calibrate the previously generated predictions from the model on the test set
preds_test_calib1 = calib1.calibrate(preds_test_uncalib)

``````
``````

In [20]:

# Visually inspect the calibration of the newly calibrated predictions
mli.plot_reliability_diagram(y_test, preds_test_calib1);

``````
``````

``````
``````

In [21]:

## Compare the log_loss values
log_loss(y_test, preds_test_uncalib),log_loss(y_test, preds_test_calib1)

``````
``````

Out[21]:

(0.2525331316238236, 0.24484468382050822)

``````

From the above, we see that not only do our reliability diagrams look better, but our log_loss values have substantially improved. Log_loss measures not only the discriminative power of the model but also how well-calibrated it is.

``````

In [ ]:

``````

Approach 2: Cross-validation on the training data

The reason to use an independent calibration set (rather than just the training data) is that how the model performs on the training data (that it has already seen) is not indicative of how it will behave on data it has not seen before. We want the calibration to correct how the model will behave on "new" data, not the training data.

Another approach is to take a cross-validation approach to generating calibration data. We divide the training data into k "folds", leave one fold out, train our model (i.e. the choice of model and hyperparameter settings) on the remaining k-1 folds, and then make predictions on the left-out fold. After doing this process k times, each time leaving out a different fold, we will have a set of predictions, each of which was generated by 1 of k slightly different models, but was always generated by a model that did not see that training point. Done properly (assuming no "leakage" across the folds), this set of predictions and answers will serve as an appropriate calibration set.

ML-Insights (the package containing SplineCalib, as well as other functionality) has a simple function to generate these cross-validated predictions. We demonstrate it below.

``````

In [22]:

# Get the cross validated predictions given a model and training data.
cv_preds_train = mli.cv_predictions(rfmodel1, X_train, y_train, clone_model=True)

``````
``````

In [23]:

calib2 = mli.SplineCalib()
calib2.fit(cv_preds_train, y_train)

``````
``````

In [24]:

# Show the reliability diagram for the cross-validated predictions, and the calibration curve
calib2.show_calibration_curve()
mli.plot_reliability_diagram(y_train, cv_preds_train[:,1]);

``````
``````

``````
``````

In [ ]:

``````
``````

In [25]:

mli.plot_reliability_diagram(y_test, calib2.calibrate(preds_test_uncalib));

``````
``````

``````
``````

In [26]:

preds_test_calib2 = calib2.calibrate(preds_test_uncalib)
log_loss(y_test, preds_test_uncalib), log_loss(y_test, preds_test_calib2)

``````
``````

Out[26]:

(0.2525331316238236, 0.24431169940349143)

``````

We see above that the cross-validated approach gives similar performance (slightly better in this case). Additionally, we did not use the 20% of data set aside for calibration at all in the second approach. We could use approach 2 on the entire training and calibration data and (presumably) get an even better model.

``````

In [27]:

rfmodel2 = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942, n_jobs=-1 )
rfmodel2.fit(X_train_calib,y_train_calib)
preds_test_2_uncalib = rfmodel2.predict_proba(X_test)[:,1]

``````
``````

In [28]:

# Get the cross validated predictions given a model and training data.
cv_preds_train_calib = mli.cv_predictions(rfmodel2, X_train_calib, y_train_calib, stratified=True, clone_model=True)

``````
``````

In [29]:

calib3 = mli.SplineCalib()
calib3.fit(cv_preds_train_calib, y_train_calib)

``````
``````

In [30]:

# Show the reliability diagram for the cross-validated predictions, and the calibration curve
calib3.show_calibration_curve()
mli.plot_reliability_diagram(y_train_calib, cv_preds_train_calib[:,1]);

``````
``````

``````
``````

In [31]:

preds_test_calib3 = calib3.calibrate(preds_test_2_uncalib)
log_loss(y_test, preds_test_2_uncalib), log_loss(y_test, preds_test_calib3)

``````
``````

Out[31]:

(0.25110679815106546, 0.2429890221514194)

``````
``````

In [32]:

roc_auc_score(y_test, preds_test_2_uncalib), roc_auc_score(y_test, preds_test_calib3)

``````
``````

Out[32]:

(0.8617756467135349, 0.8617717853200559)

``````

Indeed, we get a slightly better AUC and log_loss both before and after calibration, due to having a larger training set for our model to learn from

``````

In [ ]:

``````

Serializing Models

The SplineCalib object can be saved to disk easily with `joblib.dump()` and reloaded with `joblib.load()`

``````

In [33]:

import joblib
joblib.dump(calib3, 'calib3.pkl')

``````
``````

Out[33]:

['calib3.pkl']

``````
``````

In [34]:

``````
``````

In [35]:

``````
``````

``````
``````

In [36]:

``````
``````

``````
``````

In [37]:

``````
``````

Out[37]:

0.2429890221514194

``````
``````

In [ ]:

``````
``````

In [ ]:

``````

Comparison to Other Calibration Approaches

Here we compare SplineCalib to Isotonic Regression, Platt Scaling and Beta Calibration.

``````

In [38]:

from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
from betacal import BetaCalibration

``````
``````

In [39]:

# Fit three-parameter beta calibration
bc = BetaCalibration(parameters="abm")
bc.fit(cv_preds_train_calib[:,1], y_train_calib)

# Fit Isotonic Regression
iso = IsotonicRegression()
iso.fit(cv_preds_train_calib[:,1], y_train_calib)

# Fit Platt scaling (logistic calibration)
lr = LogisticRegression(C=99999999999)
lr.fit(cv_preds_train_calib[:,1].reshape(-1,1), y_train_calib)

``````
``````

Out[39]:

LogisticRegression(C=99999999999, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=None,
max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='warn', tol=0.0001, verbose=0,
warm_start=False)

``````
``````

In [40]:

tvec = np.linspace(0,1,1001)
bc_probs = bc.predict(tvec)
iso_probs = iso.predict(tvec)
platt_probs = lr.predict_proba(tvec.reshape(-1,1))[:,1]
splinecalib_probs = calib3.calibrate(tvec)
#calib3.show_calibration_curve()
mli.plot_reliability_diagram(y_train_calib, cv_preds_train_calib[:,1], error_bars=False);
plt.plot(tvec, splinecalib_probs, label='SplineCalib')
plt.plot(tvec, bc_probs, label='Beta')
plt.plot(tvec, iso_probs, label='Isotonic')
plt.plot(tvec, platt_probs, label='Platt')
plt.legend()
plt.title('Calibration Curves for different methods');

``````
``````

``````
``````

In [41]:

preds_test_bc = bc.predict(preds_test_2_uncalib)
preds_test_iso = iso.predict(preds_test_2_uncalib)
preds_test_platt = lr.predict_proba(preds_test_2_uncalib.reshape(-1,1))[:,1]
preds_test_splinecalib = calib3.calibrate(preds_test_2_uncalib)

``````
``````

In [42]:

bc_loss = log_loss(y_test, preds_test_bc)
iso_loss = log_loss(y_test, preds_test_iso)
platt_loss = log_loss(y_test, preds_test_platt)
splinecalib_loss = log_loss(y_test, preds_test_splinecalib)

``````
``````

In [43]:

print('Platt loss       = {}'.format(np.round(platt_loss,5)))
print('Beta Calib loss  = {}'.format(np.round(bc_loss,5)))
print('Isotonic loss    = {}'.format(np.round(iso_loss,5)))
print('SplineCalib loss = {}'.format(np.round(splinecalib_loss,5)))

``````
``````

Platt loss       = 0.25346
Beta Calib loss  = 0.24726
Isotonic loss    = 0.2458
SplineCalib loss = 0.24299

``````
``````

In [ ]:

``````
``````

In [ ]:

``````