Probability Calibration using ML-Insights

On Example of Mortality Model Using MIMIC ICU Data*

This workbook is intended to demonstrate why probability calibration may be useful, and how to do it using the prob_calibration_function in the ML-Insights package.

We build a random forest model, demonstrate that using the vote percentage as a probability is not well-calibrated, and then show how to use an independent validation set and the prob_calibration_function to properly calibrate it so that accurate probabilities can be obtained.

*MIMIC-III, a freely accessible critical care database. Johnson AEW, Pollard TJ, Shen L, Lehman L, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, and Mark RG. Scientific Data (2016). https://mimic.physionet.org


In [1]:
# "pip install ml_insights" in terminal if needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ml_insights as mli
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.metrics import roc_auc_score, log_loss, brier_score_loss
from sklearn import clone

In the next few cells, we load in some data, inspect it, select columns for our features and outcome (mortality) and fill in missing values with the median of that column.


In [2]:
# Load dataset derived from the MMIC database

lab_aug_df = pd.read_csv("data/lab_vital_icu_table.csv")

In [3]:
lab_aug_df.head(10)


Out[3]:
subject_id hadm_id icustay_id aniongap_min aniongap_max albumin_min albumin_max bicarbonate_min bicarbonate_max bilirubin_min ... meanbp_mean resprate_min resprate_max resprate_mean tempc_min tempc_max tempc_mean spo2_min spo2_max spo2_mean
0 9 150750 220597 13.0 13.0 NaN NaN 26.0 30.0 0.4 ... 98.850000 14.0 19.0 14.369565 35.500001 37.888887 37.049383 95.0 100.0 97.650000
1 13 143045 263738 10.0 14.0 3.9 3.9 23.0 24.0 0.4 ... 93.772727 11.0 25.0 15.320000 35.944443 37.400002 36.653534 94.0 100.0 97.700000
2 20 157681 264490 12.0 12.0 NaN NaN 21.0 21.0 NaN ... 75.058333 10.0 27.0 15.404762 35.900002 37.299999 36.545714 95.0 100.0 98.435897
3 28 162569 225559 13.0 13.0 NaN NaN 23.0 23.0 NaN ... 69.133333 9.0 32.0 16.677419 35.900002 37.700001 37.033333 92.0 100.0 96.419355
4 37 188670 213503 9.0 10.0 NaN NaN 33.0 35.0 NaN ... 73.297610 15.0 30.0 22.241379 36.833335 38.055556 37.333334 89.0 99.0 96.533333
5 71 111944 211832 13.0 30.0 3.6 4.7 17.0 26.0 0.4 ... 79.222208 13.0 25.0 17.130435 35.722224 37.833332 37.351852 99.0 100.0 99.862069
6 72 156857 239612 18.0 18.0 NaN NaN 20.0 20.0 3.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 78 100536 233150 9.0 9.0 2.7 3.1 26.0 26.0 0.8 ... 121.129705 11.0 24.0 16.764706 36.333334 36.833335 36.577778 96.0 100.0 98.470588
8 88 123010 297289 13.0 18.0 NaN NaN 19.0 26.0 NaN ... 91.884615 9.0 45.0 20.352941 35.722224 39.111112 37.810185 99.0 100.0 99.962963
9 95 160891 216431 13.0 17.0 NaN NaN 23.0 26.0 NaN ... 93.952386 14.0 20.0 16.363636 35.722224 36.666667 36.305556 96.0 100.0 98.071429

10 rows × 79 columns


In [4]:
X = lab_aug_df.loc[:,['aniongap_min', 'aniongap_max',
       'albumin_min', 'albumin_max', 'bicarbonate_min', 'bicarbonate_max',
       'bilirubin_min', 'bilirubin_max', 'creatinine_min', 'creatinine_max',
       'chloride_min', 'chloride_max', 
       'hematocrit_min', 'hematocrit_max', 'hemoglobin_min', 'hemoglobin_max',
       'lactate_min', 'lactate_max', 'platelet_min', 'platelet_max',
       'potassium_min', 'potassium_max', 'ptt_min', 'ptt_max', 'inr_min',
       'inr_max', 'pt_min', 'pt_max', 'sodium_min', 'sodium_max', 'bun_min',
       'bun_max', 'wbc_min', 'wbc_max','sysbp_max', 'sysbp_mean', 'diasbp_min', 'diasbp_max', 'diasbp_mean',
       'meanbp_min', 'meanbp_max', 'meanbp_mean', 'resprate_min',
       'resprate_max', 'resprate_mean', 'tempc_min', 'tempc_max', 'tempc_mean',
       'spo2_min', 'spo2_max', 'spo2_mean']]
y = lab_aug_df['hospital_expire_flag']

In [5]:
# Impute the median for in each column to replace NA's 

median_vec = [X.iloc[:,i].median() for i in range(len(X.columns))]

for i in range(len(X.columns)):
    X.iloc[:,i].fillna(median_vec[i],inplace=True)

Now we divide the data into training, validation, and test sets. The training set will be used to fit the model, the validation set will be used to calibrate the probabilities, and the test set will be used to evaluate the performance. We use a 60-20-20 split (achived by first doing 80/20 and then splitting the 80 by 75/25)


In [6]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=942)

In [7]:
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=942)

Next, we fit a Random Forest model to our training data. Then we use that model to predict "probabilities" on our validation and test sets.

I use quotes on "probabilities" because these numbers, which are the percentage of trees that voted "yes" are better understood as mere scores. A higher value should generally indicate a higher probability of mortality. However, there is no reason to expect these to be well-calibrated probabilities. The fact that, say, 60% of the trees voted "yes" on a particular case does not mean that that case has a 60% probability of mortality.

We will demonstrate this empirically later.


In [8]:
rfmodel1 = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942, n_jobs=-1 )
rfmodel1.fit(X_train,y_train)


Out[8]:
RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=942, verbose=0, warm_start=False)

In [9]:
val_res = rfmodel1.predict_proba(X_val)[:,1]
test_res_uncalib = rfmodel1.predict_proba(X_test)[:,1]

What follows is histograms showing on the validation set (top row) and the test set (bottom row) the histogram of the random forest vote percentage for the negative cases (survivors, left side) and the positive cases (mortalities, right side). As we would expect the distribution of mortalities is shifted to the right. We store the "count vectors" so that we can calculate some empirical probabilites. We use 20 equal size bins.


In [10]:
# Side by side histograms showing scores of positive vs negative cases

fig, axis = plt.subplots(2,2, figsize = (8,8))

ax=axis.flatten()
countvec0_val = ax[0].hist(val_res[np.where(y_val==0)],bins=20,range=[0,1]);
countvec1_val = ax[1].hist(val_res[np.where(y_val==1)],bins=20,range=[0,1]);
countvec0_test = ax[2].hist(test_res_uncalib[np.where(y_test==0)],bins=20,range=[0,1]);
countvec1_test = ax[3].hist(test_res_uncalib[np.where(y_test==1)],bins=20,range=[0,1]);


Next, we see the numerical counts in each bin for the validation set.

For example, in the first entry, we see that for the entries that had a vote percentage between 0 and .05, there were 4968 survivors, and 62 mortalities. For those with a vote percentage between .05 and .10, there were 2450 survivors and 138 mortalities, and so on. This allows us to calculate some empirical probabilities and visually see how well (or poorly) calibrated the vote percentage is.


In [11]:
list(zip(countvec0_val[0],countvec1_val[0]))


Out[11]:
[(4968.0, 62.0),
 (2450.0, 138.0),
 (1543.0, 204.0),
 (722.0, 193.0),
 (403.0, 185.0),
 (223.0, 104.0),
 (130.0, 98.0),
 (57.0, 70.0),
 (44.0, 58.0),
 (19.0, 59.0),
 (14.0, 33.0),
 (12.0, 44.0),
 (6.0, 20.0),
 (1.0, 22.0),
 (3.0, 14.0),
 (0.0, 12.0),
 (2.0, 13.0),
 (0.0, 12.0),
 (0.0, 5.0),
 (0.0, 2.0)]

Using the counts above, we can calculate empirical probabilities for each bin. Below, we can see the endpoints of each bin and the empirical probability of mortality for entries in that bin. So for example, for entries with a vote percentage between .45 and .5, we can see that the empirical probability of mortality was .756 (59/78), which suggests that the vote percentage is not a well-calibrated probability.


In [12]:
emp_prob_vec_val = countvec1_val[0]/(countvec0_val[0]+countvec1_val[0])
list(zip([(i/20, (i+1)/20) for i in range(20)],emp_prob_vec_val))


Out[12]:
[((0.0, 0.05), 0.012326043737574552),
 ((0.05, 0.1), 0.053323029366306028),
 ((0.1, 0.15), 0.11677160847166572),
 ((0.15, 0.2), 0.21092896174863388),
 ((0.2, 0.25), 0.31462585034013607),
 ((0.25, 0.3), 0.31804281345565749),
 ((0.3, 0.35), 0.42982456140350878),
 ((0.35, 0.4), 0.55118110236220474),
 ((0.4, 0.45), 0.56862745098039214),
 ((0.45, 0.5), 0.75641025641025639),
 ((0.5, 0.55), 0.7021276595744681),
 ((0.55, 0.6), 0.7857142857142857),
 ((0.6, 0.65), 0.76923076923076927),
 ((0.65, 0.7), 0.95652173913043481),
 ((0.7, 0.75), 0.82352941176470584),
 ((0.75, 0.8), 1.0),
 ((0.8, 0.85), 0.8666666666666667),
 ((0.85, 0.9), 1.0),
 ((0.9, 0.95), 1.0),
 ((0.95, 1.0), 1.0)]

To demonstrate this more visually, a well-calibrated probability should have the points falling (roughly) on the line x=y. However, as we see in the plot below, most of the points lie above that line. This suggest that (in particular for entries with vote percentage >.4) the true probability is considerably higher than the vote percentage.


In [13]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(0.05,1,20)-.025,emp_prob_vec_val,'ro')


Out[13]:
[<matplotlib.lines.Line2D at 0x120991940>]

So how do we fix this?! We can use the validation set to create a calibration function. This function would map our uncalibrated score into a proper probability. The ml-insights package has a function called prob_calibration_function which does just that. Give it the scores and the true 0/1 values, and it will return a calibration function. This calibration function can then be applied to the model output to yield a well-calibrated probability.


In [14]:
calibrate_rf = mli.prob_calibration_function(y_val,val_res)


Originally there were 496 knots.  Reducing to 200 while preserving first and last knot.
Trying 43 values of C between 0.0001 and 10000000000.0
Best value found C = [  2.15443469e+09]

The function takes a little bit of time to run, primarily because it is doing cross-validation on the regularization parameter C (to balance fitting the current data with generalizing well). There are additional settings on the function that can be applied to speed up this process, if desired.

As you can see below, the resulting function is a nice cubic function that fits the data well without over-fitting


In [15]:
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(0.05,1,20)-.025,emp_prob_vec_val,'ro')


Out[15]:
[<matplotlib.lines.Line2D at 0x1181f4a90>]

Now let's demonstrate that we actually have more accurate probabilities after calibration than we did before. We will use the log_loss metric to determine which performs better on the test set, with or without calibration.


In [16]:
log_loss(y_test,test_res_uncalib)


Out[16]:
0.25252730333409118

In [17]:
test_res_calib=calibrate_rf(test_res_uncalib)

In [18]:
log_loss(y_test,test_res_calib)


Out[18]:
0.24446136650832703

In [19]:
brier_score_loss(y_test,test_res_uncalib)


Out[19]:
0.073175005512208458

In [20]:
brier_score_loss(y_test,test_res_calib)


Out[20]:
0.071984829479211476

As you can see, there is a significant improvement after calibration. (Lower loss is better)

This next plot shows the empirical probabilities of the test set data (that was not seen by the calibration process).

You can see, that the curve fits the data pretty well, though not as well as it fits the validation set (since it was trained on the validation set).

Note, the warning below is just eliminating the points where the counts in both bins was 0)


In [ ]:


In [21]:
emp_prob_vec_test = countvec1_test[0]/(countvec0_test[0]+countvec1_test[0])
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
#plt.plot(np.linspace(.025,.975,20),emp_prob_vec_val,'ro')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'y+')
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')


/Applications/anaconda/envs/py3env/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in true_divide
  if __name__ == '__main__':
Out[21]:
[<matplotlib.lines.Line2D at 0x1183ab2b0>]

In [22]:
rfm_calib = mli.SplineCalibratedClassifierCV(rfmodel1, cv='prefit')
rfm_calib.fit(X_val,y_val)


Determining Calibration Function
Originally there were 496 knots.  Reducing to 200 while preserving first and last knot.
Trying 43 values of C between 0.0001 and 10000000000.0
Best value found C = [  2.15443469e+09]
Out[22]:
SplineCalibratedClassifierCV(base_estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=942, verbose=0, warm_start=False),
               cv='prefit', method='logistic')

In [23]:
test_res_calib_alt = rfm_calib.predict_proba(X_test)[:,1]

In [24]:
plt.hist(test_res_calib_alt - test_res_calib)


Out[24]:
(array([     0.,      0.,      0.,      0.,      0.,  11946.,      0.,
             0.,      0.,      0.]),
 array([-0.5, -0.4, -0.3, -0.2, -0.1,  0. ,  0.1,  0.2,  0.3,  0.4,  0.5]),
 <a list of 10 Patch objects>)

In [ ]:


In [ ]:

The default calibration uses a logistic regression fitting procedure, which is best if your metric of interest is log-loss. If brier score is the metric of interest, it is better to use method='ridge' since that will use an L2 (mean-squared error) loss function as the calibration metric to optimize.


In [25]:
calibrate_rf_L2 = mli.prob_calibration_function(y_val,val_res,method='ridge')


Originally there were 496 knots.  Reducing to 200 while preserving first and last knot.
Trying 43 values of alpha between 1e-07 and 10000000.0
Best value found alpha = 0.001

In [26]:
test_res_calib_L2=calibrate_rf_L2(test_res_uncalib)

In [27]:
log_loss(y_test,test_res_calib_L2)


Out[27]:
0.24851897496515921

In [28]:
brier_score_loss(y_test,test_res_calib_L2)


Out[28]:
0.071981483302756707

In [ ]:


In [ ]:


In [ ]:


In [ ]:

As you can see, using the L2 calibration improves (slightly) on the brier score loss, while doing (significantly) worse on the log-loss. In general, using the default logistic method is advised for calibration of probabilities


In [29]:
rfm_calib = mli.SplineCalibratedClassifierCV(rfmodel1, cv='prefit')
rfm_calib.fit(X_val,y_val)


Determining Calibration Function
Originally there were 496 knots.  Reducing to 200 while preserving first and last knot.
Trying 43 values of C between 0.0001 and 10000000000.0
Best value found C = [  2.15443469e+09]
Out[29]:
SplineCalibratedClassifierCV(base_estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=942, verbose=0, warm_start=False),
               cv='prefit', method='logistic')

In [30]:
test_res_calib_direct = rfm_calib.predict_proba(X_test)[:,1]

In [31]:
log_loss(y_test,test_res_calib_direct)


Out[31]:
0.24446136650832703

In [32]:
brier_score_loss(y_test,test_res_calib_direct)


Out[32]:
0.071984829479211476

At the heart of the spline-fitting component of the calibration is a Penalized Logistic Regression, driven by the LogisticRegressionCV function in sklearn. This function searches over a range of "C" values, where C is a parameter that controls the regularization (shrinkage) of the coefficients.

By default, prob_calibration_function searches over 43 values, logarithmically spaced between 10^-4 and 10^10. (In numpy language this is reg_param_vec = 10**np.linspace(-4,10,43). However, you have the option to specify your own reg_param_vec vector of C values to try in the Cross-Validation.

If you recall previously, when we ran

calibrate_rf = mli.prob_calibration_function(y_val, val_res)

We got a message:

Trying 43 values of C between 0.0001 and 10000000000.0 Best value found C = [ 1000.]

If we wish to optimize more fully, we might want to try a range of points between 100 and 10000 for C. Below we show how to do this.


In [33]:
new_C_vec = np.linspace(100,10000,101)
new_C_vec


Out[33]:
array([   100.,    199.,    298.,    397.,    496.,    595.,    694.,
          793.,    892.,    991.,   1090.,   1189.,   1288.,   1387.,
         1486.,   1585.,   1684.,   1783.,   1882.,   1981.,   2080.,
         2179.,   2278.,   2377.,   2476.,   2575.,   2674.,   2773.,
         2872.,   2971.,   3070.,   3169.,   3268.,   3367.,   3466.,
         3565.,   3664.,   3763.,   3862.,   3961.,   4060.,   4159.,
         4258.,   4357.,   4456.,   4555.,   4654.,   4753.,   4852.,
         4951.,   5050.,   5149.,   5248.,   5347.,   5446.,   5545.,
         5644.,   5743.,   5842.,   5941.,   6040.,   6139.,   6238.,
         6337.,   6436.,   6535.,   6634.,   6733.,   6832.,   6931.,
         7030.,   7129.,   7228.,   7327.,   7426.,   7525.,   7624.,
         7723.,   7822.,   7921.,   8020.,   8119.,   8218.,   8317.,
         8416.,   8515.,   8614.,   8713.,   8812.,   8911.,   9010.,
         9109.,   9208.,   9307.,   9406.,   9505.,   9604.,   9703.,
         9802.,   9901.,  10000.])

In [34]:
calibrate_rf_new = mli.prob_calibration_function(y_val, val_res, reg_param_vec=new_C_vec)


Originally there were 496 knots.  Reducing to 200 while preserving first and last knot.
Trying 101 values of C between 100.0 and 10000.0
Best value found C = [ 4456.]

Note that the run-time of the calibration depends on the number of C values that are tried. In general, if you want to save some time, you can specify a range with fewer points in the reg_param_vec. For a more thorough search (which takes more time) you can specify more points.


In [35]:
test_res_calib_new=calibrate_rf_new(test_res_uncalib)

In [36]:
log_loss(y_test,test_res_calib_new)


Out[36]:
0.24468146139296634

We see that, in this case, the log_loss was slightly improved from our previous calibration.


In [ ]:


In [ ]:


In [ ]:

Existing Sklearn Calibration Functionality

Note, sklearn has a CalibratedClassifierCV function, but it does not seem to work as well. Here we will use it in the same fashion as we did above. That is, we assume we have a model that is fit already and then use the additional data to calibrate (this is why we use the option cv='prefit'). Unlike ml-insights, the sklearn functionality creates a new model which integrates the calibration rather than keeping the model and calibration separate.


In [37]:
from sklearn.calibration import CalibratedClassifierCV

In [38]:
clf_isotonic = CalibratedClassifierCV(rfmodel1, method='isotonic',cv="prefit")
clf_isotonic.fit(X_val, y_val)
prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1]

In [39]:
log_loss(y_test,prob_pos_isotonic)


Out[39]:
0.26663934423877206

In [40]:
brier_score_loss(y_test,prob_pos_isotonic)


Out[40]:
0.074458906233875938

The log loss and brier score loss is actually worse (higher) than the uncalibrated version, and considerably worse than the calibrated method using ml-insights


In [41]:
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_val,'ro')
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib,prob_pos_isotonic,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')


Out[41]:
[<matplotlib.lines.Line2D at 0x12c1cac50>]

From the plot above, it seems to give more of a "step function" calibration rather than a smooth, spline-like curve. It also seems to be off quite a bit when it predicts a 0.2 probability and the actual (empirical) probability is around 0.05.


In [42]:
clf_sigmoid = CalibratedClassifierCV(rfmodel1, method='sigmoid',cv="prefit")
clf_sigmoid.fit(X_val, y_val)
prob_pos_sigmoid = clf_sigmoid.predict_proba(X_test)[:, 1]

In [43]:
log_loss(y_test,prob_pos_sigmoid)


Out[43]:
0.25396992513972455

In [44]:
brier_score_loss(y_test,prob_pos_sigmoid)


Out[44]:
0.073394487887474241

Again, the results are not very reassuring.


In [45]:
#plt.plot(np.linspace(.025,.975,20),emp_prob_vec_val,'ro')
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib,prob_pos_sigmoid,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))


Out[45]:
[<matplotlib.lines.Line2D at 0x129bd2fd0>]

Above, we see that the sigmoid option assumes a rather strict parametric form, and thus fits badly if the reality is not of that shape.


In [ ]:

Cross Validated Calibration

One disadvantage to the above approach is that the model uses one set of data for training and another set for calibration. As a result, neither step benefits from the full size of available data. Below we outline the ML-Insights capabilities to do a cross-validated training and calibration, and compare it to the existing sklearn functionality.


In [ ]:


In [ ]:


In [46]:
rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942, n_jobs=-1 )
rfm_cv = mli.SplineCalibratedClassifierCV(rfm)
rfm_cv.fit(X_train_val,y_train_val)


training fold 1 of 5
training fold 2 of 5
training fold 3 of 5
training fold 4 of 5
training fold 5 of 5
Training Full Model
Determining Calibration Function
Originally there were 959 knots.  Reducing to 200 while preserving first and last knot.
Trying 43 values of C between 0.0001 and 10000000000.0
Best value found C = [  1.00000000e+10]
Out[46]:
SplineCalibratedClassifierCV(base_estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=-1,
            oob_score=False, random_state=942, verbose=0, warm_start=False),
               cv=5, method='logistic')

In [47]:
test_res_uncalib_cv = rfm_cv.uncalibrated_classifier.predict_proba(X_test)[:,1]

In [48]:
#test_res_calib_cv = calibration_function(test_res_uncalib_cv)
test_res_calib_cv = rfm_cv.predict_proba(X_test)[:,1]

In [49]:
#log_loss(y_test,test_res_uncalib_cv)

In [50]:
#brier_score_loss(y_test,test_res_uncalib_cv)

In [51]:
log_loss(y_test,test_res_calib_cv)


Out[51]:
0.24330933810106203

In [52]:
brier_score_loss(y_test,test_res_calib_cv)


Out[52]:
0.071831705602529008

In [53]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
plt.plot(np.linspace(0,1,101),rfm_cv.calib_func(np.linspace(0,1,101)))


Out[53]:
[<matplotlib.lines.Line2D at 0x13161fb00>]

In [ ]:


In [54]:
rfm = RandomForestClassifier(n_estimators = 500, class_weight='balanced_subsample', random_state=942)
rfm_cv_L2 = mli.SplineCalibratedClassifierCV(rfm, method='ridge')
rfm_cv_L2.fit(X_train_val,y_train_val)


training fold 1 of 5
training fold 2 of 5
training fold 3 of 5
training fold 4 of 5
training fold 5 of 5
Training Full Model
Determining Calibration Function
Originally there were 977 knots.  Reducing to 200 while preserving first and last knot.
Trying 43 values of alpha between 1e-07 and 10000000.0
Best value found alpha = 0.004641588833612773
Out[54]:
SplineCalibratedClassifierCV(base_estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,
            oob_score=False, random_state=942, verbose=0, warm_start=False),
               cv=5, method='ridge')

In [55]:
test_res_calib_cv_L2 = rfm_cv_L2.predict_proba(X_test)[:,1]

In [56]:
log_loss(y_test,test_res_calib_cv_L2)


Out[56]:
0.24740431124716544

In [57]:
brier_score_loss(y_test,test_res_calib_cv_L2)


Out[57]:
0.071814391744741871

In [58]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
plt.plot(np.linspace(0,1,101),rfm_cv.calib_func(np.linspace(0,1,101)),'g')
plt.plot(np.linspace(0,1,101),rfm_cv_L2.calib_func(np.linspace(0,1,101)),'r')


Out[58]:
[<matplotlib.lines.Line2D at 0x12b9c7f60>]

In [59]:
clf_isotonic_xval = CalibratedClassifierCV(rfm, method='isotonic', cv=5)
clf_isotonic_xval.fit(X_train_val,y_train_val)
prob_pos_isotonic_xval = clf_isotonic_xval.predict_proba(X_test)[:, 1]

In [60]:
log_loss(y_test,prob_pos_isotonic_xval)


Out[60]:
0.24809833644158982

In [61]:
brier_score_loss(y_test,prob_pos_isotonic_xval)


Out[61]:
0.072170301163490103

In [62]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib_cv,prob_pos_isotonic_xval,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
#plt.plot(np.linspace(0,1,101),calibration_function(np.linspace(0,1,101)),'g')


Out[62]:
[<matplotlib.lines.Line2D at 0x131c17f28>]

In [63]:
clf_sigmoid_xval = CalibratedClassifierCV(rfm, method='sigmoid', cv=5)
clf_sigmoid_xval.fit(X_train_val,y_train_val)
prob_pos_sigmoid_xval = clf_sigmoid_xval.predict_proba(X_test)[:, 1]

In [64]:
plt.plot(np.linspace(0,1,101),np.linspace(0,1,101),'k')
plt.plot(test_res_uncalib_cv,prob_pos_sigmoid,'c.')
plt.plot(np.linspace(.025,.975,20),emp_prob_vec_test,'g+')
#plt.plot(np.linspace(0,1,101),calibrate_rf(np.linspace(0,1,101)))
#plt.plot(np.linspace(0,1,101),calibration_function(np.linspace(0,1,101)),'g')


Out[64]:
[<matplotlib.lines.Line2D at 0x131a62ef0>]

In [65]:
log_loss(y_test,prob_pos_sigmoid_xval)


Out[65]:
0.25331592414895782

In [66]:
brier_score_loss(y_test,prob_pos_sigmoid_xval)


Out[66]:
0.073210966238481959

In [ ]:


In [ ]: