In [1]:

    
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))

from formats import load_style
load_style(css_style='custom2.css', plot_style=False)









    Out[1]:



In [2]:

    
os.chdir(path)

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics as metrics
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.calibration import calibration_curve
from sklearn.model_selection import train_test_split

# prevent scientific notations
pd.set_option('display.float_format', lambda x: '%.3f' % x)

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,matplotlib,xgboost,lightgbm









    



Ethen 2019-12-20 10:06:34 

CPython 3.6.4
IPython 7.9.0

numpy 1.16.5
pandas 0.25.0
sklearn 0.21.2
matplotlib 3.1.1
xgboost 0.81
lightgbm 2.2.4

Probability Calibration

Well calibrated classifiers are classifiers for which the output probability (for sklearn, this is the output of the predict_proba method) can be directly interpreted as a confidence level. The definition of a well calibrated (binary) classifier should classify the samples such that among the samples which the model gave a predicted probability value close to 0.8, approximately 80% of them actually belong to the positive class.

Some examples of why checking whether probability are calibrated matters:

When looking up the weather forecast, we usually get a precipitation probability. e.g. If the weather forecast says there's a 80% chance of raining, then how trustworthy is this probability? Or if I take 100 days of data that were claimed to have a 80% chance of raining, how many rainy days were there. If the number of rainy days were around 80, then that means we can trust that particular rain forecast.

Another example is: after training say a binary classifier, the output we get from methods predict_proba might just be a ranking score instead of well calibrated probability. A ranking score is essentially evaluating how well does the model score positive examples above negative ones, whereas a calibrated probability is evaluating how closely the scores generated by our model resembles an actual probability. Obtaining a well calibrated probability becomes important when we wish to do something along the lines of:

Use the probability threshold to inform some action. e.g. We'll reject the loan approval if the default rate is higher than 50%.
If our ranking formula is not solely based on the original model's score. In some cases, we may wish to use the score/probability along with some additional factors for ranking purpose. e.g. In the advertising cost per click model, we're going to rank the ads by its expected value (the probability of clicking on the ad multiplied by the ad fee for the click).

Data Preparation

We'll be using the credit card default dataset from UCI, we can download this dataset from Kaggle as well.



In [3]:

    
input_path = 'UCI_Credit_Card.csv'
df = pd.read_csv(input_path)
print(df.shape)
df.head()









    



(30000, 25)






    Out[3]:







  
    
      
      ID
      LIMIT_BAL
      SEX
      EDUCATION
      MARRIAGE
      AGE
      PAY_0
      PAY_2
      PAY_3
      PAY_4
      ...
      BILL_AMT4
      BILL_AMT5
      BILL_AMT6
      PAY_AMT1
      PAY_AMT2
      PAY_AMT3
      PAY_AMT4
      PAY_AMT5
      PAY_AMT6
      default.payment.next.month
    
  
  
    
      0
      1
      20000.000
      2
      2
      1
      24
      2
      2
      -1
      -1
      ...
      0.000
      0.000
      0.000
      0.000
      689.000
      0.000
      0.000
      0.000
      0.000
      1
    
    
      1
      2
      120000.000
      2
      2
      2
      26
      -1
      2
      0
      0
      ...
      3272.000
      3455.000
      3261.000
      0.000
      1000.000
      1000.000
      1000.000
      0.000
      2000.000
      1
    
    
      2
      3
      90000.000
      2
      2
      2
      34
      0
      0
      0
      0
      ...
      14331.000
      14948.000
      15549.000
      1518.000
      1500.000
      1000.000
      1000.000
      1000.000
      5000.000
      0
    
    
      3
      4
      50000.000
      2
      2
      1
      37
      0
      0
      0
      0
      ...
      28314.000
      28959.000
      29547.000
      2000.000
      2019.000
      1200.000
      1100.000
      1069.000
      1000.000
      0
    
    
      4
      5
      50000.000
      1
      2
      1
      57
      -1
      0
      -1
      0
      ...
      20940.000
      19146.000
      19131.000
      2000.000
      36681.000
      10000.000
      9000.000
      689.000
      679.000
      0
    
  

5 rows × 25 columns



In [4]:

    
id_cols = ['ID']
cat_cols = ['EDUCATION', 'SEX', 'MARRIAGE']
num_cols = [
    'LIMIT_BAL', 'AGE',
    'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
    'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
    'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'
]
label_col = 'default.payment.next.month'

input_cols = num_cols + cat_cols



In [5]:

    
label_distr = np.round(np.bincount(df[label_col]) / df.shape[0], 3)
print('label distribution: ', label_distr)









    



label distribution:  [0.779 0.221]

We'll generate a train/validation/test three way split. The validation set created here is mainly used to calibrate our model. As per good practice, we should not be using the same dataset on both the training & calibrating process.



In [6]:

    
test_size = 0.1
val_size = 0.3
random_state = 1234

df_train, df_test = train_test_split(
    df,
    test_size=test_size,
    random_state=random_state,
    stratify=df[label_col])

df_train, df_val = train_test_split(
    df_train,
    test_size=val_size,
    random_state=random_state,
    stratify=df_train[label_col])

print('train shape: ', df_train.shape)
print('validation shape: ', df_val.shape)
print('test shape: ', df_test.shape)

df_train.head()









    



train shape:  (18900, 25)
validation shape:  (8100, 25)
test shape:  (3000, 25)






    Out[6]:







  
    
      
      ID
      LIMIT_BAL
      SEX
      EDUCATION
      MARRIAGE
      AGE
      PAY_0
      PAY_2
      PAY_3
      PAY_4
      ...
      BILL_AMT4
      BILL_AMT5
      BILL_AMT6
      PAY_AMT1
      PAY_AMT2
      PAY_AMT3
      PAY_AMT4
      PAY_AMT5
      PAY_AMT6
      default.payment.next.month
    
  
  
    
      9256
      9257
      20000.000
      2
      3
      1
      23
      1
      2
      2
      -2
      ...
      0.000
      0.000
      0.000
      480.000
      0.000
      0.000
      0.000
      0.000
      0.000
      1
    
    
      23220
      23221
      150000.000
      2
      3
      2
      35
      -1
      2
      -1
      2
      ...
      1143.000
      163.000
      2036.000
      0.000
      2264.000
      0.000
      163.000
      2036.000
      0.000
      0
    
    
      11074
      11075
      260000.000
      2
      2
      1
      43
      2
      2
      2
      2
      ...
      2500.000
      2500.000
      2500.000
      0.000
      0.000
      0.000
      0.000
      0.000
      0.000
      1
    
    
      1583
      1584
      50000.000
      2
      1
      2
      70
      2
      2
      0
      0
      ...
      17793.000
      18224.000
      18612.000
      0.000
      2200.000
      700.000
      700.000
      674.000
      608.000
      0
    
    
      8623
      8624
      390000.000
      2
      2
      1
      45
      1
      -2
      -2
      -2
      ...
      0.000
      0.000
      0.000
      0.000
      0.000
      0.000
      0.000
      0.000
      3971.000
      1
    
  

5 rows × 25 columns

Model Training

We'll train a binary classifier to predict default payment, and evaluate the model using some common evaluation metrics. In our example, we'll only focus on the widely used boosted tree including xgboost and lightgbm.



In [7]:

    
# parameters chosen in an adhoc manner
xgb_params = {
    'n_estimators': 300,
    'colsample_bytree': 0.7,
    'subsample': 0.8,
    'max_depth': 3,
    'learning_rate': 0.1
}
xgb = XGBClassifier(**xgb_params)
xgb.fit(df_train[input_cols], df_train[label_col])









    Out[7]:





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bytree=0.7, gamma=0, learning_rate=0.1,
              max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
              n_estimators=300, n_jobs=1, nthread=None,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
              subsample=0.8)



In [8]:

    
lgb_params = {
    'boosting_type': 'gbdt',
    'n_estimators': 300,
    'learning_rate': 0.1,
    'min_child_samples': 51,
    'max_depth': 3,
    'subsample': 0.8,
    'colsample_bytree': 0.7,
    'num_leaves': 76
}

# configuring the model
lgb = LGBMClassifier(**lgb_params)
lgb.fit(df_train[input_cols], df_train[label_col])









    Out[8]:





LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.7,
               importance_type='split', learning_rate=0.1, max_depth=3,
               min_child_samples=51, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=300, n_jobs=-1, num_leaves=76, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=0.8, subsample_for_bin=200000, subsample_freq=0)



In [9]:

    
def compute_score(model, X, y, verbose=True, round_digits=3):
    """
    The following blog contains introduction to various evaluation metrics
    for binary classifier.
    
    https://neptune.ml/blog/evaluation-metrics-binary-classification
    """
    y_score = model.predict_proba(X)[:, 1]

    auc = round(metrics.roc_auc_score(y, y_score), round_digits)
    log_loss = round(metrics.log_loss(y, y_score), round_digits)
    brier = round(metrics.brier_score_loss(y, y_score), round_digits)

    precision, recall, threshold = metrics.precision_recall_curve(y, y_score)
    f1 = 2 * (precision * recall) / (precision + recall)

    mask = ~np.isnan(f1)
    f1 = f1[mask]
    precision = precision[mask]
    recall = recall[mask]

    best_index = np.argmax(f1)
    precision = round(precision[best_index], round_digits)
    recall = round(recall[best_index], round_digits)
    f1 = round(f1[best_index], round_digits)

    if verbose:
        print('auc: ', auc)
        print('precision: ', precision)
        print('recall: ', recall)
        print('f1: ', f1)
        print('brier: ', brier)
        print('log_loss: ', log_loss)

    return {
        'auc': auc,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'log_loss': log_loss,
        'brier': brier
    }



In [10]:

    
# evaluate the metrics for training and test set
estimators_names = [
    (xgb, 'xgb'),
    (lgb, 'lgb')
]
df_groups = (df_train, df_test), ('train', 'test')

estimator_metrics = []
for estimator, name in estimators_names:
    for df_group, df_name in zip(*df_groups):
        metric_dict = compute_score(estimator, df_group[input_cols], df_group[label_col], verbose=False)
        metric_dict['name'] = name + '_' + df_name
        estimator_metrics.append(metric_dict)
    
df_metrics = pd.DataFrame(estimator_metrics)
df_metrics

Calibration Curve

calibration_curve or also known as reliability diagram is a common way to visualize whether the score output from the model is well calibrated or not.

To construct the graph, we discretize the predicted probability into $n$ bins ($n$ is configurable, usually 10). Instances with predicted value between 0 and 0.1 are placed in the first bin, between 0.1 and 0.2 are placed in the second bin and so on. Then for each bin, the mean predicted probability is plotted against the fraction of positive cases for that bin. If the model is well-calibrated, then the points will fall near the diagonal line.



In [11]:

    
def plot_calibration_curve(estimators_names, X, y):

    ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    ax2 = plt.subplot2grid((3, 1), (2, 0))

    # estimator_metrics stores list of dict, e.g.
    #     [{'auc': 0.776,
    #       'precision': 0.566,
    #       'recall': 0.523,
    #       'f1': 0.543,
    #       'name': 'xgb'}]
    estimator_metrics = []
    for estimator, name in estimators_names:
        proba = estimator.predict_proba(X)[:, 1]
        prob_true, prob_pred = calibration_curve(y, proba, n_bins=10)

        metric_dict = compute_score(estimator, X, y, verbose=False)
        metric_dict['name'] = name
        estimator_metrics.append(metric_dict)

        ax1.plot(prob_pred, prob_true, 's-', label=name)
        ax2.hist(proba, range=(0, 1), bins=10, label=name, histtype='step', lw=2)

    ax1.plot([0, 1], [0, 1], 'k:', label='Perfectly calibrated')

    ax1.set_xlabel('Fraction of positives (Predicted)')
    ax1.set_ylabel('Fraction of positives (Actual)')
    ax1.set_ylim([-0.05, 1.05])
    ax1.legend(loc='lower right')
    ax1.set_title('Calibration Plots (Reliability Curve)')

    ax2.set_xlabel('Predicted scores')
    ax2.set_ylabel('Count')
    ax1.set_title('Histogram of Predicted Scores')
    ax2.legend(loc='upper center', ncol=2)

    plt.tight_layout()
    df_metrics = pd.DataFrame(estimator_metrics)
    return df_metrics



In [12]:

    
# change default style figure and font size
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12

# calibration curve for the training set
estimators_names = [
    (xgb, 'xgb'),
    (lgb, 'lgb')
]
df_metrics = plot_calibration_curve(estimators_names, df_train[input_cols], df_train[label_col])
df_metrics

If we look at the calibration plot, we can see there are some points that fall above and below the diagonal line.

Below the diagonal: The model has over-forecast; the probabilities are too large.
Above the diagonal: The model has under-forecast; the probabilities are too small.

But from the looks of it, it seems like the predicted score is pretty well calibrated as the dots fall closely to the diagonal line.

Model Calibration

Calibration of prediction scores is a rescaling operation that is applied after the predictions have been made by a predictive model. The high-level procedure to train the calibration model is to make sure the data that is used to fit the original estimator and the one that is used for calibration does not overlap. e.g.

We can split the data into training / validation set, After our base model is trained on the training set, the predictions on the validation set are used to fit the calibration model.
Or do it in a cross validation way, where the data is split into $C$ folds. For each fold, one part is held aside for use as an validation set while the training is performed using the other $C-1$ fold. After repeating the process for all $C$ folds, we can compute final probability by doing an arithmetic mean of the calibrated classifier's predictions.

As for the calibration model, two popular ones are Platt Scaling and Isotonic Regression.

Platt Scaling: At a high level, Platt Scaling amounts to training a logistic regression on of the original classifier's output with respect to the true class labels.
Isotonic Regression: With this approach, the idea is to fit a piecewise constant non-decreasing function. At a high level, the idea is to merge similar scores into bins such that each bin becomes monotonically increasing. e.g. The first bin may have the range [0, 0.2] and probability 0.15, meaning that any instance with a score between 0 and 0.2 should be assigned a probability estimate of 0.15.

To evaluate whether we successfully calibrated our model, we can/should check various evaluation metrics for our calibrated model. Our ranking metrics such as AUC should remain somewhat the same, whereas our probability-related metrics such as log-loss should improve.

In general, platt scaling is preferable if the calibration curve has a sigmoid shape and when there is few calibration data, while isotonic regression, being a non-parametric method, is preferable for non-sigmoid calibration curves and in situations where many additional data can be used for calibration. But again, it doesn't hurt to try both approaches.

The following code chunk shows how to perform probability calibration for scikit-learn like estimators. Other machine learning packages such as h2o also provides this capability.

Customizing Scikit-learn Calibrated Classifier

Here, we'll be making some modifications to scikit-learn's CalibratedClassifierCV class to make it suitable for our use-case. The reason being:

If we are to use the prefit argument, where our base estimator has been fitted already and all data is used for calibration. Then we would run into a feature name mismatch issue. This is because inside the class, the check_X_y method converts our input dataframe into a numpy array, causing the feature name from the dataframe to be un-aligned with the numpy array.

# https://github.com/dmlc/xgboost/issues/2334#issuecomment-406282203
# Calibrated with sigmoid calibration
from sklearn.calibration import CalibratedClassifierCV

xgb_sigmoid = CalibratedClassifierCV(xgb, method='sigmoid', cv='prefit')
xgb_sigmoid.fit(df_val[input_cols], df_val[label_col])
# doing this will throw feature name mismatch error at the time of writing

The second customization is the _CalibratedClassifier that is used within CalibratedClassifierCV. This is the class that actually fits the calibration model. An issue occurs when we tried to use the isotonic option for the calibration model, a dtype mismatch error occurs. This is due to the fact that the predicted score from our boosted tree model is of type np.float32 (float), whereas scikit-learn's isotonic regression expects np.float64 (double). Hence the fix is to coerce the predicted score to a double.

from sklearn.calibration import _CalibratedClassifier

calibrated_classifier = _CalibratedClassifier(xgb, method='isotonic')
calibrated_classifier.fit(df_val[input_cols], df_val[label_col])
# ValueError: Buffer dtype mismatch, expected 'float' but got 'double'



In [13]:

    
from sklearn.utils.validation import check_is_fitted
from sklearn.calibration import (CalibratedClassifierCV,
                                 _CalibratedClassifier,
                                 _SigmoidCalibration)
from sklearn.preprocessing import LabelEncoder, LabelBinarizer, label_binarize
from sklearn.isotonic import IsotonicRegression


class CalibratedPrefitClassifier(CalibratedClassifierCV):

    def __init__(self, base_estimator, method):
        super().__init__(base_estimator, method, cv='prefit')

    def fit(self, X, y, sample_weight=None):
        """
        Fit the calibrated model

        Parameters
        ----------
        X : pd.DataFrame, shape (n_samples, n_features)
            Training data.

        y : array-like, shape (n_samples,)
            Target values.

        sample_weight : array-like of shape (n_samples,), default=None
            Sample weights. If None, then samples are equally weighted.

        Returns
        -------
        self : object
            Returns an instance of self.
        """
        # scikit-learn has a check_X_y here that will convert our dataframe
        # to a numpy array, remove that part to prevent the conversion

        le = LabelBinarizer().fit(y)
        self.classes_ = le.classes_

        # the main difference from scikit-learn is that our cv parameter will
        # always be 'prefit', hence we don't need to handle the cross validation case,
        # we still honor that original attribute list type, even though it isn't necessary
        self.calibrated_classifiers_ = []
        calibrated_classifier = CalibratedClassifier(self.base_estimator, method=self.method)
        if sample_weight is not None:
            calibrated_classifier.fit(X, y, sample_weight)
        else:
            calibrated_classifier.fit(X, y)

        self.calibrated_classifiers_.append(calibrated_classifier)
        return self

    def predict_proba(self, X):
        """
        Posterior probabilities of classification

        This function returns posterior probabilities of classification
        according to each class on an array of test vectors X.

        Parameters
        ----------
        X : pd.DataFrame, shape (n_samples, n_features)
            The samples.

        Returns
        -------
        proba : array, shape (n_samples, n_classes)
            The predicted probas.
        """
        # the check_is_fitted method might change to not needing to pass
        # the attribute that we're checking for in the future
        check_is_fitted(self, ['classes_', 'calibrated_classifiers_'])

        # Compute the arithmetic mean of the predictions of the calibrated
        # classifiers
        mean_proba = np.zeros((X.shape[0], len(self.classes_)))
        for calibrated_classifier in self.calibrated_classifiers_:
            proba = calibrated_classifier.predict_proba(X)
            mean_proba += proba

        mean_proba /= len(self.calibrated_classifiers_)
        return mean_proba


class CalibratedClassifier(_CalibratedClassifier):

    def fit(self, X, y, sample_weight=None):
        """
        Calibrate the fitted model

        Parameters
        ----------
        X : pd.DataFrame, shape (n_samples, n_features)
            Training data.

        y : array-like, shape (n_samples,)
            Target values.

        sample_weight : array-like of shape (n_samples,), default=None
            Sample weights. If None, then samples are equally weighted.

        Returns
        -------
        self : object
            Returns an instance of self.
        """
        self.label_encoder_ = LabelEncoder()
        if self.classes is None:
            self.label_encoder_.fit(y)
        else:
            self.label_encoder_.fit(self.classes)

        self.classes_ = self.label_encoder_.classes_
        Y = label_binarize(y, self.classes_)

        # df is the predicted score/probability of the base_estimator
        df, idx_pos_class = self._preproc(X)

        self.calibrators_ = []
        for k, this_df in zip(idx_pos_class, df.T):
            if self.method == 'isotonic':
                # coerce the predicted score to a double to prevent type error
                this_df = this_df.astype(np.float64)
                calibrator = IsotonicRegression(out_of_bounds='clip')
            elif self.method == 'sigmoid':
                calibrator = _SigmoidCalibration()
            else:
                raise ValueError('method should be "sigmoid" or '
                                 '"isotonic". Got %s.' % self.method)
            calibrator.fit(this_df, Y[:, k], sample_weight)
            self.calibrators_.append(calibrator)

        return self

We calibrate the probability using both Platt Scaling and Isotonic Regression on the validation set and evaluate their calibration curve and evaluation metrics on the test set.



In [14]:

    
xgb_sigmoid = CalibratedPrefitClassifier(xgb, method='sigmoid')
xgb_sigmoid.fit(df_val[input_cols], df_val[label_col])









    Out[14]:





CalibratedPrefitClassifier(base_estimator=XGBClassifier(base_score=0.5,
                                                        booster='gbtree',
                                                        colsample_bylevel=1,
                                                        colsample_bytree=0.7,
                                                        gamma=0,
                                                        learning_rate=0.1,
                                                        max_delta_step=0,
                                                        max_depth=3,
                                                        min_child_weight=1,
                                                        missing=None,
                                                        n_estimators=300,
                                                        n_jobs=1, nthread=None,
                                                        objective='binary:logistic',
                                                        random_state=0,
                                                        reg_alpha=0,
                                                        reg_lambda=1,
                                                        scale_pos_weight=1,
                                                        seed=None, silent=True,
                                                        subsample=0.8),
                           method='sigmoid')



In [15]:

    
xgb_isotonic = CalibratedPrefitClassifier(xgb, method='isotonic')
xgb_isotonic.fit(df_val[input_cols], df_val[label_col])









    Out[15]:





CalibratedPrefitClassifier(base_estimator=XGBClassifier(base_score=0.5,
                                                        booster='gbtree',
                                                        colsample_bylevel=1,
                                                        colsample_bytree=0.7,
                                                        gamma=0,
                                                        learning_rate=0.1,
                                                        max_delta_step=0,
                                                        max_depth=3,
                                                        min_child_weight=1,
                                                        missing=None,
                                                        n_estimators=300,
                                                        n_jobs=1, nthread=None,
                                                        objective='binary:logistic',
                                                        random_state=0,
                                                        reg_alpha=0,
                                                        reg_lambda=1,
                                                        scale_pos_weight=1,
                                                        seed=None, silent=True,
                                                        subsample=0.8),
                           method='isotonic')



In [16]:

    
estimators_names = [
    (xgb, 'xgb'),
    (xgb_sigmoid, 'xgb_sigmoid'),
    (xgb_isotonic, 'xgb_isotonic')
]
df_metrics = plot_calibration_curve(estimators_names, df_test[input_cols], df_test[label_col])
df_metrics









    Out[16]:







  
    
      
      auc
      precision
      recall
      f1
      log_loss
      brier
      name
    
  
  
    
      0
      0.771
      0.561
      0.520
      0.539
      0.437
      0.137
      xgb
    
    
      1
      0.771
      0.561
      0.520
      0.539
      0.440
      0.138
      xgb_sigmoid
    
    
      2
      0.771
      0.567
      0.512
      0.538
      0.435
      0.137
      xgb_isotonic

From the looks of things, it seems like the original boosted tree model doesn't really gain much from performing calibration. The log-loss of the original model versus the log-loss of the calibrated model (sigmoid and isotonic are two different ways of performing calibration) remained somewhat similar. And the fact that the plotted calibration curve also looks similar, confirms the hypothesis.

The sklearn documentation also contains a page that looks at how well various classifiers' predicted score are calibrated.

	auc	precision	recall	f1	log_loss	brier	name
0	0.843	0.566	0.627	0.595	0.386	0.121	xgb_train
1	0.771	0.561	0.520	0.539	0.437	0.137	xgb_test
2	0.832	0.516	0.671	0.583	0.395	0.124	lgb_train
3	0.772	0.518	0.554	0.535	0.437	0.137	lgb_test

	auc	precision	recall	f1	log_loss	brier	name
0	0.843	0.566	0.627	0.595	0.386	0.121	xgb
1	0.832	0.516	0.671	0.583	0.395	0.124	lgb

Table of Contents