Title: Calibrate Predicted Probabilities
Slug: calibrate_predicted_probabilities
Summary: How to calibrate predicted probabilities of naive bayes classifer in Scikit-Learn
Date: 2017-09-22 12:00
Category: Machine Learning
Tags: Naive Bayes
Authors: Chris Albon
Class probabilities are a common and useful part of machine learning models. In scikit-learn, most learning algortihms allow us to see the predicted probabilities of class membership using predict_proba
. This can be extremely useful if, for instance, we want to only predict a certain class if the model predicts the probability that they are that class is over 90%. However, some models, including naive Bayes classifiers output probabilities that are not based on the real world. That is, predict_proba
might predict an observation has a 0.70 chance of being a certain class, when the reality is that it is 0.10 or 0.99. Specifically in naive Bayes, while the ranking of predicted probabilities for the different target classes is valid, the raw predicted probabilities tend to take on extreme values close to 0 and 1.
To obtain meaningful predicted probabilities we need conduct what is called calibration. In scikit-learn we can use the CalibratedClassifierCV
class to create well calibrated predicted probabilities using k-fold cross-validation. In CalibratedClassifierCV
the training sets are used to train the model and the test sets is used to calibrate the predicted probabilities. The returned predicted probabilities are the average of the k-folds.
In [1]:
# Load libraries
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV
In [2]:
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target
In [3]:
# Create Gaussian Naive Bayes object
clf = GaussianNB()
In [4]:
# Create calibrated cross-validation with sigmoid calibration
clf_sigmoid = CalibratedClassifierCV(clf, cv=2, method='sigmoid')
In [5]:
# Calibrate probabilities
clf_sigmoid.fit(X, y)
Out[5]:
In [6]:
# Create new observation
new_observation = [[ 2.6, 2.6, 2.6, 0.4]]
In [7]:
# View calibrated probabilities
clf_sigmoid.predict_proba(new_observation)
Out[7]: