Precision, Recall, F-1 Score and more

In this notebook we're going to cover the concepts of Preicision and Recall and how we can use these as measurements for the overall performance of a our model. These are useful measures past the traditional measure of accuracy.

Precision

The precision of our model is seen as the ratio of True Positive values over all Positive values the model has evaluated.

$$ \frac{T_p}{T_{p} + F_{p}} $$

This is not to be confused with the precision from a statistical perspective which is the the inverse variance of a distribution, or $\frac{1}{\sigma^2}$. This is a different value and concept, altogether, which allows us to evaluate the performance of our model, over all.

Precision asks the question "What proportion of positive indentifications was actually correct?"

Recall

The recall is another ratio, this time it is the ratio of the number of True Positives over the sum of the True Positives and False Negatives.

$$ \frac{T_{p}}{T_{p} + F_{n}} $$

A system with high recall, and low precision, returns many results but the labels that were predicted for given inputs are mostly incorrect compared to the training labels.

Recall attempts to answer the question "What proportion of actual positives was identified correctly?"

Model Performance Interpretation

High Precision

A system which has high precision, is a system that returns very few total results but the results that it does return are accurate to the training labels that are provided with the evaluated examples. High precision typically related to a low False Positive rate, overall.

Perfect Precision and Recall

A system that has both a value of 1.0 or 100% for both precision and recall is said to be a perfect predictor of the underlying data generating distribution. Though the reality is that there likely exists enough noise within the underlying data generating system taht we'll eventually come up against Bayes Error Rate which will prevent our model from increasing in performance in either capacity.

Initial Investigative Setup

Before we dive into what these metrics for measuring the performance of a model look like we need to setup our experiment by downloading our dataset and splitting it into the respect training, validation, and test sets that allow us to measure the overall performance of our model. We will be leveraging many of the native functionality to the scikit-learn library to help ease this burden.

Breast Cancer Evaluation

Using the Breast Cancer dataset that is housed within the UCI Repository) we can perform some quick GridSearch over a simple Support Vector Machine (SVM) model to try and find the best performing hyperparameters of two specific kernels for SVM's, specifically a linear kernel and a radial basis function RBF kernel.


In [39]:
# import and split the training dataset 
# here we're using the Breast Cancer Wisconsin Dataset
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import datasets

# load the breast cancer dataset provided with scikit-learn
breast_cancer = datasets.load_breast_cancer()

# Convert labels in the dataframe to integer representations
# in this case they'll be binary labels
# 0 - benign (no cancer)
# 1 - malignant (cancer)

# break the dataset into data (x) and targets (y)
X = breast_cancer['data']
y = breast_cancer['target']

# splitting the dataset into training, validation, and test sets
X_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [40]:
# these examples are borrowed from Python Machine Learning by Sebastian Raschka
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

pipe_svc = Pipeline([('scl', scaler),
                     ('clf', SVC(random_state=1))])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
param_grid = [{'clf__C': param_range,
               'clf__kernel': ['linear']},
              {'clf__C': param_range,
               'clf__gamma': param_range,
               'clf__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svc,
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=100,
                  n_jobs=-1)

gs.fit(X_train, y_train)
print(f"Best Score --> {gs.best_score_}")
print(f"Best Parameters --> {gs.best_params_}")

clf = gs.best_estimator_
clf.fit(X_train, y_train)
print(f"Test Accuracy : {clf.score(x_test, y_test)}")


Best Score --> 0.9849246231155779
Best Parameters --> {'clf__C': 0.1, 'clf__kernel': 'linear'}
Test Accuracy : 0.9590643274853801
/home/ed/anaconda3/envs/notebooks/lib/python3.7/site-packages/sklearn/model_selection/_search.py:813: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)

Monitoring Model Performance

We can monitor the performance of our model as it is training and understand if we're hitting a point at which are model may start to under or over perform, also known as the bias-variance trade off that is made when selecting models. A model is said to be high bias when it starts to underfit our data in that the model doesn't have sufficient complexity to capture the variance in our dataset collected from a data generating distribution. A model is said to have high variance, or overfits our dataset, when the model's complexity is high enough that is it able to completely fit our datasets with 100% accuracy. You might also hear terms such as a high number of degrees of freedom or parameters to describe this phenomenon.

As a side note many researchers and practitioners in the field of Machine Learning generally look to understanding why parametric models such as neural networks with parameter counts far exceeding the dimensionality of the input tend to perform well with respect to generalization error. This is an open problem and many people are working diligently to try an explain this phenomenon.

Below we will graph the performance of our (relatively) simple SVM grid-search implementation.


In [41]:
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
import numpy as np
from sklearn.linear_model import LogisticRegression

lreg = LogisticRegression(penalty='l2', random_state=0, solver='newton-cg')

scaler = StandardScaler()

pipe_lr = Pipeline([
                    ('scl', scaler),
                    ('clf', lreg)])


train_sizes, train_scores, test_scores = learning_curve(estimator=pipe_lr,
                                                        X=X_train,
                                                        y=y_train,
                                                        train_sizes=np.linspace(0.1, 1.0, 10),
                                                        cv=10,
                                                        n_jobs=-1)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

%matplotlib widget

fig, axes = plt.subplots()

#plt.figure(figsize=(30,30))
axes.plot(train_sizes, train_mean, color='blue', 
         marker='o', markersize=5, label='training_accuracy')

axes.fill_between(train_sizes, train_mean + train_std, train_mean - train_std,
                 alpha=0.15, color='blue')

axes.plot(train_sizes, test_mean, color='green',
         linestyle="--", marker='s', markersize=5,
         label='validation_accuracy')

axes.fill_between(train_sizes, test_mean + test_std, test_mean - test_std,
                 alpha = 0.15, color = 'green')

axes.grid()
axes.set_xlabel("Number of training examples")
axes.set_ylabel("Accuracy")
axes.legend(loc='lower right')
axes.set_ylim([0.8, 1.0])


Out[41]:
(0.8, 1.0)

In [42]:
# Forced Overfitting
from sklearn.model_selection import validation_curve

param_range = [0.001, 0.01, 0.1, 10.0, 100.0, 1000.0]
train_scores, test_scores = validation_curve(
                                             estimator=pipe_lr,
                                             X=X_train,
                                             y=y_train,
                                             param_name='clf__C',
                                             param_range=param_range,
                                             cv=10)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

fig1, axes1 = plt.subplots()

axes1.plot(param_range, train_mean, color='blue',
         marker='o', markersize=5, label="training_accuracy")

axes1.fill_between(param_range, train_mean + train_std, train_mean - train_std,
                 alpha=0.15, color='blue')

axes1.plot(param_range, test_mean, color='green',
         linestyle='--', marker='s', markersize=5,
         label='validation_accuracy')

axes1.fill_between(param_range, test_mean + test_std, test_mean - test_std,
                 alpha=0.15, color='green')
axes1.arrow(param_range[-1],train_mean[-1], (0.0), (0.1), fc='k', ec='k', head_width=0.05, length_includes_head=True)
axes1.grid()
axes1.set_xscale('log')
axes1.legend(loc='lower right')
axes1.set_xlabel("Parameter C")
axes1.set_ylabel("Accuracy")
axes1.set_ylim([0.8, 1.0])


Out[42]:
(0.8, 1.0)

We can see above that the accuracy on the validation dataset had started to decline while the training accuracy was increasing. This is a sign that the model is overfitting the training dataset. The decline in the validation dataset is indicative of this.

Once we have a trained model we can now start to investigate how to interpret the Precision, Recall, and F-1 score of our model.


In [43]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from inspect import signature

#We'll reuse our best classifier from the GridSearch we used above

y_score = clf.decision_function(x_test)

average_precision = average_precision_score(y_test, y_score)

precision, recall, _ = precision_recall_curve(y_test, y_score)

step_kwargs = ({'step': 'post'}
              if 'step' in signature(plt.fill_between).parameters
              else {})

pr_fig, pr_plot = plt.subplots()
pr_plot.step(recall, precision, color='b', alpha=0.2,
         where='post')

pr_plot.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)

pr_plot.set_xlabel("Recall")
pr_plot.set_ylabel("Precision")
pr_plot.set_ylim([0.0, 1.05])
pr_plot.set_xlim([0.0, 1.0])
pr_plot.set_title(f"2-Class Precision-Recall curve : AP={average_precision}")


Out[43]:
Text(0.5, 1.0, '2-Class Precision-Recall curve : AP=0.9977039610004896')

In [50]:
from sklearn.metrics import f1_score

y_pred = clf.predict(x_test)
f1 = f1_score(y_test,  y_pred)

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, target_names=["Bengin", "Malignant"]))


              precision    recall  f1-score   support

      Bengin       0.95      0.94      0.94        63
   Malignant       0.96      0.97      0.97       108

    accuracy                           0.96       171
   macro avg       0.96      0.95      0.96       171
weighted avg       0.96      0.96      0.96       171

Above we can see that the classification_report method provided by scitkit-learn provides the Precision, Recall, and F1-Score all in a readable format.