Scikit-learn covers extensively the classification validation metrics [here]. The ones presented here are:
By: Hugo Lopes
Learning Unit 11
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, \
recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
%matplotlib inline
In [ ]:
df_results = pd.read_csv('../data/classifier_prediction_scores.csv')
print('Number of rows:', df_results.shape[0])
df_results.head()
Let's take a look at the scores distribution. As an output of the predict_proba()
, the scores range is [0, 1].
In [ ]:
df_results['scores'].hist(bins=50)
plt.ylabel('Frequency')
plt.xlabel('Scores')
plt.title('Distribution of Scores')
plt.xlim(0, 1)
plt.show()
The accuracy_score is the fraction (default) or the count (normalize=False) of correct predictions. It is given by:
$$ A = \frac{TP + TN}{TP + TN + FP + FN} $$Where, TP is the True Positives, TN the True Negatives, FP the False Positives, and False Negative.
Disavantages:
In [ ]:
# Specifying the threshold above which the predicted label is considered 1:
threshold = 0.50
# Generate the predicted labels (above threshold = 1, below = 0)
predicted_outcome = [0 if k <= threshold else 1 for k in df_results['scores']]
In [ ]:
print('Accuracy = %2.3f' % accuracy_score(df_results['target'], predicted_outcome))
The confusion_matrix C provides several performance indicators:
In [ ]:
# Get the confusion matrix:
confmat = confusion_matrix(y_true=df_results['target'], y_pred=predicted_outcome)
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.4)
for i in range(confmat.shape[0]):
for j in range(confmat.shape[1]):
ax.text(x=j, y=i,
s=confmat[i, j],
va='center', ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.title('Confusion Matrix')
plt.show()
As can be seen, the number of False Negatives is very high, which, depending on the business could be harmful.
where $T_P$ is the true positives, $F_P$ the false positives, and $F_N$ the false negatives. Further information on precision, recall and f1-score.
First, let's check if our dataset has class imbalance:
In [ ]:
df_results['target'].value_counts(normalize=True)
Rather imbalanced! Approximately 83% of the labels are 0. Let's take a look at the other metrics more appropriate for this type of datasets:
In [ ]:
print('Precision score = %1.3f' % precision_score(df_results['target'], predicted_outcome))
print('Recall score = %1.3f' % recall_score(df_results['target'], predicted_outcome))
print('F1 score = %1.3f' % f1_score(df_results['target'], predicted_outcome))
As you can see, the results actually not so good as the accuracy metric would show us.
The ROC is very common for binary classification problems. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings.
roc_curve
compute Receiver operating characteristic (ROC)roc_auc_score
function computes the area under the receiver operating characteristic (ROC) curve. The curve information is summarized in one number. Unlike the previous metrics, the ROC functions above require the actual scores/probabilities (and not the predicted labels). Further information on roc_curve and roc_auc_score. This metric is rather useful for imbalanced datasets.
In [ ]:
# Data to compute the ROC curve (FPR and TPR):
fpr, tpr, thresholds = roc_curve(df_results['target'], df_results['scores'])
# The Area Under the ROC curve:
roc_auc = roc_auc_score(df_results['target'], df_results['scores'])
# Plot ROC Curve
plt.figure(figsize=(8,6))
lw = 2
plt.plot(fpr, tpr, color='orange', lw=lw, label='ROC curve (AUROC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--', label='random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
As we can see, the AUROC is 0.70. A value of 0.50 means that the classifier is no better than random.