A popular way to evaluate the performance of a machine learning algorithm is to use a confusion matrix. This is a table with two rows and two columns that displays the number of true positives, false positives, false negatives and true negatives.
In [ ]:
import sklearn
import pandas as pd
import numpy as np
Confusion matrix
In [ ]:
index_names = ['predicted condition positive', 'predicted condition negative']
column_names = ['true condition positive', 'true condition negative']
The table below shows an example confusion matrix for a hypothetical test for a rare disease where only 2 people of out 100 have the disease. This is an unbalanced data set as a much larger number, 98 out of 100 do not have the disease. The first named row has cases of people who have the disease and the second named row has cases of people who do not have the disease. The first named column has people who test positive and the second named column has people who test negative.
This leads to four numeric cell with the top left containing true positive counts, the bottom left having false positive, the top right having false negative and the bottom right with true negative counts.
A simple way to create a very accurate test for this unbalanced example is to just assume everyone tests negative for the disease. This misses out on all the people who do actually have the disease and results in two false negative cases. However it correctly predicts 98 true negative cases. This results in a 98% accurate test. But this test cannot distinguish between people who have a disease and people who don't. Accuracy may not be a useful measure of the goodness of the test.
Two useful measures are precision and recall: Precision is a measure of how many of the selected items are relevant and recall is a measure of how many relevant items are selected.
precision = (true positives)/(true positives + false positives)
recall = (True positives)/positives
In the example below the precision is undefined while the recall is zero.
In [ ]:
pd.DataFrame.from_records(
np.array([[0, 2], [0, 98]]).T, columns=column_names, index=index_names)
An alternative test for the same rare disease where 2 out of 100 have the disease is show below. Now there is 1 true positive, 2 false positives, 1 false negative and 96 true negatives.
This test has a lower accuracy as it has correct predicted 97 out of 100 cases, lower than the previous test. This test also has a defined precision of 0.333333 and a recall of 0.5
This test correctly identifies 1 out of the 2 people who have the disease.
In [ ]:
pd.DataFrame.from_records(
np.array([[1, 1], [2, 96]]).T, columns=column_names, index=index_names)
To demonstrate the use of accuracy, precision and recall when measuring the peformance of a classifier, we use the "Wisconsin Breast Cancer" data set.
In [ ]:
from sklearn.datasets import load_breast_cancer
dataset = load_breast_cancer()
This data set has 569 samples of which 357 are benign and 212 are malignant
In [ ]:
target = pd.Series(dataset.target, dtype='category')
target.cat.rename_categories(['malignant', 'benign'], inplace=True)
target.value_counts()
We predict whether the cancer is benign or malignant using ten factors: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension.
In [ ]:
column_names = [
'radius', 'texture', 'perimeter', 'area',
'smoothness', 'compactness', 'concavity', 'concave_points',
'symmetry', 'fractal_dimension']
df = pd.DataFrame(data=dataset.data[:, :10], columns=column_names)
In [ ]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def get_metrics(target, predict, name):
return {
'classifier': name,
'accuracy': accuracy_score(target, predict),
'precision': precision_score(target, predict),
'recall': recall_score(target, predict)
}
In [ ]:
from sklearn import linear_model
# C is the inverse of regularization parameter (smaller values specify strong regularization)
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(df.values, dataset.target)
predict = logreg.predict(df.values)
result1 = get_metrics(dataset.target, predict, 'logistic regression')
In [ ]:
from sklearn.svm import SVC
clf = SVC(kernel='rbf')
clf.fit(df.values, dataset.target)
predict = clf.predict(df.values)
result2 = get_metrics(dataset.target, predict, 'support vector (radial basis)')
In [ ]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=10)
clf.fit(df.values, dataset.target)
predict = clf.predict(df.values)
result3 = get_metrics(dataset.target, predict, 'decision tree')
In [ ]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50)
clf.fit(df.values, dataset.target)
predict = clf.predict(df.values)
result4 = get_metrics(dataset.target, predict, 'random forest')
We compare four classifiers: logistic regression, support vector, decision tree and random forests on three different measures, accuracy, precision and recall. The decision tree and random forest classifiers are so good that they correctly classify 100% of the samples in this data set.
In [ ]:
pd.DataFrame([result1, result2, result3, result4], columns=['classifier', 'accuracy', 'precision', 'recall'])