Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!
Model evaluation strategies come in many different forms and shapes. In the following sections, we will, therefore, highlight three of the most commonly used techniques to compare models against each other:
In principle, model evaluation is simple: after training a model on some data, we can estimate its effectiveness by comparing model predictions to some ground truth values. We learned early on that we should split the data into a training and a test set, and we tried to follow this instruction whenever possible. But why exactly did we do that again?
The reason we never evaluate a model on the training set is that, in principle, any dataset can be learned if we throw a strong enough model at it.
A quick demonstration of this can be given with help of the Iris dataset, which we talked about extensively in Chapter 3, First Steps in Supervised Learning. There the goal was to classify species of Iris flowers based on their physical dimensions. We can load the Iris dataset using scikit-learn:
In [1]:
from sklearn.datasets import load_iris
iris = load_iris()
An innocent approach to this problem would be to store all data points in matrix X
and all
class labels in the vector y
:
In [2]:
import numpy as np
X = iris.data.astype(np.float32)
y = iris.target
Next, we choose a model and its hyperparameters. For example, let's use the $k$-NN algorithm from Chapter 3, First Steps in Supervised Learning, which provides only a single hyperparameter: the number of neighbors, $k$. With $k=1$, we get a very simple model that classifies the label of an unknown point as belonging to the same class as its closest neighbor.
In OpenCV, $k$-NN instantiates as follows:
In [3]:
import cv2
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
Then we train the model and use it to predict labels for the data that we already know:
In [4]:
knn.train(X, cv2.ml.ROW_SAMPLE, y)
_, y_hat = knn.predict(X)
Finally, we compute the fraction of correctly labeled points:
In [5]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_hat)
Out[5]:
We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model!
But is this truly measuring the expected accuracy? Have we really come up with a model that we expect to be correct 100% of the time?
As you may have gathered, the answer is no. This example shows that even a simple algorithm is capable of memorizing a real-world dataset. Imagine how easy this task would have been for a deep neural network! Usually, the more parameters a model has, the more powerful it is. We will come back to this shortly.
A better sense of a model's performance can be found using what's known as a test set, but you already knew this. When presented with data held out from the training procedure, we can check whether a model has learned some dependencies in the data that hold across the board or whether it just memorized the training set.
We can split the data into training and test sets using the familiar train_test_split
from
scikit-learn's model_selection
module:
In [6]:
from sklearn.model_selection import train_test_split
But how do we choose the right train-test ratio? Is there even such a thing as a right ratio? Or is this considered another hyperparameter of the model?
There are two competing concerns here:
A good starting point is usually a 80-20 training-test split. However, it all depends on the amount of data available. For relatively small datasets, a 50-50 split might be more suitable:
In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37,
train_size=0.8)
Then we retrain the preceding model on the training set:
In [8]:
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train);
When we test the model on the test set, we suddenly get a different result:
In [9]:
_, y_test_hat = knn.predict(X_test)
accuracy_score(y_test, y_test_hat)
Out[9]:
We see a more reasonable result here, although 97% accuracy is still a formidable result. But is this the best possible result—and how can we know for sure?
To answer this question is, we have to dig a little deeper.