This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

< Selecting the Right Model with Hyper-Parameter Tuning | Contents | Understanding Cross-Validation >

Evaluating a Model

Model evaluation strategies come in many different forms and shapes. In the following sections, we will, therefore, highlight three of the most commonly used techniques to compare models against each other:

$k$-fold cross-validation
bootstrapping
McNemar's test

In principle, model evaluation is simple: after training a model on some data, we can estimate its effectiveness by comparing model predictions to some ground truth values. We learned early on that we should split the data into a training and a test set, and we tried to follow this instruction whenever possible. But why exactly did we do that again?

Evaluating a model the wrong way

The reason we never evaluate a model on the training set is that, in principle, any dataset can be learned if we throw a strong enough model at it.

A quick demonstration of this can be given with help of the Iris dataset, which we talked about extensively in Chapter 3, First Steps in Supervised Learning. There the goal was to classify species of Iris flowers based on their physical dimensions. We can load the Iris dataset using scikit-learn:



In [1]:

    
from sklearn.datasets import load_iris
iris = load_iris()

An innocent approach to this problem would be to store all data points in matrix X and all class labels in the vector y:



In [2]:

    
import numpy as np
X = iris.data.astype(np.float32)
y = iris.target

Next, we choose a model and its hyperparameters. For example, let's use the $k$-NN algorithm from Chapter 3, First Steps in Supervised Learning, which provides only a single hyperparameter: the number of neighbors, $k$. With $k=1$, we get a very simple model that classifies the label of an unknown point as belonging to the same class as its closest neighbor.

In OpenCV, $k$-NN instantiates as follows:



In [3]:

    
import cv2
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)

Then we train the model and use it to predict labels for the data that we already know:



In [4]:

    
knn.train(X, cv2.ml.ROW_SAMPLE, y)
_, y_hat = knn.predict(X)

Finally, we compute the fraction of correctly labeled points:



In [5]:

    
from sklearn.metrics import accuracy_score
accuracy_score(y, y_hat)









    Out[5]:





1.0

We see an accuracy score of 1.0, which indicates that 100% of points were correctly labeled by our model!

But is this truly measuring the expected accuracy? Have we really come up with a model that we expect to be correct 100% of the time?

As you may have gathered, the answer is no. This example shows that even a simple algorithm is capable of memorizing a real-world dataset. Imagine how easy this task would have been for a deep neural network! Usually, the more parameters a model has, the more powerful it is. We will come back to this shortly.

Evaluating a model the right way

A better sense of a model's performance can be found using what's known as a test set, but you already knew this. When presented with data held out from the training procedure, we can check whether a model has learned some dependencies in the data that hold across the board or whether it just memorized the training set.

We can split the data into training and test sets using the familiar train_test_split from scikit-learn's model_selection module:



In [6]:

    
from sklearn.model_selection import train_test_split

But how do we choose the right train-test ratio? Is there even such a thing as a right ratio? Or is this considered another hyperparameter of the model?

There are two competing concerns here:

If our training set is too small, our model might not be able to extract the relevant data dependencies. As a result, our model performance might differ significantly from run to run, that is, if we repeat our experiment multiple times with different random number seeds. As an extreme example, consider a training set with a single data point from the Iris dataset. In this case, there would be no way for the model to even learn that there are multiple species in the dataset!
If our test set is too small, our performance metric might differ significantly from run to run. As a result, we would have to rerun our experiment multiple times to get an idea of how well our model does on average. As an extreme example, consider a test set with a single data point. Since there are three different classes in the Iris dataset, we might get either 0, 33%, 66%, or 100% correct.

A good starting point is usually a 80-20 training-test split. However, it all depends on the amount of data available. For relatively small datasets, a 50-50 split might be more suitable:



In [7]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=37,
                                                    train_size=0.8)

Then we retrain the preceding model on the training set:



In [8]:

    
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)
knn.train(X_train, cv2.ml.ROW_SAMPLE, y_train);

When we test the model on the test set, we suddenly get a different result:



In [9]:

    
_, y_test_hat = knn.predict(X_test)
accuracy_score(y_test, y_test_hat)









    Out[9]:





0.96666666666666667

We see a more reasonable result here, although 97% accuracy is still a formidable result. But is this the best possible result—and how can we know for sure?

To answer this question is, we have to dig a little deeper.

< Selecting the Right Model with Hyper-Parameter Tuning | Contents | Understanding Cross-Validation >