This notebook contains an excerpt from the book Machine Learning for OpenCV by Michael Beyeler. The code is released under the MIT license, and is available on GitHub.

Note that this excerpt contains only the raw code - the book is rich with additional explanations and illustrations. If you find this content useful, please consider supporting the work by buying the book!

Understanding Cross-Validation

Cross-validation is a method of evaluating the generalization performance of a model that is generally more stable and thorough than splitting the dataset into training and test sets.

The most commonly used version of cross-validation is $k$-fold cross-validation, where $k$ is a number specified by the user (usually five or ten). Here, the dataset is partitioned into k parts of more or less equal size, called folds. For a dataset that contains $N$ data points, each fold should thus have approximately $N / k$ samples. Then a series of models is trained on the data, using $k - 1$ folds for training and one remaining fold for testing. The procedure is repeated for $k$ iterations, each time choosing a different fold for testing, until every fold has served as a test set once.

Refer to the book for an illustration of $k$-fold cross-validation for different values of $k$. Do you know what makes cross-validation different from just splitting the data into training and test sets?

Manually implementing cross-validation in OpenCV

The easiest way to perform cross-validation in OpenCV is to do the data splits by hand. For example, in order to implement two-fold cross-validation, we would follow the following procedure.

Load the dataset:


In [1]:
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris.data.astype(np.float32)
y = iris.target

Split the data into two equally sized parts:


In [2]:
from sklearn.model_selection import train_test_split
X_fold1, X_fold2, y_fold1, y_fold2 = train_test_split(
    X, y, random_state=37, train_size=0.5
)

Instantiate the classifier:


In [3]:
import cv2
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)

Train the classifier on the first fold, then predict the labels of the second fold:


In [4]:
knn.train(X_fold1, cv2.ml.ROW_SAMPLE, y_fold1)
_, y_hat_fold2 = knn.predict(X_fold2)

Train the classifier on the second fold, then predict the labels of the first fold:


In [5]:
knn.train(X_fold2, cv2.ml.ROW_SAMPLE, y_fold2)
_, y_hat_fold1 = knn.predict(X_fold1)

Compute accuracy scores for both folds:


In [6]:
from sklearn.metrics import accuracy_score
accuracy_score(y_fold1, y_hat_fold1)


Out[6]:
0.92000000000000004

In [7]:
accuracy_score(y_fold2, y_hat_fold2)


Out[7]:
0.88

This procedure will yield two accuracy scores, one for the first fold (92% accuracy), and one for the second fold (88% accuracy). On average, our classifier thus achieved 90% accuracy on unseen data.

Automating cross-validation using scikit-learn

Instantiate the classifier:


In [8]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1)

Perform cross-validation with the cross_val_score function. This function takes as input a model, the full dataset (X), the target labels (y) and an integer value for the number of folds (cv). It is not necessary to split the data by hand—the function will do that automatically depending on the number of folds. After the cross-validation is completed, the function returns the test scores:


In [9]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
scores


Out[9]:
array([ 0.96666667,  0.96666667,  0.93333333,  0.93333333,  1.        ])

In order to get a sense how the model did on average, we can look at the mean and standard deviation of the five scores:


In [10]:
scores.mean(), scores.std()


Out[10]:
(0.95999999999999996, 0.024944382578492935)

With five folds, we have a much better idea about how robust the classifier is on average. We see that $k$-NN with $k=1$ achieves on average 96% accuracy, and this value fluctuates from run to run with a standard deviation of roughly 2.5%.

Implementing leave-one-out cross-validation

Another popular way to implement cross-validation is to choose the number of folds equal to the number of data points in the dataset. In other words, if there are $N$ data points, we set $k=N$. This means that we will end up having to do $N$ iterations of cross-validation, but in every iteration, the training set will consist of only a single data point. The advantage of this procedure is that we get to use all-but-one data point for training. Hence, this procedure is also known as leave-one-out cross-validation.

In scikit-learn, this functionality is provided by the LeaveOneOut method from the model_selection module:


In [11]:
from sklearn.model_selection import LeaveOneOut

This object can be passed directly to the cross_val_score function in the following way:


In [12]:
scores = cross_val_score(model, X, y, cv=LeaveOneOut())

Because every test set now contains a single data point, we would expect the scorer to return 150 values—one for each data point in the dataset. Each of these points we could get either right or wrong. Thus, we expect scores to be a list of ones (1) and zeros (0), which corresponds to correct and incorrect classifications, respectively:


In [13]:
scores


Out[13]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.])

If we want to know the average performance of the classifier, we would still compute the mean and standard deviation of the scores:


In [14]:
scores.mean(), scores.std()


Out[14]:
(0.95999999999999996, 0.19595917942265423)

We can see this scoring scheme returns very similar results to five-fold cross-validation.

Estimating robustness using bootstrapping

An alternative procedure to $k$-fold cross-validation is bootstrapping.

Instead of splitting the data into folds, bootstrapping builds a training set by drawing samples randomly from the dataset. Typically, a bootstrap is formed by drawing samples with replacement. Imagine putting all of the data points into a bag and then drawing randomly from the bag. After drawing a sample, we would put it back in the bag. This allows for some samples to show up multiple times in the training set, which is something cross-validation does not allow.

The classifier is then tested on all samples that are not part of the bootstrap (the so-called out-of-bag examples), and the procedure is repeated a large number of times (say, 10,000 times). Thus, we get a distribution of the model's score that allows us to estimate the robustness of the model.

Bootstrapping can be implemented with the following procedure.

Instantiate the classifier:


In [15]:
knn = cv2.ml.KNearest_create()
knn.setDefaultK(1)

From our dataset with $N$ samples, randomly choose $N$ samples with replacement to form a bootstrap. This can be done most easily with the choice function from NumPy's random module. We tell the function to draw len(X) samples in the range [0, len(X)-1] with replacement (replace=True). The function then returns a list of indices, from which we form our bootstrap:


In [16]:
idx_boot = np.random.choice(len(X), size=len(X), replace=True)
X_boot = X[idx_boot, :]
y_boot = y[idx_boot]

Put all samples that do not show in the bootstrap in the out-of-bag set:


In [17]:
idx_oob = np.array([x not in idx_boot
                    for x in np.arange(len(X))], dtype=np.bool)
X_oob = X[idx_oob, :]
y_oob = y[idx_oob]

Train the classifier on the bootstrap samples:


In [18]:
knn.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)


Out[18]:
True

Test the classifier on the out-of-bag samples:


In [19]:
_, y_hat = knn.predict(X_oob)
accuracy_score(y_oob, y_hat)


Out[19]:
0.92727272727272725

Then we want to repeat these steps up to 10,000 times to get 10,000 accuracy scores, then average the scores to get an idea of the classifier's mean performance.

For our convenience, we can build a function so that it is easy to run the procedure for some n_iter number of times. We also pass a model (our $k$-NN classifier, model), the feature matrix (X), and the vector with all class labels (y):


In [20]:
def yield_bootstrap(model, X, y, n_iter=10000):
    for _ in range(n_iter):
        # train the classifier on bootstrap
        idx_boot = np.random.choice(len(X), size=len(X),
                                    replace=True)
        X_boot = X[idx_boot, :]
        y_boot = y[idx_boot]
        model.train(X_boot, cv2.ml.ROW_SAMPLE, y_boot)
        
        # test classifier on out-of-bag examples
        idx_oob = np.array([x not in idx_boot
                            for x in np.arange(len(X))],
                           dtype=np.bool)
        X_oob = X[idx_oob, :]
        y_oob = y[idx_oob]
        _, y_hat = model.predict(X_oob)
        
        # return accuracy
        yield accuracy_score(y_oob, y_hat)

To make sure we all get the same result, let's fix the seed of the random number generator:


In [21]:
np.random.seed(42)

Now, let's run the procedure for n_iter=10 times by converting the function output to a list:


In [22]:
list(yield_bootstrap(knn, X, y, n_iter=10))


Out[22]:
[0.98333333333333328,
 0.93650793650793651,
 0.92452830188679247,
 0.92307692307692313,
 0.94545454545454544,
 0.94736842105263153,
 0.98148148148148151,
 0.96078431372549022,
 0.93220338983050843,
 0.96610169491525422]

As you can see, for this small sample we get accuracy scores anywhere between 92% and 98%. To get a more reliable estimate of the model's performance, we repeat the procedure 1,000 times and calculate both mean and standard deviation of the resulting scores:


In [23]:
acc = list(yield_bootstrap(knn, X, y, n_iter=1000))
np.mean(acc), np.std(acc)


Out[23]:
(0.95524155136419198, 0.022040380995646654)

You are always welcome to increase the number of repetitions. But once n_iter is large enough, the procedure should be robust to the randomness of the sampling procedure. In this case, we do not expect to see any more changes to the distribution of score values as we keep increasing n_iter to, for example, 10,000 iterations:


In [24]:
acc = list(yield_bootstrap(knn, X, y, n_iter=10000))
np.mean(acc), np.std(acc)


Out[24]:
(0.95501528733009422, 0.021778543317079499)

Typically, the scores obtained with bootstrapping would be used in a statistical test to assess the significance of our result. Let's have a look at how that is done.

Implementing Student's t-test

One of the most famous statistical tests is Student's $t$-test. You might have heard of it before: it allows us to determine whether two sets of data are significantly different from one another. This was a really important test for William Sealy Gosset, the inventor of the test, who worked at the Guinness brewery and wanted to know whether two batches of stout differed in quality.

In practice, the $t$-test allows us to determine whether two data samples come from underlying distributions with the same mean or expected value.

For our purposes, this means that we can use the $t$-test to determine whether the test scores of two independent classifiers have the same mean value. We start by hypothesizing that the two sets of test scores are identical. We call this the null hypothesis because this is the hypothesis we want to nullify, that is, we are looking for evidence to reject the hypothesis because we want to ensure that one classifier is significantly better than the other.

We accept or reject a null hypothesis based on a parameter known as the $p$-value that the $t$-test returns. The $p$-value takes on values between 0 and 1. A $p$-value of 0.05 would mean that the null hypothesis is right only 5 out of 100 times. A small $p$-value thus indicates strong evidence that the hypothesis can be safely rejected. It is customary to use $p=0.05$ as a cut-off value below which we reject the null hypothesis.

If this is all too confusing, think of it this way: when we run a $t$-test for the purpose of comparing classifier test scores, we are looking to obtain a small $p$-value because that means that the two classifiers give significantly different results.

We can implement Student's $t$-test with SciPy's ttest_ind function from the stats module:


In [25]:
from scipy.stats import ttest_ind

Let's start with a simple example. Assume we ran five-fold cross-validation on two classifiers and obtained the following scores:


In [26]:
scores_a = [1, 1, 1, 1, 1]
scores_b = [0, 0, 0, 0, 0]

This means that Model A achieved 100% accuracy in all five folds, whereas Model B got 0% accuracy. In this case, it is clear that the two results are significantly different. If we run the $t$-test on this data, we should thus find a really small $p$-value:


In [27]:
ttest_ind(scores_a, scores_b)


Out[27]:
Ttest_indResult(statistic=inf, pvalue=0.0)

And we do! We actually get the smallest possible $p$-value, $p=0.0$.

On the other hand, what if the two classifiers got exactly the same numbers, except during different folds. In this case, we would expect the two classifiers to be equivalent, which is indicated by a really large $p$-value:


In [28]:
scores_a = [0.9, 0.9, 0.9, 0.8, 0.8]
scores_b = [0.8, 0.8, 0.9, 0.9, 0.9]
ttest_ind(scores_a, scores_b)


Out[28]:
Ttest_indResult(statistic=0.0, pvalue=1.0)

Analogous to the aforementioned, we get the largest possible $p$-value, $p=1.0$.

To see what happens in a more realistic example, let's return to our $k$-NN classifier from earlier example. Using the test scores obtained from the ten-fold cross-validation procedure, we can compare two different $k$-NN classifiers with the following procedure.

Obtain a set of test scores for Model A. We choose Model A to be the $k$-NN classifier from earlier ($k=1$):


In [29]:
k1 = KNeighborsClassifier(n_neighbors=1)
scores_k1 = cross_val_score(k1, X, y, cv=10)
np.mean(scores_k1), np.std(scores_k1)


Out[29]:
(0.95999999999999996, 0.053333333333333323)

Obtain a set of test scores for Model B. Let's choose Model B to be a $k$-NN classifier with $k=3$:


In [30]:
k3 = KNeighborsClassifier(n_neighbors=3)
scores_k3 = cross_val_score(k3, X, y, cv=10)
np.mean(scores_k3), np.std(scores_k3)


Out[30]:
(0.96666666666666656, 0.044721359549995787)

Apply the $t$-test to both sets of scores:


In [31]:
ttest_ind(scores_k1, scores_k3)


Out[31]:
Ttest_indResult(statistic=-0.2873478855663425, pvalue=0.77712784875052965)

As you can see, this is a good example of two classifiers giving different cross-validation scores (96.0% and 96.7%) that turn out to be not significantly different! Because we get a large $p$-value ($p=0.777$), we expect the two classifiers to be equivalent 77 out of 100 times.

Implementing McNemar's test

A more advanced statistical technique is McNemar's test. This test can be used on paired data to determine whether there are any differences between the two samples. As in the case of the $t$-test, we can use McNemar's test to determine whether two models give significantly different classification results.

McNemar's test operates on pairs of data points. This means that we need to know, for both classifiers, how they classified each data point. Based on the number of data points that the first classifier got right but the second got wrong and vice versa, we can determine whether the two classifiers are equivalent.


In [32]:
from scipy.stats import binom
def mcnemar_midp(b, c):
    """
    Compute McNemar's test using the "mid-p" variant suggested by:
    
    M.W. Fagerland, S. Lydersen, P. Laake. 2013. The McNemar test for 
    binary matched-pairs data: Mid-p and asymptotic are better than exact 
    conditional. BMC Medical Research Methodology 13: 91.
    
    `b` is the number of observations correctly labeled by the first---but 
    not the second---system; `c` is the number of observations correctly 
    labeled by the second---but not the first---system.
    """
    n = b + c
    x = min(b, c)
    dist = binom(n, .5)
    p = 2. * dist.cdf(x)
    midp = p - dist.pmf(x)
    return midp

Let's assume the preceding Model A and Model B were applied to the same five data points. Whereas Model A classified every data point correctly (denoted with a 1), Model B got all of them wrong (denoted with a 0):


In [33]:
scores_a = np.array([1, 1, 1, 1, 1])
scores_b = np.array([0, 0, 0, 0, 0])

McNemar's test wants to know two things:

  • How many data points did Model A get right but Model B get wrong?
  • How many data points did Model A get wrong but Model B get right?

We can check which data points Model A got right but Model B got wrong as follows:


In [34]:
a1_b0 = scores_a * (1 - scores_b)
a1_b0


Out[34]:
array([1, 1, 1, 1, 1])

Of course, this applies to all of the data points. The opposite is true for the data points that Model B got right and Model A got wrong:


In [35]:
a0_b1 = (1 - scores_a) * scores_b
a0_b1


Out[35]:
array([0, 0, 0, 0, 0])

Feeding these numbers to McNemar's test should return a small $p$-value because the two classifiers are obviously different:


In [36]:
mcnemar_midp(a1_b0.sum(), a0_b1.sum())


Out[36]:
0.03125

And it does!

We can apply McNemar's test to a more complicated example, but we cannot operate on cross-validation scores anymore. The reason is that we need to know the classification result for every data point, not just an average. Hence, it makes more sense to apply McNemar's test to the leave-one-out cross-validation.

Going back to $k$-NN with $k=1$ and $k=3$, we can calculate their scores as follows:


In [37]:
scores_k1 = cross_val_score(k1, X, y, cv=LeaveOneOut())
scores_k3 = cross_val_score(k3, X, y, cv=LeaveOneOut())

The number of data points that one of the classifiers got right but the other got wrong are as follows:


In [38]:
np.sum(scores_k1 * (1 - scores_k3))


Out[38]:
0.0

In [39]:
np.sum((1 - scores_k3) * scores_k3)


Out[39]:
0.0

We got no differences whatsoever! Now it becomes clear why the $t$-test led us to believe that the two classifiers are identical. As a result, if we feed the two sums into McNemar's test function, we get the largest possible $p$-value, $p=1.0$:


In [40]:
mcnemar_midp(np.sum(scores_k1 * (1 - scores_k3)),
             np.sum((1 - scores_k1) * scores_k3))


Out[40]:
1.0