Messy modelling: overfitting, cross-validation, and the bias-variance trade-off

Introduction

In this post you will get to grips with perhaps the most essential concept in machine learning: the bias-variance trade-off. The main idea here is that you want to create models that are as good at prediction as possible but that are still applicable to new data (i.e. are generalizable). The danger is that you can easily create models that overfit to the local noise in your specific dataset, which isn't too helpful and leads to poor generalizability since the noise is random and different in each dataset. Essentially, you want to create models that capture only the useful components of the dataset. Models that generalize very well but are too inflexible to generate good predictions are the other extreme we want to avoid (underfitting).

We discuss and demonstrate these concepts using the k-nearest neighbours algorithm, which has a simple parameter k which can be varied to cleanly demonstrate these ideas of underfitting, overfitting and generalization. Together, this bundle of concepts related to the balance between underfitting and overfitting is referred to as the bias-variance trade-off. Here is a table summarizing these different, related aspects of models, which you can refer to throughout this post.

We will explain what all of these terms mean and how they are inter-related. We will also discuss cross-validation, which is a good way of estimating the accuracy and generalizability of your models.

You will encounter all of these concepts in the next few blog posts in this series, which will cover model optimization, random forests, Naive Bayes, logistic regression and combining different models into an ensembled meta-model.

Generating the dataset

Let's start off by building an artificial dataset to play with. We can do this easily with the make_classification() function from the sklearn.datasets package. Specifically, we will generate a relatively simple binary classification problem. To make it a bit more interesting, let's make the data crescent-shaped and add some random noise. This should make it more realistic and increase the difficulty of classifying observations.


In [2]:
# Creating the dataset
# e.g. make_moons generates crescent-shaped data
# Check out make_classification, which generates ~linearly-separable data
from sklearn.datasets import make_moons

X, y = make_moons(
    n_samples=500,  # the number of observations
    random_state=1,
    noise=0.3 #0.3
)

# Take a peek
print(X[:10,])
print(y[:10])


[[ 0.50316464  0.11135559]
 [ 1.06597837 -0.63035547]
 [ 0.95663377  0.58199637]
 [ 0.33961202  0.40713937]
 [ 2.17952333 -0.08488181]
 [ 2.00520942  0.7817976 ]
 [ 0.12531776 -0.14925731]
 [ 1.06990641  0.36447753]
 [-0.76391099 -0.6136396 ]
 [ 0.55678871  0.8810501 ]]
[1 1 0 0 1 1 1 0 0 0]

The dataset we just generated looks a bit like this:


In [42]:
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

%matplotlib inline

# Plot the first feature against the other, color by class
plt.scatter(X[y == 1, 0], X[y == 1, 1], color="#EE3D34", marker="x")
plt.scatter(X[y == 0, 0], X[y == 0, 1], color="#4458A7", marker="o")


Out[42]:
<matplotlib.collections.PathCollection at 0x10af82f10>

Next up, let's split the dataset into a training and test set. The training set will be used to develop and tune our models. The test set will be completely left alone until the very end, at which point you'll run your finished models on it. Having a test set will allow you to get a good estimate of how well your models would perform out in the wild on unseen data.


In [43]:
from sklearn.cross_validation import train_test_split

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

We are going to try to predict the classes with a k Nearest Neighbor (kNN) classifier. Chapter 2 of the Introduction to Statistical Learning book provides a great intro to the theory behind kNN. We are huge fans of the ISLR book, so definitely check it out if you have the time. You could also have a look at this previous post that teaches you how to implement the algorithm from scratch in Python.

Introducing the k hyperparameter in kNN

The kNN algorithm works by using information on the k-nearest neighbours of a new data point in order to classify it into a class. It simply looks at the class of other data points most similar to it (its 'nearest neighbours') and assigns the new data point to most common class of these neighbours. When using kNN, you have to set the value of k that you want the algorithm to use ahead of time, and it is not trivial to know which value to use.

If the value for k is high (e.g. k=99), then the model considers a large number of neighbours when making a a decision about the class of an unknown datapoint. This means that the model is quite constrained, since it has to take a large amount of information into account when classifying instances. In other words, a high number for k give rise to relatively "rigid" model behaviour.

By contrast, if the value for k is low (e.g. k=1 or k=2), then only a few neighbours are taken into account when making a classification decision. It is a very flexible model with a lot of complexity - it really fits very closely to the precise shape of the dataset. Hence, the predictions of the model are much more dependent on the local tendencies of the data (crucially, this includes the noise!).

Take a look at how the kNN algorithm separates the training cases when k=99 compared to when k=1. The green line is the decision boundary on the training data (i.e. the point at which the algorithm decides whether a data point as being blue or red).

In a minute, you'll learn to generate these plots yourself.

When k=99 (on the left), it looks like the model fit might be a bit too smooth and could stand to fit the data a bit closer. The model has low flexibility and low complexity. It paints the decision boundary with a broad brush. It has relatively high bias because we can tell it is not modelling the data as well as it could - it models the underlying generative process of the data as something too simple, and this is highly biased away from the ground truth. But, the decision boundary would probably look very similar if we redrew it on a slightly different dataset. It is a stable model that won't vary a lot - it has low variance.

When k=1 (on the right), you can see that the model is massively overfitting to the noise. It is technically perfectly correct on the training set (the error in the bottom right hand corner is equal to 0.00!), but hopefully you can see how this fit is way too sensitive to individual data points. Keep in mind we added noise to the dataset - it looks like this model fit is taking the noise too seriously and is fitting very closely to it. We can say that the k=1 model has high flexibility and high complexity. It tunes very tightly to the data. It also has low bias - if nothing else, the decision boundary certainly fits the trends in the data. But, the fitted boundary would drastically change on even slightly different data - it would vary significantly, i.e. the k=1 model has high variance.

But how well do these models generalize, i.e. how well would they perform on new data?

We have so far only looked at the training data, but quantifying training error isn't that useful. We want to know how well the models are modelling the underlying generative process of the data, not how well they recapitulate what they just learned on the training set. Let's take a look at how they perform on test data, since that gives a better impression of whether our models are actually good or not.


In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

knn99 = KNeighborsClassifier(n_neighbors = 99)
knn99.fit(XTrain, yTrain)
yPredK99 = knn99.predict(XTest)
print "Overall Error of k=99 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK99), 2)

knn1 = KNeighborsClassifier(n_neighbors = 1)
knn1.fit(XTrain, yTrain)
yPredK1 = knn1.predict(XTest)
print "Overall Error of k=1 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK1), 2)


Overall Error of k=99 Model: 0.14
Overall Error of k=1 Model: 0.14

Actually, it looks like these models perform approximately as well on the test data. Here are the decision boundaries we learned on the training set, applied to the test set. See if you can figure out where the two models are making their mistakes.

It seems that the k=99 model isn't doing a good job at capturing the crescent shape of the data (it is underfitting), while the k=1 model is making mistakes from being horribly overfitted. The hallmark of overfitting is good training performance and bad testing performance, which is what we observe here.

Maybe intermediate values of k are where we want to be? Let's give it a shot:


In [44]:
knn50 = KNeighborsClassifier(n_neighbors = 50)
knn50.fit(XTrain, yTrain)
yPredK50 = knn50.predict(XTest)
print "Overall Error of k=50 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK50), 2)


Overall Error of k=50 Model: 0.11

Looking better! Let's check out the decision boundary for the k=50 model.

Much better - the model fit is similar to the actual trend in the dataset and this improvement is reflected in a lower test set error.

The bias-variance trade-off: concluding comments

Hopefully you now have a good intuition over what it means for models to underfit and overfit. See if all of the terms in the beginning of this post now make sense. Before we throw tons of code at you, let's finish up talking about the bias-variance trade-off.

To recap, when we train machine learning algorithms on a dataset, what we are really interested in is how our model will perform on an independent data set. It is not enough to do a good job classifying instances on the training set. Essentially, we are only interested in building models that are generalizable - getting 100% accuracy on the training set is not impressive, and is simply an indicator of overfitting. Overfitting is the situation in which we have fitted our model too closely to the data, and have tuned to the noise instead of just to the signal.

To be clear: strictly speaking, we are not trying to model the trends in the dataset. We try to model the underlying generative process that has created the data. The specific dataset we happen to be working with is just a small set of instances (i.e. a sample) of the ground truth, which brings with it its own noise and peculiarities.

Here is a summary figure showing how under-fitting (high bias, low variance), properly fitting, and over-fitting (low bias, high variance) models fare on the training compared to the test sets:

This idea of building generalizable models is the motivation behind splitting your dataset into a training set (on which models can be trained) and a test set (which is held out until the very end of your analysis, and provides an accurate measure of model performance).

But - big warning! It's also possibly to overfit to the test set. If we were to try lots of different models out and keep changing them in order to chase accuracy points on the test set, then the information from the test set can inadvertently leak into our model creation phase, which is a big no-no. We need a way around this.

Estimating model performance using k-fold cross validation

Enter k-fold cross-validation, which is a handy technique for measuring a model's performance using only the training set. Say that we want to do e.g. 10-fold cross-validation. The process is as follows: we randomly partition the training set into 10 equal sections. Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. We then evaluate its performance on the remaining 1 section. This gives us some measure of the model's performance (e.g. overall accuracy). We then train the algorithm on a different 9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section. We continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure of performance. Of course, we could have chosen some number other than 10. To keep on with the example, the process behind 10-fold CV looks like this:

We can use k-fold cross validation to get an estimate of model accuracy, and we can use these estimates to tweak our model until we are happy. This lets us leave the test data alone until the very end, thus side-stepping the danger of overfitting to it. k-fold cross validation is extremely popular and very useful, especially if you're trying out lots and lots of different models (e.g. if you want to test how well a load of differently parameterized models perform).

Comparing training error, cross-validation error, and test error

Let's try out different values for k and see which ones fare best on the training set, in k-fold cross validation, and on the test set!


In [46]:
from sklearn.cross_validation import train_test_split, cross_val_score

knn = KNeighborsClassifier()

# the range of number of neighbours we want to test
n_neighbors = np.arange(1, 141, 2)

# here we store the results of each model
train_scores = list()
test_scores = list()
cv_scores = list()

# loop through possible n_neighbours and try them out
for n in n_neighbors:
    knn.n_neighbors = n
    knn.fit(XTrain, yTrain)
    train_scores.append(1 - metrics.accuracy_score(yTrain, knn.predict(XTrain)))  # this will over-estimate the accuracy
    test_scores.append(1 - metrics.accuracy_score(yTest, knn.predict(XTest)))
    cv_scores.append(1 - cross_val_score(knn, XTrain, yTrain, cv = 5).mean())  # we take the mean of the CV scores

# what do these different sources think is the best value of k?
print('The best values of k are: \n{} According to the Training Set\n{} According to the Test Set and\n{} According to Cross-Validation'.format(
        n_neighbors[train_scores == min(train_scores)],
        n_neighbors[test_scores == min(test_scores)],
        n_neighbors[cv_scores == min(cv_scores)]        
    )
)    

# let's plot the error we get with different values of k 
plt.plot(n_neighbors, train_scores, c = "grey", label = "Training Set")
plt.plot(n_neighbors, test_scores, c = "orange", label = "Test Set")
plt.plot(n_neighbors, cv_scores, c = "green", label = "Cross-Validation")
plt.xlabel('Number of K Nearest Neighbors')
plt.ylabel('Classification Error')
plt.gca().invert_xaxis()
plt.legend(loc = "lower left")
plt.show()


The best values of k are: 
[1] According to the Training Set
[45 47 49 55 57 59 61 63 65 67 75 77 79] According to the Test Set and
[15] According to Cross-Validation

This plot really reinforces the point that low values of k tend to lead to very little error (i.e. high accuracy) on the training set, but much more error in the testing set and on cross-validation. We can also see that cross-validation is a reasonable estimator of test error.

Wait, what about datasets without lots of noise?

Whew, lots of theory! Just a bit more. We have been talking all along about how overfitting occurs when models starting tuning to the noise. But what if we had a dataset without a lot of noise? Would overfitting really be so bad then? What would "overfitting" even mean in this context? For example, say we had this dataset:


In [47]:
# Let's tone down the noise in the dataset
X_no_noise, y_no_noise = make_moons(
    n_samples=500,  # the number of observations
    random_state=1,
    noise=0.1
)

# Plot the first feature against the other, color by class
plt.scatter(X_no_noise[y == 1, 0], X_no_noise[y == 1, 1], color="#EE3D34", marker="x")
plt.scatter(X_no_noise[y == 0, 0], X_no_noise[y == 0, 1], color="#4458A7", marker="o")


Out[47]:
<matplotlib.collections.PathCollection at 0x10a5f6590>

Running the kNN code (with k=1) on the original noisy dataset (left) and the new less noisy dataset (right) leads to this result:

Obviously this is an extreme example, but in this case it turned out that the less noisy data (right) actually needed a very finely tuned model (k=1), whereas this model previously overfitted on the more noisy dataset (left). Most real-life datasets will be full of noise, so overfitting is always a danger. However, what exactly consistutes "overfitting" will differ tremendously between datasets, and you have to decide what course of action is optimal. The best guidance is your model's performance on the test set and in cross-validation. If a model that "should" be overfitting is actually generalizing well, then no need to worry.

Just show me the code!

Aha, so you've made it this far. Here is our code for generating the above plots, and doing the training and testing of different kNN algorithms. This code is largely a simplified version of this scikit-learn example, and most it deals with the finicky details of making the plots look nice. The meaty machine learning parts of splitting the dataset, fitting the algorithm, and testing it were covered above.


In [10]:
import numpy as np

def detect_plot_dimension(X, h=0.02, b=0.05):
    x_min, x_max = X[:, 0].min() - b, X[:, 0].max() + b
    y_min, y_max = X[:, 1].min() - b, X[:, 1].max() + b
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    dimension = xx, yy
    return dimension
    
def detect_decision_boundary(dimension, model):
    xx, yy = dimension  # unpack the dimensions
    boundary = model.predict(np.c_[xx.ravel(), yy.ravel()])
    boundary = boundary.reshape(xx.shape)  # Put the result into a color plot
    return boundary

def plot_decision_boundary(panel, dimension, boundary, colors=['#DADDED', '#FBD8D8']):
    xx, yy = dimension  # unpack the dimensions
    panel.contourf(xx, yy, boundary, cmap=ListedColormap(colors), alpha=1)
    panel.contour(xx, yy, boundary, colors="g", alpha=1, linewidths=0.5)  # the decision boundary in green

def plot_dataset(panel, X, y, colors=["#EE3D34", "#4458A7"], markers=["x", "o"]):
    panel.scatter(X[y == 1, 0], X[y == 1, 1], color=colors[0], marker=markers[0])
    panel.scatter(X[y == 0, 0], X[y == 0, 1], color=colors[1], marker=markers[1])

def calculate_prediction_error(model, X, y):
    yPred = model.predict(X)
    score = 1 - round(metrics.accuracy_score(y, yPred), 2)
    return score

def plot_prediction_error(panel, dimension, score, b=.3):
    xx, yy = dimension  # unpack the dimensions
    panel.text(xx.max() - b, yy.min() + b, ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')

In [23]:
def explore_fitting_boundaries(model, n_neighbors, datasets, width):  
    # determine the height of the plot given the aspect ration of each panel should be equal
    height = float(width)/len(n_neighbors) * len(datasets.keys())

    nrows = len(datasets.keys())
    ncols = len(n_neighbors)
    
    # set up the plot
    figure, axes = plt.subplots(
        nrows,
        ncols,
        figsize=(width, height),
        sharex=True,
        sharey=True
    )

    dimension = detect_plot_dimension(X, h=0.02)  # the dimension each subplot based on the data

    # Plotting the dataset and decision boundaries
    i = 0
    for n in n_neighbors:
        model.n_neighbors = n
        model.fit(datasets["Training Set"][0], datasets["Training Set"][1])
        boundary = detect_decision_boundary(dimension, model)
        j = 0
        for d in datasets.keys():
            try:
                panel = axes[j, i]
            except (TypeError, IndexError):
                if (nrows * ncols) == 1:
                    panel = axes
                elif nrows == 1:  # if we only have one dataset
                    panel = axes[i]
                elif ncols == 1:  # if we only try one number of neighbors
                    panel = axes[j]
            plot_decision_boundary(panel, dimension, boundary)  # plot the decision boundary
            plot_dataset(panel, X=datasets[d][0], y=datasets[d][1])  # plot the observations
            score = calculate_prediction_error(model, X=datasets[d][0], y=datasets[d][1])
            plot_prediction_error(panel, dimension, score, b=0.2)  # plot the score
            
            # make compacted layout
            panel.set_frame_on(False)
            panel.set_xticks([])
            panel.set_yticks([])
            
            # format the axis labels
            if i == 0:
                panel.set_ylabel(d)
            if j == 0:
                panel.set_title('k={}'.format(n))
            j += 1
        i += 1    

    plt.subplots_adjust(hspace=0, wspace=0)  # make compacted layout

We then run the code like this:


In [26]:
# specify the model and settings
model = KNeighborsClassifier()
n_neighbors = [200, 100, 20, 5, 1]
datasets = {
    "Training Set": [XTrain, yTrain],
    "Test Set": [XTest, yTest]
}
width = 15

# explore_fitting_boundaries(model, n_neighbors, datasets, width)
explore_fitting_boundaries(model=model, n_neighbors=n_neighbors, datasets=datasets, width=width)


Conclusion

The bias-variance trade-off rears its head in a lot of different areas of machine learning. All algorithms can be considered to have a certain degree of flexibility and this is certainly not specific to kNN. The goal of finding the sweet spot of flexibility that describes the patterns in the data well but is still generalizable to new data applies to basically all algorithms.

In case you're interested, read a bit more about the bias-variance trade-off here. On a related note, there has been a lot of work on the topic of playing with model flexibility to improve performance - a good place to start is by reading about different methods of regularizing linear regression.