Messy modelling: overfitting, cross-validation, and the bias-variance trade-off

Introduction

In the next blog post, you will learn how to tune models. Other posts in this series will include random forests, naive bayes, logistic regression and combinging different models into an ensembled meta-model.

Loading and exploring the dataset

We start off by collecting the dataset. It can be found both online and (in a slightly nicer form) in our GitHub repository, so we just fetch it via wget (note: make sure you first type pip install wget into your Terminal since wget is not a preinstalled Python library). It will download a copy of the dataset to your current working directory.


In [1]:
import wget
import pandas as pd

# Import the dataset
data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/spam/spam_dataset.csv'
dataset = wget.download(data_url)
dataset = pd.read_csv(dataset, sep=",")

# Take a peak at the data
dataset.head()


Out[1]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total is_spam
0 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 ... 0.00 0.000 0 0.778 0.000 0.000 3.756 61 278 1
1 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.00 0.132 0 0.372 0.180 0.048 5.114 101 1028 1
2 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.01 0.143 0 0.276 0.184 0.010 9.821 485 2259 1
3 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.137 0 0.137 0.000 0.000 3.537 40 191 1
4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.135 0 0.135 0.000 0.000 3.537 40 191 1

5 rows × 58 columns

Introducing kNN


In [ ]:
knn3scores = cross_val_score(knn3, XTrain, yTrain, cv = 5)
print knn3scores
print "Mean of scores KNN3:", knn3scores.mean() 

knn99scores = cross_val_score(knn99, XTrain, yTrain, cv = 5)
print knn99scores
print "Mean of scores KNN99:", knn99scores.mean() 

XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state = 1) #seed 1

knn = KNeighborsClassifier()
n_neighbors = np.arange(3, 151, 2)

grid = GridSearchCV(knn, [{'n_neighbors':n_neighbors}], cv = 10)
grid.fit(XTrain, yTrain)
cv_scores = [x[1] for x in grid.grid_scores_]

train_scores = list()
test_scores = list()

for n in n_neighbors:
    knn.n_neighbors = n
    knn.fit(XTrain, yTrain)
    train_scores.append(metrics.accuracy_score(yTrain, knn.predict(XTrain)))
    test_scores.append(metrics.accuracy_score(yTest, knn.predict(XTest)))

plt.plot(n_neighbors, train_scores, c = "blue", label = "Training Scores")
plt.plot(n_neighbors, test_scores, c = "brown", label = "Test Scores")
plt.plot(n_neighbors, cv_scores, c = "black", label = "CV Scores")
plt.xlabel('Number of K nearest neighbors')
plt.ylabel('Classification Accuracy')
plt.gca().invert_xaxis()
plt.legend(loc = "upper left")
plt.show()

Let's examine the shape of the dataset (the number of rows and columns), the types of features it contains, and some summary statistics for each feature.


In [206]:
# Examine shape of dataset and some column names
print (dataset.shape)
print (dataset.columns.values)

# Summarise feature values
dataset.describe()


(4601, 58)
['word_freq_make' 'word_freq_address' 'word_freq_all' 'word_freq_3d'
 'word_freq_our' 'word_freq_over' 'word_freq_remove' 'word_freq_internet'
 'word_freq_order' 'word_freq_mail' 'word_freq_receive' 'word_freq_will'
 'word_freq_people' 'word_freq_report' 'word_freq_addresses'
 'word_freq_free' 'word_freq_business' 'word_freq_email' 'word_freq_you'
 'word_freq_credit' 'word_freq_your' 'word_freq_font' 'word_freq_000'
 'word_freq_money' 'word_freq_hp' 'word_freq_hpl' 'word_freq_george'
 'word_freq_650' 'word_freq_lab' 'word_freq_labs' 'word_freq_telnet'
 'word_freq_857' 'word_freq_data' 'word_freq_415' 'word_freq_85'
 'word_freq_technology' 'word_freq_1999' 'word_freq_parts' 'word_freq_pm'
 'word_freq_direct' 'word_freq_cs' 'word_freq_meeting' 'word_freq_original'
 'word_freq_project' 'word_freq_re' 'word_freq_edu' 'word_freq_table'
 'word_freq_conference' 'char_freq_;' 'char_freq_(' 'char_freq_['
 'char_freq_!' 'char_freq_$' 'char_freq_#' 'capital_run_length_average'
 'capital_run_length_longest' 'capital_run_length_total' 'is_spam']
Out[206]:
word_freq_make word_freq_address word_freq_all word_freq_3d word_freq_our word_freq_over word_freq_remove word_freq_internet word_freq_order word_freq_mail ... char_freq_; char_freq_( char_freq_[ char_freq_! char_freq_$ char_freq_# capital_run_length_average capital_run_length_longest capital_run_length_total is_spam
count 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 ... 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000 4601.000000
mean 0.104553 0.213015 0.280656 0.065425 0.312223 0.095901 0.114208 0.105295 0.090067 0.239413 ... 0.038575 0.139030 0.016976 0.269071 0.075811 0.044238 5.191515 52.172789 283.289285 0.394045
std 0.305358 1.290575 0.504143 1.395151 0.672513 0.273824 0.391441 0.401071 0.278616 0.644755 ... 0.243471 0.270355 0.109394 0.815672 0.245882 0.429342 31.729449 194.891310 606.347851 0.488698
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.588000 6.000000 35.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.065000 0.000000 0.000000 0.000000 0.000000 2.276000 15.000000 95.000000 0.000000
75% 0.000000 0.000000 0.420000 0.000000 0.380000 0.000000 0.000000 0.000000 0.000000 0.160000 ... 0.000000 0.188000 0.000000 0.315000 0.052000 0.000000 3.706000 43.000000 266.000000 1.000000
max 4.540000 14.280000 5.100000 42.810000 10.000000 5.880000 7.270000 11.110000 5.260000 18.180000 ... 4.385000 9.752000 4.081000 32.478000 6.003000 19.829000 1102.500000 9989.000000 15841.000000 1.000000

8 rows × 58 columns

Next up, let's convert the pandas dataframe into a numpy array and isolate the outcome variable we'd like to predict (here, 0 means 'non-spam', 1 means 'spam'):


In [208]:
import numpy as np

# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]

Next up, let's split the dataset into a training and test set. The training set will be used to develop and tune our predictive models. The test will be completely left alone until the very end, at which point you'll run your finished models on it. Having a test set will allow you to get a good estimate of how well our models would perform out in the wild on unseen data.


In [209]:
from sklearn.cross_validation import train_test_split

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

We are first going to try to predict spam emails with a random forest classifier. Chapter 8 of the Introduction to Statistical Learning book provides a truly excellent introduction to theory behind random forests. Briefly, random forests build a collection of classification trees, which each try to predict classes by recursively splitting the data on the features (and feature values) that split the classes best. Each tree is trained on bootstrapped data, and each split is only allowed to use certain variables. So, an element of randomness is introduced, a variety of different trees are built, and the 'random forest' ensembles these base learners together.

Out of the box, scikit's random forest classifier already performs quite well on the spam dataset:


In [211]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

rf = RandomForestClassifier()
rf.fit(XTrain, yTrain)

rf_predictions = rf.predict(XTest)

print (metrics.classification_report(yTest, rf_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, rf_predictions),2))


             precision    recall  f1-score   support

        0.0       0.95      0.97      0.96       701
        1.0       0.95      0.92      0.94       450

avg / total       0.95      0.95      0.95      1151

('Overall Accuracy:', 0.95)

An overall accuracy of 0.95 is very good for a start, but keep in mind that this is a heavily idealized dataset. Next up, we are going to learn how to pick the best parameters for the random forest algorithm (as well as for an SVM and logistic regression classifier) in order to get better models with (hopefully!) improved accuracy.

The perils of overfitting

In order to build the best possible model that does a good job at describing the underlying trends in a dataset, we need to pick the right HP values. In the following example, we will introduce different strategies of searching for the set of HPs that define the best model, but we will first need to make a slight detour to explain how to avoid a major pitfall when it comes to tuning models - overfitting.

The hallmark of overfitting is good training performance and bad testing performance.

As we mentioned above, HPs are not optimised while a learning algorithm is learning. Hence, we need other strategies to optimise them. The most basic way would just to test different possible values for the HPs and see how the model performs. In a random forest, some hyperparameters we can optimise are n_estimators and max_features. n_estimators controls the number of trees in the forest - the more the better, but more trees comes at the expense of longer training time. max_features controls the size of the random selection of features the algorithm is allowed to consider when splitting a node.

Let's try out some HP values.


In [212]:
n_estimators = np.array([5, 100])
max_features  = np.array([10, 50])

We can manually write a small loop to test out how well the different combinations of these fare (later, we'll find out better ways to do this):


In [117]:
from itertools import product

# get grid of all possible combinations of hp values
hp_combinations = list(itertools.product(n_estimators, max_features))

for hp_combo in range(len(hp_combinations)):
    
    print (hp_combinations[hp_combo])
    
    # Train and output accuracies
    rf = RandomForestClassifier(n_estimators=hp_combinations[hp_combo][0], 
                                max_features=hp_combinations[hp_combo][1])
    rf.fit(XTrain, yTrain)
    RF_predictions = rf.predict(XTest)
    print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))


(5, 10)
('Overall Accuracy:', 0.94)
(5, 50)
('Overall Accuracy:', 0.94)
(100, 10)
('Overall Accuracy:', 0.96)
(100, 50)
('Overall Accuracy:', 0.95)

Looks like the higher value of n_estimators and the lower value of max_features did better. However, manually searching for the best HPs in this way is not efficient and could potentially lead to models that perform well on the training data, but do not generalise well to a new dataset, which is what we are actually interested in. This phenomenon of building models that do not generalise well, or that are fitting too closely to the training dataset, is called overfitting. This is a very important concept in machine learning and it is very much worth to get a better understanding of what it is. Let's briefly discuss the bias-variance tradeoff.

The bias-variance trade-off

When we train machine learning algorithms on a dataset, what we are really interested in is how our model will perform on an independent data set. It is not enough to predict whether emails in our training set are spam - how well would the model fare when predicting if a completely new, previously unseen datapoint is spam or not?

This is the idea behind splitting your dataset into a training set (on which models can be trained) and a test set (which is held out until the very end of your analysis, and provides an accurate measure of model performance). Essentially, we are only interested in building models that are generalizable - getting 100% accuracy on the training set is not impressive, and is simply an indicator of overfitting. Overfitting is the situation in which we have fitted our model too closely to the data, and have tuned to the noise instead of just to the signal.

In fact, this concept of wanting to fit algorithms to the training data well, but not so tightly that the model doesn't generalize, is a pervasive problem in machine learning. A common term for this balancing act is the bias-variance trade-off. Here is a nice introductory article on the topic that goes into more depth.

Have a look at how underfitting (high bias, low variance), properly fitting, and overfitting (low bias, high variance) models fare on the training compared to the test sets.

"Bias" and "variance" have got to be some of the least helpful terms in machine learning. One way to think of them is: a model that underfits (e.g. the straight line) is quite a bit wrong - it models the underlying generative process of the data as something too simple, and this is highly biased away from the truth. But, the straight line fit is not going to change very much across different datasets. The opposite trend applies to overfitted models.

Hence, we never try to optimize the model's perfomance on the training data because it is a misguided and fruitless endeavour. But wait, didn't we also say that the test set is not meant to be touched until we are completely done training our model? How are we meant to optimize our hyperparameters then?

k-fold cross validation

Enter k-fold cross-validation, which is a handy technique for measuring a model's performance using only the training set. k-fold CV is a general method, and is not specific to hyperparameter optimization, but is very useful for that purpose. Say that we want to do e.g. 10-fold cross-validation. The process is as follows: we randomly partition the training set into 10 equal sections. Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. We then evaluate its performance on the remaining 1 section. This gives us some measure of the model's performance (e.g. overall accuracy). We then train the algorithm on a different 9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section. We continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure of performance. Of course, we could have chosen some number other than 10. To keep on with the example, the process behind 10-fold CV looks like this:

We can use k-fold cross validation to optimize HPs. Say we are deciding whether to use 1,3 or 5 nearest-neighbours in our nearest-neighbours classifier. We can start by setting the n_neighbours HP in our classifier object to 1, running 10-fold CV, and getting a measurement of the model's performance. Repeating the process with the other HP values will lead to different levels of performance, and we can simply choose the n_neighbours value that worked best.

In the context of HP optimization, we perform k-fold cross validation together with grid search or randomized search to get a more robust estimate of the model performance associated with specific HP values.

Conclusion

In this post, we started with the motivation for tuning machine learning algorithms (i.e. nicer, bigger numbers in your models' performance reports!). We used different methods of searching over hyperparameter space, and evaluated candidate models at different points using k-fold cross validation. Tuned models were then run on the test set. Note that the concepts of training/test sets, cross-validation, and overfitting extend beyond the topic of tuning hyperparameters, though it is a nice application to demonstrate these ideas.

In this post, we tried to maximize the accuracy of our models, but there are problems for which you might want to maximize something else, like the model's specificity or the sensitivity. For example, if we were doing medical diagnostics and trying to detect a deadly illness, it would be very bad to accidentally label a sick person as healthy (this would be called a "false negative" in the lingo). Maybe it's not so bad if we misclassify healthy people as sick people ("false positive"), since in the worst case we would just annoy people by having them retake the diagnostic test. Hence, we might want our diagnostic model to be weighted towards optimizing sensitivity. Here is a good introduction to sensitivity and specificity which continues with the example of diagnostic tests.

Arguably, in spam detection, it is worse to misclassify real email as spam (false positive) than to let a few spam emails pass through your filter (false negative) and show up in people's mailboxes. In this case, we might aim to maximize specificity. Of course, we cannot be so focused on improving the specificity of our classifier that we completely bomb our sensitivity. There is a natural trade-off between these quantities (see this primer on ROC curves), and part of our job as statistical modellers is to practice the dark art of deciding where to draw the line.

Sometimes there is no tuning to be done. For example, a Naive Bayes (NB) classifier just operates by calculating conditional probabilities, and there is no real hyperparameter optimization stage. NB is actually a very interesting algorithm that is famous for classifying text documents (and the spam dataset in particular), so if you have time, check out a great overview and Python implementation here). It's a "naive" classifier because it rests on the assumption that the features in your dataset are independent, which is often not strictly true. In our spam dataset, you can image that the occurence of the strings "win", "money", and "!!!" is probably not independent. Despite this, NB often still does a decent job at classification tasks.

In our next post, we will take these different tuned models and build them up into an ensemble model to increase our predictive performance even more. Stay... tuned! Cue groans.