Scanning hyperspace: how to tune machine learning models


When doing machine learning using Python's scikit-learn library, we can often get reasonable model performance by using out-of-the-box settings. However, the payoff can be huge if you invest at least some time into tuning models to your specific problem and dataset. In the previous post, we explored the concepts of overfitting, cross-validation, and the bias-variance tradeoff. These ideas turn out to be central to doing a good job at optimizing the hyperparameters (roughly, the settings) of algorithms. In this post, we will explore the concepts behind hyperparameter optimization and demonstrate the process of tuning and training a random forest classifier.

You'll be working with the famous (well, machine learning famous!) wine dataset, which contains features of different quality wines, like the acidity and sugar content. Our goal is to tune and apply a random forest to these features in order to predict whether a given wine is nice or not.

The steps we'll cover in this blog post can be summarized as follows:

In the next two posts, you will learn about different strategies for model optimization and how to tune a support vector machine and logistic regression classifier. You will also find out how to take several different tuned models and combine them to build an ensemble model, which is a type of aggregated meta-model that often has higher accuracy and lower overfitting than its constituents.

Let's get cracking.

Loading and exploring the dataset

We start off by collecting the dataset. It can be found both online and in our GitHub repository, so we can also just fetch it via wget (note: make sure you first type pip install wget into your terminal since wget is not a preinstalled Python library). It will download a copy of the dataset to your current working directory.

In [16]:
import wget
import pandas as pd

# Import the dataset
data_url = ''
dataset =
dataset = pd.read_csv(dataset, sep=";")

# Take a peak at the first few columns of the data
first_5_columns = dataset.columns[0:5]

fixed acidity volatile acidity citric acid residual sugar chlorides
0 7.4 0.70 0.00 1.9 0.076
1 7.8 0.88 0.00 2.6 0.098
2 7.8 0.76 0.04 2.3 0.092
3 11.2 0.28 0.56 1.9 0.075
4 7.4 0.70 0.00 1.9 0.076

You can examine the dimensions of the dataset and the column names:

In [17]:
# Examine shape of dataset and the column names
print (dataset.shape)
print (dataset.columns.values)

(1599, 12)
['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol' 'quality']

So, it looks like you have a dozen features to play with, and just under 1600 data points. Get some summary statistics on the features using describe():

In [30]:
# Summarise feature values

fixed acidity volatile acidity citric acid residual sugar chlorides
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467
std 1.741096 0.179060 0.194801 1.409928 0.047065
min 4.600000 0.120000 0.000000 0.900000 0.012000
25% 7.100000 0.390000 0.090000 1.900000 0.070000
50% 7.900000 0.520000 0.260000 2.200000 0.079000
75% 9.200000 0.640000 0.420000 2.600000 0.090000
max 15.900000 1.580000 1.000000 15.500000 0.611000

The distribution of the outcome variable quality is a bit funky - the values are mostly 5 and 6 (how would you check this?). This could get a bit irksome later on, so go ahead and recode the quality scores into something more convenient. One idea would be to label wines as being high quality (e.g. if their score is 6 or higher) and low quality otherwise. You could encode this with a 1 representing high quality and 0 representing low quality, like so:

In [19]:
# using a lambda function to bin quality scores
dataset['quality_is_high'] = dataset.quality.apply(lambda x: 1 if x >= 6 else 0)

If you're interested in getting to know the wine dataset further, check out Charles' previous post on using the plotly library to make interactive plots of the wine features here.

Now convert the pandas dataframe into a numpy array and isolate the outcome variable you'd like to predict ('quality_is_high'). This conversion is needed to feed the data into a machine learning pipeline:

In [31]:
import numpy as np

# Convert the dataframe to a numpy array and split the
# data into an input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-2].astype(float)
y = npArray[:,-1]

Next up, split the dataset into a training and test set. The training set will be used to develop and tune our predictive models. The test will be completely left alone until the very end, at which point you'll run your finished models on it. Having a test set will allow you to get a good estimate of how well your models would perform out in the wild on unseen data, which is what you're actually interested in when you model data (see previous post).

In [32]:
from sklearn.cross_validation import train_test_split

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

You are now going to try to predict high quality wine with a random forest classifier. Chapter 8 of the Introduction to Statistical Learning book provides a truly excellent introduction to the theory behind classification trees, bagged trees, and random forests. It's worth a read if you have time.

Briefly, random forests build a collection of classification trees, which each try to classify data points by recursively splitting the data on the features (and feature values) that separate the classes best. Each tree is trained on bootstrapped data, and each bifurcation point is only allowed to 'see' a subset of the available variables when deciding on the best split. So, an element of randomness is introduced when constructing each tree, which means that a variety of different trees are built. The random forest ensembles these base learners together, i.e. it combines these trees into an aggregated model. When making a new prediction, the individual trees each make their individual predictions, and the random forest surveys these opinions and accepts the majority position. This often leads to improved accuracy, generalizability, and stability in the predictions.

Out of the box, scikit's random forest classifier performs reasonably well on the wine dataset:

In [190]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

rf = RandomForestClassifier(), yTrain)

rf_predictions = rf.predict(XTest)

print (metrics.classification_report(yTest, rf_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, rf_predictions),4))

             precision    recall  f1-score   support

        0.0       0.74      0.85      0.79       188
        1.0       0.85      0.73      0.78       212

avg / total       0.80      0.79      0.79       400

('Overall Accuracy:', 0.7875)

We get an overall accuracy around the 0.77-0.79 mark (try running the code block a few times, or experiment with setting different seeds using random_state). Next up, you are going to learn how to pick the best values for the hyperparameters of the random forest algorithm in order to get better models with (hopefully!) even higher accuracy than this baseline.

Better modelling through hyperparameter optimization

We've glossed over what a hyperparameter actually is. Let's explore the topic now. Often, when setting out to train a machine learning algorithm on your dataset of interest, you must first specify a number of arguments or hyperparameters (HPs). An HP is just a variable than influences the performance of your model, but isn't directly tuned during model training. For example, when using the k-nearest neighbours algorithm to do classification (see these two previous posts), the value of k (the number of nearest neighbours the model considers) is a hyperparameter that must be supplied in advance. As another example, when building a neural network, the number of layers in the network and the number of neurons per layer are both hyperparameters that must be specified before training commences. By contrast, the weights and biases in a neural network are parameters (not hyperparameters) because they are explicitly tuned during training.

It turns out that scikit-learn generally provides reasonable hyperparameter default values, such that it is possible to quickly build an e.g. kNN classifier by simply typing KNeighborsClassifier() and then fitting it to your data. Behind the scenes, we can can get the documentation on what hyperparameter values that the classifier has automatically assumed, but you can also examine models directly using get_params:

In [35]:
from sklearn.neighbors import KNeighborsClassifier

# Create a default kNN classifer and print params
knn_default = KNeighborsClassifier()
print (knn_default.get_params)

<bound method KNeighborsClassifier.get_params of KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')>

So you see that the default kNN classifier has the number of nearest neighbours it considers set to 5 (n_neighbors=5) and gives all datapoints equal importance (weights=uniform), and so on.

Often, the default hyperparameters values will do a decent job (as we saw above with the random forest example), so it may be tempting to skip the topic of model tuning completely. However, it is basically always a good idea to do some level of hyperparameter optimization, due to the potential for substantial improvements in your learning algorithm's performance.

But how do you know what values to set the hyperparameters to in order to get the best performance from your learning algorithms?

You optimize hyperparameters in exactly the way that you might expect - you try different values and see what works best. However, some care is needed when deciding how exactly to measure if certain values work well, and which strategy to use to systematically explore hyperparameter space. In a later post, we will introduce model ensembling, in which individual models can be considered 'hyper-hyper parameters' (™; ©; ®; patent pending; T-shirts printing).

Tuning your random forest

In order to build the best possible model that does a good job at describing the underlying trends in a dataset, we need to pick the right HP values. As we mentioned above, HPs are not optimised while an algorithm is learning. Hence, we need other strategies to optimise them. The most basic way to do this would be just to test different possible values for the HPs and see how the model performs.

In a random forest, some hyperparameters we can optimise are n_estimators and max_features. n_estimators controls the number of trees in the forest - the more the better (with diminishing returns), but more trees come at the expense of longer training time. max_features controls the size of the random selection of features the algorithm is allowed to consider when splitting a node. Larger values help if the individual predictors aren't that great. Smaller values can be helpful if the features in the dataset are decent and/or highly correlated.

Let's try out some random HP values.

In [167]:
# manually specifying some HP values
parameter_combinations = [
    {"n_estimators": 2, "max_features": 10},  # parameter combination 1...
    {"n_estimators": 5, "max_features": 3},  # 2
    {"n_estimators": 9, "max_features": 7} 

We can manually write a small loop to test out how well the different combinations of these potential HP values fare (later, we'll find out better ways to do this):

In [193]:
import itertools

# test out different HP combinations
for hp_combo in parameter_combinations:
    # Train and output accuracies
    rf = RandomForestClassifier(n_estimators=hp_combo["n_estimators"], 
                                max_features=hp_combo["max_features"]), yTrain)
    RF_predictions = rf.predict(XTest)
    print ('When n_estimators is {} and max_features is {}, test set accuracy is {}'.format(
            round(metrics.accuracy_score(yTest, RF_predictions),4))

When n_estimators is 2 and max_features is 10, test set accuracy is 0.72
When n_estimators is 5 and max_features is 3, test set accuracy is 0.755
When n_estimators is 9 and max_features is 7, test set accuracy is 0.785

Looks like the last combinations of HPs might be doing better. However, manually searching for the best HPs in this way is not efficient, it's a bit random and liable to missing good combinations, and could potentially lead to models that perform well on this specific dataset, but do not generalise well to new data. The specific phenomenon of building models that do not generalise well, or that are fitting too closely to the dataset, is called overfitting.

Here, you trained different models on the training dataset using manually selected HP values. We then tested on the test dataset. This is not as bad as training a model and evaluating it on the training set, but it is still bad - since you repeatedly evaluated on the test dataset, knowledge of the test set can leak into the model bulding phase. You are at risk of inadvertenly learning something about the test set, and hence are susceptible to overfitting.

k-fold cross validation for hyperparameter tuning

So, you have to be careful not to overfit to our data. But wait, didn't we also say that the test set is not meant to be touched until you are completely done training our model? How are you meant to optimize our hyperparameters then?

Enter k-fold cross-validation, which is a handy technique for measuring a model's performance using only the training set. k-fold CV is a general method (see an explanation here), and is not specific to hyperparameter optimization, but is very useful for that purpose. We simply try out different HP values, get several different estimates of model performance for each HP value (or combination of HP values), and choose the model with the lowest CV error. The process looks like this:

In the context of HP optimization, we perform k-fold cross validation together with grid search or randomized search to get a more robust estimate of the model performance associated with specific HP values.

Traditionally and perhaps most intuitively, scanning for good HPs values can be done with the grid search (also called parameter sweep). This strategy exhaustively searches through some manually prespecified HP values and reports the best option. It is common to try to optimize multiple HPs simultaneously - grid search tries each combination of HPs in turn, hence the name. This is a more convenient and complete way of searching through hyperparameter space than manually specifying combinations.

The combination of grid search and k-fold cross validation is very popular for finding the models with good performance and generalisability. So, in HP optimisation we are actually trying to do two things: (i) find the best possible combination of HPs that define a model and (ii) making sure that the pick generalises well to new data. In order to address the second concern, CV is often the method of choice. Scikit-learn makes this process very easy and slick, and even supports parallel distributing of the search (via the n_jobs argument).

You use grid search to tune a random forest like this:

In [178]:
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV

# Search for good hyperparameter values
# Specify values to grid search over
n_estimators = list(np.arange(10, 60, 20))
max_features  = list(np.arange(2, X.shape[1], 3))

hyperparameters   = {'n_estimators': n_estimators, 
                     'max_features': max_features}

print (hyperparameters)

{'n_estimators': [10, 30, 50], 'max_features': [2, 5, 8]}

In [196]:
# Grid search using cross-validation
gridCV = GridSearchCV(RandomForestClassifier(), param_grid=hyperparameters, cv=10, n_jobs=4), yTrain)

# Identify optimal hyperparameter values
best_n_estim      = gridCV.best_params_['n_estimators']
best_max_features = gridCV.best_params_['max_features']  

print("The best performing n_estimators value is: {:5.1f}".format(best_n_estim))
print("The best performing max_features value is: {:5.1f}".format(best_max_features))

# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from gridCV.best_estimator_
rf = RandomForestClassifier(n_estimators=best_n_estim,
                            max_features=best_max_features), yTrain)
RF_predictions = rf.predict(XTest)

print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))

The best performing n_estimators value is:  50.0
The best performing max_features value is:   5.0
             precision    recall  f1-score   support

        0.0       0.79      0.83      0.81       188
        1.0       0.84      0.80      0.82       212

avg / total       0.82      0.81      0.82       400

('Overall Accuracy:', 0.81)

We now get ~0.81 accuracy. This is quite a bit better than the poorly paramterized models which yielded ~0.7 accuracy, and is a noticeable difference from the default 0.77-0.79 values. Importantly, it was not too difficult to get this boost. How could you try to improve on this result?

Note that grid search with k-fold CV simply returns the best HP values out of the available options, and is therefore not guaranteed to return a global optimum. It makes sense to choose a diverse collection of possible values that is somewhat centred around an empirically sensible default.

You tuned your random forest classifier!

So, that was an overview of the concepts and practicalities involved when tuning a random forest classifer. We could also choose to tune various other hyperpramaters, like max_depth (the maximum depth of a tree, which controls how tall we grow our trees and influences overfitting) and the choice of the purity criterion (which are specific formulas for calculating how good or 'pure' the splits we choose are, as judged by how well they separate the classes in our dataset). The two HPs we chose to tune are regarded as the most important. Have a look at tuning more than just the n_estimators and max_features HPs and see what happens.

Quick quiz:

  1. How do you think that altering the n_estimators and max_depth HPs would affect the bias and variance of the random forest classifier?

  2. It is interesting that the random forest performs better with quite low values of max_features on this dataset. What do you think this says about the features in the dataset?

  3. Try max_features=1. What does this force the trees in the random forest to do?

  4. To get more of an intuition of how random forests operate, play around with printing the importance of the features with print (rf.feature_importances_) under different conditions and experiment with setting max_depth=0.


In this post, we started with the motivation for tuning machine learning algorithms (i.e. nicer, bigger numbers in your models' performance reports!). You evaluated different candidate models by simple trial and error, as well as by using k-fold cross validation. You then ran your tuned models on the test set in order to predict wine quality from a set of chemical characteristics.

In this post, you were keeping an eye on the accuracy of models in order to optimize hyperparameters, but there are problems for which you might want to maximize something else, like the model's specificity or the sensitivity. For example, if you were doing medical diagnostics and trying to detect a deadly illness, it would be very bad to accidentally label a sick person as healthy (this would be called a "false negative" in the classification lingo). Maybe it's not so bad if you misclassify healthy people as sick people ("false positive"), since in the worst case you would just annoy people by having them retake the diagnostic test. Hence, you might want your diagnostic model to be weighted towards optimizing sensitivity. Here is a good introduction to sensitivity and specificity which continues with the example of diagnostic tests.

On the other hand, if you were doing spam email detection, it would arguably be worse to misclassify real email as spam (false positive) than to let a few spam emails pass through your filter (false negative) and show up in people's mailboxes. In this case, you might aim to maximize specificity. Of course, you cannot be so focused on improving the specificity of your classifier that you completely tank your sensitivity. There is a natural trade-off between these quantities (see this primer on ROC curves), and part of our job as statistical modellers is to practice the dark art of deciding where to draw the line.

Sometimes there is no model tuning to be done. For example, a Naive Bayes (NB) classifier just operates by calculating conditional probabilities, and there is no real hyperparameter optimization stage. NB is actually a very interesting algorithm that is famous for classifying text documents (and the spam dataset in particular), so if you have time, check out a great overview and Python implementation here). It's a "naive" classifier because it rests on the assumption that the features in your dataset are independent, which is often not strictly true. In the spam dataset, you can image that the occurence of the strings "win", "money", and "!!!" is probably not independent. Despite this, NB often still does a decent job at classification tasks.

In our next post, we will explore different ways to tune models and optimise a support vector machine and logistic regression classifier. Stay... tuned! Cue groans.

In [ ]: