When doing machine learning using Python's scikit-learn library, we can often get reasonable performance out the algorithms just by using the default settings for hyperparameters. However, it is a much better idea to do at least some level of tuning of an algorithm to your specific problem and dataset. In this post, we will the concepts behind hyperparamter optimization and cross-validation, and demonstrate the process of tuning and training three algorithms - a random forest, support vector machine and basic feedforward neural network.
You'll be working with the famous (well, machine learning famous!) spam dataset, which contains loads of NLP-mined features of spam and non-spam emails. The goal is to apply different algorithms to these features in order to predict whether a given email is spam. The overall process looks like this:
In the next blog post, you will learn how to take different tunes machine learning algorithms and combine them to build an ensemble model, which is a type of aggregated, hybrid model that often has higher accuracy than its individual constituents.
The topics we'll cover in this post are:
Let's get cracking.
Often, when setting out to train a machine learning algorithm on your dataset of interest, you must first specify a number of arguments or hyperparameters. A hyperparameter is just a variable than influences the performance of your model, but isn't directly tuned during model training. For example, when building a neural network, the number of layers in the network and the number of neurons per layer are both hyperparameters that must be specified before training commences. By contrast, the weights in biases in a neural network are parameters (not hyperparameters) because they are explicitly tuned during training. As another example, when using the k-nearest neighbours algorithm to do classification, the value of k (the number of nearest neighbours the model considers) is a hyperparameter that must be supplied in advance. But how can we know what values to set the hyperparameters to in order to get the best performance from our learning algorithms?
Actually, scikit-learn generally provides reasonable hyperparameter default values, such that it is possible to quickly build an e.g. kNN classifier by simply typing clfr = sklearn.neighbors.KNeighborsClassifier()
and then fitting it to your data. Behind the scenes, we can see the hyperparameter values that the classifier has automatically assumed, such as setting the number of nearest neighbours hyperparameter (n_neighbors=5
) to 5, giving all datapoints equal importance (weights=uniform
), and so on. Often, the default hyperparameters values will do a decent job, so it may be tempting to skip the topic of model tuning completely. However, it is basically always a good idea to do at least some level of hyperparameter optimization, due to the potential for substantial improvements in a learning algorithm's performance.
We optimize hyperparameters in exactly the way that you might expect - we try different values and see what works best. However, some care is needed when deciding how we will measure if certain values work well, and which strategy to use to systematically explore hyperparameter space.
When we train machine learning algorithms on a dataset, what we are really interested in is how our model will perform on an independent data set. It is not enough to predict whether emails in our training set are spam - how well would the model fare when predicting if a completely new, previously unseen datapoint is spam or not?
This is the idea behind splitting your dataset into a training set (on which models can be trained) and a test set (which is held out until the very end of your analysis, and provides an accurate measure of model performance). Essentially, we are only interested in building models that are generalizable - getting 100% accuracy on the training set is not impressive, and is simply an indicator of overfitting. Overfitting is the situation in which we have fitted our model too closely to the data, and have tuned to the noise instead of just to the signal.
In fact, this concept of wanting to fit algorithms to the training data well, but not so tightly that the model doesn't generalize, is a pervasive problem in machine learning. A common (but perhaps not enormously helpful) term for this balancing act is the bias-variance trade-off.
[Charles' figure on overfitting and all that good stuff]
Hence, we never try to optimize the model's perfomance on the training data because it is a misguided and fruitless endeavour. But wait, didn't we also say that the test set is not meant to be touched until we are completely done training our model? How are we meant to optimize our hyperparameters then?
Enter k-fold cross-validation, which is a handy technique for measuring a model's performance using only the training set. k-fold CV is a general method, and is not specific to hyperparameter optimization, but is very useful for that purpose. Say that we want to do e.g. 10-fold cross-validation. The process is as follows: we randomly partition the training set into 10 equal sections. Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. We then evaluate its performance on the remaining 1 section. This gives us some measure of the model's performance (e.g. accuracy). We then train the algorithm on a different 9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section. We continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure of performance. The process looks like this:
[k-fold CV illustration]
We can use k-fold cross validation to optimize hyperparameters. Say we are deciding whether to use 1,3 or 5 nearest-neighbours in our nearest-neighbours classifier. We can start by setting the n_neighbours
hyperparameter in our classifier object to 1, running 10-fold CV, and getting a measurement of the model's performance. Repeating the process with the other hyperparameter values will lead to different measures of performance, and we choose the n_neighbours
value that worked best.
In the above example, we were actually doing a grid search (also called parameter sweep), which exhaustively searchers through some manually prespecified hyperparameter values, calculates model performance using k-fold cross-validation for each values, and reports the best option. It is common to try to optimize multiple hyperparameters simultaneously - grid search tries each combination in turn, hence the name.
Note that grid search with k-fold CV simply returns the best hyperparameter values out of the available options, and is therefore not guaranteed to return a global optimum. It makes sense to choose a somewhat different collection of possible values that is somewhat centred around an empirically sensible default.
Another way to search through hyperparameter space is using randomized search.
From Wiki: "Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more effective in high-dimensional spaces than exhaustive search"
In [5]:
import pandas as pd
import numpy as np
import wget
# Import the dataset
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data'
columns_url =
dataset = wget.download(data_url)
columns = wget.download(columns_url)
dataset = pd.read_csv(dataset, sep=",", )
# Take a peak at the data
dataset.head()
# Reorder the data columns and drop email_id
cols = dataset.columns.tolist()
cols = cols[2:] + [cols[1]]
dataset = dataset[cols]
# Examine shape of dataset and some column names
print dataset.shape
print dataset.columns.values[0:10]
# Summarise feature values
dataset.describe()
# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]
In [7]:
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
# Scale and split dataset
X_scaled = preprocessing.scale(X)
# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X_scaled, y, random_state=1)
The main hyperparameters to adjust for random forrests are n_estimators
and max_features
. n_estimators
controls the number of trees in the forest - the more the better, but more trees comes at the expense of longer training time. max_features
controls the size of the random selection of features the algorithm is allowed to consider when splitting a node.
We could also choose to tune various other hyperpramaters, like max_depth
(the maximum depth of a tree, which controls how tall we grow our trees and influences overfitting) and the choice of the purity criterion
(which are specific formulas for calculating how good or 'pure' our splits make the terminal nodes).
We could have also tuned our hyperparameters using randomized search, which samples some values from a distribution rather than trying out all given values. Either is probably fine.
The following code block takes about a minute to run.
In [6]:
from sklearn import metrics
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Search for good hyperparameter values
# Specify values to grid search over
n_estimators = np.arange(1, 30, 5)
max_features = np.arange(1, X.shape[1], 10)
max_depth = np.arange(1, 100, 10)
hyperparameters = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth}
# Grid search using cross-validation
gridCV = GridSearchCV(RandomForestClassifier(), param_grid=hyperparameters, cv=10, n_jobs=4)
gridCV.fit(XTrain, yTrain)
# Identify optimal hyperparameter values
best_n_estim = gridCV.best_params_['n_estimators']
best_max_features = gridCV.best_params_['max_features']
best_max_depth = gridCV.best_params_['max_depth']
# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from gridCV.best_estimator_
clfRDF = RandomForestClassifier(n_estimators=best_n_estim, max_features=best_max_features, max_depth=best_max_depth)
clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)
print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))
93-95% accuracy, not too shabby! Have a look and see how random forests with suboptimal hyperparameters fare. We got around 91-92% accuracy on the out of the box (untuned) random forests, which actually isn't terrible.
Let's train our second algorithm, support vector machines (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be read here. Briefly, SVMs search for hyperplanes in the feature space which best divide the different classes in your dataset. Crucially, SVMs can find non-linear decision boundaries between classes using a process called kernelling, which projects the data into a higher-dimensional space. This sounds a bit abstract, but if you've ever fit a linear regression to power-transformed variables (e.g. maybe you used x^2, x^3 as features), you're already familiar with the concept.
SVMs can use different types of kernels, like Gaussian or radial ones, to throw the data into a different space. The main hyperparameters we must tune for SVMs are gamma (a kernel parameter, controlling how far we 'throw' the data into the new feature space) and C (which controls the bias-variance tradeoff of the model).
In [3]:
from sklearn.svm import SVC
# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=2)
C_range = 2. ** np.arange(-5, 15, step=2)
hyperparameters = [{'gamma': g_range,
'C': C_range}]
# Grid search using cross-validation
grid = GridSearchCV(SVC(), param_grid=hyperparameters, cv= 10)
grid.fit(XTrain, yTrain)
bestG = grid.best_params_['gamma']
bestC = grid.best_params_['C']
# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=bestC, gamma=bestG)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)
print metrics.classification_report(yTest, SVM_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),2)
How does this compare to untuned classifier? What about one with especially badly tuned hyperparams?
In [ ]: