When doing machine learning using Python's scikit-learn library, we can often get reasonable performance by using out of the box algorithms with default values for their hyperparameters. However, it is a much better idea to do at least some level of tuning of an algorithm to your specific problem and dataset. In this post, we will explore the concepts behind hyperparamter optimization and cross-validation, and demonstrate the process of tuning and training three algorithms - a random forest, a support vector machine and a Naive Bayes classifier.
You'll be working with the famous (well, machine learning famous!) spam dataset, which contains loads of NLP-mined features of spam and non-spam emails, like the frequencies of the words "money", "free" and "viagra". The goal is to apply different algorithms to these features in order to predict whether a given email is spam. The overall process looks like this:
In the next blog post, you will learn how to take different tuned machine learning algorithms and combine them to build an ensemble model, which is a type of aggregated model that often has higher accuracy than its constituent algorithms.
The topics we'll cover in this post are:
Let's get cracking.
Often, when setting out to train a machine learning algorithm on your dataset of interest, you must first specify a number of arguments or hyperparameters. A hyperparameter is just a variable than influences the performance of your model, but isn't directly tuned during model training. For example, when building a neural network, the number of layers in the network and the number of neurons per layer are both hyperparameters that must be specified before training commences. By contrast, the weights in biases in a neural network are parameters (not hyperparameters) because they are explicitly tuned during training. As another example, when using the k-nearest neighbours algorithm to do classification, the value of k (the number of nearest neighbours the model considers) is a hyperparameter that must be supplied in advance. But how can we know what values to set the hyperparameters to in order to get the best performance from our learning algorithms?
Actually, scikit-learn generally provides reasonable hyperparameter default values, such that it is possible to quickly build an e.g. kNN classifier by simply typing clfr = sklearn.neighbors.KNeighborsClassifier()
and then fitting it to your data. Behind the scenes, we can see the hyperparameter values that the classifier has automatically assumed, such as setting the number of nearest neighbours hyperparameter (n_neighbors=5
) to 5, giving all datapoints equal importance (weights=uniform
), and so on. Often, the default hyperparameters values will do a decent job, so it may be tempting to skip the topic of model tuning completely. However, it is basically always a good idea to do at least some level of hyperparameter optimization, due to the potential for substantial improvements in a learning algorithm's performance.
We optimize hyperparameters in exactly the way that you might expect - we try different values and see what works best. However, some care is needed when deciding how we will measure if certain values work well, and which strategy to use to systematically explore hyperparameter space.
When we train machine learning algorithms on a dataset, what we are really interested in is how our model will perform on an independent data set. It is not enough to predict whether emails in our training set are spam - how well would the model fare when predicting if a completely new, previously unseen datapoint is spam or not?
This is the idea behind splitting your dataset into a training set (on which models can be trained) and a test set (which is held out until the very end of your analysis, and provides an accurate measure of model performance). Essentially, we are only interested in building models that are generalizable - getting 100% accuracy on the training set is not impressive, and is simply an indicator of overfitting. Overfitting is the situation in which we have fitted our model too closely to the data, and have tuned to the noise instead of just to the signal.
In fact, this concept of wanting to fit algorithms to the training data well, but not so tightly that the model doesn't generalize, is a pervasive problem in machine learning. A common (but perhaps not enormously helpful) term for this balancing act is the bias-variance trade-off.
Hence, we never try to optimize the model's perfomance on the training data because it is a misguided and fruitless endeavour. But wait, didn't we also say that the test set is not meant to be touched until we are completely done training our model? How are we meant to optimize our hyperparameters then?
Enter k-fold cross-validation, which is a handy technique for measuring a model's performance using only the training set. k-fold CV is a general method, and is not specific to hyperparameter optimization, but is very useful for that purpose. Say that we want to do e.g. 10-fold cross-validation. The process is as follows: we randomly partition the training set into 10 equal sections. Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. We then evaluate its performance on the remaining 1 section. This gives us some measure of the model's performance (e.g. accuracy). We then train the algorithm on a different 9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section. We continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure of performance. Of course, we could have chosen some number other than 10. To keep on with the example, the process behind 10-fold CV looks like this:
We can use k-fold cross validation to optimize hyperparameters. Say we are deciding whether to use 1,3 or 5 nearest-neighbours in our nearest-neighbours classifier. We can start by setting the n_neighbours
hyperparameter in our classifier object to 1, running 10-fold CV, and getting a measurement of the model's performance. Repeating the process with the other hyperparameter values will lead to different measures of performance, and we choose the n_neighbours
value that worked best.
In the above example, we were actually doing a grid search (also called a parameter sweep), which exhaustively searchers through some manually prespecified hyperparameter values, calculates model performance using k-fold cross-validation (CV) for each value, and reports the best option. It is common to try to optimize multiple hyperparameters simultaneously - grid search tries each combination in turn, hence the name "grid".
Note that grid search with k-fold CV simply returns the best hyperparameter values out of the available options, and is therefore certainly not guaranteed to return a global optimum. It makes sense to choose a collection of possible values that is somewhat centred around an empirically sensible default. With enough values in the grid, you are likely to find a near optimum, but this can be computationally expensive. Grid search could be a good option if your algorithm can be tested very quickly, but in general, there is a trade off between doing a better job searching for optimal values and computational time.
Grid search is quite commonly used, but another way to search through hyperparameter space to find optimums is by using randomized search. In randomized search, we sample parameter values a specified number of times from some distribution which we prespecify in advance. There is evidence that randomized search is more efficient than grid search, because not all hyperparameters are as important to tune and grid search wastes time by exhaustively checking each option when it might not be necessary. By contrast, random experiments explore the important dimensions of hyperparameter space with more coverage while simultaneously not devoting too many trials to dimensions which are not as important.
We start of by collecting the dataset. It can be found in the github repository and we download it via wget
(note: make sure you pip install wget
as it is not a preinstalled python library). It will download a copy of the dataset to your current working directory.
We often want our input data to be a matrix (X) and the vector of instance labels as a separate vector (y).
In [1]:
import pandas as pd
import numpy as np
import wget
# Import the dataset
data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/spam/spam_dataset.csv'
dataset = wget.download(data_url)
dataset = pd.read_csv(dataset, sep=",")
# Take a peak at the data
dataset.head()
Out[1]:
In [5]:
# Examine shape of dataset and the column names
print dataset.shape
print dataset.columns.values
# Summarise feature values
dataset.describe()
Out[5]:
In [11]:
# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]
In [12]:
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
# Scale and split dataset
# X_scaled = preprocessing.scale(X)
# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)
The main hyperparameters to adjust for random forrests are n_estimators
and max_features
. n_estimators
controls the number of trees in the forest - the more the better, but more trees comes at the expense of longer training time. max_features
controls the size of the random selection of features the algorithm is allowed to consider when splitting a node.
We could also choose to tune various other hyperpramaters, like max_depth
(the maximum depth of a tree, which controls how tall we grow our trees and influences overfitting) and the choice of the purity criterion
(which are specific formulas for calculating how good or 'pure' our splits make the terminal nodes).
In [18]:
from sklearn import metrics
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
# Search for good hyperparameter values
# Specify values to grid search over
n_estimators = np.arange(10, 30, 5)
max_features = np.arange(1, X.shape[1], 10)
hyperparameters = {'n_estimators': n_estimators,
'max_features': max_features}
print hyperparameters
In [19]:
# Grid search using cross-validation
gridCV = GridSearchCV(RandomForestClassifier(), param_grid=hyperparameters, cv=10, n_jobs=4)
gridCV.fit(XTrain, yTrain)
Out[19]:
In [21]:
# Identify optimal hyperparameter values
best_n_estim = gridCV.best_params_['n_estimators']
best_max_features = gridCV.best_params_['max_features']
# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from gridCV.best_estimator_
clfRDF = RandomForestClassifier(
n_estimators=best_n_estim,
max_features=best_max_features
)
clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)
print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))
We could have also used Randomized search to tune hyperparameters. The difference is that we , which would have looked like this:
In [38]:
from scipy.stats import uniform
from scipy.stats import norm
from sklearn.grid_search import RandomizedSearchCV
# Designate distributions to sample hyperparameters from
n_estimators = np.random.uniform(10, 30, 10)
max_features = np.random.normal(24, 10, 10)
# or is it
from scipy.stats import randint as sp_randint
sp_randint(1, 11)
hyperparameters = {'n_estimators': n_estimators,
'max_features': max_features}
print hyperparameters
In [40]:
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search)
randomCV = RandomizedSearchCV(RandomForestClassifier(),
param_distributions=hyperparameters,
n_iter=100)
randomCV.get_params
Out[40]:
93-95% accuracy, not too shabby! Have a look and see how random forests with suboptimal hyperparameters fare. We got around 91-92% accuracy on the out of the box (untuned) random forests, which actually isn't terrible.
Let's train our second algorithm, support vector machines (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be read here. Briefly, SVMs search for hyperplanes in the feature space which best divide the different classes in your dataset. Crucially, SVMs can find non-linear decision boundaries between classes using a process called kernelling, which projects the data into a higher-dimensional space. This sounds a bit abstract, but if you've ever fit a linear regression to power-transformed variables (e.g. maybe you used x^2, x^3 as features), you're already familiar with the concept.
SVMs can use different types of kernels, like Gaussian or radial ones, to throw the data into a different space. The main hyperparameters we must tune for SVMs are gamma (a kernel parameter, controlling how far we 'throw' the data into the new feature space) and C (which controls the bias-variance tradeoff of the model).
In [22]:
from sklearn.svm import SVC
# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=2)
C_range = 2. ** np.arange(-5, 15, step=2)
hyperparameters = [{'gamma': g_range,
'C': C_range}]
print hyperparameters
In [72]:
from sklearn.svm import SVC
# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=2)
C_range = 2. ** np.arange(-5, 15, step=2)
hyperparameters = [{'gamma': g_range,
'C': C_range}]
# Grid search using cross-validation
grid = GridSearchCV(SVC(), param_grid=hyperparameters, cv= 10)
grid.fit(XTrain, yTrain)
bestG = grid.best_params_['gamma']
bestC = grid.best_params_['C']
# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=bestC, gamma=bestG)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)
print metrics.classification_report(yTest, SVM_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),2)
How does this performance compare to the untuned classifier? What about one with especially badly chosen hyperparameter values?
The last algorithm we'll tune and apply to predict spam emails is a logistic regression classifier. This is a type of regression model which is used for predicting binary outcomes (like spam/non-spam). We fit a straight line through our transformed data, where the x axes remain the same but the dependent variable is the log odds of data points being one of the two classes. So essentialy, logistic regression is just a transformed version of linear regression. Check out Charles' explanation and implementation of logistic regression [here].
One topic you will often encounter in machine learning is regularization, which is a class of techniques to reduce overfitting. The idea is that we often don't just want to maximize model fit, but also penalize the model for e.g. using too many parameters, or assigning coefficients or weights that are too big. Read more about regularized regression [here]. We can adjust just how much regularization we want by adjusting regularization hyperparameters, and scikit-learn comes with some models that can very efficiently fit data for a range of regulatization hyperparameter values. This is the case for regularized linear regression models like Lasso regression and ridge regression. Allows fast cross-validation for model selection.
But we can also optimize how much regularization we want ourselves, in the same manner as we've been doing.
In [71]:
from sklearn.linear_model import LogisticRegression
# Search for good hyperparameter values
# Specify values to grid search over
penalty = ["l1", "l2"]
C_range = np.arange(0.1, 1.1, 0.1)
hyperparameters = [{'penalty': penalty,
'C': C_range}]
# Grid search using cross-validation
grid = GridSearchCV(LogisticRegression(), param_grid=hyperparameters, cv=10)
grid.fit(XTrain, yTrain)
bestPenalty = grid.best_params_['penalty']
bestC = grid.best_params_['C']
print bestPenalty
print bestC
In [74]:
# Train SVM and output predictions
classifier_logistic = LogisticRegression(penalty=bestPenalty, C=bestC)
classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain)
logistic_predictions = classifier_logistic_fit.predict(XTest)
print metrics.classification_report(yTest, logistic_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),2)
In this post, we started with the motivation for tuning machine learning algorithms by optimizing hyperparameter values. We used different methods of searching over hyperparameter space, and evaluate the model at each point using k-fold cross validation. These techniques allowed us to get nicer, bigger numbers in our model's performance reports. However, the concepts of training/test sets and cross-validation extend beyond the topic of tuning hyperparameters, though it is a nice application.
In this post, we tried to maximize the accuracy of our models, but there are problems for which you might want to maximize something else, like the model's specificity or the sensitivity. For example, if we were doing medical diagnostics and trying to detect a deadly illness, it would be very bad to accidentally label a sick person as healthy (this would be called a "false negative" in the lingo). Maybe it's not so bad if we misclassify healthy people as sick people ("false positive"), since in the worst case we would maybe just annoy people by having them redo the diagnostic test. Hence, we might want our diagnostic model to be weighted towards optimizing sensitivity. Here is a good introduction to sensitivity and specificity in diagnostics.
Arguably, in spam detection, it is worse to misclassify real email as spam (false positive) than to let a few spam emails pass through your filter (false negative) and show up in people's mailboxes. In this case, we might aim to maximize specificity. Of course, we cannot be so focused on improving the specificity of our classifier that we take our sensitivity. There is a natural trade-off between these quantities (see this primer on ROC curves), and part of our job as modellers is to decide where to draw the line.
Sometimes there is no tuning to be done. For example, a Naive Bayes classifier just operates by calculating conditional probabilities, and there is no real hyperparameter optimization stage. It's a very interesting algorithm that is quite famous for classifying text documents (and the spam
dataset in particular); check out a great overview and implementation here). It's a "naive" classifier because it rests on the assumption that the features in your dataset are independent, which is often not strictly true. In our spam dataset, you can image that the occurence of the strings "win", "money", and "!!!" is probably not independent. Despite this, NB often still does a decent job at classification tasks.
In our next post, we will take these different tuned models and build them up into an ensemble model to increase our predictive performance even more. Stay... tuned!
In [ ]: