When doing machine learning using Python's scikit-learn library, we can often get reasonable predictions by using out-of-the-box algorithms with default settings. However, it is a much better idea to do at least some tuning of the algorithms to your specific problem and dataset. In this post, we will explore how hyperparameter optimization can be used to tune models, different strategies for traversing hyperparameter space, and several related concepts such as overfitting and cross-validation. We also demonstrate the process of tuning, training, and testing three different algorithms - a random forest, a support vector machine and a logistic regression classifier.

You'll be working with the famous (well, machine learning famous!) spam dataset, which contains loads of NLP-mined features of spam and non-spam emails, like the frequencies of the words "money", "free" and "viagra". Our goal is to tune and apply different algorithms to these features in order to predict whether a given email is spam.

The steps we'll cover in this blog post can be summarized as follows:

In the next blog post, you will learn how to take different tuned machine learning algorithms and combine them to build an ensemble model, which is a type of aggregated, meta-model that often has higher accuracy and less overfitting than its individual constituent models.

Let's get cracking.

We start off by retrieving the dataset. It can be found both online and (in a slightly nicer form) in our GitHub repository, so we can just fetch it via `wget`

(note: make sure you first install wget by typing `pip install wget`

into your Terminal since it is not a preinstalled Python library). The command will download a copy of the dataset to your current working directory.

```
In [2]:
```import wget
import pandas as pd
# Import the dataset
data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/spam/spam_dataset.csv'
dataset = wget.download(data_url)
dataset = pd.read_csv(dataset, sep=",")
# Take a peak at the data
dataset.head()

```
Out[2]:
```

```
In [3]:
```# Examine shape of dataset and some column names
print (dataset.shape)
print (dataset.columns.values)
# Summarise feature values
dataset.describe()

```
Out[3]:
```

```
In [4]:
```import numpy as np
# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]

```
In [5]:
```from sklearn.cross_validation import train_test_split
# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

First, we are first going to try to predict spam emails with a random forest classifier. Chapter 8 of the Introduction to Statistical Learning book provides a truly excellent introduction to the theory behind random forests. Briefly, random forests build a collection of classification trees, where each tree tries to predict classes by recursively splitting the data on the features (and feature values) that split the classes 'best'. Each tree is trained on bootstrapped data, and each split is only allowed to use certain variables. So, an element of randomness is introduced when constructing a variety of different trees, and the random forest ensembles these base learners together. Have a read of Chapter 8 of the ISLR book if you're interested in the inner workings of random forests, but you don't need to know the theory to understand the rest of this blog post.

Out of the box, scikit's random forest classifier already performs quite well on the spam dataset:

```
In [12]:
```from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# Create and train random forest classifer
rf = RandomForestClassifier()
rf.fit(XTrain, yTrain)
# Predict classes on the test set
rf_predictions = rf.predict(XTest)
# Output performance metrics of the model
print (metrics.classification_report(yTest, rf_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, rf_predictions),2))

```
```

We've glossed over what a hyperparameter actually is. Let's explore the topic now. Often, when setting out to train a machine learning algorithm on your dataset, you must first set a number of arguments or **hyperparameters** (HPs). A hyperparameter is just a variable that influences the performance of your model, but isn't directly tuned during the training phase. Most machine learning algorithms have hyperparameters. For example, when using the k-nearest neighbours algorithm to do classification, the value of k (the number of nearest neighbours the model considers) is a hyperparameter that must be supplied in advance. As another example, when building a neural network, the number of layers in the network and the number of neurons per layer are both hyperparameters that must be specified before training commences. By contrast, the weights and biases in a neural network are **parameters** (not hyperparameters) because they *are* explicitly tuned during training. But how can we know what values to set the hyperparameters to in order to get the best performance from our learning algorithms?

Actually, scikit-learn generally provides reasonable HP default values, such that it is possible to quickly build an e.g. kNN classifier by simply typing `kNN_clfr=sklearn.neighbors.KNeighborsClassifier()`

and then fitting it to your data. Behind the scenes, we can see the hyperparameter values that the classifier has automatically assigned, such as setting the number of nearest neighbours hyperparameter to 5 (`n_neighbors=5`

), giving all datapoints equal importance (`weights=uniform`

), and so on. Often, the default HP values will do a decent job (as we saw above), so it may be tempting to skip the topic of model tuning completely. However, it is basically always a good idea to do at least some level of hyperparameter optimization, due to the potential for substantial improvements in a learning algorithm's performance.

We optimize hyperparameters in exactly the way that you might expect - we try different values and see what works best. However, some care is needed when deciding how exactly to measure if certain values work well, and which strategy to use to systematically explore hyperparameter space.

In a later post, we will introduce model ensembling, in which individual models can in a sense be considered 'hyper-hyper parameters' (™; ©; ®; patent pending; T-shirts printing).

In order to build the best possible model that does a good job at describing the underlying trends in the dataset, we need to pick the right HP values. In the following examples, we will introduce different strategies of searching for the set of HPs that define the best model, but we will first need to make a slight detour to explain how to avoid a major pitfall when it comes to tuning models - **overfitting**.

As we mentioned above, HPs are not optimised while a learning algorithm is learning. Hence, we need other strategies to optimise them. The most basic way would just be to test different possible values for the HPs and see how the model performs. In a random forest, the key HPs we need to optimise are `n_estimators`

and `max_features`

. `n_estimators`

controls the number of trees in the forest - generally, the more the better, but more trees comes at the expense of longer training time. `max_features`

controls the size of the random selection of features the algorithm is allowed to consider when splitting a node.

Let's try out some HP values for a random forest.

```
In [13]:
```# Defining a couple of HP options
n_estimators = np.array([5, 100])
max_features = np.array([10, 50])

```
In [15]:
```import itertools
# Get grid of all possible combinations of hp values
hp_combinations = list(itertools.product(n_estimators, max_features))
for hp_combo in range(len(hp_combinations)):
print (hp_combinations[hp_combo])
# Train and output accuracies
rf = RandomForestClassifier(n_estimators=hp_combinations[hp_combo][0],
max_features=hp_combinations[hp_combo][1])
rf.fit(XTrain, yTrain)
RF_predictions = rf.predict(XTest)
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))

```
```

`n_estimators`

and the lower value of `max_features`

did better. However, manually searching for the best HPs in this way is not efficient and could potentially lead to models that perform well on this specific data, but do not generalise well to a new dataset, which is what we are actually interested in. This phenomenon of building models that do not generalise well, or that are fitting too closely to the current dataset, is called **overfitting**. This is a key concept in machine learning and it is very much worth getting a better understanding of what it is. Let's briefly discuss the **bias-variance tradeoff**.

When we train machine learning algorithms, what we are really interested in is how our model will perform on an independent dataset. It is not enough to predict whether emails in our training set are spam - how well would the model fare when predicting if a completely new, previously unseen datapoint is spam or not?

This is the idea behind splitting your dataset into a **training set** (on which models can be trained) and a **test set** (which is held out until the very end of your analysis, and provides an accurate measure of model performance). Essentially, we are only interested in building models that are **generalizable** - getting 100% accuracy on the training set is not impressive, and is simply an indicator of **overfitting**. Overfitting is the situation in which we have fitted our model too closely to the data, and have tuned to the noise instead of just to the signal.

In fact, this concept of wanting to fit algorithms to the training data well, but not so tightly that the model doesn't generalize, is a pervasive problem in machine learning. A common term for this balancing act is **the bias-variance trade-off**. Here is a nice introductory article on the topic that goes into more depth.

Have a look at how underfitting (high bias, low variance), properly fitting, and overfitting (low bias, high variance) models fare on the training compared to the test sets.

"Bias" and "variance" have got to be some of the least helpful terms in machine learning. One way to think of them is: a model that underfits (e.g. the straight line) is quite a bit wrong - it models the underlying generative process too simply, and this simple model is highly biased away from the ground truth (high bias). But, the straight line fit is not going to change very much across different datasets (low variance). The opposite trend applies to overfitted models.

Hence, we never try to optimize the model's perfomance on the training data because it is a misguided endeavour. But wait, didn't we also say that the test set is not meant to be touched until we are completely done training our model? We often don't have the luxury of lots of extra data we can use just to fit loads of models. So how are we meant to optimize our hyperparameters?

Enter **k-fold cross-validation**, which is a handy technique for simulating having an abundance of test datasets available, and is used to measure a model's performance using *only* the training set. k-fold CV is a general method, and is not specific to hyperparameter optimization, but is very useful for that purpose. Say that we want to do e.g. 10-fold cross-validation. The process is as follows: we randomly partition the training set into 10 equal sections. Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. We then evaluate its performance on the remaining 1 section. This gives us some measure of the model's performance (e.g. overall accuracy). We then train the same algorithm on a *different* 9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section. We continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure. Of course, we could have chosen some number other than 10. To keep on with this example, the process behind 10-fold CV looks like this:

We can use k-fold cross validation to optimize HPs. Say we are deciding whether to use 1, 3 or 5 nearest-neighbours in our nearest-neighbours classifier. We can start by setting the `n_neighbours`

HP in our classifier object to 1, running 10-fold CV, and getting a measurement of the model's performance. Repeating the process with the other HP values will lead to different levels of performance, and we can simply choose the `n_neighbours`

value that worked best.

In the context of HP optimization, we perform k-fold cross validation together with **grid search** or **randomized search** to get a more robust estimate of the model performance associated with specific HP values.

Traditionally and perhaps most intuitively, scanning for HPs is done using **grid search** (also called parameter sweep). This strategy exhaustively searches through some manually prespecified HP values and reports the best option. It is common to try to optimize multiple HPs simultaneously - grid search tries each combination in turn, hence the name. It works like this:

The combination of grid search and k-fold cross validation is very popular for finding the models with the best possible performance and generalisability. So, in HP optimisation we are actually trying to do two things: (i) find the best possible combination of HPs that define a model and (ii) make sure that the model generalises well to new data. In order to address the second concern, CV is often the method of choice. Scikit-learn makes this process of HP optimization using k-fold CV very easy and slick, and even supports parallel distributing of the HP search (via the `n_jobs`

argument).

Note that grid search with k-fold CV simply returns the best HP values out of the available options, and is therefore not guaranteed to return a global optimum. It makes sense to choose a somewhat different collection of possible values that is centred around an empirically sensible default.

```
In [18]:
```from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
# Search for good hyperparameter values
# Specify values to grid search over
n_estimators = list(np.arange(25, 45, 5))
max_features = list(np.arange(10, X.shape[1], 20))
hyperparameters = {'n_estimators': n_estimators,
'max_features': max_features}
print (hyperparameters)

```
```

```
In [19]:
```# Grid search using cross-validation
gridCV = GridSearchCV(RandomForestClassifier(), param_grid=hyperparameters, cv=10, n_jobs=4)
gridCV.fit(XTrain, yTrain)
# Identify optimal hyperparameter values
best_n_estim = gridCV.best_params_['n_estimators']
best_max_features = gridCV.best_params_['max_features']
print("The best performing n_estimators value is: {:5.1f}".format(best_n_estim))
print("The best performing max_features value is: {:5.1f}".format(best_max_features))
# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from gridCV.best_estimator_
clfRDF = RandomForestClassifier(n_estimators=best_n_estim,
max_features=best_max_features)
clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)
print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))

```
```

Grid search is quite commonly used, but another way to search through hyperparameter space to find optima is by using **randomized search**. With randomized search, we don't build a grid of HP values we want to try out. Instead, we specify the regions of hyperparameter space we are interested in by specifying distributions that we want to sample from. Importantly, we also specify a computational budget (`n_iter`

) which controls the number of HP combinations we will test out using CV. So, rather than the brute force approach of grid search, where we are forced to try out every combination, we can trade off between the computational resources we want to spend and the chances that we'll find the optimal combination of HP values. But because we prespecify regions where we think the correct HP values will lie, we are likely to cover some good possibilities even with a modest computational budget.

There is evidence that randomized search is more efficient than grid search, because grid search effectively wastes time by exhaustively checking each option when it might not be necessary. By contrast, the random experiments utilized by randomized search can explore the important dimensions of hyperparameter space with more coverage. If we were to use uniform distributions for all of our HPs and allow as many sampling iterations as there are combinations of the HPs, then randomized search just becomes grid search.

To use randomized search to tune random forests, we first specify the HP distributions:

```
In [20]:
```from scipy.stats import uniform
from scipy.stats import norm
from sklearn.grid_search import RandomizedSearchCV
# Designate distributions to sample hyperparameters from
# Uniform distribution for n_estimators, Gaussian for max_features
n_estimators = np.random.uniform(25, 45, 5).astype(int)
max_features = np.random.normal(20, 10, 5).astype(int)
hyperparameters = {'n_estimators': list(n_estimators),
'max_features': list(max_features)}
print hyperparameters

```
```

We then run the randomized search:

```
In [21]:
```# Run randomized search
# n_iter controls the number of HP combinations we try out
randomCV = RandomizedSearchCV(RandomForestClassifier(), param_distributions=hyperparameters, n_iter=10)
randomCV.fit(XTrain, yTrain)
# Identify optimal hyperparameter values
best_n_estim = randomCV.best_params_['n_estimators']
best_max_features = randomCV.best_params_['max_features']
print("The best performing n_estimators value is: {:5.1f}".format(best_n_estim))
print("The best performing max_features value is: {:5.1f}".format(best_max_features))
# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from randomCV.best_estimator_
clfRDF = RandomForestClassifier(n_estimators=best_n_estim,
max_features=best_max_features)
clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)
print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))

```
```

So, that was an overview of the concepts and practicalities involved when tuning a random forest classifer. We could also choose to tune various other hyperpramaters, like `max_depth`

(the maximum depth of a tree, which controls how tall we grow our trees and influences overfitting) and the choice of the purity `criterion`

(which are specific formulas for calculating how good or 'pure' the bifurcations we choose are, as judged by how well they separate the classes in our dataset). The two we chose to tune are generally regarded as the most important. Either grid search or randomized search is probably fine for tuning random forests.

Fancier techniques for hyperparameter optimization include methods based on gradient descent, grad student descent, and Bayesian approaches which make smart decisions about what part of hyperparameter space to try next based on the performance of previous combinations (see Spearmint and hyperopt).

Note that the toy spam dataset we were working on is unusually straightforward, clean, and easy, and we were getting very high accuracies. It is rare to encounter such simple datasets in real life. Also, in real life we would expect to see a much bigger performance boost as our tuning strategy improves.

Let's finish off by showing how to tune two other predictors.

Let's train our second algorithm, support vector machines (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be read here. Briefly, SVMs search for hyperplanes in the feature space which best divide the different classes in your dataset. Crucially, SVMs can find non-linear decision boundaries between classes using a process called kernelling, which projects the data into a higher-dimensional space. This sounds a bit abstract, but if you've ever fit a linear regression to power-transformed variables (e.g. maybe you used x^2, x^3 as features), you're already familiar with the concept. Do take a look at the guide we linked above.

SVMs can use different types of kernels, like Gaussian or radial ones, to throw the data into a different space. Let's use the latter. The main hyperparameters we must tune for SVMs are `gamma`

(a kernel parameter, controlling how far we 'throw' the data into the new feature space) and `C`

(which controls the bias-variance tradeoff of the model). Change the step size in the hyperparameter range specification to do better; we just chose a couple so that the code block runs quickly.

```
In [29]:
```from sklearn.svm import SVC
# Search for good hyperparameter values
# Specify values to grid search over
gamma_range = 2. ** np.arange(-15, 5, step=10)
C_range = 2. ** np.arange(-5, 15, step=10)
hyperparameters = [{'gamma': gamma_range,
'C': C_range}]
print hyperparameters

```
```

```
In [31]:
```from sklearn.svm import SVC
# Grid search using cross-validation
gridCV = GridSearchCV(SVC(), param_grid=hyperparameters, cv=5)
gridCV.fit(XTrain, yTrain)
best_gamma = gridCV.best_params_['gamma']
best_C = gridCV.best_params_['C']
# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', gamma=best_gamma, C=best_C)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)
print metrics.classification_report(yTest, SVM_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),2)

```
```

How does this compare an untuned SVM? What about an SVM with especially badly tuned hyperparams?

The last algorithm we'll tune and apply to predict spam emails is a logistic regression classifier. This is a type of regression model which is used for predicting binary outcomes (like spam/non-spam). We fit a straight line through our transformed data, where the x axis remain the same but the dependent variable is the log odds of data points being one of the two classes. So essentialy, logistic regression is just a transformed version of linear regression.

One topic you will often encounter in machine learning is **regularization**, which is a class of techniques to reduce overfitting. The idea is that we often don't just want to maximize model fit, but also penalize the model for e.g. using too many parameters, or assigning coefficients or weights that are too big. Read more about regularized regression here. We can adjust just how much regularization we want by adjusting regularization hyperparameters, and scikit-learn comes with some models that can very efficiently fit data for a range of regulatization hyperparameter values. This is the case for regularized linear regression models like Lasso regression and ridge regression. These classes are shortcuts to doing cross-validated selection of models with different level of regularization.

But we can also optimize how much regularization we want ourselves, as well as tuning the values of other hyperparameters, in the same manner as we've been doing.

```
In [34]:
```from sklearn.linear_model import LogisticRegression
# Search for good hyperparameter values
# Specify values to grid search over
penalty = ["l1", "l2"]
C_range = np.arange(0.1, 1.1, 0.1)
hyperparameters = [{'penalty': penalty,
'C': C_range}]
# Grid search using cross-validation
grid = GridSearchCV(LogisticRegression(), param_grid=hyperparameters, cv=10)
grid.fit(XTrain, yTrain)
bestPenalty = grid.best_params_['penalty']
bestC = grid.best_params_['C']
print bestPenalty
print bestC
# Train model and output predictions
classifier_logistic = LogisticRegression(penalty=bestPenalty, C=bestC)
classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain)
logistic_predictions = classifier_logistic_fit.predict(XTest)
print metrics.classification_report(yTest, logistic_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),2)

```
```

In this post, we started with the motivation for tuning machine learning algorithms (i.e. nicer, bigger numbers in your models' performance reports!). We used different methods of searching over hyperparameter space, and evaluated candidate models at different points using k-fold cross validation. Tuned models were then run on the test set. Note that the concepts of training/test sets, cross-validation, and overfitting extend beyond the topic of tuning hyperparameters, though it is a nice application to demonstrate these ideas.

Here, we tried to maximize the accuracy of our models, but there are problems for which you might want to maximize something else, like the model's specificity or the sensitivity. For example, if we were doing medical diagnostics and trying to detect a deadly illness, it would be very bad to accidentally label a sick person as healthy (this would be called a "false negative" in the lingo). Maybe it's not so bad if we misclassify healthy people as sick people ("false positive"), since in the worst case we would just annoy people by having them retake the diagnostic test. Hence, we might want our diagnostic model to be weighted towards optimizing sensitivity. Here is a good introduction to sensitivity and specificity which continues with the example of diagnostic tests.

Arguably, in spam detection, it is worse to misclassify real email as spam (false positive) than to let a few spam emails pass through your filter (false negative) and show up in people's mailboxes. In this case, we might aim to maximize specificity. Of course, we cannot be so focused on improving the specificity of our classifier that we completely bomb our sensitivity. There is a natural trade-off between these quantities (see this primer on ROC curves), and part of our job as statistical modellers is to practice the dark art of deciding where to draw the line.

Sometimes there is no tuning to be done. For example, a Naive Bayes (NB) classifier just operates by calculating conditional probabilities, and there is no real hyperparameter optimization stage. NB is actually a very interesting algorithm that is famous for classifying text documents (and the `spam`

dataset in particular), so if you have time, check out a great overview and Python implementation here). It's a "naive" classifier because it rests on the assumption that the features in your dataset are independent, which is often not strictly true. In our spam dataset, you can image that the occurence of the strings "win", "money", and "!!!" is probably not independent. Despite this, NB often still does a decent job at classification tasks.

In our next post, we will take these different tuned models and build them up into an ensemble model to increase our predictive performance even more. Stay... tuned! *Cue groans*.