Tricks of the trade: expanding your machine learning toolkit

Introduction

Previously, we learned about the importance tuning our machine learning algorithms in order to improve prediction accuracy. We demonstrated tuning a random forest classifier using grid search, and how cross-validation can help avoid overfitting when tuning hyperparameters (HPs).

Here, we introduce a different strategy for traversing hyperparameter space - randomized search. We also demonstrate the process of tuning and training two other algorithms - a support vector machine and a logistic regression classifier.

We'll keep working with the spam dataset, which contains features relating to the frequency of words ("money" and "viagra") and symbols (like "!!!") in spam and non-spam emails. Our goal is to tune and apply different algorithms to these features in order to predict whether a given email is spam.

Here are the things we'll cover in this blog post:

In the next blog post, you will learn how to take different tuned machine learning algorithms and combine them to build different types of ensemble models, which are aggregated models which frequently have higher accuracy and lower overfitting.

Loading and train/test splitting the dataset

We start off by collecting the dataset. We have covered the data loading, conversion and train/test split previously, so we won't repeat the explanations here.


In [2]:
import wget
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

# Import the dataset
data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/spam/spam_dataset.csv'
dataset = wget.download(data_url)
dataset = pd.read_csv(dataset, sep=",")

# Convert dataframe to numpy array and split
# data into input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-1].astype(float)
y = npArray[:,-1]

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

We have already built a random forest classifier, tuned using grid search, to predict spam emails (here). Grid search exhaustively searches through some manually prespecified HP values and reports the best option and is quite commonly used. Another way to search through hyperparameter space to find optimums is by using randomized search. In randomized search, we sample HP values a certain number of times from some distribution which we prespecify in advance. There is evidence that randomized search is more efficient than grid search, because not all HPs are as important to tune and grid search effectively wastes time by exhaustively checking each option when it might not be necessary. By contrast, the random experiments utilized by randomized search explore the important dimensions of hyperparameter space with more coverage, while simultaneously not devoting too many trials to dimensions which are not as important. So, randomized search is very useful for high-dimensional feature spaces.

To use randomized search to tune random forests, we first specify the distributions we want to sample from.

If we were to sample from a uniform distribution and have the same number of n_iter trials, randomized search would be practically equivalent to grid search.


In [4]:
from scipy.stats import uniform
from scipy.stats import norm

from sklearn.grid_search import RandomizedSearchCV
from sklearn import metrics

# Designate distributions to sample hyperparameters from 
n_estimators = np.random.uniform(25, 45, 5).astype(int)
max_features = np.random.normal(20, 10, 5).astype(int)

hyperparameters = {'n_estimators': list(n_estimators),
                   'max_features': list(max_features)}

print hyperparameters


{'n_estimators': [38, 33, 40, 32, 40], 'max_features': [8, 31, 23, 13, 25]}

We then run the random search:


In [5]:
from sklearn.ensemble import RandomForestClassifier

# Run randomized search
randomCV = RandomizedSearchCV(RandomForestClassifier(), param_distributions=hyperparameters, n_iter=10)
randomCV.fit(XTrain, yTrain)

# Identify optimal hyperparameter values
best_n_estim      = randomCV.best_params_['n_estimators']
best_max_features = randomCV.best_params_['max_features']  

print("The best performing n_estimators value is: {:5.1f}".format(best_n_estim))
print("The best performing max_features value is: {:5.1f}".format(best_max_features))

# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from randomCV.best_estimator_
clfRDF = RandomForestClassifier(n_estimators=best_n_estim,
                                max_features=best_max_features)

clfRDF.fit(XTrain, yTrain)
RF_predictions = clfRDF.predict(XTest)

print (metrics.classification_report(yTest, RF_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, RF_predictions),2))


The best performing n_estimators value is:  40.0
The best performing max_features value is:  13.0
             precision    recall  f1-score   support

        0.0       0.96      0.98      0.97       701
        1.0       0.96      0.94      0.95       450

avg / total       0.96      0.96      0.96      1151

('Overall Accuracy:', 0.96)

Either grid search or randomized search is probably fine for tuning random forests.

Fancier techniques for hyperparameter optimization include methods based on gradient descent, grad student descent, and Bayesian approaches which update prior beliefs about likely values of hyperparameters based on the data (see Spearmint and hyperopt).

Let's look at how to tune our two other predictors. For simplicity, let's revert back to grid search.

Tuning a support vector machine

Let's train our second algorithm, support vector machines (SVMs) to do the same exact prediction task. A great introduction to the theory behind SVMs can be read here. Briefly, SVMs search for hyperplanes in the feature space which best divide the different classes in your dataset. Crucially, SVMs can find non-linear decision boundaries between classes using a process called kernelling, which projects the data into a higher-dimensional space. This sounds a bit abstract, but if you've ever fit a linear regression to power-transformed variables (e.g. maybe you used x^2, x^3 as features), you're already familiar with the concept. Do have a read of the guide we linked above.

SVMs can use different types of kernels, like Gaussian or radial ones, to throw the data into a different space. Let's use the latter. The main hyperparameters we must tune for SVMs are gamma (a kernel parameter, controlling how far we 'throw' the data into the new feature space) and C (which controls the bias-variance tradeoff of the model).


In [6]:
from sklearn.svm import SVC

# Search for good hyperparameter values
# Specify values to grid search over
g_range = 2. ** np.arange(-15, 5, step=5)
C_range = 2. ** np.arange(-5, 15, step=5)

hyperparameters = [{'gamma': g_range, 
                    'C': C_range}] 

print hyperparameters


[{'C': array([  3.12500000e-02,   1.00000000e+00,   3.20000000e+01,
         1.02400000e+03]), 'gamma': array([  3.05175781e-05,   9.76562500e-04,   3.12500000e-02,
         1.00000000e+00])}]

In [ ]:
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC

# Grid search using cross-validation
grid = GridSearchCV(SVC(), param_grid=hyperparameters, cv=10)  
grid.fit(XTrain, yTrain)

bestG = grid.best_params_['gamma']
bestC = grid.best_params_['C']

# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=bestC, gamma=bestG)
rbfSVM.fit(XTrain, yTrain)
SVM_predictions = rbfSVM.predict(XTest)

print metrics.classification_report(yTest, SVM_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, SVM_predictions),2)

How does this compare an untuned SVM? What about an SVM with especially badly tuned hyperparams?

Tuning a logistic regression classifier

The last algorithm you'll tune and apply to predict spam emails is a logistic regression classifier. This is a type of regression model which is used for predicting binary outcomes (like spam/non-spam). We fit a straight line through our transformed data, where the x axes remain the same but the dependent variable is the log odds of data points being one of the two classes. So essentialy, logistic regression is just a transformed version of linear regression. Check out Charles' explanation and implementation of logistic regression [here].

One topic you will often encounter in machine learning is regularization, which is a class of techniques to reduce overfitting. The idea is that we often don't just want to maximize model fit, but also penalize the model for e.g. using too many parameters, or assigning coefficients or weights that are too big. Read more about regularized regression here. We can adjust just how much regularization we want by adjusting regularization hyperparameters, and scikit-learn comes with some models that can very efficiently fit data for a range of regulatization hyperparameter values. This is the case for regularized linear regression models like Lasso regression and ridge regression. These classes are shortcuts to doing cross-validated selection of models with different level of regularization.

But we can also optimize how much regularization we want ourselves, as well as tuning the values of other hyperparameters, in the same manner as we've been doing.


In [ ]:
from sklearn.linear_model import LogisticRegression

# Search for good hyperparameter values
# Specify values to grid search over
penalty = ["l1", "l2"]
C_range = np.arange(0.1, 1.1, 0.1)

hyperparameters = [{'penalty': penalty, 
                    'C': C_range}] 

# Grid search using cross-validation
grid = GridSearchCV(LogisticRegression(), param_grid=hyperparameters, cv=10)  
grid.fit(XTrain, yTrain)

bestPenalty = grid.best_params_['penalty']
bestC = grid.best_params_['C']

print bestPenalty
print bestC

In [ ]:
# Train model and output predictions
classifier_logistic = LogisticRegression(penalty=bestPenalty, C=bestC)
classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain)
logistic_predictions = classifier_logistic_fit.predict(XTest)

print metrics.classification_report(yTest, logistic_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),2)

Conclusion

In this post, we started with the motivation for tuning machine learning algorithms (i.e. nicer, bigger numbers in your models' performance reports!). We used different methods of searching over hyperparameter space, and evaluated candidate models at different points using k-fold cross validation. Tuned models were then run on the test set. Note that the concepts of training/test sets, cross-validation, and overfitting extend beyond the topic of tuning hyperparameters, though it is a nice application to demonstrate these ideas.

In this post, we tried to maximize the accuracy of our models, but there are problems for which you might want to maximize something else, like the model's specificity or the sensitivity. For example, if we were doing medical diagnostics and trying to detect a deadly illness, it would be very bad to accidentally label a sick person as healthy (this would be called a "false negative" in the lingo). Maybe it's not so bad if we misclassify healthy people as sick people ("false positive"), since in the worst case we would just annoy people by having them retake the diagnostic test. Hence, we might want our diagnostic model to be weighted towards optimizing sensitivity. Here is a good introduction to sensitivity and specificity which continues with the example of diagnostic tests.

Arguably, in spam detection, it is worse to misclassify real email as spam (false positive) than to let a few spam emails pass through your filter (false negative) and show up in people's mailboxes. In this case, we might aim to maximize specificity. Of course, we cannot be so focused on improving the specificity of our classifier that we completely bomb our sensitivity. There is a natural trade-off between these quantities (see this primer on ROC curves), and part of our job as statistical modellers is to practice the dark art of deciding where to draw the line.

Sometimes there is no tuning to be done. For example, a Naive Bayes (NB) classifier just operates by calculating conditional probabilities, and there is no real hyperparameter optimization stage. NB is actually a very interesting algorithm that is famous for classifying text documents (and the spam dataset in particular), so if you have time, check out a great overview and Python implementation here). It's a "naive" classifier because it rests on the assumption that the features in your dataset are independent, which is often not strictly true. In our spam dataset, you can image that the occurence of the strings "win", "money", and "!!!" is probably not independent. Despite this, NB often still does a decent job at classification tasks.

In our next post, we will take these different tuned models and build them up into an ensemble model to increase our predictive performance even more. Stay... tuned! Cue groans.