Expanding your machine learning toolkit: Randomized search, computational budgets, and new algorithms

Introduction

Previously, we wrote about some common trade-offs in machine learning and the importance of tuning models to your specific dataset. We demonstrated how to tune a random forest classifier using grid search, and how cross-validation can help avoid overfitting when tuning hyperparameters (HPs).

In this follow-up post, you'll beef up your machine learning toolbox by trying out some new, broadly-applicable tools. You'll learn a different strategy for traversing hyperparameter space - randomized search - and how to use it to tune two other classification algorithms - a support vector machine and a regularized logistic regression classifier.

We'll keep working with the wine dataset, which contains chemical characteristics of wines of varying quality. As before, our goal is to try to predict a wine's quality from these features.

Here are the things we'll cover in this blog post:

In the next blog post, you will learn how to take these three different tuned machine learning algorithms and combine them to build an aggregate model ensemble. Building ensembles often leads to improved model performance and generalizability. Stay tuned!

Loading and train/test splitting the dataset

You start off by collecting the dataset. We have covered the data loading, preprocessing, and train/test splitting previously, so we won't repeat ourselves here. Also check out this post on using plotly to create exploratory, interactive graphics of the wine dataset features.

You can fetch and format the data as follows:



In [2]:

    
import wget
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

# Import the dataset
data_url = 'https://raw.githubusercontent.com/nslatysheva/data_science_blogging/master/datasets/wine/winequality-red.csv'
dataset = wget.download(data_url)
dataset = pd.read_csv(dataset, sep=";")

# Using a lambda function to bin quality scores
dataset['quality_is_high'] = dataset.quality.apply(lambda x: 1 if x >= 6 else 0)

# Convert the dataframe to a numpy array and split the
# data into an input matrix X and class label vector y
npArray = np.array(dataset)
X = npArray[:,:-2].astype(float)
y = npArray[:,-1]

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1)

Introducing randomized search: comparison with grid search

You have already built a random forest classifier and tuned it using grid search to predict wine quality (here). Grid search is quite commonly used, and is essentially just a method that exhaustively tries out all combinations of manually prespecified HP values and reports the best option (i.e. the one leading to the highest test accuracy). The benefit of this approach is that you thoroughly test out various combinations, but this is of course very computationally expensive. For grid search to be tractable, you often have to restrict the number of combinations, which can severely limit how well you explore hyperparameter space and lead to you overlooking regions where accuracy would be highest.

Another way to search through hyperparameter space to find optima is via randomized search. In randomized search, you sample HP values a certain number of times from some distribution which you prespecify in advance. So unlike grid search, in which you specify particular numbers to combinatorially try out, you instead specify distributions that cover the HP space you want to explore. For example, you might specify a standard normal distribution over an HP if you think reasonable values are roughly centered around 0, or a uniform distribution over some range if you think values within that range are about as likely to be "good". In randomized search, you also specify a n_iter parameter, which acts as a computational budget, controlling how many different parameter settings are tried out in total.

We can visually summarize the grid search (grey boxes) and randomized search (purple boxes) strategies like so:

Here, both approaches are constrained by the same computational budget - they can only try out 9 different HP settings (i.e. certain values for HP1 and HP2). Randomized search tries out HP values from two normal distributions (purple bell curves), repeating the process 9 times and thus getting 9 different values of both HP1 and HP2. Most combinations fall into the meaty portion of the normal distributions, but occasionally the tails are sampled from as well - this means you have at least some chance of trying out distant regions that could potentially strike gold (i.e. the hypothetically optimal HP space leading to high accuracy, bottom right).

Meanwhile, grid search tries out 3 values each of HP1 and HP2. Of course, these values do not have to be as close to each other as we have drawn (and one could indeed hit the gold space with grid search), but the idea is that since you are constrained to trying out all combinations of prespecified HP values, this intrinsically limits how much of the HP space can be explored. Specifically, here randomized search has searched a space that is 16 times bigger (we drew a 3x3 box for the grid search and a 12x12 box for the larger grid). The n_iter argument controlling the number of HP combinations to try out gives you access to a tradeoff between computational resources invested and the HP space you can explore.

Check out this paper outlining the efficiency of randomized search compared to grid search, especially in high-dimensional HP spaces. You can imagine that if you already have 12x12=144 possible combinations of 2 HPs, adding another HP increases the number of possibilities to search through to 12x12x12=1728. This becomes very demanding very quickly and randomized search is the only feasible practical approach. However intuitively, were computational resources and patience infinite, grid search would become the better choice.

Using randomized search

Scikit makes using randomized search easy with RandomizedSearchCV. You can feed distributions of HPs to the RandomizedSearchCV object in two (fairly similar) ways:

You can either define distributions over HPs, without immediately sampling from them, and pass these distributions to RandomizedSearchCV, which will proceed to sample n_iter number of times with replacement from the distributions to generate candidate HP combinations.
You can sample from distributions immediately and pass a list of possible HP values to RandomizedSearchCV, and it will sample from these possible values n_iter number of times without replacement.

Both approaches lead to the same outcome and you will be using the second one here as it allows you to have a peak at the HP values that were sampled beforehand. Here is what this looks like with random forests:



In [3]:

    
from scipy.stats import uniform
from scipy.stats import norm

from sklearn.grid_search import RandomizedSearchCV
from sklearn import metrics

# Designate distributions to sample hyperparameters from 
n_estimators = np.random.uniform(70, 80, 5).astype(int)
max_features = np.random.normal(6, 3, 5).astype(int)

# Check max_features>0 & max_features<=total number of features
max_features[max_features <= 0] = 1
max_features[max_features > X.shape[1]] = X.shape[1]

hyperparameters = {'n_estimators': list(n_estimators),
                   'max_features': list(max_features)}

print (hyperparameters)









    



{'n_estimators': [79, 79, 71, 71, 76], 'max_features': [3, 11, 8, 1, 5]}

You then run the random search:



In [4]:

    
from sklearn.ensemble import RandomForestClassifier

# Run randomized search
randomCV = RandomizedSearchCV(RandomForestClassifier(), param_distributions=hyperparameters, n_iter=20)
randomCV.fit(XTrain, yTrain)

# Identify optimal hyperparameter values
best_n_estim      = randomCV.best_params_['n_estimators']
best_max_features = randomCV.best_params_['max_features']  

print("The best performing n_estimators value is: {:5d}".format(best_n_estim))
print("The best performing max_features value is: {:5d}".format(best_max_features))

# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from randomCV.best_estimator_
rf = RandomForestClassifier(n_estimators=best_n_estim,
                            max_features=best_max_features)

rf.fit(XTrain, yTrain)
rf_predictions = rf.predict(XTest)

print (metrics.classification_report(yTest, rf_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, rf_predictions), 3))









    



The best performing n_estimators value is:    71
The best performing max_features value is:     3
             precision    recall  f1-score   support

        0.0       0.79      0.84      0.82       188
        1.0       0.85      0.81      0.83       212

avg / total       0.82      0.82      0.82       400

('Overall Accuracy:', 0.823)

Let's compare this performance to the default random forest:



In [5]:

    
# Create default rf
rf = RandomForestClassifier()
print(rf.get_params)

# Fit and predict with default rf
rf.fit(XTrain, yTrain)
rf_predictions = rf.predict(XTest)

print (metrics.classification_report(yTest, rf_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, rf_predictions),3))









    



<bound method RandomForestClassifier.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)>
             precision    recall  f1-score   support

        0.0       0.73      0.85      0.79       188
        1.0       0.85      0.73      0.78       212

avg / total       0.79      0.79      0.78       400

('Overall Accuracy:', 0.785)

Looks like the default performance is slightly lower, which is generally what you might expect. Either grid search or randomized search are good options for tuning random forests.

Let's look at how to tune the two other predictors.

Tuning a support vector machine

Let's train the second algorithm, a support vector machine (SVM) classifier, to do the same wine quality prediction task. A great introduction to the theory behind SVMs can be found in Chapter 9 of the Introduction to Statistical Learning book or in this nice blog post. Briefly, SVMs search for separating hyperplanes in the feature space which best divide the different classes in your dataset. If you had 2 features, SVMs would search for the best dividing line; if you had 3 features, SVMs search for the best dividing 2d plane, etc. Crucially, SVMs can construct complex, non-linear decision boundaries between classes by making use of a process called kernelling, which projects the data into a higher-dimensional space and facilitates the identification of a good boundary.

SVMs can use different types of kernel functions, like linear, polynomial, Gaussian or radial kernels, to throw the data into a different space. Let's use the popular radial basis function kernel (RBF kernel). In the case of RBF SVMs, the hyperparameters to tune include:

gamma - it controls how influential a single observation can be when being selected as a support vector in the model. Low values for gamma lead to large influence of individual observations and high values to less influence.
C - it controls the 'softness' of the classification boundary margin and hence the bias-variance tradeoff of the model. Lower values for C will draw smoother decision boundaries (less flexible), whereas higher values will give more rugged boundaries that can fit the training data better (more flexible)

Examine the default HP settings and performance:



In [6]:

    
from sklearn import svm

default_SVC = svm.SVC()
print ("Default SVC parameters are: \n{}".format(default_SVC.get_params))









    



Default SVC parameters are: 
<bound method SVC.get_params of SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)>

Using the default settings, the SVM performance:



In [7]:

    
from sklearn.svm import SVC

# Create, fit, and test default SVM
rbfSVM = SVC(kernel='rbf')
rbfSVM.fit(XTrain, yTrain)
svm_predictions = rbfSVM.predict(XTest)

print (metrics.classification_report(yTest, svm_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, svm_predictions),3))









    



             precision    recall  f1-score   support

        0.0       0.67      0.75      0.71       188
        1.0       0.75      0.67      0.71       212

avg / total       0.71      0.71      0.71       400

('Overall Accuracy:', 0.71)

Now use randomized search to try to improve on this accuracy. First, define distributions you want to sample HP values from and create a dictionary of possible values:



In [8]:

    
# Designate distributions to sample hyperparameters from 
np.random.seed(123)
g_range = np.random.uniform(0.0, 0.3, 5).astype(float)
C_range = np.random.normal(1, 0.1, 5).astype(float)

# Check that gamma>0 and C>0 
C_range[C_range < 0] = 0.0001

hyperparameters = {'gamma': list(g_range), 
                    'C': list(C_range)}

print (hyperparameters)









    



{'C': [1.0322106068339623, 0.99484822790606153, 0.97957990353611057, 1.197934843277785, 0.83806999349632538], 'gamma': [0.20894075567935849, 0.085841800485113834, 0.068055436069260927, 0.16539443072486737, 0.21584069093566891]}

Now pass this dictionary to param_distributions argument of RandomizedSearchCV:



In [9]:

    
# Run randomized search
randomCV = RandomizedSearchCV(SVC(kernel='rbf', ), param_distributions=hyperparameters, n_iter=20)
randomCV.fit(XTrain, yTrain)

# Identify optimal hyperparameter values
best_gamma  = randomCV.best_params_['gamma']
best_C      = randomCV.best_params_['C']

print("The best performing gamma value is: {:5.2f}".format(best_gamma))
print("The best performing C value is: {:5.2f}".format(best_C))









    



The best performing gamma value is:  0.07
The best performing C value is:  0.84

We can examine the scores of e.g. the first 5 tested HP combinations:



In [10]:

    
print (randomCV.grid_scores_[0:5])









    



[mean: 0.67473, std: 0.01398, params: {'C': 1.197934843277785, 'gamma': 0.20894075567935849}, mean: 0.67556, std: 0.01087, params: {'C': 1.197934843277785, 'gamma': 0.21584069093566891}, mean: 0.67056, std: 0.00857, params: {'C': 0.97957990353611057, 'gamma': 0.21584069093566891}, mean: 0.67223, std: 0.01004, params: {'C': 1.0322106068339623, 'gamma': 0.21584069093566891}, mean: 0.67306, std: 0.01285, params: {'C': 1.0322106068339623, 'gamma': 0.20894075567935849}]



In [11]:

    
# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=best_C, gamma=best_gamma)
rbfSVM.fit(XTrain, yTrain)
svm_predictions = rbfSVM.predict(XTest)

print (metrics.classification_report(yTest, svm_predictions))
print ("Overall Accuracy:", round(metrics.accuracy_score(yTest, svm_predictions),4))









    



             precision    recall  f1-score   support

        0.0       0.65      0.75      0.70       188
        1.0       0.74      0.64      0.69       212

avg / total       0.70      0.69      0.69       400

('Overall Accuracy:', 0.6925)

Looks like we get similar accuracy as the default model in this case. This is fine, but in general you might think about doing things like casting the net wider to try out quite different HP values, adding new HPs to the tuning process, try a different learning algorithm, etc.

Tuning a logistic regression classifier

The final model you'll tune and apply to predict wine quality is a logistic regression classifier (LogisticRegression). This is a type of regression model which is used for predicting binary outcomes (like good wine/not good wine). Logistic regression fits a sigmoidal (S-shaped) curve through the data, but can be viewed as just a transformed version of linear regression - a straight line predicting the log odds of data points being in one of the two classes. A nice explanation of logistic regression can be found here.

One topic you will often encounter in machine learning is regularization, which is a class of techniques to reduce overfitting. The idea behind regularization is that you do not only want to maximize a model's fit to your data, since this is susceptible to overfitting. Regularization techniques try to cut down on overfitting by penalizing models, for example if they use too many parameters, or if they assign coefficients or weights that are "too big". Regularization means that models have to learn from the data under a series of constraints, which often leads to robust representations of the data.

You can adjust just how much regularization you want by adjusting regularization hyperparameters, and since this is something you might want to do often, scikit-learn comes with some pre-built models that can very efficiently fit data for a range of regularization hyperparameter values. This is the case for regularized linear regression models like Lasso regression and ridge regression, which use l1 and l2 penalties, respectively, to shrink the size of the regression coefficients. These scikit modules offer a shortcut to performing cross-validated selection of the regularization hyperparameter.

But you can also optimize how much regularization you want yourself, while at the same time tuning other hyperparameters (like the choice between l1 and l2 penalty), in the same manner as you've been doing.

Let's examine default HP settings and performance for a logistic regression model:



In [12]:

    
# Tuning a regularized logistic regression model
from sklearn.linear_model import LogisticRegression

# Examine defaults
default_lr = LogisticRegression()
print ("Default logistic regression parameters are: {}".format(default_lr.get_params))
       
# Train model and output predictions
classifier_logistic = LogisticRegression()
classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain)
logistic_predictions = classifier_logistic_fit.predict(XTest)

print metrics.classification_report(yTest, logistic_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),3)









    



Default logistic regression parameters are: <bound method LogisticRegression.get_params of LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)>
             precision    recall  f1-score   support

        0.0       0.73      0.74      0.73       188
        1.0       0.77      0.75      0.76       212

avg / total       0.75      0.75      0.75       400

Overall Accuracy: 0.748

Now to optimise the HPs:



In [13]:

    
# Specify HP distributions
penalty = ["l1", "l2"]
np.random.seed(123)
C_range = np.random.normal(1, 0.2, 10).astype(float)

# Check that C>0 
C_range[C_range < 0] = 0.0001

hyperparameters = {'penalty': penalty, 
                    'C': C_range}

print (hyperparameters)









    



{'penalty': ['l1', 'l2'], 'C': array([ 0.78287388,  1.19946909,  1.0565957 ,  0.69874106,  0.88427995,
        1.33028731,  0.51466415,  0.91421747,  1.25318725,  0.82665192])}

And feeding these values into RandomizedSearchCV:



In [14]:

    
# Randomized search using cross-validation
randomCV = RandomizedSearchCV(LogisticRegression(), param_distributions=hyperparameters, cv=20)  
randomCV.fit(XTrain, yTrain)

best_penalty = randomCV.best_params_['penalty']
best_C       = randomCV.best_params_['C']

print ("The best performing penalty is: {}".format(best_penalty))
print ("The best performing C value is: {:5.2f}".format(best_C))









    



The best performing penalty is: l2
The best performing C value is:  0.70

We can now use these values to train a new, hopefully better model:



In [15]:

    
# Train model and output predictions
classifier_logistic = LogisticRegression(penalty=best_penalty, C=best_C)
classifier_logistic_fit = classifier_logistic.fit(XTrain, yTrain)
logistic_predictions = classifier_logistic_fit.predict(XTest)

print metrics.classification_report(yTest, logistic_predictions)
print "Overall Accuracy:", round(metrics.accuracy_score(yTest, logistic_predictions),3)









    



             precision    recall  f1-score   support

        0.0       0.72      0.73      0.73       188
        1.0       0.76      0.75      0.76       212

avg / total       0.74      0.74      0.74       400

Overall Accuracy: 0.743

This also actually looks quite similar to the default performance. It's always important to tune hyperparameters, but the effect might vary across learning algorithms and datasets.

Conclusion

In this tutorial, you have done randomized search to optimize a random forest, an SVM model, and logistic regression. Fancier techniques for hyperparameter optimization include methods based on gradient descent, grad student descent (not recommended), and Bayesian approaches which update prior beliefs about good values of hyperparameters based on observed performance (see Spearmint and hyperopt). The important thing is to try out different learning algorithms and HP settings on your dataset.

In the next post, we will try to leverage the contribution of different types of models by building them up into an ensemble model, with the aim of increasing predictive performance.

Quiz

Explain why the main advantages of randomized search over grid search are the following (as stated by scikit):
- A budget can be chosen independent of the number of parameters and possible values.
- Adding parameters that do not influence the performance does not decrease efficiency.
RandomizedSearchCV will throw an error if n_iter is greater than the number of possible HP combinations (e.g. ValueError: The total space of parameters 25 is smaller than n_iter=100.). Why is this a useful property? What happens if the number of combinations is indeed smaller than n_iter, but some of the combinations happen to be duplicates?
If the HP search space is quite small, the behaviour of randomized search can approach that of grid search. Specifically, if you were to sample values of two HPs from uniform distributions n times and set n_iter to $n^2$, then randomized search would be very similar to a grid search over the same HP ranges. But it wouldn't be quite identical - why?



In [ ]: