TechFestNW Interactive

Hyperparameter Optimization for Fun and Profit

Thunder Shiviah@TotalGood

github: ThunderShiviah

What is hyperparameter optimization (HPO)?

First things first

What are hyperparameters?

One answer:

A machine learning model's 'knobs'

Photograph by Jorge Royan, licensed under the Creative Commons Attribution-Share Alike 3.0 Unported

Example: gradient descent - step size

Created by Joris Gillis using Maple 10, distributed as public domain.

Created by Joris Gillis using Maple 10, distributed as public domain.

Difference between parameters and hyperparameters

  • The parameters are what the machine learning algorithm 'learns'. Hyperparameters are set before the ML model runs.

In some cases, hyperparameters determine the complexity or flexibility of the model (e.g. regularization parameters) which are used to prevent overfitting (high variance).

  • Decision trees
    • Desired depth and number of leaves in the tree
  • Support Vector Machines
    • Misclassification penalty term
  • Kernelized SVM
    • kernel parameters

In other cases, hyperparameters affect how the algorithm learns:

  • Stochastic gradient descent optimization
    • learning rate
  • Convergence thresholds
  • Random forests and boosted decision trees
    • number of total trees

What is an HPO?

[A hyperparameter optimizer is] a functional from data to classifier (taking classification problems as an example)...

-- James Bergstra, Algorithms for Hyper-Parameter Optimization [0]

Let's make an HPO!

The good: It's parallelizable

The bad: It's the definition of brute-force

The ugly: Grid Search might not even find optimal values

Illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio.[1]

A solution?

Bergstra and Bengio's paper, Random Search for Hyper-Parameter Optimization[1] tell us that if we randomly sample our space, we approach our global optima using fewer steps than grid search. Furthermore, while grid search gets exponentially worse (see Curse of Dimensionality), random search actually performs quite well in higher dimensional spaces because many problems actually have lower effective representations, namely even though a data set may have many parameters, only a few of those parameters account for most of the variation. Hence randomized sampling of the search space is suprisingly effective.

Random Search vs Grid Search

  • Number of necessary samples
  • Higher dimensions

How many times would we need to sample randomly to get reasonably close (>95% probability) to an optimal configuration?

To rephrase a bit more quantitatively: let's figure out how many points we should sample using random search in order to get a sample within 5% of the globally optimal hyper-parameterization of our sample space with at least 95% probability.

The probability of missing (not getting within 5% of) our global optimum on $n$ successive samples will be

$$ P_{miss} = (1 - 0.05)^n $$

Hence the probability of a hit will be

$$ P_{hit} = 1 - P_{miss} = 1 - (1 - 0.05)^n. $$

Setting the probability of a hit to 95% and solving for $n$ gives us $$ P_{hit} = 1 - (1 - 0.05)^n = 0.95 $$

$$ \implies n = 58.4 \approx 60 $$

Hence, using only 60 samples, we have a 95% probability of getting a sample within 5% of the global optimum. Very nice!

Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. [1]

Scikit-learn provides both grid and random search as well as an example benchmark comparing the two. Let's try it out.

First we'll import some relevant packages

  • Sidenote: I recommend the Anaconda Scientific Python distribution and conda package manager for creating virtual environments that come pre-installed with scikit-learn and (almost) everything else I need.

In [64]:
import numpy as np

from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint

from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

For this example, we'll load up the iris data set, an example data set from scikit-learn that has various measurements of different species of iris (the flower, not the eye thing).


In [65]:
iris = load_digits() # get some data
X, y = iris.data, iris.target

Next, we initialize our classifier (a random forest in this case).

Also, we'll make a short helper function for outputting nice grid scores from our HPOs.


In [66]:
clf = RandomForestClassifier(n_estimators=20) # build a classifier

def report(grid_scores, n_top=3):
    # Utility function to report best scores
    top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
    for i, score in enumerate(top_scores):
        print("Model with rank: {0}".format(i + 1))
        print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
              score.mean_validation_score,
              np.std(score.cv_validation_scores)))
        print("Parameters: {0}".format(score.parameters))
        print("")

In order to run random search, we need to specify a distribution to sample from. We'll use sp_randint from the scipy.stats library which will return a random integer.


In [67]:
# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(1, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

Finally, we'll run the random search over our random forest classifier


In [68]:
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search)

In [69]:
start = time()
random_search.fit(X, y)
finish = (time() - start)

In [70]:
print("RandomizedSearchCV took {time} seconds for {candidate} candidates"
      " parameter settings.".format(time=finish, candidate=n_iter_search))
report(random_search.grid_scores_)


RandomizedSearchCV took 4.0490076541900635 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.921 (std: 0.013)
Parameters: {'max_features': 8, 'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 5, 'bootstrap': True, 'min_samples_leaf': 3}

Model with rank: 2
Mean validation score: 0.920 (std: 0.007)
Parameters: {'max_features': 4, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 3, 'bootstrap': True, 'min_samples_leaf': 3}

Model with rank: 3
Mean validation score: 0.919 (std: 0.017)
Parameters: {'max_features': 8, 'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 10, 'bootstrap': False, 'min_samples_leaf': 5}

We'll now follow the same process for grid search, the only difference being that instead of sampling from a distribution, we'll specify an array of values to try.


In [71]:
# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [1, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

In [72]:
# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)
finish = (time() - start)

In [73]:
print("GridSearchCV took {time} seconds for {candidates} candidate parameter settings.".format(
        time=finish, candidates=len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)


GridSearchCV took 37.768978118896484 seconds for 216 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.936 (std: 0.013)
Parameters: {'max_features': 10, 'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 3, 'bootstrap': True, 'min_samples_leaf': 1}

Model with rank: 2
Mean validation score: 0.932 (std: 0.009)
Parameters: {'max_features': 3, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 1, 'bootstrap': False, 'min_samples_leaf': 1}

Model with rank: 3
Mean validation score: 0.931 (std: 0.012)
Parameters: {'max_features': 10, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 3, 'bootstrap': False, 'min_samples_leaf': 3}

The results are clear

  • Random search time was a magnitude less than grid search by sampling only 20 candidate parameter settings compared to grid search's 216
  • Comparable to better performance (depending on the initial random seed)
  • Random search time was a magnitude less than grid search (~3 seconds compared to ~30 seconds) by sampling only 20 candidate parameter settings compared to grid search's 216, and yet was able to achieve comparable to slightly better performance (validation scores will change based on the random seed that we initialize each with).

How can I learn more about HPOs?

  • Explore other forms of optimization such as derivative-free and bayesian optimizers using HPOLib

Feel free to check out the TotalGood Hyperparameter Optimization repo (HyperParameterOptimization/benchmarks) on github to learn more about our current research (we love pull requests fyi).

Bibliography

[0] Bengio, Y., Bergstra, J., Bardenet, R., & Kegl, B. (n.d.). Algorithms for Hyper-Parameter Optimization. Annual Conference on Neural Information Processing Systems (NIPS). Retrieved August 21, 2015, from http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

[1] Bengio, Yoshua, and James Bergstra. "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research 13.281-305 (2012). Web. 20 Aug. 2015. http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf.

Zheng, Alice. "How to Evaluate Machine Learning Models: Hyperparameter Tuning." How to Evaluate Machine Learning Models: Hyperparameter Tuning. 27 May 2015. Web. 21 Aug. 2015.

"Automatic Hyperparameter Tuning Methods." John Myles White. Web. 21 Aug. 2015.

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Presentation made using Jupyter, IPython, and React.js