In some cases, hyperparameters determine the complexity or flexibility of the model (e.g. regularization parameters) which are used to prevent overfitting (high variance).
In other cases, hyperparameters affect how the algorithm learns:
[A hyperparameter optimizer is] a functional from data to classifier (taking classification problems as an example)...
-- James Bergstra, Algorithms for Hyper-Parameter Optimization [0]
Bergstra and Bengio's paper, Random Search for Hyper-Parameter Optimization[1] tell us that if we randomly sample our space, we approach our global optima using fewer steps than grid search. Furthermore, while grid search gets exponentially worse (see Curse of Dimensionality), random search actually performs quite well in higher dimensional spaces because many problems actually have lower effective representations, namely even though a data set may have many parameters, only a few of those parameters account for most of the variation. Hence randomized sampling of the search space is suprisingly effective.
Random Search vs Grid Search
How many times would we need to sample randomly to get reasonably close (>95% probability) to an optimal configuration?
To rephrase a bit more quantitatively: let's figure out how many points we should sample using random search in order to get a sample within 5% of the globally optimal hyper-parameterization of our sample space with at least 95% probability.
The probability of missing (not getting within 5% of) our global optimum on $n$ successive samples will be
$$ P_{miss} = (1 - 0.05)^n $$Hence the probability of a hit will be
$$ P_{hit} = 1 - P_{miss} = 1 - (1 - 0.05)^n. $$Setting the probability of a hit to 95% and solving for $n$ gives us $$ P_{hit} = 1 - (1 - 0.05)^n = 0.95 $$
Hence, using only 60 samples, we have a 95% probability of getting a sample within 5% of the global optimum. Very nice!
Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. [1]
In [64]:
import numpy as np
from time import time
from operator import itemgetter
from scipy.stats import randint as sp_randint
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
In [65]:
iris = load_digits() # get some data
X, y = iris.data, iris.target
In [66]:
clf = RandomForestClassifier(n_estimators=20) # build a classifier
def report(grid_scores, n_top=3):
# Utility function to report best scores
top_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n_top]
for i, score in enumerate(top_scores):
print("Model with rank: {0}".format(i + 1))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
score.mean_validation_score,
np.std(score.cv_validation_scores)))
print("Parameters: {0}".format(score.parameters))
print("")
In [67]:
# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(1, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
In [68]:
# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search)
In [69]:
start = time()
random_search.fit(X, y)
finish = (time() - start)
In [70]:
print("RandomizedSearchCV took {time} seconds for {candidate} candidates"
" parameter settings.".format(time=finish, candidate=n_iter_search))
report(random_search.grid_scores_)
In [71]:
# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [1, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
In [72]:
# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)
finish = (time() - start)
In [73]:
print("GridSearchCV took {time} seconds for {candidates} candidate parameter settings.".format(
time=finish, candidates=len(grid_search.grid_scores_)))
report(grid_search.grid_scores_)
Feel free to check out the TotalGood Hyperparameter Optimization repo (HyperParameterOptimization/benchmarks) on github to learn more about our current research (we love pull requests fyi).
[0] Bengio, Y., Bergstra, J., Bardenet, R., & Kegl, B. (n.d.). Algorithms for Hyper-Parameter Optimization. Annual Conference on Neural Information Processing Systems (NIPS). Retrieved August 21, 2015, from http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf
[1] Bengio, Yoshua, and James Bergstra. "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research 13.281-305 (2012). Web. 20 Aug. 2015. http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf.
Zheng, Alice. "How to Evaluate Machine Learning Models: Hyperparameter Tuning." How to Evaluate Machine Learning Models: Hyperparameter Tuning. 27 May 2015. Web. 21 Aug. 2015.
"Automatic Hyperparameter Tuning Methods." John Myles White. Web. 21 Aug. 2015.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
Presentation made using Jupyter, IPython, and React.js