Hyper-parameter Search

Most scikit-learn estimators have a set of hyper-parameters. These are parameters that are not learned during estimation; they must be set ahead of time.

The dask-searchcv is able to parallelize scikit-learn's hyper-parameter search classes cleverly. It's able to schedule computation using any of dask's schedulers.


In [1]:
%matplotlib inline

In [2]:
import numpy as np

from time import time
from scipy.stats import randint as sp_randint
from scipy import stats

from distributed import Client
import distributed.joblib

from sklearn.externals import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression

from dask_searchcv import GridSearchCV, RandomizedSearchCV
from sklearn import model_selection as ms
import matplotlib.pyplot as plt

client = Client()

This example is based off this scikit-learn example.


In [3]:
# get some data
digits = load_digits()
X, y = digits.data, digits.target

We'll fit a LogisticRegression, and compare the GridSearchCV and RandomizedSearchCV implementations from scikit-learn and dask-searchcv.

Grid-search is the brute-force method of hyper-parameter optimization. It fits each combination of parameters, which can be time consuming if you have many hyper-parameters or if you have a fine grid.

To use grid search from scikit-learn, you create a dictionary mapping parameter names to lists of values to try. That param_grid is passed to GridSearchCV along with a classifier (LogisticRegression in this example). Notice that dask_searchcv.GridSearchCV is a drop-in replacement for sklearn.model_selection.GridSearchCV.


In [5]:
# use a full grid over all parameters
param_grid = {
    "C": [1e-5, 1e-3, 1e-1, 1],
    "fit_intercept": [True, False],
    "penalty": ["l1", "l2"]
}

clf = LogisticRegression()

# run grid search
dk_grid_search = GridSearchCV(clf, param_grid=param_grid, n_jobs=-1)
sk_grid_search = ms.GridSearchCV(clf, param_grid=param_grid, n_jobs=-1)

GridSearchCV objects are fit just like regular estimators: .fit(X, y).

First, we'll fit the scikit-learn version.


In [6]:
start = time()
sk_grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(sk_grid_search.cv_results_['params'])))


GridSearchCV took 2.93 seconds for 16 candidate parameter settings.

And now the dask-searchcv version.


In [7]:
start = time()

dk_grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(dk_grid_search.cv_results_['params'])))


GridSearchCV took 1.85 seconds for 16 candidate parameter settings.

Randomized search is similar in spirit to grid search, but the method of choosing parameters to evaluate differs. With grid search, you specify the parameters to try, and scikit-learn tries each possible combination. Randomized search, on the other hand, takes some distributions to sample from and a maximum number of iterations to try. This lets you focus your search on areas where the parameters should perform better.


In [8]:
param_dist = {
    "C": stats.beta(1, 3),
    "fit_intercept": [True, False],
    "penalty": ["l1", "l2"]
}
n_iter_search = 100
clf = LogisticRegression()

In [9]:
# scikit-learn
sk_random_search = ms.RandomizedSearchCV(clf, param_distributions=param_dist,
                                         n_iter=n_iter_search, n_jobs=-1)

# dask
dk_random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                      n_iter=n_iter_search, n_jobs=-1)

In [10]:
# run randomized search
start = time()
sk_random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))


RandomizedSearchCV took 7.64 seconds for 100 candidates parameter settings.

In [11]:
dk_random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))


RandomizedSearchCV took 17.11 seconds for 100 candidates parameter settings.

Avoid Repeated Work

dask works by building a task graph of computations on data. It's able to cache intermediate computations in the graph, to avoid unnescessarily computing something multiple times. This speeds up computations on scikit-learn Pipelines, since the early stages of a pipeline are used for each parameter search.


In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier())])

grid = {'vect__ngram_range': [(1, 1)],
        'tfidf__norm': ['l1', 'l2'],
        'clf__alpha': [1e-5, 1e-4, 1e-3, 1e-1]}

Using a regular sklearn.model_selection.GridSearchCV, we would need to evaluate the CountVectorizor(ngram_range=(1, 1)) 8 times (once for each of the tfidf__norm and clf__alpha combintions.

With dask, we need only compute it once and the intermediate result is cached and reused.


In [15]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')

In [16]:
sk_grid_search = ms.GridSearchCV(pipeline, grid, n_jobs=-1)
dk_grid_search = GridSearchCV(pipeline, grid, n_jobs=-1)

In [17]:
start = time()

dk_grid_search.fit(data.data, data.target)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(dk_grid_search.cv_results_['params'])))


GridSearchCV took 34.44 seconds for 8 candidate parameter settings.

In [18]:
start = time()

sk_grid_search.fit(data.data, data.target)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(sk_grid_search.cv_results_['params'])))


GridSearchCV took 40.32 seconds for 8 candidate parameter settings.