Most scikit-learn estimators have a set of hyper-parameters. These are parameters that are not learned during estimation; they must be set ahead of time.
The dask-searchcv
is able to parallelize scikit-learn's hyper-parameter search classes cleverly.
It's able to schedule computation using any of dask's schedulers.
In [1]:
%matplotlib inline
In [2]:
import numpy as np
from time import time
from scipy.stats import randint as sp_randint
from scipy import stats
from distributed import Client
import distributed.joblib
from sklearn.externals import joblib
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from dask_searchcv import GridSearchCV, RandomizedSearchCV
from sklearn import model_selection as ms
import matplotlib.pyplot as plt
client = Client()
This example is based off this scikit-learn example.
In [3]:
# get some data
digits = load_digits()
X, y = digits.data, digits.target
We'll fit a LogisticRegression
, and compare the GridSearchCV
and RandomizedSearchCV
implementations from scikit-learn
and dask-searchcv
.
Grid-search is the brute-force method of hyper-parameter optimization. It fits each combination of parameters, which can be time consuming if you have many hyper-parameters or if you have a fine grid.
To use grid search from scikit-learn, you create a dictionary mapping parameter names to lists of values to try.
That param_grid
is passed to GridSearchCV
along with a classifier (LogisticRegression
in this example). Notice that dask_searchcv.GridSearchCV
is a drop-in replacement for sklearn.model_selection.GridSearchCV
.
In [5]:
# use a full grid over all parameters
param_grid = {
"C": [1e-5, 1e-3, 1e-1, 1],
"fit_intercept": [True, False],
"penalty": ["l1", "l2"]
}
clf = LogisticRegression()
# run grid search
dk_grid_search = GridSearchCV(clf, param_grid=param_grid, n_jobs=-1)
sk_grid_search = ms.GridSearchCV(clf, param_grid=param_grid, n_jobs=-1)
GridSearchCV
objects are fit just like regular estimators: .fit(X, y)
.
First, we'll fit the scikit-learn version.
In [6]:
start = time()
sk_grid_search.fit(X, y)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(sk_grid_search.cv_results_['params'])))
And now the dask-searchcv
version.
In [7]:
start = time()
dk_grid_search.fit(X, y)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(dk_grid_search.cv_results_['params'])))
Randomized search is similar in spirit to grid search, but the method of choosing parameters to evaluate differs. With grid search, you specify the parameters to try, and scikit-learn tries each possible combination. Randomized search, on the other hand, takes some distributions to sample from and a maximum number of iterations to try. This lets you focus your search on areas where the parameters should perform better.
In [8]:
param_dist = {
"C": stats.beta(1, 3),
"fit_intercept": [True, False],
"penalty": ["l1", "l2"]
}
n_iter_search = 100
clf = LogisticRegression()
In [9]:
# scikit-learn
sk_random_search = ms.RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search, n_jobs=-1)
# dask
dk_random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search, n_jobs=-1)
In [10]:
# run randomized search
start = time()
sk_random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
In [11]:
dk_random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
dask works by building a task graph of computations on data. It's able to cache intermediate computations
in the graph, to avoid unnescessarily computing something multiple times. This speeds up computations on
scikit-learn Pipeline
s, since the early stages of a pipeline are used for each parameter search.
In [14]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier())])
grid = {'vect__ngram_range': [(1, 1)],
'tfidf__norm': ['l1', 'l2'],
'clf__alpha': [1e-5, 1e-4, 1e-3, 1e-1]}
Using a regular sklearn.model_selection.GridSearchCV
, we would need to evaluate the CountVectorizor(ngram_range=(1, 1))
8 times (once for each of the tfidf__norm
and clf__alpha
combintions.
With dask, we need only compute it once and the intermediate result is cached and reused.
In [15]:
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train')
In [16]:
sk_grid_search = ms.GridSearchCV(pipeline, grid, n_jobs=-1)
dk_grid_search = GridSearchCV(pipeline, grid, n_jobs=-1)
In [17]:
start = time()
dk_grid_search.fit(data.data, data.target)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(dk_grid_search.cv_results_['params'])))
In [18]:
start = time()
sk_grid_search.fit(data.data, data.target)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(sk_grid_search.cv_results_['params'])))