In [ ]:
%matplotlib inline
This example demonstrates :class:wrappers.ParallelPostFit
. A
:class:sklearn.svm.SVC
is fit on a small dataset that easily fits in memory.
After training, we predict for successively larger datasets. We compare
SVC.predict
methodWe see that the parallel version is faster, especially for larger datasets.
Additionally, the parallel version from ParallelPostFit
scales out to
larger than memory datasets.
While only predict
is demonstrated here, :class:wrappers.ParallelPostFit
is equally useful for predict_proba
and transform
.
In [ ]:
from timeit import default_timer as tic
import pandas as pd
import seaborn as sns
import sklearn.datasets
from sklearn.svm import SVC
import dask_ml.datasets
from dask_ml.wrappers import ParallelPostFit
X, y = sklearn.datasets.make_classification(n_samples=1000)
clf = ParallelPostFit(SVC())
clf.fit(X, y)
Ns = [100_000, 200_000, 400_000, 800_000]
timings = []
for n in Ns:
X, y = dask_ml.datasets.make_classification(n_samples=n,
random_state=n,
chunks=n // 20)
t1 = tic()
# Serial scikit-learn version
clf.estimator.predict(X)
timings.append(('Scikit-Learn', n, tic() - t1))
t1 = tic()
# Parallelized scikit-learn version
clf.predict(X).compute()
timings.append(('dask-ml', n, tic() - t1))
df = pd.DataFrame(timings,
columns=['method', 'Number of Samples', 'Predict Time'])
ax = sns.factorplot(x='Number of Samples', y='Predict Time', hue='method',
data=df, aspect=1.5)