Out-of-core Prediction

For some estimators, additional data don't improve performance past a certain point. The learning curve levels off. You may have additional data, but using it in the fit step won't make any difference.

In these cases, you'll commonly fit a model on a dataset that fits in memory, and use it to predict for datasets that may not. Dask can make the prediction step easier and faster.

In [69]:
import numpy as np
import dask.array as da
from sklearn.datasets import make_classification

In [71]:
X_train, y_train = make_classification(
    n_features=2, n_redundant=0, n_informative=2,
    random_state=1, n_clusters_per_class=1, n_samples=1000)

In [72]:
N = 100
X = da.concatenate([da.from_array(X_train, chunks=X_train.shape)
                    for _ in range(N)])
y = da.concatenate([da.from_array(y_train, chunks=y_train.shape)
                    for _ in range(N)])

So X_train and y_train are regular numpy arrays that we'll use to fit the model. X and y are large dask arrays that may not fit in memory.

In [76]:
from sklearn.linear_model import LogisticRegressionCV

In [79]:
clf = LogisticRegressionCV()

clf.fit(X_train, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

With the model, we can make predictions for each observation by mapping the clf.predict_proba method over each block. This can then be scheduled to run on your single machine or your cluster.

In [82]:
yhat = X.map_blocks(clf.predict_proba, dtype=np.float64)

dask.array<predict_proba, shape=(100000, 2), dtype=float64, chunksize=(1000, 2)>

In [83]:

array([[ 0.06448766,  0.93551234],
       [ 0.23144553,  0.76855447],
       [ 0.22557832,  0.77442168],
       [ 0.08833674,  0.91166326],
       [ 0.90324265,  0.09675735]])