For some estimators, additional data don't improve performance past a certain point.
The learning curve levels off.
You may have additional data, but using it in the fit
step won't make any difference.
In these cases, you'll commonly fit a model on a dataset that fits in memory, and use it to predict for datasets that may not. Dask can make the prediction step easier and faster.
In [69]:
import numpy as np
import dask.array as da
from sklearn.datasets import make_classification
In [71]:
X_train, y_train = make_classification(
n_features=2, n_redundant=0, n_informative=2,
random_state=1, n_clusters_per_class=1, n_samples=1000)
In [72]:
N = 100
X = da.concatenate([da.from_array(X_train, chunks=X_train.shape)
for _ in range(N)])
y = da.concatenate([da.from_array(y_train, chunks=y_train.shape)
for _ in range(N)])
So X_train
and y_train
are regular numpy arrays that we'll use to fit the model.
X
and y
are large dask arrays that may not fit in memory.
In [76]:
from sklearn.linear_model import LogisticRegressionCV
In [79]:
clf = LogisticRegressionCV()
clf.fit(X_train, y_train)
Out[79]:
With the model, we can make predictions for each observation by mapping the clf.predict_proba
method over each block. This can then be scheduled to run on your single machine or your cluster.
In [82]:
yhat = X.map_blocks(clf.predict_proba, dtype=np.float64)
yhat
Out[82]:
In [83]:
yhat[:5].compute()
Out[83]: