For some estimators, additional data don't improve performance past a certain point.
The learning curve levels off.
You may have additional data, but using it in the
fit step won't make any difference.
In these cases, you'll commonly fit a model on a dataset that fits in memory, and use it to predict for datasets that may not. Dask can make the prediction step easier and faster.
In :import numpy as np import dask.array as da from sklearn.datasets import make_classification
In :X_train, y_train = make_classification( n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1, n_samples=1000)
In :N = 100 X = da.concatenate([da.from_array(X_train, chunks=X_train.shape) for _ in range(N)]) y = da.concatenate([da.from_array(y_train, chunks=y_train.shape) for _ in range(N)])
y_train are regular numpy arrays that we'll use to fit the model.
y are large dask arrays that may not fit in memory.
In :from sklearn.linear_model import LogisticRegressionCV
In :clf = LogisticRegressionCV() clf.fit(X_train, y_train)
Out:LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False, fit_intercept=True, intercept_scaling=1.0, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
With the model, we can make predictions for each observation by mapping the
clf.predict_proba method over each block. This can then be scheduled to run on your single machine or your cluster.
In :yhat = X.map_blocks(clf.predict_proba, dtype=np.float64) yhat
Out:dask.array<predict_proba, shape=(100000, 2), dtype=float64, chunksize=(1000, 2)>
Out:array([[ 0.06448766, 0.93551234], [ 0.23144553, 0.76855447], [ 0.22557832, 0.77442168], [ 0.08833674, 0.91166326], [ 0.90324265, 0.09675735]])