Scikit-learn already parallelizes many algorithms internally using joblib
.
You can schedule these to run on a distributed cluster using dask.distributed
, which register's a plugin with joblib.
Parts of this example are taken from https://github.com/ogrisel/parallel_ml_tutorial/blob/master/notebooks/06%20-%20Distributed%20Model%20Selection%20and%20Assessment.ipynb
In [1]:
from time import time
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.externals import joblib
import distributed.joblib
from distributed import Client
As usual, we connect to our client. Typically you would be the address for your scheduler here.
In [2]:
client = Client()
Let's load the data as usual.
In [3]:
# Load the faces dataset
data = fetch_olivetti_faces()
X = data.images.reshape((len(data.images), -1))
y = data.target
mask = y < 5 # Limit to 5 classes
X = X[mask]
y = y[mask]
And create the classifier.
In [4]:
forest = ExtraTreesClassifier(n_estimators=1000,
max_features=128,
n_jobs=-1,
random_state=0)
And now we fit the model. The actual fitting step is the usual forest.fit(X, y)
.
To use our cluster, we'll use the joblib.parallel_backend
context manager.
In [6]:
# Build a forest and compute the pixel importances
t0 = time()
with joblib.parallel_backend('dask.distributed', scheduler_host=client.scheduler.address):
forest.fit(X, y)
print("done in %0.3fs" % (time() - t0))
In [7]:
importances = forest.feature_importances_
importances = importances.reshape(data.images[0].shape)
# Plot pixel importances
plt.matshow(importances, cmap=plt.cm.hot)
plt.title("Pixel importances with forests of trees")
plt.show()