author: Jacob Schreiber
contact: jmschreiber91@gmail.com
pomegranate supports parallelization through a set of built in functions based off of joblib. All computationally intensive functions in pomegranate are implemented in cython with the global interpreter lock (GIL) released, allowing for multithreading to be used for efficient parallel processing.
These functions can all be simply parallelized by passing in n_jobs=X
to the method calls. This tutorial will demonstrate how to use those calls. First we'll look at a simple multivariate Gaussian mixture model, and compare its performance to sklearn. Then we'll look at a hidden Markov model with Gaussian emissions, and lastly we'll look at a mixture of Gaussian HMMs. These can all utilize the build-in parallelization that pomegranate has.
Let's dive right in!
In [1]:
%matplotlib inline
import numpy
import matplotlib.pyplot as plt
import seaborn; seaborn.set_style('whitegrid')
from pomegranate import *
numpy.random.seed(0)
numpy.set_printoptions(suppress=True)
%load_ext watermark
%watermark -m -n -p numpy,scipy,pomegranate
In [2]:
def create_dataset(n_samples, n_dim, n_classes, alpha=1):
"""Create a random dataset with n_samples in each class."""
X = numpy.concatenate([numpy.random.normal(i*alpha, 1, size=(n_samples, n_dim)) for i in range(n_classes)])
y = numpy.concatenate([numpy.zeros(n_samples) + i for i in range(n_classes)])
idx = numpy.arange(X.shape[0])
numpy.random.shuffle(idx)
return X[idx], y[idx]
In [3]:
from sklearn.naive_bayes import GaussianNB
n, d, k = 50000, 1000, 2
X, y = create_dataset(n, d, k)
print("sklearn GNB")
%timeit GaussianNB().fit(X, y)
print()
print("pomegranate GNB")
%timeit NaiveBayes.from_samples(NormalDistribution, X, y)
print()
print("pomegranate GNB (2 jobs)")
%timeit NaiveBayes.from_samples(NormalDistribution, X, y, n_jobs=2)
print()
print("pomegranate GNB (4 jobs)")
%timeit NaiveBayes.from_samples(NormalDistribution, X, y, n_jobs=4)
It looks like scikit-learn starts off as faster on a single thread, but that adding multiple threads will improve pomegranate's performance, typically past that of scikit-learn on a single thread. The reason this is the case is likely because pomegranate has to create one object per feature per class, whereas scikit-learn will just define the means and variances in a matrix. It's efficient when you know what distribution you're implementing in the naive Bayes, but not as modular.
In [4]:
print("pomegranate GBC")
%timeit BayesClassifier.from_samples(MultivariateGaussianDistribution, X, y)
print()
print("pomegranate GBC (2 jobs)")
%timeit BayesClassifier.from_samples(MultivariateGaussianDistribution, X, y, n_jobs=2)
print()
print("pomegranate GBC (4 jobs)")
%timeit BayesClassifier.from_samples(MultivariateGaussianDistribution, X, y, n_jobs=4)
We see a nice spead improvement here as well. There is no model to compare against in scikit-learn because their naive Bayes' model assumes that all features are independent.
In [5]:
from sklearn.mixture import GaussianMixture
n, d, k = 1000000, 5, 3
X, y = create_dataset(n, d, k)
print("sklearn GMM")
%timeit GaussianMixture(n_components=k, covariance_type='full', max_iter=15, tol=1e-10).fit(X)
print()
print("pomegranate GMM")
%timeit GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, k, X, max_iterations=15, stop_threshold=1e-10)
print()
print("pomegranate GMM (2 jobs)")
%timeit GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, k, X, n_jobs=2, max_iterations=15, stop_threshold=1e-10)
print
print("pomegranate GMM (4 jobs)")
%timeit GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, k, X, n_jobs=4, max_iterations=15, stop_threshold=1e-10)
It looks like pomegranate can be faster than sklearn at performing 15 iterations of EM on this data set, and that parallelization can speed things up even further. We can ignore the convergence warnings. Here we're just timing a fixed amount of computational work. Different models might converge after a different number of iterations, and we don't want the pseedup to come just from that.
Lets now take a look at the time it takes to make predictions using GMMs. Lets fit the model to a small amount of data, and then predict a larger amount of data drawn from the same underlying distributions.
In [6]:
X, _ = create_dataset(1000, 25, 2)
a = GaussianMixture(k, n_init=1, max_iter=25).fit(X)
b = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, k, X, max_iterations=25)
X, _ = create_dataset(1000000, 25, 2)
print("sklearn GMM")
%timeit -n 1 a.predict_proba(X)
print()
print("pomegranate GMM")
%timeit -n 1 b.predict_proba(X)
print()
print("pomegranate GMM (4 jobs)")
%timeit -n 1 b.predict_proba(X, n_jobs=4)
It looks like pomegranate can be slightly slower than sklearn when using a single processor, but that it can be parallelized to get faster performance.
To ensure that we're getting the exact same results, only faster, lets subtract the predictions from each other and make sure that the sum is equal to 0.
In [7]:
print((b.predict_proba(X) - b.predict_proba(X, n_jobs=4)).sum())
Great, no difference between the two.
Lets now make sure that pomegranate and sklearn are learning basically the same thing. Lets fit both models to some 2 dimensional 2 component data and make sure that they both extract the underlying clusters by plotting them.
In [8]:
X, y = create_dataset(1000, 2, 2, alpha=2)
a = GaussianMixture(2, n_init=1, max_iter=25).fit(X)
b = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, 2, X, max_iterations=25)
y1, y2 = a.predict(X), b.predict(X)
plt.figure(figsize=(16,6))
plt.subplot(121)
plt.title("sklearn clusters", fontsize=14)
plt.scatter(X[y1==0, 0], X[y1==0, 1], color='m', edgecolor='m')
plt.scatter(X[y1==1, 0], X[y1==1, 1], color='c', edgecolor='c')
plt.subplot(122)
plt.title("pomegranate clusters", fontsize=14)
plt.scatter(X[y2==0, 0], X[y2==0, 1], color='m', edgecolor='m')
plt.scatter(X[y2==1, 0], X[y2==1, 1], color='c', edgecolor='c')
plt.show()
It looks like we're getting the same basic results for the two. The two algorithms are initialized a bit differently, and so it can be difficult to directly compare the results between them, but it looks like they're getting roughly the same results.
In [9]:
X = numpy.random.randn(1000, 500, 50)
print("pomegranate Gaussian HMM (1 job)")
%timeit -n 1 -r 1 HiddenMarkovModel.from_samples(NormalDistribution, 5, X, max_iterations=5)
print()
print("pomegranate Gaussian HMM (2 jobs)")
%timeit -n 1 -r 1 HiddenMarkovModel.from_samples(NormalDistribution, 5, X, max_iterations=5, n_jobs=2)
All we had to do was pass in the n_jobs parameter to the fit function in order to get a speed improvement. It looks like we're getting a really good speed improvement, as well! This is mostly because the HMM algorithms perform a lot more operations than the other models, and so spend the vast majority of time with the GIL released. You may not notice as strong speedups when using a MultivariateGaussianDistribution because BLAS uses multithreaded operations already internally, even when only one job is specified.
Now lets look at the prediction function to make sure the we're getting speedups there as well. You'll have to use a wrapper function to parallelize the predictions for a HMM because it returns an annotated sequence rather than a single value like a classic machine learning model might.
In [10]:
model = HiddenMarkovModel.from_samples(NormalDistribution, 5, X, max_iterations=2, verbose=False)
print("pomegranate Gaussian HMM (1 job)")
%timeit predict_proba(model, X)
print()
print("pomegranate Gaussian HMM (2 jobs)")
%timeit predict_proba(model, X, n_jobs=2)
Great, we're getting a speedup on that as well, albiet a minor one.
Hopefully you'll find pomegranate useful in your work! Parallelization should allow you to train complex models faster than before. Keep in mind though that there is an overhead to using parallel processing, and so it's possible that on some smaller examples it does not work as well. In general, the bigger the dataset, the closer to a linear speedup you'll get with pomegranate.
If you have any interesting examples of how you've used pomegranate in your work, I'd love to hear about them. In addition I'd like to hear any feedback you may have on features you'd like to see. Please shoot me an email. Good luck!
In [ ]: