In [ ]:
emcee supports parallelization out of the box. The algorithmic details are given in the paper but the implementation is very simple. The parallelization is applied across the walkers in the ensemble at each step and it must therefore be synchronized after each iteration. This means that you will really only benefit from this feature when your probability function is relatively expensive to compute.
The recommended method is to use IPython's parallel feature but it's possible to use other "mappers" like the Python standard library's multiprocessing.Pool
. The only requirement of the mapper is that it exposes a map
method.
As mentioned above, it's possible to parallelize your model using the standard library's multiprocessing
package. Instead, I would recommend the pools.InterruptiblePool
that is included with emcee because it is a simple thin wrapper around multiprocessing.Pool
with support for a keyboard interrupt (^C
)... you'll thank me later! If we wanted to use this pool, the final few lines from the example on the front page would become the following:
In [1]:
import emcee3
import numpy as np
def log_prob(x):
return -0.5 * np.sum(x ** 2)
ndim, nwalkers = 10, 100
with emcee3.pools.InterruptiblePool() as pool:
ensemble = emcee3.Ensemble(log_prob, np.random.randn(nwalkers, ndim), pool=pool)
sampler = emcee3.Sampler()
sampler.run(ensemble, 1000)
To distribute emcee3 across nodes on a cluster, you'll need to use MPI. This can be done with the MPIPool
from schwimmbad. To use this, you'll need to install the dependency mpi4py. Otherwise, the code is almost the same as the multiprocessing example above – the main change is the definition of the pool:
The if not pool.is_master()
block is crucial otherwise the code will hang at the end of execution. To run this code, you would execute something like the following:
ipyparallel is a flexible and powerful framework for running distributed computation in Python. It works on a single machine with multiple cores in the same way as it does on a huge compute cluster and in both cases it is very efficient!
To use IPython parallel, make sure that you have a recent version of IPython installed (ipyparallel docs) and start up the cluster by running:
Then, run the following:
In [12]:
# Connect to the cluster.
from ipyparallel import Client
rc = Client()
dv = rc.direct_view()
# Run the imports on the cluster too.
with dv.sync_imports():
import emcee3
import numpy
# Define the model.
def log_prob(x):
return -0.5 * numpy.sum(x ** 2)
# Distribute the model to the nodes of the cluster.
dv.push(dict(log_prob=log_prob), block=True)
# Set up the ensemble with the IPython "DirectView" as the pool.
ndim, nwalkers = 10, 100
ensemble = emcee3.Ensemble(log_prob, numpy.random.randn(nwalkers, ndim), pool=dv)
# Run the sampler in the same way as usual.
sampler = emcee3.Sampler()
ensemble = sampler.run(ensemble, 1000)
There is a significant overhead incurred when using any of these parallelization methods so for this simple example, the parallel version is actually slower but this effect will be quickly offset if your probability function is computationally expensive.
One major benefit of using ipyparallel is that it can also be used identically on a cluster with MPI if you have a really big problem. The Python code would look identical and the only change that you would have to make is to start the cluster using:
Take a look at the documentation for more details of all of the features available in ipyparallel.
In [ ]: