This post describes a simple yet flexible implementation on how to deploy an IPython.parallel cluster in multiple EC2 instances using salt and a little bit of Vagrant. The final output will be one instance running the IPython notebook, ipcontroller
and acting as the salt-master, also 6 instances each one running ipengine
and being salt-minions, see the IPython.parallel docs for information on those commands.
In a previous post I created a one-liner for deploying an ipython notebook to the cloud, after that I have been refactoring and advancing the concept and it became my own datasciencebox, it was natural to include the code for creating the ipcluster code in the same project.
Let me begin saying that I know about starcluster I have used it (for this same task) and I believe is the best way to create cluster for doing parallel computing but it lacks of a few features:
requirements.txt
Why not hadoop? Again, I use hadoop every week, but for me it seams more natural to do the analysis on your tool of choice (python or R) and then scale it using the same (or similar) tool, it doesn't make a lot of sense to do the analysis on a tool just becuase it scales if it lacks the capabilities you need, for example the excelente interactivity of IPython.
git clone
the datasciencebox repoipcluster.pem
, this is so Vagrant can copy the keypair to the cloudcd
to that directory open the Vagrantfile
and fill the values inside config.vm.provider
, mainly the AWS credentials.vagrant up --provider=aws
Now you have an IPython notebook running in the new instance on port 8888
First, shh
into the instance by running vagrant ssh
.
You need to know the private address of the instance. Use this one-liner to get it: python -c 'import socket; print socket.gethostbyname(socket.gethostname())'
it should print something like: 10.2.182.86
Start the ipcontroller
by running the ipcluster
salt state, you need to add a couple of values, mainly the aws and ec2 instances settings:
sudo salt-call state.sls ipcluster pillar='{ipcluster: {master: {MASTER_ADDRESS}, keyname: {AWS_KEYPAIR_NAME}, securitygroup: {EC2_SECURITY_GROUP_NAME}}, aws: {access_key: {AWS_ACCESS_KEY}, secret_key: {AWS_SECRET_KEY}}}'
The previous command will install the dependencies of salt-cloud
and start the ipcontroller
process.
Now you need to create the instances that are going to run the ipengine
: sudo salt-cloud -p base_ec2 ipython-minion-X
where X
can be a number or string, i ran it with 1 and 2 for this example. This command will create the new instances, install the salt-minion
and connect it to the salt-master
.
To test that the minions are running run:
↪ sudo salt '*' test.ping
ipython-minion-1:
True
ipython-minion-2:
True
Now start the ipengine
on each minion by running the ipcluster-minion
salt state, this will also provision the same python environment using anaconda. Run sudo salt '*' saltutil.sync_all
and after sudo salt '*' state.sls ipcluster-minion
Everything should be in place now, go to the notebook and connect to the clients:
In [1]:
from IPython.parallel import Client
In [2]:
client = Client()
In [3]:
len(client)
Out[3]:
We can see that there are four clients since by default it will be 2 ipengine
s per instance. We can check the address and pid of each engine:
In [4]:
def where_am_i():
import os
import socket
return "In process with pid {0} on host: '{1}'".format(os.getpid(), socket.gethostname())
In [5]:
where_am_i()
Out[5]:
In [6]:
direct_view = client.direct_view()
In [7]:
where_am_i_direct_results = direct_view.apply(where_am_i)
In [8]:
where_am_i_direct_results.get()
Out[8]:
We can see that there are two process on each instance.
Now lets add one more machine by running:
sudo salt-cloud -p base_ec2 ipython-minion-3
sudo salt '*' saltutil.sync_all
sudo salt '*' state.sls ipcluster-minion
In [9]:
len(client)
Out[9]:
In [10]:
where_am_i_direct_results = direct_view.apply(where_am_i)
where_am_i_direct_results.get()
Out[10]:
There was no need to restart the notebook or anything, IPython.parallel took care of everything.
Since we bootstrapped the same python environment using anaconda we can use the scientific libraries, for example numpy:
In [15]:
%%px
import numpy as np
A = np.random.random((2,2))
ev = numpy.linalg.eigvals(A)
ev.max()
It was relatively simple to get 6 workers running on the cloud and while it is not as simple as creating a config file and run one command as startcluster most of the stuff can be easily wrapped in a command line utility.
It was also simple to write this code since it was no code just salt states, I also learned about salt-cloud and like salt even more now.
Finally is worth saying that there are dozens of examples of what you can do with IPython.parallel on the web, one of my favorites is by Olivier Grisel where he does machine learning with sklearn.