This post describes a simple yet flexible implementation on how to deploy an IPython.parallel cluster in multiple EC2 instances using salt and a little bit of Vagrant. The final output will be one instance running the IPython notebook, ipcontroller and acting as the salt-master, also 6 instances each one running ipengine and being salt-minions, see the IPython.parallel docs for information on those commands.

In a previous post I created a one-liner for deploying an ipython notebook to the cloud, after that I have been refactoring and advancing the concept and it became my own datasciencebox, it was natural to include the code for creating the ipcluster code in the same project.

Why

Let me begin saying that I know about starcluster I have used it (for this same task) and I believe is the best way to create cluster for doing parallel computing but it lacks of a few features:

  1. Only EC2 support: I use EC2 and a couple of AWS services everyday but what if I want to move to rackspace or any other cloud? So instead of relying on boto I use salt-cloud that relies on apache libcloud
  2. Dependency of installing the scientific libraries from pypi: I use anaconda on my mac and I wanted to use the same versions on the cloud, using the salt state I wrote previously this was an easy task. With this having the latest version is as simple as changing the requirements.txt
  3. Plugins are python classes (see the starcluster ipython plugin for example): While this is completely fine once you learn configuration management you don't want to come back, a salt state is much more easier to create and mantain

Why not hadoop? Again, I use hadoop every week, but for me it seams more natural to do the analysis on your tool of choice (python or R) and then scale it using the same (or similar) tool, it doesn't make a lot of sense to do the analysis on a tool just becuase it scales if it lacks the capabilities you need, for example the excelente interactivity of IPython.

How

Requirements

  1. Vagrant (+aws-plugin) to create the an EC2 instance that runs the ipython notebook. Note you could on run this on Rackspace by using vagrant-rackspace
  2. git clone the datasciencebox repo
  3. AWS account credentials
  4. Copy your EC2 keypair in the root of the repo of step 2 and name it ipcluster.pem, this is so Vagrant can copy the keypair to the cloud

Let's get started

  1. cd to that directory open the Vagrantfile and fill the values inside config.vm.provider, mainly the AWS credentials.
  2. vagrant up --provider=aws

Now you have an IPython notebook running in the new instance on port 8888

ipcluster

First, shh into the instance by running vagrant ssh.

You need to know the private address of the instance. Use this one-liner to get it: python -c 'import socket; print socket.gethostbyname(socket.gethostname())' it should print something like: 10.2.182.86

Start the ipcontroller by running the ipcluster salt state, you need to add a couple of values, mainly the aws and ec2 instances settings:

sudo salt-call state.sls ipcluster pillar='{ipcluster: {master: {MASTER_ADDRESS}, keyname: {AWS_KEYPAIR_NAME}, securitygroup: {EC2_SECURITY_GROUP_NAME}}, aws: {access_key: {AWS_ACCESS_KEY}, secret_key: {AWS_SECRET_KEY}}}'

The previous command will install the dependencies of salt-cloud and start the ipcontroller process.

Now you need to create the instances that are going to run the ipengine: sudo salt-cloud -p base_ec2 ipython-minion-X where X can be a number or string, i ran it with 1 and 2 for this example. This command will create the new instances, install the salt-minion and connect it to the salt-master.

To test that the minions are running run:

↪ sudo salt '*' test.ping
ipython-minion-1:
    True
ipython-minion-2:
    True

Now start the ipengine on each minion by running the ipcluster-minion salt state, this will also provision the same python environment using anaconda. Run sudo salt '*' saltutil.sync_all and after sudo salt '*' state.sls ipcluster-minion

Everything should be in place now, go to the notebook and connect to the clients:


In [1]:
from IPython.parallel import Client

In [2]:
client = Client()

In [3]:
len(client)


Out[3]:
4

We can see that there are four clients since by default it will be 2 ipengines per instance. We can check the address and pid of each engine:


In [4]:
def where_am_i():
    import os
    import socket
    
    return "In process with pid {0} on host: '{1}'".format(os.getpid(), socket.gethostname())

In [5]:
where_am_i()


Out[5]:
"In process with pid 15314 on host: 'ip-10-151-123-100'"

In [6]:
direct_view = client.direct_view()

In [7]:
where_am_i_direct_results = direct_view.apply(where_am_i)

In [8]:
where_am_i_direct_results.get()


Out[8]:
["In process with pid 14018 on host: 'domU-12-31-39-16-52-7E'",
 "In process with pid 14019 on host: 'domU-12-31-39-16-52-7E'",
 "In process with pid 13956 on host: 'ip-10-110-86-251'",
 "In process with pid 13955 on host: 'ip-10-110-86-251'"]

We can see that there are two process on each instance.

Now lets add one more machine by running:

  1. sudo salt-cloud -p base_ec2 ipython-minion-3
  2. sudo salt '*' saltutil.sync_all
  3. sudo salt '*' state.sls ipcluster-minion

In [9]:
len(client)


Out[9]:
6

In [10]:
where_am_i_direct_results = direct_view.apply(where_am_i)
where_am_i_direct_results.get()


Out[10]:
["In process with pid 14617 on host: 'domU-12-31-39-16-52-7E'",
 "In process with pid 14618 on host: 'domU-12-31-39-16-52-7E'",
 "In process with pid 14554 on host: 'ip-10-110-86-251'",
 "In process with pid 14553 on host: 'ip-10-110-86-251'",
 "In process with pid 14006 on host: 'ip-10-191-45-204'",
 "In process with pid 14007 on host: 'ip-10-191-45-204'"]

There was no need to restart the notebook or anything, IPython.parallel took care of everything.

Since we bootstrapped the same python environment using anaconda we can use the scientific libraries, for example numpy:


In [15]:
%%px
import numpy as np
A = np.random.random((2,2))
ev = numpy.linalg.eigvals(A)
ev.max()


Out[4:3]: 1.0222318748650969
Out[5:3]: 1.138421700991769
Out[6:3]: 1.0188265649715529
Out[7:3]: 0.74439354047097206
Out[8:3]: 1.468974252745251
Out[9:3]: 0.92031189875595965

Conclusions

It was relatively simple to get 6 workers running on the cloud and while it is not as simple as creating a config file and run one command as startcluster most of the stuff can be easily wrapped in a command line utility.

It was also simple to write this code since it was no code just salt states, I also learned about salt-cloud and like salt even more now.

Finally is worth saying that there are dozens of examples of what you can do with IPython.parallel on the web, one of my favorites is by Olivier Grisel where he does machine learning with sklearn.