Overview of IPython.parallel

Introduction

IPython is a tool for interactive and exploratory computing. We have seen that IPython's kernel provides a mechanism for interactive remote computation, and we have extended this same mechanism for interactive remote parallel computation, simply by having multiple kernels.

wideview

Architecture overview

At a high level, there are three basic components to parallel IPython:

Engine(s) - the remote or distributed processes where your code runs.
Client - your interface to running code on Engines.
Controller - the collection of processes that coordinate Engines and Clients.

These components live in the IPython.parallel package and are installed with IPython.

This leads to a usage model where you can create as many engines as you want per compute node, and then control them all from your clients via a central 'controller' object that encapsulates the hub and schedulers:

cluster

IPython engine

The Engine is simply a remote Python namespace where your code is run, and is identical to the kernel used elsewhere in IPython.

It can do all the magics, pylab integration, and everything else you can do in a regular IPython session.

IPython controller

The Controller is a collection of processes:

Schedulers relay messages between Engines and Clients.
The Hub monitors the cluster state.

Together, these processes provide a single connection point for your clients and engines. Each Scheduler is a small GIL-less function in C provided by pyzmq (the Python load-balancing scheduler being an exception).

The Hub can be viewed as an über-logger, which monitors all communication between clients and engines, and can log to a database (e.g. SQLite or MongoDB) for later retrieval or resubmission. The Hub is not involved in execution in any way, and a slow Hub cannot slow down submission of tasks.

Schedulers

All actions that can be performed on the engine go through a Scheduler. While the engines themselves block when user code is run, the schedulers hide that from the user to provide a fully asynchronous interface to a set of engines.

IPython client and views

There is one primary object, the Client, for connecting to a cluster. For each execution model there is a corresponding View, and you determine how your work should be executed on the cluster by creating different views or manipulating attributes of views.

The two basic views:

The DirectView class for explicitly running code on particular engine(s).
The LoadBalancedView class for destination-agnostic scheduling.

You can use as many views of each kind as you like, all at the same time.

Getting Started

Starting the IPython controller and engines

The quickest way to get started is to visit the 'clusters' tab, and start some engines with the 'default' profile.

To follow along with this tutorial, you will need to start the IPython controller and some IPython engines. The simplest way of doing this is visit the 'clusters' tab, and start some engines with the 'default' profile, or to use the ipcluster command:

$ ipcluster start -n 4

There isn't time to go into it here, but ipcluster can be used to start engines and the controller with various batch systems including:

SGE
PBS
LSF
MPI
SSH
WinHPC

More information on starting and configuring the IPython cluster in the IPython.parallel docs.

Once you have started the IPython controller and one or more engines, you are ready to use the engines to do something useful.

To make sure everything is working correctly, let's do a very simple demo:



In [1]:

    
from IPython import parallel
rc = parallel.Client()
rc.block = True



In [2]:

    
rc.ids









    Out[2]:





[0, 1, 2, 3]

Let's define a simple function



In [3]:

    
def mul(a,b):
    return a*b



In [4]:

    
mul(5,6)









    Out[4]:





30

Create some Views



In [5]:

    
dview = rc[:]
dview









    Out[5]:





<DirectView [0, 1, 2, 3]>



In [6]:

    
e0 = rc[0]
e0









    Out[6]:





<DirectView 0>

What does it look like to call this function remotely?

Just turn f(*args, **kwargs) into view.apply(f, *args, **kwargs)!



In [7]:

    
e0.apply(mul, 5, 6)









    Out[7]:





30

And the same thing in parallel?



In [8]:

    
dview.apply(mul, 5, 6)









    Out[8]:





[30, 30, 30, 30]

Python has a builtin map for calling a function with a sequence of arguments



In [9]:

    
map(mul, range(1,10), range(2,11))









    Out[9]:





[2, 6, 12, 20, 30, 42, 56, 72, 90]

So how do we do this in parallel?



In [10]:

    
dview.map(mul, range(1,10), range(2,11))









    Out[10]:





[2, 6, 12, 20, 30, 42, 56, 72, 90]

We can also run code in strings with execute



In [11]:

    
dview.execute("import os")
dview.execute("a = os.getpid()")









    Out[11]:





<AsyncResult: finished>

And treat the view as a dict lets you access the remote namespace:



In [12]:

    
dview['a']









    Out[12]:





[24377, 24380, 24381, 24384]

AsyncResults

When you do async execution, the calls return an AsyncResult object immediately



In [13]:

    
def wait_here(t):
    import time, os
    time.sleep(t)
    return os.getpid()



In [14]:

    
ar = dview.apply_async(wait_here, 2)



In [15]:

    
ar.get()









    Out[15]:





[24377, 24380, 24381, 24384]



In [ ]: