One of the real strenghts of ipyrad
is the advanced parallelization methods that it uses to distribute work across arbitrarily large computing clusters, and to be able to do so when working interactively and remotely. This is done through use of the ipyparallel
package, which is tightly linked to ipython
and jupyter
. When you run the command-line ipyrad
program all of the work of ipyparallel
is hidden under the hood, which we've done to make the program very user-friendly. However, when using the ipyrad
API, we've taken the alternative approach of instructing users to become intimate with the ipyparallel
library to better understand how work is being distributed on their system. This has the benefit of allowing more flexible parallelization setups, and also makes it easier for users to take advantage of ipyparallel
for parallelizing downstream analyses, which we have many examples of in the analysis-tools
section of the ipyrad documentation.
In [1]:
## conda install ipyrad -c ipyrad
ipcluster
instanceThe tricky aspect of using ipyparallel
inside of Python (i.e., in a jupyter-notebook) is that you need to first start a cluster instance by running a command-line program called ipcluster
(alternatively, you can also install an extension that makes it possible to start ipcluster from a tab in jupyter notebooks, but I feel the command line tool is simpler). This command will start separate python "kernels" (instances) running on the cluster/computer and ensure that they can all talk to each other. Using advanced options you can even connect kernels across multiple computers or nodes on a HPC cluster, which we'll demonstrate.
In [2]:
## ipcluster start --n=4
In [18]:
## jupyter-notebook
In [3]:
import ipyparallel as ipp
print ipp.__version__
In [4]:
## connect to ipcluster using default arguments
ipyclient = ipp.Client()
## count how many engines are connected
print len(ipyclient), 'cores'
Profiles
in ipcluster and how ipyrad uses themBelow we show an example of a common error caused when the Client cannot find the ipcluster
instance, in this case because it has a differnt profile name. When you start an ipcluster instance it keeps track of itself by using a specific name (its profile). The default profile is an empty string ("") and so this is the default profile that the ipp.Client()
command will look for (and similarly the default profile that ipyrad
will look for). If you change the name of the profile then you have to indicate this, like below.
In [22]:
## example connecting to a named profile
mpi = ipp.Client(profile="MPI")
ipcluster
instance with a specific profileBy using separate profiles you can have multiple ipcluster
instances running at the same time (and possibly using different options or connected to different nodes of an HPC cluster) and you ensure that you connect to each one dictinctly. Start a new instance and then connect to it like below. Here we give it a profile name and we also tell it to initiate the engines using the MPI
initiator by using the --engines
flag.
In [5]:
## ipcluster start --n=4 --engines=MPI --profile=MPI
In [6]:
## now you should be able to connect to the MPI profile
mpi = ipp.Client(profile="MPI")
## print mpi info
print len(mpi), 'cores'
client()
object to distribute jobsFor full details of how this works you can read the ipyparallel
documentation. Here I will focus on the tools in ipyrad
that we have developed to utilize Client
objects to distribute work. In general, all you have to do is provide the ipyclient object to the .run()
function and ipyrad
will take care of the rest.
In [7]:
## the ipyclient object is simply a view to the engines
ipyclient
Out[7]:
In [8]:
import ipyrad as ip
import ipyrad.analysis as ipa
Here we create an Assembly and when we call the .run()
command we provide a specific ipyclient
object as the target to distribute work on. If you do not provide this option then by default ipyrad
will look for an ipcluster
instance running on the default profile ("").
In [9]:
## run steps of an ipyrad assembly on a specific ipcluster instance
data = ip.Assembly("example")
data.set_params("sorted_fastq_path", "example_empirical_rad/*.fastq.gz")
data.run("1", ipyclient=mpi)
In [10]:
## run ipyrad analysis tools on a specific ipcluster instance
tet = ipa.tetrad(
name="test",
data="./analysis-ipyrad/pedic-full_outfiles/pedic-full.snps.phy",
mapfile="./analysis-ipyrad/pedic-full_outfiles/pedic-full.snps.map",
nboots=20);
tet.run(ipyclient=mpi)
In [18]:
## get load-balanced view of our ipyclient
lbview = ipyclient.load_balanced_view()
## a dict to store results
res = {}
## define your func
def my_sum_func(x, y):
return x, y, sum([x, y])
## submit jobs to cluster with arguments
import random
for job in range(10):
x = random.randint(0, 10)
y = random.randint(0, 10)
## submitting a job returns an async object
async = lbview.apply(my_sum_func, x, y)
## store results
res[job] = async
## block until all jobs finish
ipyclient.wait()
Out[18]:
In [19]:
## the results objects
res
Out[19]:
In [20]:
for ridx in range(10):
print "job: {}, result: {}".format(ridx, res[ridx].get())
See the multi-node setup in the HPC tunneling tutorial for instructions. http://ipyrad.readthedocs.io/HPC_Tunnel.html
In [ ]:
## an example command to connect to 80 cores across multiple nodes using MPI
## ipcluster start --n=80 --engines=MPI --ip='*' --profile='MPI80'