One of the real strenghts of ipyrad is the advanced parallelization methods that it uses to distribute work across arbitrarily large computing clusters, and to be able to do so when working interactively and remotely. This is done through use of the ipyparallel package, which is tightly linked to ipython and jupyter. When you run the command-line ipyrad program all of the work of ipyparallel is hidden under the hood, which we've done to make the program very user-friendly. However, when using the ipyrad API, we've taken the alternative approach of instructing users to become intimate with the ipyparallel library to better understand how work is being distributed on their system. This has the benefit of allowing more flexible parallelization setups, and also makes it easier for users to take advantage of ipyparallel for parallelizing downstream analyses, which we have many examples of in the analysis-tools section of the ipyrad documentation.
In [1]:
## conda install ipyrad -c ipyrad
ipcluster instanceThe tricky aspect of using ipyparallel inside of Python (i.e., in a jupyter-notebook) is that you need to first start a cluster instance by running a command-line program called ipcluster (alternatively, you can also install an extension that makes it possible to start ipcluster from a tab in jupyter notebooks, but I feel the command line tool is simpler). This command will start separate python "kernels" (instances) running on the cluster/computer and ensure that they can all talk to each other. Using advanced options you can even connect kernels across multiple computers or nodes on a HPC cluster, which we'll demonstrate.
In [2]:
## ipcluster start --n=4
In [18]:
## jupyter-notebook
In [3]:
import ipyparallel as ipp
print ipp.__version__
In [4]:
## connect to ipcluster using default arguments
ipyclient = ipp.Client()
## count how many engines are connected
print len(ipyclient), 'cores'
Profiles in ipcluster and how ipyrad uses themBelow we show an example of a common error caused when the Client cannot find the ipcluster instance, in this case because it has a differnt profile name. When you start an ipcluster instance it keeps track of itself by using a specific name (its profile). The default profile is an empty string ("") and so this is the default profile that the ipp.Client() command will look for (and similarly the default profile that ipyrad will look for). If you change the name of the profile then you have to indicate this, like below.
In [22]:
## example connecting to a named profile
mpi = ipp.Client(profile="MPI")
ipcluster instance with a specific profileBy using separate profiles you can have multiple ipcluster instances running at the same time (and possibly using different options or connected to different nodes of an HPC cluster) and you ensure that you connect to each one dictinctly. Start a new instance and then connect to it like below. Here we give it a profile name and we also tell it to initiate the engines using the MPI initiator by using the --engines flag.
In [5]:
## ipcluster start --n=4 --engines=MPI --profile=MPI
In [6]:
## now you should be able to connect to the MPI profile
mpi = ipp.Client(profile="MPI")
## print mpi info
print len(mpi), 'cores'
client() object to distribute jobsFor full details of how this works you can read the ipyparallel documentation. Here I will focus on the tools in ipyrad that we have developed to utilize Client objects to distribute work. In general, all you have to do is provide the ipyclient object to the .run() function and ipyrad will take care of the rest.
In [7]:
## the ipyclient object is simply a view to the engines
ipyclient
Out[7]:
In [8]:
import ipyrad as ip
import ipyrad.analysis as ipa
Here we create an Assembly and when we call the .run() command we provide a specific ipyclient object as the target to distribute work on. If you do not provide this option then by default ipyrad will look for an ipcluster instance running on the default profile ("").
In [9]:
## run steps of an ipyrad assembly on a specific ipcluster instance
data = ip.Assembly("example")
data.set_params("sorted_fastq_path", "example_empirical_rad/*.fastq.gz")
data.run("1", ipyclient=mpi)
In [10]:
## run ipyrad analysis tools on a specific ipcluster instance
tet = ipa.tetrad(
name="test",
data="./analysis-ipyrad/pedic-full_outfiles/pedic-full.snps.phy",
mapfile="./analysis-ipyrad/pedic-full_outfiles/pedic-full.snps.map",
nboots=20);
tet.run(ipyclient=mpi)
In [18]:
## get load-balanced view of our ipyclient
lbview = ipyclient.load_balanced_view()
## a dict to store results
res = {}
## define your func
def my_sum_func(x, y):
return x, y, sum([x, y])
## submit jobs to cluster with arguments
import random
for job in range(10):
x = random.randint(0, 10)
y = random.randint(0, 10)
## submitting a job returns an async object
async = lbview.apply(my_sum_func, x, y)
## store results
res[job] = async
## block until all jobs finish
ipyclient.wait()
Out[18]:
In [19]:
## the results objects
res
Out[19]:
In [20]:
for ridx in range(10):
print "job: {}, result: {}".format(ridx, res[ridx].get())
See the multi-node setup in the HPC tunneling tutorial for instructions. http://ipyrad.readthedocs.io/HPC_Tunnel.html
In [ ]:
## an example command to connect to 80 cores across multiple nodes using MPI
## ipcluster start --n=80 --engines=MPI --ip='*' --profile='MPI80'