Welcome! This tutorial will introduce you to the basic and advanced features of working with the ipyrad API to assemble RADseq data in Python. The API offers many advantages over the command-line interface, but requires a little more work up front to learn the necessary tools for using it. This includes knowing some very rudimentary Python, and setting up a Jupyter notebook.
In [1]:
import ipyrad as ip
This tutorial is an example of a Jupyter Notebook. If you've installed ipyrad then you already have jupyter installed as well, which you can start from the command-line (type jupyter-notebook
) to launch an interactive notebook like this one. For some background on how jupyter notebooks work I would recommend searching on google, or watching this YouTube video. Once you have the hang of it, follow along with this code in your own notebook.
We have a separate tutorial about using Jupyter notebooks and connecting Jupyter notebooks to a computing cluster (see here). For this notebook I will assume that you are running this code in a Jupyter notebook, and that you have an ipcluster instance running either locally or remotely on a cluster. If an ipcluster instance is running on its default settings (default profile) then ipyrad will automatically use all available cores started on that cluster instance.
The only library we need to import is ipyrad. The import command is usually the first code called in a Python document to load any necessary packages. In the code below, we use a convenient trick in Python to tell it that we want to refer to ipyrad simply as ip. This saves us a little space since we might type the name many times. Below that, we use the print statement to print the version number of ipyrad. This is good practice to keep a record of which software version we are using.
In [2]:
## this is a comment, it is not executed, but the code below it is.
import ipyrad as ip
## here we print the version
print ip.__version__
There are two main objects in ipyrad: Assembly class objects and Sample class objects. And in fact, most users will only ever interact with the Assembly class objects, since Sample objects are stored inside of the Assembly objects, and the Assembly objects have functions, such as merge, and branch, that are designed for manipulating and exchanging Samples between different Assemblies.
Assembly objects are a unique data structure that ipyrad uses to store and organize information about how to Assemble RAD-seq data. It contains functions that can be applied to data, such as clustering, and aligning sequences. And it stores information about which settings (prarmeters) to use for assembly functions, and which Samples the functions should be applied to. You can think of it mostly as a container that has a set of rules associated with it.
To create a new Assembly object use the ip.Assembly()
function and pass it the name of your new Assembly. Creating an object in this way has exactly the same effect as using the -n {name} argument in the ipyrad command line tool, except in the API instead of creating a params.txt file, we store the new Assembly information in a Python variable. This can be named anything you want. Below I name the variable data1 so it is easy to remember that the Assembly name is also data1.
In [3]:
## create an Assembly object named data1.
data1 = ip.Assembly("data1")
In [4]:
## setting/modifying parameters for this Assembly object
data1.set_params('project_dir', "pedicularis")
data1.set_params('sorted_fastq_path', "./example_empirical_rad/*.gz")
data1.set_params('filter_adapters', 2)
data1.set_params('datatype', 'rad')
## prints the parameters to the screen
data1.get_params()
A nice feature of the set_params()
function in the ipyrad API is that it checks your parameter settings at the time that you change them to make sure that they are compatible. By contrast, the ipyrad CLI does not check params until you try to run a step function. Below you can see that an error is raised when we try to set the "clust_threshold" parameters with an integer, since it requires the value to be a float (decimal). It's hard to catch every possible error, but we've tried to catch many of the most common errors in parameter settings.
In [5]:
## this should raise an error, since clust_threshold cannot be 2.0
data1.set_params("clust_threshold", 2.0)
Assembly objects have many attributes which you can access to learn more about your Assembly. To see the full list of options you can type the name of your Assembly variable, followed by a '.', and then press
In [6]:
print data1.name
In [7]:
## another example attribute listing directories
## associated with this object. Most are empty b/c
## we haven't started creating files yet. But you
## can see that it shows the fastq directory.
print data1.dirs
Sample Class objects correspond to individual samples in your study. They store the file paths pointing to the data that is saved on disk, and they store statistics about the results of each step of the Assembly. Sample class objects are stored inside Assembly class objects, and can be added, removed, or merged with other Sample class objects between differnt Assemblies.
Samples are created during step 1 of the ipyrad Assembly. This involves either demultiplexing raw data files or loading data files that are already demultiplexed. For this example we are loading demultiplexed data files. Because we've already entered the path to our data files in sorted_fastq_path
of our Asssembly object, we can go ahead and run step 1 to create Sample objects that are linked to the data files.
In [8]:
## run step 1 to create Samples objects
data1.run("1")
.run()
commandThe run function is equivalent to the -s
argument in the ipyrad command line tool, and tell ipyrad which steps (1-7) of the assembly to run. If a step has already been run on your samples they will be skipped and it will print a warning. You can enforce overwriting the existing data using the force flag.
In [9]:
## The force flag allows you to re-run a step that is already finished
data1.run("1", force=True)
The run command will automatically parallelize work across all cores of a running ipcluster instance (remember, you should have started this outside of notebook. Or you can start it now.) If ipcluster is running on the default profile then ipyrad will detect and use it when the run command is called. However, if you start an ipcluster instance with a specific profile name then you will need to connect to it using the ipyparallel library and then pass the connection client object to ipyrad. I'll show an example of that here.
In [10]:
## this is the explicit way to connect to ipcluster
import ipyparallel
## connect to a running ipcluster instance
ipyclient = ipyparallel.Client()
## or, if you used a named profile then enter that
ipyclient = ipyparallel.Client(profile="default")
In [11]:
## call the run function of ipyrad and pass it the ipyclient
## process that you want the work distributed on.
data1.run("1", ipyclient=ipyclient, force=True)
You can see below that after step 1 has been run there will be a collection of Sample objects stored in an Assembly that can be accessed from the attribute .Samples
. They are stored as a dictionary in which the keys are Sample names and the values of the dictionary are the Sample objects.
In [12]:
## Sample objects stored as a dictionary
data1.samples
Out[12]:
As you can see running a step of the analysis prints a progress bar similar to what you would see in the ipyrad command line tool. There are some differences, however. It shows on the far right "s1" to indicate that this was step 1 of the assembly, and it does not print information about our cluster setup (e.g., number of nodes and cores). This was a stylistic choice to provide a cleaner output for analyses inside Jupyter notebooks. You can view the cluster information when running the step functions by adding the argument show_cluster=True
.
In [13]:
## run step 1 to create Samples objects
data1.run("1", show_cluster=True, force=True)
Results for each step are stored in Sample class objects, however, Assembly class objects have functions available for summarizing the stats of all Sample class objects that they contain, which provides an easy way to view results. This includes .stats
attribute, and the .stats_dfs
attributes for each step.
In [14]:
## print full stats summary
print data1.stats
In [15]:
## print full stats for step 1 (in this case it's the same but for other
## steps the stats_dfs often contains more information.)
print data1.stats_dfs.s1
Branching in the ipyrad API works the same as in the CLI, but in many ways is easier to use because you can access attributes of the Assembly objects much more easily, such as when you want to provide a list of Sample names in order to subsample (exclude samples) during the branching process. Below is an example.
In [16]:
## access all Sample names in data1
allsamples = data1.samples.keys()
print "Samples in data1:\n", "\n".join(allsamples)
In [17]:
## Drop the two samples from this list that have "prz" in their names.
## This is a programmatic way to remove the outgroup samples.
subs = [i for i in allsamples if "prz" not in i]
## use branching to create new Assembly named 'data2'
## with only Samples whose name is in the subs list
data2 = data1.branch("data2", subsamples=subs)
print "Samples in data2:\n", "\n".join(data2.samples)
This is the real bread and butter of the ipyrad API.
You can write simple for-loops using Python code to apply a range of parameter settings to different branched assemblies. Furthermore, using branching this can be done in a way that greatly reduces the amount of computation needed to produce multiple data sets. Essentially, branching allows you to recycle intermediate states that are shared between branched Assemblies. This is particularly useful when assemblies differ by only one or few parameters that are applied late in the assembly process. To set up efficient branching code in this way requires some prior knowledge about when (which step) each parameter is applied in ipyrad. That information is available in the documentation (http://ipyrad.readthedocs.io/parameters.html).
When setting up for-loop routines like the one below it may be helpful to break the script up among multiple cells of a Jupyter notebook so that you can easily restart from one step or another. It may also be useful to subsample your data set to a small number of samples to test the code first, and if all goes well, then proceed with your full data set.
In [18]:
## Start by creating an initial assembly, setting the path to your data,
## and running step1. I set a project-dir so that all of our data sets
## will be grouped into a single directory called 'branch-test'.
data = ip.Assembly("base")
data.set_params("project_dir", "branch-test")
data.set_params("raw_fastq_path", "./ipsimdata/rad_example_R1_.fastq.gz")
data.set_params("barcodes_path", "./ipsimdata/rad_example_barcodes.txt")
## step 1: load in the data
data.run('1')
In [22]:
## let's create a dictionary to hold the finished assemblies
adict = {}
## iterate over parameters settings creating a new named assembly
for filter_setting in [1, 2]:
## create a new name for the assembly and branch
newname = data.name + "_f{}".format(filter_setting)
child1 = data.branch(newname)
child1.set_params("filter_adapters", filter_setting)
child1.run("2")
## iterate over clust thresholds
for clust_threshold in ['0.85', '0.90']:
newname = child1.name + "_c{}".format(clust_threshold[2:])
child2 = child1.branch(newname)
child2.set_params("clust_threshold", clust_threshold)
child2.run("3456")
## iterate over min_sample coverage
for min_samples_locus in [4, 12]:
newname = child2.name + "_m{}".format(min_samples_locus)
child3 = child2.branch(newname)
child3.set_params("min_samples_locus", min_samples_locus)
child3.run("7")
## store the complete assembly in the dictionary by its name
## so it is easy for us to access and retrieve, since we wrote
## over the variable name 'child' during the loop. You can do
## this using dictionaries, lists, etc., or, as you'll see below,
## we can use the 'load_json()' command to load a finished assembly
## from its saved file object.
adict[newname] = child3
A key benefit of using the ipyrad API is that all of the statistics of your analysis are more-or-less accessible through the Assembly and Sample objects. For example, if you want to examine how different minimum depth setting affect your heterozygosity estimates, then you can create two separate branches with different parameter values and access the heterozygosity estimates from the .stats attributes.
If you wanted to apply a set of parameters to only a subset of your Samples during part of the assembly you can do so easily with branching and merging. In the example below I create two new branches from the Assembly before the base-calling steps, where each Assembly selects a different subset of the samples. Then I run steps 4 and 5 with a different set of parameters applied to each, so that one makes haploid base calls and the other makes diploid base calls. Then I merged the Assemblies back together so that all Samples are assembled together in steps 6 and 7.
In [23]:
## run an assembly up to step 3
data.run("123", force=True)
## select clade 1 from the sample names
subs = [i for i in data.samples if "1" in i]
## branch selecting only those samples
data1 = data.branch("data1", subs)
## select clade 2 from the sample names
subs = [i for i in data.samples if "2" in i]
## branch selecting only those samples
data2 = data.branch("data2", subs)
## make diploid base calls on 'data1' samples
data1.set_params("max_alleles_consens", 2)
## make haploid base calls on 'data2' samples
data2.set_params("max_alleles_consens", 1)
## run both assemblies through base-calling steps
data1.run("45", force=True)
data2.run("45", force=True)
## merge assemblies back together for across-sample steps
data3 = ip.merge("data3", [data1, data2])
data3.run("67")
You can easily make population assignments in ipyrad using the Python dictionaries. This is useful for applying min_sample_locus filters to different groups of samples to maximize data that is shared across all of your samples. For example, if we wanted to ensure that every locus had data that was shared across all three clades in our data set then we would set a min_samples_locus value of 1 for each clade. You can see below that I use list-comprehension to select all samples in each clade based on the presence of characters in their names that define them (i.e., the presence of "1" for all samples in clade 1). When possible, this makes group assignments much easier than having to write every sample name by hand.
In [24]:
## create a branch for a population-filtered assembly
pops = data3.branch("populations")
## assign samples to populations
pops.populations = {
"clade1": (1, [i for i in pops.samples if "1" in i]),
"clade2": (1, [i for i in pops.samples if "2" in i]),
}
## print the population dictionary
pops.populations
Out[24]:
In [25]:
## run assembly
pops.run("7")
Assembly objects (and the Sample objects they contain) are automatically saved each time that you use the .run()
function. However, you can also save by calling the .save()
function of an Assembly object. This updates the JSON file. Additionally, Assembly objects have a function called .write_params()
which can be invoked to create a params file for use by the ipyrad command line tool.
In [28]:
## save assembly object (also auto-saves after every run() command)
data1.save()
## load assembly object
data1 = ip.load_json("pedicularis/data1.json")
## write params file for use by the CLI
data1.write_params(force=True)