Welcome! This tutorial will introduce you to the basics of working with ipyrad to assemble RADseq data.
Note: this tutorial was created in a Jupyter Notebook and assumes that you’re following-along in a notebook of your own. If you installed ipyrad then you will also have jupyter installed by default. For a little background on how jupyter notebooks see Notebooks. All of the code below is written in Python. Follow along in your own notebook.
To begin, we're going to import the ipyrad module and name it ip so that we can refer to it more easily:
In [5]:
import ipyrad as ip
Assembly objects are used by ipyrad to access data stored on disk and to manipulate it. Each biological sample in a data set is represented as a Sample object, and a set of Samples is stored inside an Assembly object.
The Assembly object has functions to assemble data, and to view or plot the resulting assembled files and statistics. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assembly objects. Examples of this are shown in the workflow.
Every analysis begins by creating at least one Assembly object. Here we will name it "data1".
In [13]:
## create an Assembly object named data1.
data1 = ip.Assembly("data1")
The printout tells us that we created the object data1
, and also that it found 4 engines on our system that can be used for computation. An engine is simply a CPU. When working on a single machine it will usually be easiest to simply let the Assembly object connect to all available local engines. However, on HPC clusters you may need to modify the controller or the number of engines, as shown below:
In [ ]:
## create an Assembly object linked to 8 engines using MPI
data1 = ip.Assembly("data1", N=4, controller="MPI")
For more information about connecting CPUs for parallelization see ipyparallel setup.
In [16]:
## setting/modifying parameters for this Assembly object
data1.set_params('project_dir', "./test_rad")
data1.set_params('raw_fastq_path', "./data/sim_rad_test_R1_.fastq.gz")
data1.set_params('barcodes_path', "./data/sim_rad_test_barcodes.txt")
data1.set_params('filter_adapters', 0)
data1.set_params('datatype', 'rad')
## print the parameters for `data`
data1.get_params()
To get more detailed information about each parameter use ip.get_params_info()
, or look up their funcion in the documentation (Parameters). To quickly look up the proper formatting for a parameter, you can use ip.get_params_info(N)
, where N is the number of a parameter. Example:
In [17]:
ip.get_params_info(10)
Each biological sample in a data set is represented as a Sample object. Sample objects are created during step1()
of an analysis, at which time they are linked to an Assembly object.
When getting started raw data should be in one of two forms:
Note: For additional ways to add raw data files to a data set see link_fastqs.
If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.
In [5]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it does nothing b/c there are no files in the sorted_fastq_path
data1.link_fastqs()
Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The stats
attribute of an Assembly object is returned as a pandas
data frame.
In [6]:
## run step 1 to demultiplex the data
data1.step1()
## print the results for each Sample in data1
print data1.stats
In [ ]:
## remove the lane control sequence
#data1.samples.pop("FGXCONTROL")
In [7]:
## example of ways to run step 2 to filter and trim reads
#data1.step2(["1A_0"]) ## run on a single sample
#data1.step2(["1B_0", "1C_0"]) ## run on one or more samples
data1.step2(force=True) ## run on all samples, overwrite finished
## print the results
print data1.stats
In [ ]:
#data1.samples["veitchii"].files
Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the .dirs
argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name.
In [8]:
## create a copy of our Assembly object
data2 = data1.branch(newname="data2")
## set clustering threshold to 0.90
data2.set_params(11, 0.90)
## look at inherited parameters
data2.get_params()
In [9]:
import ipyrad as ip
data1 = ip.load_assembly("test_rad/data1")
In [10]:
## run step 3 to cluster reads within samples using vsearch
data1.step3(force=True)
## print the results
print data1.stats
In [ ]:
## run step 3 to cluster reads in data2 at 0.90 sequence similarity
data2.step3(force=True)
## print the results
print data2.stats
In [ ]:
print "data1 directories:"
for (i,j) in data1.dirs.items():
print "{}\t{}".format(i, j)
print "\ndata2 directories:"
for (i,j) in data2.dirs.items():
print "{}\t{}".format(i, j)
In [ ]:
## TODO, just make a [name]_stats directory in [work] for each data obj
data1.statsfiles
In [ ]:
data1.stats.to_csv("data1_results.csv", sep="\t")
data1.stats.to_latex("data1_results.tex")
In [11]:
import ipyrad.plotting as iplot
## plot for one or more selected samples
#iplot.depthplot(data1, ["1A_0", "1B_0"])
## plot for all samples in data1
iplot.depthplot(data1)
## save plot as pdf and html
#iplot.depthplot(data1, outprefix="testfig")
In [ ]:
## run step 4
data1.step4()
## print the results
print data1.stats
In [ ]:
#import ipyrad as ip
## reload autosaved data. In case you quit and came back
#data1 = ip.load_dataobj("test_rad/data1.assembly")
In [ ]:
## run step 5
#data1.step5()
## print the results
#print data1.stats
In [ ]:
ip.get_params_info(10)
A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, ipyrad also stores a log history which time stamps all modifications to Assembly objects.
In [ ]:
for i in data1.log:
print i
print "\ndata 2 log includes its pre-branching history with data1"
for i in data2.log:
print i
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly step
function is called, but you can also create your own checkpoints with the save
command.
In [ ]:
## save assembly object
#ip.save_assembly("data1.p")
## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name