Quick guide to the ipyrad API

Getting Started

Welcome! This tutorial will introduce you to the basics of working with ipyrad to assemble RADseq data.

Note: this tutorial was created in a Jupyter Notebook and assumes that you’re following-along in a notebook of your own. If you installed ipyrad then you will also have jupyter installed by default. For a little background on how jupyter notebooks see Notebooks. All of the code below is written in Python. Follow along in your own notebook.

To begin, we're going to import the ipyrad module and name it ip so that we can refer to it more easily:


In [5]:
import ipyrad as ip

Assembly objects

Assembly objects are used by ipyrad to access data stored on disk and to manipulate it. Each biological sample in a data set is represented as a Sample object, and a set of Samples is stored inside an Assembly object.

The Assembly object has functions to assemble data, and to view or plot the resulting assembled files and statistics. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assembly objects. Examples of this are shown in the workflow.

Every analysis begins by creating at least one Assembly object. Here we will name it "data1".


In [13]:
## create an Assembly object named data1. 
data1 = ip.Assembly("data1")


INFO:ipyrad.core.assembly:New Assembly object `data1` created
DEBUG:ipyrad.core.parallel:ipcontroller already running.
New Assembly object `data1` created

The printout tells us that we created the object data1, and also that it found 4 engines on our system that can be used for computation. An engine is simply a CPU. When working on a single machine it will usually be easiest to simply let the Assembly object connect to all available local engines. However, on HPC clusters you may need to modify the controller or the number of engines, as shown below:


In [ ]:
## create an Assembly object linked to 8 engines using MPI
data1 = ip.Assembly("data1", N=4, controller="MPI")

For more information about connecting CPUs for parallelization see ipyparallel setup.

Modifying Assembly parameters

The arguments get_params() and set_params() are used to view and modify parameter settings of an Assembly object, respectively.


In [16]:
## setting/modifying parameters for this Assembly object
data1.set_params('project_dir', "./test_rad")
data1.set_params('raw_fastq_path', "./data/sim_rad_test_R1_.fastq.gz")
data1.set_params('barcodes_path', "./data/sim_rad_test_barcodes.txt")
data1.set_params('filter_adapters', 0)
data1.set_params('datatype', 'rad')

## print the parameters for `data`
data1.get_params()


  1   project_dir           ./test_rad                                   
  2   raw_fastq_path              ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path               ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                        
  5   restriction_overhang        ('TGCAG', '')                                
  6   max_low_qual_bases          5                                            
  7   engines_per_job             4                                            
  8   mindepth_statistical        6                                            
  9   mindepth_majrule            6                                            
  10  datatype                    rad                                          
  11  clust_threshold             0.85                                         
  12  minsamp                     4                                            
  13  max_shared_heterozygosity   0.25                                         
  14  prefix_outname              data1                                        
  15  phred_Qscore_offset         33                                           
  16  max_barcode_mismatch        1                                            
  17  filter_adapters             0                                            
  18  filter_min_trim_len         35                                           
  19  ploidy                      2                                            
  20  max_stack_size              1000                                         
  21  max_Ns_consens              (5, 5)                                       
  22  max_Hs_consens              (8, 8)                                       
  23  max_SNPs_locus              (100, 100)                                   
  24  max_Indels_locus            (5, 99)                                      
  25  trim_overhang               (1, 2, 2, 1)                                 
  26  hierarchical_clustering     0                                            
  27  assembly_method             denovo                                       
  28  reference_sequence                                                       

To get more detailed information about each parameter use ip.get_params_info(), or look up their funcion in the documentation (Parameters). To quickly look up the proper formatting for a parameter, you can use ip.get_params_info(N), where N is the number of a parameter. Example:


In [17]:
ip.get_params_info(10)


        (10) clust_threshold -------------------------------------------------
        Clustering threshold. 
        Examples:
        ----------------------------------------------------------------------
        data.setparams(10) = .85          ## clustering similarity threshold
        data.setparams(10) = .90          ## clustering similarity threshold
        data.setparams(10) = .95          ## very high values not recommended 
        data.setparams("clust_threshold") = .83  ## verbose
        ----------------------------------------------------------------------
        

Sample Objects

Each biological sample in a data set is represented as a Sample object. Sample objects are created during step1() of an analysis, at which time they are linked to an Assembly object.

When getting started raw data should be in one of two forms:

  • Non-demultiplexed data files (accompanied by a barcode file).
  • Demultiplexed data files.

Note: For additional ways to add raw data files to a data set see link_fastqs.

If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.


In [5]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it does nothing b/c there are no files in the sorted_fastq_path
data1.link_fastqs()


0 new Samples created in `data1`.

Step 1: Demultiplexing raw data files

Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The stats attribute of an Assembly object is returned as a pandas data frame.


In [6]:
## run step 1 to demultiplex the data
data1.step1()

## print the results for each Sample in data1
print data1.stats


      state  reads_raw
1A_0      1      20099
1B_0      1      19977
1C_0      1      20114
1D_0      1      19895
2E_0      1      19928
2F_0      1      19934
2G_0      1      20026
2H_0      1      19936
3I_0      1      20084
3J_0      1      20011
3K_0      1      20117
3L_0      1      19901

In [ ]:
## remove the lane control sequence
#data1.samples.pop("FGXCONTROL")

Step 2: Filter reads

If for some reason we wanted to execute on just a subsample of our data, we could do this by selecting only certain samples to call the step2 function on. Because step2 is a function of data, it will always execute with the parameters that are linked to data.


In [7]:
## example of ways to run step 2 to filter and trim reads

#data1.step2(["1A_0"])              ## run on a single sample
#data1.step2(["1B_0", "1C_0"])      ## run on one or more samples
data1.step2(force=True)             ## run on all samples, overwrite finished

## print the results
print data1.stats


INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
INFO:ipyrad.assemble.demultiplex:optim = 10000
      state  reads_raw  reads_filtered
1A_0      2      20099           20099
1B_0      2      19977           19977
1C_0      2      20114           20114
1D_0      2      19895           19895
2E_0      2      19928           19928
2F_0      2      19934           19934
2G_0      2      20026           20026
2H_0      2      19936           19936
3I_0      2      20084           20084
3J_0      2      20011           20011
3K_0      2      20117           20117
3L_0      2      19901           19901

In [ ]:
#data1.samples["veitchii"].files

Branching Assembly objects

Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the .dirs argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name.


In [8]:
## create a copy of our Assembly object
data2 = data1.branch(newname="data2")

## set clustering threshold to 0.90
data2.set_params(11, 0.90)

## look at inherited parameters
data2.get_params()


  1   project_dir           ./test_rad                                   
  2   raw_fastq_path              ./data/sim_rad_test_R1_.fastq.gz             
  3   barcodes_path               ./data/sim_rad_test_barcodes.txt             
  4   sorted_fastq_path                                                        
  5   restriction_overhang        ('TGCAG', '')                                
  6   max_low_qual_bases          5                                            
  7   engines_per_job             4                                            
  8   mindepth_statistical        6                                            
  9   mindepth_majrule            6                                            
  10  datatype                    rad                                          
  11  clust_threshold             0.9                                          
  12  minsamp                     4                                            
  13  max_shared_heterozygosity   0.25                                         
  14  prefix_outname              data2                                        
  15  phred_Qscore_offset         33                                           
  16  max_barcode_mismatch        1                                            
  17  filter_adapters             0                                            
  18  filter_min_trim_len         35                                           
  19  ploidy                      2                                            
  20  max_stack_size              1000                                         
  21  max_Ns_consens              (5, 5)                                       
  22  max_Hs_consens              (8, 8)                                       
  23  max_SNPs_locus              (100, 100)                                   
  24  max_Indels_locus            (5, 99)                                      
  25  trim_overhang               (1, 2, 2, 1)                                 
  26  hierarchical_clustering     0                                            
  27  assembly_method             denovo                                       
  28  reference_sequence                                                       

Step 3: clustering within-samples


In [9]:
import ipyrad as ip
data1 = ip.load_assembly("test_rad/data1")


DEBUG:ipyrad.core.parallel:ipcontroller already running.

In [10]:
## run step 3 to cluster reads within samples using vsearch
data1.step3(force=True)

## print the results
print data1.stats


clustering 12 samples using 4 engines per job
      state  reads_raw  reads_filtered  clusters_total  clusters_kept
1A_0      3      20099           20099            1000           1000
1B_0      3      19977           19977            1000           1000
1C_0      3      20114           20114            1000           1000
1D_0      3      19895           19895            1000           1000
2E_0      3      19928           19928            1000           1000
2F_0      3      19934           19934            1000           1000
2G_0      3      20026           20026            1000           1000
2H_0      3      19936           19936            1000           1000
3I_0      3      20084           20084            1000           1000
3J_0      3      20011           20011            1000           1000
3K_0      3      20117           20117            1000           1000
3L_0      3      19901           19901            1000           1000

In [ ]:
## run step 3 to cluster reads in data2 at 0.90 sequence similarity
data2.step3(force=True) 

## print the results
print data2.stats

Branched Assembly objects

And you can see below that the two Assembly objects are now working with several shared directories (working, fastq, edits) but with different clust directories (clust_0.85 and clust_0.9).


In [ ]:
print "data1 directories:"
for (i,j) in data1.dirs.items():
    print "{}\t{}".format(i, j)
    
print "\ndata2 directories:"
for (i,j) in data2.dirs.items():
    print "{}\t{}".format(i, j)

In [ ]:
## TODO, just make a [name]_stats directory in [work] for each data obj
data1.statsfiles

Saving stats outputs

Example: two simple ways to save the stats data frame to a file.


In [ ]:
data1.stats.to_csv("data1_results.csv", sep="\t")
data1.stats.to_latex("data1_results.tex")

Example of plotting with ipyrad

There are a a few simple plotting functions in ipyrad useful for visualizing results. These are in the module ipyrad.plotting. Below is an interactive plot for visualizing the distributions of coverages across the 12 samples in the test data set.


In [11]:
import ipyrad.plotting as iplot

## plot for one or more selected samples
#iplot.depthplot(data1, ["1A_0", "1B_0"])

## plot for all samples in data1
iplot.depthplot(data1)

## save plot as pdf and html
#iplot.depthplot(data1, outprefix="testfig")


010203040500501001501A_0010203040500501001501B_0010203040500501001501C_0010203040500501001501D_0010203040500501001502E_0010203040500501001502F_0010203040500501001502G_0010203040500501001502H_0010203040500501001503I_0010203040500501001503J_0010203040500501001503K_0010203040500501001503L_0

Step 4: Joint estimation of heterozygosity and error rate


In [ ]:
## run step 4
data1.step4() 

## print the results
print data1.stats

Step 5: Consensus base calls


In [ ]:
#import ipyrad as ip

## reload autosaved data. In case you quit and came back 
#data1 = ip.load_dataobj("test_rad/data1.assembly")

In [ ]:
## run step 5
#data1.step5()

## print the results
#print data1.stats

Quick parameter explanations are always on-hand


In [ ]:
ip.get_params_info(10)

Log history

A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, ipyrad also stores a log history which time stamps all modifications to Assembly objects.


In [ ]:
for i in data1.log:
    print i
    
print "\ndata 2 log includes its pre-branching history with data1"
for i in data2.log:
    print i

Saving Assembly objects

Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly step function is called, but you can also create your own checkpoints with the save command.


In [ ]:
## save assembly object
#ip.save_assembly("data1.p")

## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name