In [1]:
## import modules
import ipyrad as ip ##
print "version", ip.__version__ ## print version
In [2]:
## clear data from test directory if it already exists
import shutil
import os
if os.path.exists("./test_rad/"):
shutil.rmtree("./test_rad/")
Assembly and Sample objects are used by ipyrad to access data stored on disk and to manipulate it. Each biological sample in a data set is represented in a Sample object, and a set of Samples is stored inside an Assembly object. The Assembly object has functions to assemble the data, and stores a log of all steps performed and the resulting statistics of those steps. Assembly objects can be copied or merged to allow branching events where different parameters can subsequently be applied to different Assemblies going forward. Examples of this are shown below.
We'll being by creating a single Assembly object named "data1". It is created with a set of default assembly parameters and without any Samples linked to it. The name provided will be used in the output files that this Assembly creates.
In [3]:
## create an Assembly object named data1.
data1 = ip.Assembly("data1")
An Assembly object's parameter settings can be viewed using its get_params()
function. To get more detailed information about all parameters use the function ip.get_params_info()
or select a single parameter with ip.get_params_info(N)
, where N is the number or string representation of a parameter. Assembly objects have a function set_params()
that is used to modify parameters, like below.
In [4]:
## modify parameters for this Assembly object
data1.set_params('project_dir', "./test_rad")
data1.set_params('raw_fastq_path', "./data/sim_rad_test_R1_.fastq.gz")
data1.set_params('barcodes_path', "./data/sim_rad_test_barcodes.txt")
data1.set_params('filter_adapters', 0)
data1.set_params('datatype', 'rad')
## test on real data
#data1.set_params(2, "~/Dropbox/UO_C353_1.fastq.part-aa.gz")
#data1.set_params(3, "/home/deren/Dropbox/Viburnum_revised.barcodes")
## print the new parameters to screen
data1.get_params()
If the data are already demultiplexed then fastq files can be linked directly to the Assembly object, which in turn will create new Sample objects from them, or link them to existing Sample objects based on the file names (or pair of fastq files for paired data files). The files may be gzip compressed. If the data are not demultiplexed then you will have to run the step1 function below to demultiplex the raw data.
In [5]:
## This would link fastq files from the 'sorted_fastq_path' if present
## Here it raises an error because there are no files in the sorted_fastq_path
## data1.link_fastqs() #path="./test_rad/data1_fastqs/*")
Step1 uses barcode information to demultiplex data files found in param 2 ['raw_fastq_path']. It will create a Sample object for each barcoded sample. Below we use the step1() function to demultiplex. The stats
attribute of an Assembly object is returned as a pandas
data frame.
In [6]:
## run step 1 to demultiplex the data
data1.step1(force=True)
## print the results for each Sample in data1
print data1.stats
In [7]:
## example of ways to run step 2 to filter and trim reads
data1.step2() #["1B_0", "1C_0"], force=True) ## run on one or more samples
## print the results
print data1.statsfiles.s2
Let's imagine at this point that we are interested in clustering our data at two different clustering thresholds. We will try 0.90 and 0.85. First we need to make a copy/branch of the Assembly object. This will inherit the locations of the data linked in the first object, but diverge in any future applications to the object. Thus, the two Assembly objects can share the same working directory, and inherit shared files, but will diverge in creating new files linked to only one or the other. You can view the directories linked to an Assembly object with the .dirs
argument, shown below. The prefix_outname (param 14) of the new object is automatically set to the Assembly object name.
In [8]:
## create a branch of our Assembly object
data2 = data1.branch(newname="data2")
## set clustering threshold to 0.90
data2.set_params("clust_threshold", 0.90)
## look at inherited parameters
data2.get_params()
In [9]:
import ipyrad as ip
data1 = ip.load_json("test_rad/data1.json")
In [10]:
## run step 3 to cluster reads within samples using vsearch
data1.step3(force=True)
## print the results
print data1.statsfiles.s3
In [11]:
## run step 3 to cluster reads in data2 at 0.90 sequence similarity
data2.step3(force=True)
## print the results
print data2.stats
In [12]:
print "data1 directories:"
for (i,j) in data1.dirs.items():
print "{}\t{}".format(i, j)
print "\ndata2 directories:"
for (i,j) in data2.dirs.items():
print "{}\t{}".format(i, j)
In [13]:
data1.stats.to_csv("data1_results.csv", sep="\t")
data1.stats.to_latex("data1_results.tex")
In [14]:
import ipyrad.plotting as iplot
## plot for one or more selected samples
#iplot.depthplot(data1, ["1A_0", "1B_0"])
## plot for all samples in data1
iplot.depthplot(data1)
## save plot as pdf and html
#iplot.depthplot(data1, outprefix="testfig")
In [15]:
## run step 4
data1.step4(force=True)
## print the results
print data1.stats
In [16]:
## run step 5
data1.step5(force=True) #"1B_0")
## print the results
print data1.stats
In [17]:
## run step 6
data1.step6(force=True)
In [18]:
data1.step7(force=True)
In [ ]:
In [ ]:
In [ ]:
import h5py
ioh5 = h5py.File(data1.database, 'r')
superseqs = ioh5["seqs"][:]
superseqs[superseqs == ""] = "N"
In [30]:
AMBIGS = {"R":("G", "A"),
"K":("G", "T"),
"S":("G", "C"),
"Y":("T", "C"),
"W":("T", "A"),
"M":("C", "A")}
In [64]:
def ucount(sitecol):
"""
Used to count the number of unique bases in a site for snpstring.
returns as a spstring with * and -
"""
## make into set
site = set(sitecol)
## get resolutions of ambiguitites
for iupac in "RSKYWM":
if iupac in site:
site.discard(iupac)
site.update(AMBIGS[iupac])
## remove - and N
site.discard("N")
site.discard("-")
## if site is invariant return ""
if len(site) < 2:
return " "
else:
## get how many bases come up more than once
ccx = np.sum(np.array([np.sum(sitecol == base) for base in site]) > 1)
## if another site came up more than once return *
if ccx > 1:
return "*"
else:
return "-"
In [65]:
snps = np.apply_along_axis(ucount, 1, superseqs)
In [86]:
snpsarr = np.zeros((superseqs.shape[0], superseqs.shape[2], 2), dtype=np.bool)
snpsarr[:, :, 0] = snps == "-"
snpsarr[:, :, 1] = snps == "*"
snpsarr.shape
Out[86]:
In [107]:
import numpy as np
import os
a = np.array(['a', 'b'])
set(a)
Out[107]:
In [112]:
path = os.path.join(data1.dirs.outfiles, data1.name+"_tmpchunks", "snpsarr.0.npy")
np.load(path).shape
Out[112]:
In [113]:
path = os.path.join(data1.dirs.outfiles, data1.name+"_tmpchunks", "hetf.0.npy")
np.load(path).shape
Out[113]:
In [5]:
ll test_rad/outfiles/
In [18]:
ip.paramsinfo(10)
A common problem at the end of an analysis, or while troubleshooting it, is that you find you've completely forgotten which parameters you used at what point, and when you changed them. Documenting or executing code inside Jupyter notebooks (like the one you're reading right now) is a great way to keep track of all of this. In addition, ipyrad also stores a log history which time stamps all modifications to Assembly objects.
In [ ]:
for i in data1.log:
print i
print "\ndata 2 log includes its pre-branching history with data1"
for i in data2.log:
print i
Assembly objects can be saved and loaded so that interactive analyses can be started, stopped, and returned to quite easily. The format of these saved files is a serialized 'dill' object used by Python. Individual Sample objects are saved within Assembly objects. These objects to not contain the actual sequence data, but only link to it, and so are not very large. The information contained includes parameters and the log of Assembly objects, and the statistics and state of Sample objects. Assembly objects are autosaved each time an assembly step
function is called, but you can also create your own checkpoints with the save
command.
In [ ]:
## save assembly object
#ip.save_assembly("data1.p")
## load assembly object
#data = ip.load_assembly("data1.p")
#print data.name
In [60]:
isinstance(0.25, float)
Out[60]: