This is an Jupyter/IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set, and further code below to analyze missing data in that data set.
In [2]:
### Notebook 2
### Data set 2 (Phrynosomatidae)
### Authors: Leache et al. (2015)
### Data Location: NCBI SRA SRP063316
Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt downloaded from the SRA website for this project. It contains all of the information on the id numbers needed to download data for all samples for this project.
In [83]:
%%bash
## make a new directory for this analysis
mkdir -p empirical_2/fastq/
In [20]:
## IPython code
## import libraries
import pandas as pd
import urllib2
import os
## read in the SRA run table from public github url
## as a pandas data frame
url = "https://raw.githubusercontent.com/"+\
"dereneaton/RADmissing/master/empirical_2_SraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")
## print first few rows
print indata.head()
In [89]:
def wget_download(SRR, outdir, outname):
""" Python function to get sra data from ncbi and write to
outdir with a new name using bash call wget """
## get output name
output = os.path.join(outdir, outname+".sra")
## create a call string
call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
"ftp://ftp-trace.ncbi.nlm.nih.gov/"+\
"sra/sra-instant/reads/ByRun/sra/SRR/"+\
"{}/{}/{}.sra;".format(SRR[:6], SRR, SRR)
## call bash script
! $call
Here we pass the SRR number and the sample name to the wget_download
function so that the files are saved with their sample names.
In [90]:
for ID, SRR in zip(indata.Library_Name_s, indata.Run_s):
wget_download(SRR, "empirical_2/fastq/", ID)
In [93]:
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_2/fastq/ empirical_2/fastq/*.sra
## remove .sra files
rm empirical_2/fastq/*.sra
In [97]:
%%bash
pyrad --version
In [104]:
%%bash
## create a new default params file
pyrad -n
In [28]:
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_2/ ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAGG ## 6. cutters ' params.txt
sed -i '/## 7. /c\30 ## 7. N processors ' params.txt
sed -i '/## 9. /c\6 ## 9. NQual ' params.txt
sed -i '/## 10./c\.85 ## 10. clust threshold ' params.txt
sed -i '/## 12./c\4 ## 12. MinCov ' params.txt
sed -i '/## 13./c\10 ## 13. maxSH ' params.txt
sed -i '/## 14./c\empirical_2_m4 ## 14. output name ' params.txt
sed -i '/## 18./c\empirical_2/fastq/*.gz ## 18. data location ' params.txt
sed -i '/## 29./c\2,2 ## 29. trim overhang ' params.txt
sed -i '/## 30./c\p,n,s ## 30. output formats ' params.txt
In [29]:
cat params.txt
In [ ]:
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1
In [32]:
%%bash
sed -i '/## 12./c\2 ## 12. MinCov ' params.txt
sed -i '/## 14./c\empirical_2_m2 ## 14. output name ' params.txt
In [33]:
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1
In [6]:
## read in the data
s2dat = pd.read_table("empirical_2/stats/s2.rawedit.txt", header=0, nrows=74)
## print summary stats
print s2dat["passed.total"].describe()
## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]
In [50]:
## read in the s3 results
s3dat = pd.read_table("empirical_2/stats/s3.clusters.txt", header=0, nrows=74)
## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()
## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()
## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()
## find which sample has the greatest depth of retained loci
max_hiprop = (s3dat["d>5.tot"]/s3dat["total"]).max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']/s3dat["total"]==max_hiprop]
In [24]:
## print mean and std of coverage for the highest coverage sample
with open("empirical_2/clust.85/PHBR4.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
print depths.mean(), depths.std()
In [27]:
import toyplot
import toyplot.svg
import numpy as np
## read in the depth information for this sample
with open("empirical_2/clust.85/PHBR4.depths", 'rb') as indat:
depths = np.array(indat.read().strip().split(","), dtype=int)
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)",
ylabel="N loci",
label="dataset2/sample=PHBR4")
## select the loci with depth > 5 (kept)
keeps = depths[depths>5]
## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)
#toyplot.svg.render(canvas, "empirical_2_depthplot.svg")
Out[27]:
In [34]:
cat empirical_2/stats/empirical_2_m4.stats
In [35]:
%%bash
head -n 10 empirical_2/stats/empirical_2_m2.stats
In [ ]:
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_2/ \
-n empirical_2_m4 -s empirical_2/outfiles/empirical_2_m4.phy
In [ ]:
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
-w /home/deren/Documents/RADmissing/empirical_2/ \
-n empirical_2_m2 -s empirical_2/outfiles/empirical_2_m2.phy
In [227]:
%%bash
head -n 20 empirical_2/RAxML_info.empirical_2_m4
In [ ]:
%%bash
head -n 20 empirical_2/RAxML_info.empirical_2_m2
In [39]:
%load_ext rpy2.ipython
In [44]:
%%R -w 600 -h 800
library(ape)
tre <- read.tree("empirical_2/RAxML_bipartitions.empirical_2")
ltre <- ladderize(tre)
plot(ltre, cex=0.8, edge.width=2)
#nodelabels(ltre$node.label)
In [51]:
%%R
mean(cophenetic.phylo(ltre))