Notebook 2:

This is an Jupyter/IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set, and further code below to analyze missing data in that data set.


In [2]:
### Notebook 2
### Data set 2 (Phrynosomatidae)
### Authors: Leache et al. (2015)
### Data Location: NCBI SRA SRP063316



Download the sequence data

Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt downloaded from the SRA website for this project. It contains all of the information on the id numbers needed to download data for all samples for this project.

  • Project SRA: SRP063316
  • BioProject ID: PRJNA294316
  • Biosample numbers: SAMN04027506 -- SAMN04027579
  • Runs: SRR2240500 -- SRR2240573

In [83]:
%%bash
## make a new directory for this analysis
mkdir -p empirical_2/fastq/

In [20]:
## IPython code

## import libraries
import pandas as pd
import urllib2
import os

## read in the SRA run table from public github url
## as a pandas data frame
url = "https://raw.githubusercontent.com/"+\
      "dereneaton/RADmissing/master/empirical_2_SraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")

## print first few rows
print indata.head()


    BioSample_s Library_Name_s  MBases_l  MBytes_l            Organism_s  \
0  SAMN04027526          PHCE3        88        58  Phrynosoma cerroense   
1  SAMN04027532          PHCN3        67        44   Phrynosoma cornutum   
2  SAMN04027531          PHCN2        79        52   Phrynosoma cornutum   
3  SAMN04027530          PHCN1        47        31   Phrynosoma cornutum   
4  SAMN04027529          PHCE6        45        29  Phrynosoma cerroense   

        Run_s SRA_Sample_s Sample_Name_s  \
0  SRR2240520   SRS1054806         PHCE3   
1  SRR2240526   SRS1054807         PHCN3   
2  SRR2240525   SRS1054808         PHCN2   
3  SRR2240524   SRS1054809         PHCN1   
4  SRR2240523   SRS1054810         PHCE6   

                                      geo_loc_name_s          sex_s  \
0           Mexico:Colonia Guerrero, Baja California           male   
1  USA:3.4 mi (by gravel road) N Steins, Hidalgo ...  not collected   
2  USA:Hwy 180, 8.4 mi NW of Hwy 10 (at Deming), ...  not collected   
3          USA:0.5 mi E Portal, Cochise Co., Arizona  not collected   
4  Mexico:7.5 mi S Guerrero Negro, Baja Californi...         female   

          voucher_s Assay_Type_s  AssemblyName_s BioProject_s  \
0   MVZ:Herp:161187        OTHER  <not provided>  PRJNA294316   
1  AMCC:Herp:106073        OTHER  <not provided>  PRJNA294316   
2   CAS:Herp:228870        OTHER  <not provided>  PRJNA294316   
3   MVZ:Herp:238582        OTHER  <not provided>  PRJNA294316   
4   MVZ:Herp:161210        OTHER  <not provided>  PRJNA294316   

           BioSampleModel_s Center_Name_s Consent_s  InsertSize_l  \
0  Model organism or animal         NREMC    public             0   
1  Model organism or animal         NREMC    public             0   
2  Model organism or animal         NREMC    public             0   
3  Model organism or animal         NREMC    public             0   
4  Model organism or animal         NREMC    public             0   

  LibraryLayout_s    LoadDate_s      
0          SINGLE  Sep 04, 2015 ...  
1          SINGLE  Sep 04, 2015 ...  
2          SINGLE  Sep 04, 2015 ...  
3          SINGLE  Sep 04, 2015 ...  
4          SINGLE  Sep 04, 2015 ...  

[5 rows x 29 columns]

Download sequence data using the SRA information

This is a function to make wget calls to download data base on SRA IDs


In [89]:
def wget_download(SRR, outdir, outname):
    """ Python function to get sra data from ncbi and write to
    outdir with a new name using bash call wget """
    
    ## get output name
    output = os.path.join(outdir, outname+".sra")
    
    ## create a call string 
    call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
           "ftp://ftp-trace.ncbi.nlm.nih.gov/"+\
           "sra/sra-instant/reads/ByRun/sra/SRR/"+\
           "{}/{}/{}.sra;".format(SRR[:6], SRR, SRR)
        
    ## call bash script
    ! $call

Here we pass the SRR number and the sample name to the wget_download function so that the files are saved with their sample names.


In [90]:
for ID, SRR in zip(indata.Library_Name_s, indata.Run_s):
    wget_download(SRR, "empirical_2/fastq/", ID)

Convert from SRA format to FASTQ format


In [93]:
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_2/fastq/ empirical_2/fastq/*.sra

## remove .sra files
rm empirical_2/fastq/*.sra


Read 1883604 spots for empirical_2/fastq/CADR1.sra
Written 1883604 spots for empirical_2/fastq/CADR1.sra
Read 1168210 spots for empirical_2/fastq/CADR2.sra
Written 1168210 spots for empirical_2/fastq/CADR2.sra
Read 1452471 spots for empirical_2/fastq/COTE1.sra
Written 1452471 spots for empirical_2/fastq/COTE1.sra
Read 5406187 spots for empirical_2/fastq/GAWI1.sra
Written 5406187 spots for empirical_2/fastq/GAWI1.sra
Read 699921 spots for empirical_2/fastq/HOMA1.sra
Written 699921 spots for empirical_2/fastq/HOMA1.sra
Read 2590961 spots for empirical_2/fastq/PETH1.sra
Written 2590961 spots for empirical_2/fastq/PETH1.sra
Read 1200924 spots for empirical_2/fastq/PHAS1.sra
Written 1200924 spots for empirical_2/fastq/PHAS1.sra
Read 2698423 spots for empirical_2/fastq/PHAS2.sra
Written 2698423 spots for empirical_2/fastq/PHAS2.sra
Read 1622179 spots for empirical_2/fastq/PHAS3.sra
Written 1622179 spots for empirical_2/fastq/PHAS3.sra
Read 1230297 spots for empirical_2/fastq/PHAS4.sra
Written 1230297 spots for empirical_2/fastq/PHAS4.sra
Read 1723043 spots for empirical_2/fastq/PHBL1.sra
Written 1723043 spots for empirical_2/fastq/PHBL1.sra
Read 1753039 spots for empirical_2/fastq/PHBL2.sra
Written 1753039 spots for empirical_2/fastq/PHBL2.sra
Read 1598240 spots for empirical_2/fastq/PHBL3.sra
Written 1598240 spots for empirical_2/fastq/PHBL3.sra
Read 1744633 spots for empirical_2/fastq/PHBL4.sra
Written 1744633 spots for empirical_2/fastq/PHBL4.sra
Read 651033 spots for empirical_2/fastq/PHBR1.sra
Written 651033 spots for empirical_2/fastq/PHBR1.sra
Read 1047774 spots for empirical_2/fastq/PHBR2.sra
Written 1047774 spots for empirical_2/fastq/PHBR2.sra
Read 2913849 spots for empirical_2/fastq/PHBR3.sra
Written 2913849 spots for empirical_2/fastq/PHBR3.sra
Read 3075277 spots for empirical_2/fastq/PHBR4.sra
Written 3075277 spots for empirical_2/fastq/PHBR4.sra
Read 1560169 spots for empirical_2/fastq/PHCE1.sra
Written 1560169 spots for empirical_2/fastq/PHCE1.sra
Read 1655737 spots for empirical_2/fastq/PHCE2.sra
Written 1655737 spots for empirical_2/fastq/PHCE2.sra
Read 2066291 spots for empirical_2/fastq/PHCE3.sra
Written 2066291 spots for empirical_2/fastq/PHCE3.sra
Read 1385481 spots for empirical_2/fastq/PHCE4.sra
Written 1385481 spots for empirical_2/fastq/PHCE4.sra
Read 2873860 spots for empirical_2/fastq/PHCE5.sra
Written 2873860 spots for empirical_2/fastq/PHCE5.sra
Read 1068532 spots for empirical_2/fastq/PHCE6.sra
Written 1068532 spots for empirical_2/fastq/PHCE6.sra
Read 1112293 spots for empirical_2/fastq/PHCN1.sra
Written 1112293 spots for empirical_2/fastq/PHCN1.sra
Read 1854834 spots for empirical_2/fastq/PHCN2.sra
Written 1854834 spots for empirical_2/fastq/PHCN2.sra
Read 1572449 spots for empirical_2/fastq/PHCN3.sra
Written 1572449 spots for empirical_2/fastq/PHCN3.sra
Read 1434649 spots for empirical_2/fastq/PHCN4.sra
Written 1434649 spots for empirical_2/fastq/PHCN4.sra
Read 755638 spots for empirical_2/fastq/PHCO1.sra
Written 755638 spots for empirical_2/fastq/PHCO1.sra
Read 1333197 spots for empirical_2/fastq/PHDI1.sra
Written 1333197 spots for empirical_2/fastq/PHDI1.sra
Read 1461823 spots for empirical_2/fastq/PHDI2.sra
Written 1461823 spots for empirical_2/fastq/PHDI2.sra
Read 1625284 spots for empirical_2/fastq/PHDO1.sra
Written 1625284 spots for empirical_2/fastq/PHDO1.sra
Read 2479285 spots for empirical_2/fastq/PHDO2.sra
Written 2479285 spots for empirical_2/fastq/PHDO2.sra
Read 993112 spots for empirical_2/fastq/PHGO1.sra
Written 993112 spots for empirical_2/fastq/PHGO1.sra
Read 848505 spots for empirical_2/fastq/PHGO2.sra
Written 848505 spots for empirical_2/fastq/PHGO2.sra
Read 2298845 spots for empirical_2/fastq/PHGO3.sra
Written 2298845 spots for empirical_2/fastq/PHGO3.sra
Read 4136233 spots for empirical_2/fastq/PHGO4.sra
Written 4136233 spots for empirical_2/fastq/PHGO4.sra
Read 1130365 spots for empirical_2/fastq/PHHE1.sra
Written 1130365 spots for empirical_2/fastq/PHHE1.sra
Read 1828106 spots for empirical_2/fastq/PHHE2.sra
Written 1828106 spots for empirical_2/fastq/PHHE2.sra
Read 1782490 spots for empirical_2/fastq/PHHE3.sra
Written 1782490 spots for empirical_2/fastq/PHHE3.sra
Read 2137199 spots for empirical_2/fastq/PHHE4.sra
Written 2137199 spots for empirical_2/fastq/PHHE4.sra
Read 1326138 spots for empirical_2/fastq/PHHE5.sra
Written 1326138 spots for empirical_2/fastq/PHHE5.sra
Read 3215345 spots for empirical_2/fastq/PHMC1.sra
Written 3215345 spots for empirical_2/fastq/PHMC1.sra
Read 1083680 spots for empirical_2/fastq/PHMC2.sra
Written 1083680 spots for empirical_2/fastq/PHMC2.sra
Read 1119901 spots for empirical_2/fastq/PHMC3.sra
Written 1119901 spots for empirical_2/fastq/PHMC3.sra
Read 2292044 spots for empirical_2/fastq/PHMC4.sra
Written 2292044 spots for empirical_2/fastq/PHMC4.sra
Read 1769810 spots for empirical_2/fastq/PHMO1.sra
Written 1769810 spots for empirical_2/fastq/PHMO1.sra
Read 1780446 spots for empirical_2/fastq/PHMO2.sra
Written 1780446 spots for empirical_2/fastq/PHMO2.sra
Read 566270 spots for empirical_2/fastq/PHOR1.sra
Written 566270 spots for empirical_2/fastq/PHOR1.sra
Read 2065124 spots for empirical_2/fastq/PHOR2.sra
Written 2065124 spots for empirical_2/fastq/PHOR2.sra
Read 2024585 spots for empirical_2/fastq/PHOR3.sra
Written 2024585 spots for empirical_2/fastq/PHOR3.sra
Read 718036 spots for empirical_2/fastq/PHOR4.sra
Written 718036 spots for empirical_2/fastq/PHOR4.sra
Read 1307372 spots for empirical_2/fastq/PHPL1.sra
Written 1307372 spots for empirical_2/fastq/PHPL1.sra
Read 2900231 spots for empirical_2/fastq/PHPL2.sra
Written 2900231 spots for empirical_2/fastq/PHPL2.sra
Read 1960451 spots for empirical_2/fastq/PHPL3.sra
Written 1960451 spots for empirical_2/fastq/PHPL3.sra
Read 1154546 spots for empirical_2/fastq/PHSH1.sra
Written 1154546 spots for empirical_2/fastq/PHSH1.sra
Read 1650407 spots for empirical_2/fastq/PHSH2.sra
Written 1650407 spots for empirical_2/fastq/PHSH2.sra
Read 940706 spots for empirical_2/fastq/PHSH3.sra
Written 940706 spots for empirical_2/fastq/PHSH3.sra
Read 814375 spots for empirical_2/fastq/PHSH4.sra
Written 814375 spots for empirical_2/fastq/PHSH4.sra
Read 978350 spots for empirical_2/fastq/PHSO1.sra
Written 978350 spots for empirical_2/fastq/PHSO1.sra
Read 1523072 spots for empirical_2/fastq/PHSO2.sra
Written 1523072 spots for empirical_2/fastq/PHSO2.sra
Read 1668677 spots for empirical_2/fastq/PHSO3.sra
Written 1668677 spots for empirical_2/fastq/PHSO3.sra
Read 1311410 spots for empirical_2/fastq/PHTA1.sra
Written 1311410 spots for empirical_2/fastq/PHTA1.sra
Read 1685806 spots for empirical_2/fastq/PHTA2.sra
Written 1685806 spots for empirical_2/fastq/PHTA2.sra
Read 1740419 spots for empirical_2/fastq/PHTA3.sra
Written 1740419 spots for empirical_2/fastq/PHTA3.sra
Read 1522268 spots for empirical_2/fastq/PHTA4.sra
Written 1522268 spots for empirical_2/fastq/PHTA4.sra
Read 1898189 spots for empirical_2/fastq/SCAN1.sra
Written 1898189 spots for empirical_2/fastq/SCAN1.sra
Read 1742309 spots for empirical_2/fastq/SCGA1.sra
Written 1742309 spots for empirical_2/fastq/SCGA1.sra
Read 1595531 spots for empirical_2/fastq/SCMA1.sra
Written 1595531 spots for empirical_2/fastq/SCMA1.sra
Read 1404985 spots for empirical_2/fastq/SCOC1.sra
Written 1404985 spots for empirical_2/fastq/SCOC1.sra
Read 806846 spots for empirical_2/fastq/UMNO1.sra
Written 806846 spots for empirical_2/fastq/UMNO1.sra
Read 2923832 spots for empirical_2/fastq/URBI1.sra
Written 2923832 spots for empirical_2/fastq/URBI1.sra
Read 3465996 spots for empirical_2/fastq/UROR1.sra
Written 3465996 spots for empirical_2/fastq/UROR1.sra
Read 4818547 spots for empirical_2/fastq/UTST1.sra
Written 4818547 spots for empirical_2/fastq/UTST1.sra
Read 131630146 spots total
Written 131630146 spots total

Make a params file for pyrad analysis


In [97]:
%%bash
pyrad --version


pyRAD 3.0.63

In [104]:
%%bash
## create a new default params file
pyrad -n


	new params.txt file created

In [28]:
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_2/           ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAGG                 ## 6. cutters ' params.txt
sed -i '/## 7. /c\30                     ## 7. N processors      ' params.txt
sed -i '/## 9. /c\6                      ## 9. NQual             ' params.txt
sed -i '/## 10./c\.85                    ## 10. clust threshold  ' params.txt
sed -i '/## 12./c\4                      ## 12. MinCov           ' params.txt
sed -i '/## 13./c\10                     ## 13. maxSH            ' params.txt
sed -i '/## 14./c\empirical_2_m4         ## 14. output name      ' params.txt
sed -i '/## 18./c\empirical_2/fastq/*.gz ## 18. data location    ' params.txt
sed -i '/## 29./c\2,2                    ## 29. trim overhang    ' params.txt
sed -i '/## 30./c\p,n,s                  ## 30. output formats   ' params.txt

In [29]:
cat params.txt


==** parameter inputs for pyRAD version 3.0.63  **======================== affected step ==
empirical_2/           ## 1. working directory 
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAGG                 ## 6. cutters 
30                     ## 7. N processors      
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
6                      ## 9. NQual             
.85                    ## 10. clust threshold  
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                      ## 12. MinCov           
10                     ## 13. maxSH            
empirical_2_m4         ## 14. output name      
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
empirical_2/fastq/*.gz ## 18. data location    
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
2,2                    ## 29. trim overhang    
p,n,s                  ## 30. output formats   
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Assemble in pyrad

We assemble the data set at mincov=4 and mincov=2


In [ ]:
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1

In [32]:
%%bash
sed -i '/## 12./c\2                    ## 12. MinCov           ' params.txt
sed -i '/## 14./c\empirical_2_m2       ## 14. output name      ' params.txt

In [33]:
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1

Results

We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.

Raw data amounts

The average number of raw reads per sample is 1.5M.


In [6]:
## read in the data
s2dat = pd.read_table("empirical_2/stats/s2.rawedit.txt", header=0, nrows=74)

## print summary stats
print s2dat["passed.total"].describe()

## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]


count         74.000000
mean     1500455.675676
std       753933.061083
min       469134.000000
25%      1023280.500000
50%      1366480.000000
75%      1648418.250000
max      4436366.000000
Name: passed.total, dtype: float64

most raw data in sample:
3    GAWI1
Name: sample , dtype: object

Look at distributions of coverage

pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std here is the std in means across samples. The std of depths within individuals is much higher.


In [50]:
## read in the s3 results
s3dat = pd.read_table("empirical_2/stats/s3.clusters.txt", header=0, nrows=74)

## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()

## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()

## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()

## find which sample has the greatest depth of retained loci
max_hiprop = (s3dat["d>5.tot"]/s3dat["total"]).max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']/s3dat["total"]==max_hiprop]


summary of means
==================
count    74.000000
mean     38.482919
std      15.210132
min      13.655000
25%      27.747500
50%      36.933500
75%      45.540000
max      97.009000
Name: dpt.me, dtype: float64

summary of std
==================
count      74.000000
mean      339.613689
std       304.751343
min       108.546000
25%       172.106500
50%       235.372000
75%       349.702750
max      2090.841000
Name: dpt.sd, dtype: float64

summary of proportion lowdepth
==================
count    74.000000
mean      0.716992
std       0.057099
min       0.618293
25%       0.677747
50%       0.702102
75%       0.756614
max       0.894946
dtype: float64

highest coverage in sample:
17    PHBR4
Name: taxa, dtype: object

In [24]:
## print mean and std of coverage for the highest coverage sample
with open("empirical_2/clust.85/PHBR4.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
print depths.mean(), depths.std()


57.1020689655 368.84812359

Plot the coverage for the sample with highest mean coverage

Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.


In [27]:
import toyplot
import toyplot.svg
import numpy as np

## read in the depth information for this sample
with open("empirical_2/clust.85/PHBR4.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)", 
                   ylabel="N loci",
                   label="dataset2/sample=PHBR4")

## select the loci with depth > 5 (kept)
keeps = depths[depths>5]

## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)

#toyplot.svg.render(canvas, "empirical_2_depthplot.svg")


Out[27]:
<toyplot.mark.BarMagnitudes at 0x7fd4eee25c10>
0102030Depth of coverage (N reads)050001000015000N locidataset2/sample=PHBR4

In [34]:
cat empirical_2/stats/empirical_2_m4.stats



41086       ## loci with > minsp containing data
41033       ## loci with > minsp containing data & paralogs removed
41011       ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
CADR1	2794
CADR2	2368
COTE1	2169
GAWI1	770
HOMA1	1473
PETH1	2020
PHAS1	4764
PHAS2	7858
PHAS3	6246
PHAS4	5945
PHBL1	9814
PHBL2	7591
PHBL3	6895
PHBL4	6951
PHBR1	4095
PHBR2	4235
PHBR3	8615
PHBR4	9529
PHCE1	8592
PHCE2	9942
PHCE3	8272
PHCE4	7031
PHCE5	9280
PHCE6	7123
PHCN1	5035
PHCN2	6544
PHCN3	5789
PHCN4	6183
PHCO1	5207
PHDI1	6054
PHDI2	6145
PHDO1	6949
PHDO2	8862
PHGO1	5427
PHGO2	4957
PHGO3	7282
PHGO4	7148
PHHE1	6596
PHHE2	8473
PHHE3	6519
PHHE4	8101
PHHE5	5674
PHMC1	8894
PHMC2	6284
PHMC3	6505
PHMC4	6808
PHMO1	5078
PHMO2	5500
PHOR1	3478
PHOR2	8272
PHOR3	8218
PHOR4	4160
PHPL1	5890
PHPL2	8079
PHPL3	6535
PHSH1	6346
PHSH2	7586
PHSH3	4994
PHSH4	5671
PHSO1	3819
PHSO2	5372
PHSO3	6552
PHTA1	5481
PHTA2	6787
PHTA3	6527
PHTA4	7901
SCAN1	1165
SCGA1	1383
SCMA1	1489
SCOC1	1651
UMNO1	1099
URBI1	1277
UROR1	1481
UTST1	1861


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	10107	*	41011
5	5693	*	30904
6	4159	*	25211
7	3133	*	21052
8	2510	*	17919
9	1924	*	15409
10	1545	*	13485
11	1310	*	11940
12	1056	*	10630
13	986	*	9574
14	781	*	8588
15	690	*	7807
16	584	*	7117
17	529	*	6533
18	492	*	6004
19	413	*	5512
20	391	*	5099
21	329	*	4708
22	319	*	4379
23	299	*	4060
24	289	*	3761
25	264	*	3472
26	255	*	3208
27	237	*	2953
28	211	*	2716
29	185	*	2505
30	192	*	2320
31	179	*	2128
32	138	*	1949
33	119	*	1811
34	168	*	1692
35	151	*	1524
36	106	*	1373
37	107	*	1267
38	109	*	1160
39	96	*	1051
40	90	*	955
41	85	*	865
42	93	*	780
43	71	*	687
44	49	*	616
45	56	*	567
46	45	*	511
47	47	*	466
48	46	*	419
49	50	*	373
50	42	*	323
51	30	*	281
52	33	*	251
53	31	*	218
54	18	*	187
55	26	*	169
56	16	*	143
57	16	*	127
58	11	*	111
59	15	*	100
60	11	*	85
61	11	*	74
62	8	*	63
63	11	*	55
64	6	*	44
65	10	*	38
66	7	*	28
67	7	*	21
68	6	*	14
69	0	*	8
70	4	*	8
71	2	*	4
72	1	*	2
73	0	*	1
74	1	*	1


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	8048	0	16090	0
1	5782	5782	7148	7148
2	4644	15070	5054	17256
3	4018	27124	3757	28527
4	3561	41368	2869	40003
5	3270	57718	2249	51248
6	2957	75460	1492	60200
7	2201	90867	893	66451
8	1702	104483	608	71315
9	1332	116471	377	74708
10	1037	126841	211	76818
11	698	134519	128	78226
12	542	141023	71	79078
13	372	145859	36	79546
14	262	149527	14	79742
15	198	152497	8	79862
16	115	154337	2	79894
17	87	155816	2	79928
18	67	157022	1	79946
19	43	157839	1	79965
20	28	158399	0	79965
21	16	158735	0	79965
22	8	158911	0	79965
23	9	159118	0	79965
24	7	159286	0	79965
25	6	159436	0	79965
26	0	159436	0	79965
27	0	159436	0	79965
28	0	159436	0	79965
29	0	159436	0	79965
30	1	159466	0	79965
total var= 159466
total pis= 79965
sampled unlinked SNPs= 32963
sampled unlinked bi-allelic SNPs= 18577

In [35]:
%%bash 
head -n 10 empirical_2/stats/empirical_2_m2.stats



87440       ## loci with > minsp containing data
87387       ## loci with > minsp containing data & paralogs removed
87325       ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
CADR1	7329
CADR2	6453

Infer ML phylogeny in raxml as an unrooted tree


In [ ]:
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_2/ \
                      -n empirical_2_m4 -s empirical_2/outfiles/empirical_2_m4.phy

In [ ]:
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_2/ \
                      -n empirical_2_m2 -s empirical_2/outfiles/empirical_2_m2.phy

In [227]:
%%bash 
head -n 20 empirical_2/RAxML_info.empirical_2_m4



This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 296560 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 86.18%

RAxML rapid bootstrapping and subsequent ML search

In [ ]:
%%bash 
head -n 20 empirical_2/RAxML_info.empirical_2_m2

Plot the tree in R using ape


In [39]:
%load_ext rpy2.ipython

In [44]:
%%R -w 600 -h 800
library(ape)
tre <- read.tree("empirical_2/RAxML_bipartitions.empirical_2")
ltre <- ladderize(tre)
plot(ltre, cex=0.8, edge.width=2)
#nodelabels(ltre$node.label)


Get average phylo distances (GTRgamma distance)


In [51]:
%%R
mean(cophenetic.phylo(ltre))


[1] 0.06260378