Notebook 5:

This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.



In [2]:

    
### Notebook 5
### Data set 5 (Heliconius)
### Authors: Nadeau et al. (2013)
### Data Location: ERP000991

Download the sequence data

Below I read in EraRunTable.txt for this project which contains all of the information we need to download the data.

Project ERA: ERP000991



In [1]:

    
%%bash
## make a new directory for this analysis
mkdir -p empirical_5/fastq/

sub-sampling

The authors used only a subset of the data that are on the archive for their phylogenetic analyses so I will choose the same 54 samples here which are listed in Table S1 of their publication.



In [1]:

    
subsamples = ['ERS070268','ERS070269','ERS070270','ERS070271','ERS070272',
              'ERS070273','ERS070274','ERS070275','ERS070276','ERS070277',
              'ERS070246','ERS070247','ERS070248','ERS070249','ERS070245',
              'ERS070252','ERS070253','ERS070254','ERS070256','ERS070255',
              'ERS074398','ERS074399','ERS074400','ERS074401','ERS074402',
              'ERS074403','ERS074404','ERS074405','ERS074406','ERS074407',
              'ERS074408','ERS074409','ERS074410','ERS074411','ERS074412',
              'ERS074413','ERS074414','ERS074415','ERS074416','ERS074417',
              'ERS074418','ERS074419','ERS074420','ERS074421','ERS074422',
              'ERS074423','ERS074424','ERS074425','ERS074426','ERS074427',
              'ERS074428','ERS074429','ERS070250','ERS070251']

len(subsamples)









    Out[1]:





54

For each ERS (individuals) get all of the ERR (sequence file accessions).



In [82]:

    
## IPython code
import pandas as pd
import numpy as np
import urllib2
import os

## open the SRA run table from github url
url = "https://raw.githubusercontent.com/"+\
      "dereneaton/RADmissing/master/empirical_9_EraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")

## print first few rows
print indata.head()









    



  study_accession secondary_study_accession sample_accession  \
0       PRJEB2743                 ERP000991     SAMEA1322899   
1       PRJEB2743                 ERP000991     SAMEA1322899   
2       PRJEB2743                 ERP000991     SAMEA1322873   
3       PRJEB2743                 ERP000991     SAMEA1322873   
4       PRJEB2743                 ERP000991     SAMEA1322940   

  secondary_sample_accession experiment_accession run_accession  tax_id  \
0                  ERS070236            ERX030872     ERR053824   33444   
1                  ERS070236            ERX030873     ERR053825   33444   
2                  ERS070237            ERX030874     ERR053826   33444   
3                  ERS070237            ERX030875     ERR053827   33444   
4                  ERS070238            ERX030876     ERR053828   33444   

       scientific_name              instrument_model library_layout  \
0  Heliconius elevatus  Illumina Genome Analyzer IIx         PAIRED   
1  Heliconius elevatus  Illumina Genome Analyzer IIx         PAIRED   
2  Heliconius elevatus  Illumina Genome Analyzer IIx         PAIRED   
3  Heliconius elevatus  Illumina Genome Analyzer IIx         PAIRED   
4  Heliconius elevatus  Illumina Genome Analyzer IIx         PAIRED   

                                           fastq_ftp  \
0  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053824/...   
1  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053825/...   
2  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053826/...   
3  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053827/...   
4  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053828/...   

                                        fastq_galaxy  \
0  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053824/...   
1  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053825/...   
2  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053826/...   
3  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053827/...   
4  ftp.sra.ebi.ac.uk/vol1/fastq/ERR053/ERR053828/...   

                                       submitted_ftp  \
0  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...   
1  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...   
2  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...   
3  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...   
4  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...   

                                    submitted_galaxy  cram_index_ftp  \
0  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...             NaN   
1  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...             NaN   
2  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...             NaN   
3  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...             NaN   
4  ftp.sra.ebi.ac.uk/vol1/ERA066/ERA066292/fastq/...             NaN   

   cram_index_galaxy  
0                NaN  
1                NaN  
2                NaN  
3                NaN  
4                NaN



In [42]:

    
def wget_download_ERR(ERR, outdir, outname):
    """ Python function to get sra data from ncbi and write to
    outdir with a new name using bash call wget """
    
    ## get output name
    output = os.path.join(outdir, outname+".fastq.gz")

    ## create a call string 
    call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
           "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/"+\
           "{}/{}/{}_1.fastq.gz".format(ERR[:6], ERR, ERR)
    
    ## call bash script
    ! $call

Here we pass the SRR number and the sample name to the wget_download function so that the files are saved with their sample names.



In [43]:

    
for ID, ERS, ERR in zip(indata.scientific_name, 
                        indata.secondary_sample_accession,
                        indata.run_accession):
    if ERS in subsamples:
        name = ID.split()[1]
        name += "_"+ERS+"_"+ERR
        print "{:<35}\t{}\t{}".format(ID, ERS, ERR)
        wget_download_ERR(ERR, "empirical_5/fastq/", name)









    



Heliconius timareta ssp. JD-2011   	ERS070246	ERR053848
Heliconius timareta ssp. JD-2011   	ERS070246	ERR053849
Heliconius timareta ssp. JD-2011   	ERS070246	ERR053850
Heliconius timareta ssp. JD-2011   	ERS070247	ERR053851
Heliconius timareta ssp. JD-2011   	ERS070247	ERR053852
Heliconius timareta ssp. JD-2011   	ERS070247	ERR053853
Heliconius timareta ssp. JD-2011   	ERS070245	ERR053854
Heliconius timareta ssp. JD-2011   	ERS070245	ERR053855
Heliconius timareta ssp. JD-2011   	ERS070248	ERR053856
Heliconius timareta ssp. JD-2011   	ERS070248	ERR053857
Heliconius timareta ssp. JD-2011   	ERS070249	ERR053858
Heliconius timareta ssp. JD-2011   	ERS070249	ERR053859
Heliconius timareta timareta       	ERS070250	ERR053860
Heliconius timareta timareta       	ERS070250	ERR053861
Heliconius timareta timareta       	ERS070250	ERR053862
Heliconius timareta timareta       	ERS070251	ERR053863
Heliconius timareta timareta       	ERS070251	ERR053864
Heliconius timareta timareta       	ERS070251	ERR053865
Heliconius hecale felix            	ERS070252	ERR053866
Heliconius hecale felix            	ERS070252	ERR053867
Heliconius hecale felix            	ERS070253	ERR053868
Heliconius hecale felix            	ERS070253	ERR053869
Heliconius hecale felix            	ERS070254	ERR053870
Heliconius hecale felix            	ERS070254	ERR053871
Heliconius hecale felix            	ERS070255	ERR053872
Heliconius hecale felix            	ERS070255	ERR053873
Heliconius hecale felix            	ERS070255	ERR053874
Heliconius hecale felix            	ERS070256	ERR053875
Heliconius hecale felix            	ERS070256	ERR053876
Heliconius hecale felix            	ERS070256	ERR053877
Heliconius melpomene aglaope       	ERS070268	ERR053897
Heliconius melpomene aglaope       	ERS070268	ERR053898
Heliconius melpomene aglaope       	ERS070269	ERR053899
Heliconius melpomene aglaope       	ERS070269	ERR053900
Heliconius melpomene aglaope       	ERS070270	ERR053901
Heliconius melpomene aglaope       	ERS070270	ERR053902
Heliconius melpomene aglaope       	ERS070271	ERR053903
Heliconius melpomene aglaope       	ERS070271	ERR053904
Heliconius melpomene aglaope       	ERS070272	ERR053905
Heliconius melpomene aglaope       	ERS070272	ERR053906
Heliconius melpomene amaryllis     	ERS070273	ERR053907
Heliconius melpomene amaryllis     	ERS070273	ERR053908
Heliconius melpomene amaryllis     	ERS070274	ERR053909
Heliconius melpomene amaryllis     	ERS070274	ERR053910
Heliconius melpomene amaryllis     	ERS070275	ERR053911
Heliconius melpomene amaryllis     	ERS070275	ERR053912
Heliconius melpomene amaryllis     	ERS070276	ERR053913
Heliconius melpomene amaryllis     	ERS070276	ERR053914
Heliconius melpomene amaryllis     	ERS070277	ERR053915
Heliconius melpomene amaryllis     	ERS070277	ERR053916
Heliconius heurippa                	ERS074398	ERR055402
Heliconius heurippa                	ERS074399	ERR055403
Heliconius heurippa                	ERS074400	ERR055404
Heliconius heurippa                	ERS074401	ERR055405
Heliconius heurippa                	ERS074402	ERR055406
Heliconius heurippa                	ERS074403	ERR055407
Heliconius cydno cordula           	ERS074404	ERR055408
Heliconius cydno cordula           	ERS074405	ERR055409
Heliconius cydno cordula           	ERS074406	ERR055410
Heliconius cydno cordula           	ERS074407	ERR055411
Heliconius cydno cordula           	ERS074408	ERR055412
Heliconius cydno cordula           	ERS074409	ERR055413
Heliconius cydno chioneus          	ERS074410	ERR055414
Heliconius cydno chioneus          	ERS074411	ERR055415
Heliconius cydno weymeri           	ERS074412	ERR055416
Heliconius cydno weymeri           	ERS074413	ERR055417
Heliconius cydno cydnides          	ERS074414	ERR055418
Heliconius cydno cydnides          	ERS074415	ERR055419
Heliconius melpomene melpomene     	ERS074416	ERR055420
Heliconius melpomene melpomene     	ERS074417	ERR055421
Heliconius melpomene melpomene     	ERS074418	ERR055422
Heliconius melpomene melpomene     	ERS074419	ERR055423
Heliconius melpomene melpomene     	ERS074420	ERR055424
Heliconius melpomene melpomene     	ERS074421	ERR055425
Heliconius melpomene melpomene     	ERS074422	ERR055426
Heliconius melpomene melpomene     	ERS074423	ERR055427
Heliconius melpomene melpomene     	ERS074424	ERR055428
Heliconius melpomene melpomene     	ERS074425	ERR055429
Heliconius melpomene rosina        	ERS074426	ERR055430
Heliconius melpomene rosina        	ERS074427	ERR055431
Heliconius melpomene ecuadorensis  	ERS074428	ERR055432
Heliconius melpomene ecuadorensis  	ERS074429	ERR055433

Note:

The data look kind of weird because there are a lot of As in the beginning. I figured out it is just because the sequences are sorted alphabetically.

Merge technical replicates

This study includes several technical replicates per sequenced individuals, which we combine into a single file for each individual here.

Make a new directory for merged files



In [59]:

    
%%bash
mkdir -p empirical_5/fastq/merged



In [62]:

    
##IPython code
import glob
import os

## get all fastq files
taxa = [i.split("/")[-1].rsplit('_',1)[0] for i in \
        glob.glob("empirical_5/fastq/*.gz")]

## iterate over individuals and create merge file
for taxon in set(taxa):
    print taxon, "merged"
    reps = glob.glob("empirical_5/fastq/"+taxon+"*")

    if len(reps) > 1:      
        ## get replicate files
        with open("empirical_5/fastq/merged/"+taxon+".fastq.gz", 'wb') as out:
            for rep in reps:
                with open(rep, 'rb') as tempin:
                    out.write(tempin.read())
    else:
        dirs, ff = os.path.split(reps[0])
        os.rename(os.path.join(dirs,ff), os.path.join(dirs,"merged",taxon+".fastq.gz"))









    



melpomene_ERS070276 merged
melpomene_ERS070277 merged
melpomene_ERS070274 merged
melpomene_ERS070275 merged
melpomene_ERS070272 merged
melpomene_ERS070273 merged
melpomene_ERS070270 merged
melpomene_ERS070271 merged
timareta_ERS070247 merged
timareta_ERS070246 merged
timareta_ERS070245 merged
timareta_ERS070249 merged
timareta_ERS070248 merged
cydno_ERS074413 merged
cydno_ERS074412 merged
cydno_ERS074411 merged
cydno_ERS074410 merged
cydno_ERS074415 merged
cydno_ERS074414 merged
hecale_ERS070253 merged
hecale_ERS070252 merged
hecale_ERS070256 merged
hecale_ERS070255 merged
hecale_ERS070254 merged
melpomene_ERS074418 merged
melpomene_ERS074419 merged
melpomene_ERS074416 merged
melpomene_ERS074417 merged
melpomene_ERS070269 merged
melpomene_ERS070268 merged
timareta_ERS070250 merged
timareta_ERS070251 merged

Make a params file



In [64]:

    
%%bash
pyrad --version









    



pyRAD 3.0.63



In [65]:

    
%%bash
## remove old params file if it exists
rm params.txt 

## create a new default params file
pyrad -n









    



	new params.txt file created

Note:

The data here are from Illumina Casava <1.8, so the phred scores are offset by 64 instead of 33, so we use that in the params file below.



In [68]:

    
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_5/           ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG                  ## 6. cutters ' params.txt
sed -i '/## 7. /c\20                     ## 7. N processors      ' params.txt
sed -i '/## 9. /c\6                      ## 9. NQual             ' params.txt
sed -i '/## 10./c\.85                    ## 10. clust threshold  ' params.txt
sed -i '/## 12./c\4                      ## 12. MinCov           ' params.txt
sed -i '/## 13./c\10                     ## 13. maxSH            ' params.txt
sed -i '/## 14./c\empirical_5_m4          ## 14. output name      ' params.txt
sed -i '/## 18./c\empirical_5/fastq/merged/*.gz ## 18. data location    ' params.txt
sed -i '/## 20./c\64                     ## 20. phred offset     ' params.txt
sed -i '/## 29./c\2,2                    ## 29. trim overhang    ' params.txt
sed -i '/## 30./c\p,n,s                  ## 30. output formats   ' params.txt



In [69]:

    
cat params.txt









    



==** parameter inputs for pyRAD version 3.0.63  **======================== affected step ==
empirical_5/           ## 1. working directory 
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                  ## 6. cutters 
20                     ## 7. N processors      
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
6                      ## 9. NQual             
.85                    ## 10. clust threshold  
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                      ## 12. MinCov           
10                     ## 13. maxSH            
empirical_5            ## 14. output name      
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
empirical_5/fastq/merged/*.gz ## 18. data location    
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
64                     ## 20. phred offset     
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
2,2                    ## 29. trim overhang    
p,n,s                  ## 30. output formats   
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Assemble in pyrad



In [ ]:

    
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1



In [ ]:

    
%%bash
sed -i '/## 12./c\2                    ## 12. MinCov           ' params.txt
sed -i '/## 14./c\empirical_5_m2       ## 14. output name      ' params.txt



In [ ]:

    
pyrad -p params.txt -s 7 >> log.txt 2>&1

Results

We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.

Raw data amounts

The average number of raw reads per sample is 1.36M.



In [3]:

    
## read in the data
s2dat = pd.read_table("empirical_5/stats/s2.rawedit.txt", header=0, nrows=55)

## print summary stats
print s2dat["passed.total"].describe()

## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]









    



count          54.000000
mean      5315416.333333
std       2877440.775877
min       2022986.000000
25%       3310234.000000
50%       3784971.000000
75%       8453907.500000
max      12011939.000000
Name: passed.total, dtype: float64

most raw data in sample:
15    hecale_ERS070255
Name: sample , dtype: object

Look at distributions of coverage

pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others.



In [10]:

    
## read in the s3 results
s5dat = pd.read_table("empirical_5/stats/s3.clusters.txt", header=0, nrows=54)

## print summary stats
print "summary of means\n=================="
print s5dat['dpt.me'].describe()

## print summary stats
print "\nsummary of std\n=================="
print s5dat['dpt.sd'].describe()

## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s5dat['d>5.tot']/s5dat["total"]).describe()

## find which sample has the greatest depth of retained loci
max_hiprop = (s5dat["d>5.tot"]/s5dat["total"]).max()
print "\nhighest coverage in sample:"
print s5dat['taxa'][s5dat['d>5.tot']/s5dat["total"]==max_hiprop]









    



summary of means
==================
count    54.000000
mean     47.842574
std      19.705225
min      15.745000
25%      38.288250
50%      44.435500
75%      56.646000
max      92.441000
Name: dpt.me, dtype: float64

summary of std
==================
count     54.000000
mean     165.924389
std       76.682339
min       34.403000
25%      118.026500
50%      150.211500
75%      215.882000
max      381.038000
Name: dpt.sd, dtype: float64

summary of proportion lowdepth
==================
count    54.000000
mean      0.359732
std       0.135457
min       0.151041
25%       0.257660
50%       0.347190
75%       0.466101
max       0.669051
dtype: float64

highest coverage in sample:
53    timareta_ERS070251
Name: taxa, dtype: object



In [13]:

    
## print mean and std of coverage for the highest coverage sample
with open("empirical_5/clust.85/timareta_ERS070251.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
print depths.mean(), depths.std()









    



44.0298150232 117.15674992

Plot the coverage for the sample with highest mean coverage

Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.



In [16]:

    
import toyplot
import toyplot.svg
import numpy as np

## read in the depth information for this sample
with open("empirical_5/clust.85/timareta_ERS070251.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)", 
                   ylabel="N loci", 
                   label="dataset5/sample=timareta_ERS070251")

## select the loci with depth > 5 (kept)
keeps = depths[depths>5]

## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)

#toyplot.svg.render(canvas, "empirical_5_depthplot.svg")









    Out[16]:





<toyplot.mark.BarMagnitudes at 0x7ff22c0e08d0>






    




Save as .csv

Print final stats table



In [94]:

    
cat empirical_5/stats/empirical_5_m4.stats









    



125876      ## loci with > minsp containing data
119819      ## loci with > minsp containing data & paralogs removed
119819      ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
cydno_ERS074404    	37350
cydno_ERS074405    	38572
cydno_ERS074406    	37734
cydno_ERS074407    	37863
cydno_ERS074408    	37843
cydno_ERS074409    	38464
cydno_ERS074410    	33024
cydno_ERS074411    	30593
cydno_ERS074412    	22874
cydno_ERS074413    	22699
cydno_ERS074414    	22996
cydno_ERS074415    	22966
hecale_ERS070252   	38275
hecale_ERS070253   	38129
hecale_ERS070254   	38378
hecale_ERS070255   	38013
hecale_ERS070256   	37917
heurippa_ERS074398 	38458
heurippa_ERS074399 	38354
heurippa_ERS074400 	37843
heurippa_ERS074401 	38376
heurippa_ERS074402 	38231
heurippa_ERS074403 	38201
melpomene_ERS070268	40055
melpomene_ERS070269	40284
melpomene_ERS070270	39727
melpomene_ERS070271	39908
melpomene_ERS070272	38809
melpomene_ERS070273	40151
melpomene_ERS070274	36786
melpomene_ERS070275	40068
melpomene_ERS070276	39807
melpomene_ERS070277	37965
melpomene_ERS074416	33006
melpomene_ERS074417	36365
melpomene_ERS074418	36483
melpomene_ERS074419	35993
melpomene_ERS074420	38888
melpomene_ERS074421	38023
melpomene_ERS074422	37885
melpomene_ERS074423	38333
melpomene_ERS074424	30796
melpomene_ERS074425	33266
melpomene_ERS074426	35757
melpomene_ERS074427	37341
melpomene_ERS074428	32355
melpomene_ERS074429	31705
timareta_ERS070245 	40166
timareta_ERS070246 	38704
timareta_ERS070247 	37990
timareta_ERS070248 	39932
timareta_ERS070249 	40204
timareta_ERS070250 	38882
timareta_ERS070251 	37739


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	26561	*	119819
5	13321	*	93258
6	6556	*	79937
7	5141	*	73381
8	4286	*	68240
9	3795	*	63954
10	3316	*	60159
11	2938	*	56843
12	2831	*	53905
13	2541	*	51074
14	2294	*	48533
15	2163	*	46239
16	1941	*	44076
17	1816	*	42135
18	1740	*	40319
19	1691	*	38579
20	1663	*	36888
21	1568	*	35225
22	1426	*	33657
23	1412	*	32231
24	1368	*	30819
25	1170	*	29451
26	1217	*	28281
27	1130	*	27064
28	1122	*	25934
29	1032	*	24812
30	1028	*	23780
31	1011	*	22752
32	1003	*	21741
33	947	*	20738
34	922	*	19791
35	881	*	18869
36	838	*	17988
37	806	*	17150
38	811	*	16344
39	802	*	15533
40	849	*	14731
41	800	*	13882
42	849	*	13082
43	864	*	12233
44	902	*	11369
45	959	*	10467
46	746	*	9508
47	794	*	8762
48	903	*	7968
49	1171	*	7065
50	1834	*	5894
51	480	*	4060
52	611	*	3580
53	969	*	2969
54	2000	*	2000


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	10137	0	29807	0
1	9141	9141	16094	16094
2	8326	25793	11449	38992
3	7494	48275	8901	65695
4	7121	76759	7571	95979
5	6775	110634	6655	129254
6	6705	150864	5800	164054
7	6231	194481	5399	201847
8	6035	242761	4762	239943
9	5580	292981	4234	278049
10	5326	346241	3797	316019
11	4909	400240	3255	351824
12	4535	454660	2658	383720
13	4217	509481	2376	414608
14	3864	563577	1919	441474
15	3381	614292	1423	462819
16	3056	663188	1085	480179
17	2665	708493	844	494527
18	2365	751063	594	505219
19	2096	790887	407	512952
20	1780	826487	297	518892
21	1596	860003	168	522420
22	1319	889021	129	525258
23	1131	915034	85	527213
24	912	936922	39	528149
25	706	954572	32	528949
26	563	969210	17	529391
27	465	981765	9	529634
28	394	992797	5	529774
29	277	1000830	7	529977
30	228	1007670	1	530007
31	152	1012382	0	530007
32	118	1016158	0	530007
33	88	1019062	0	530007
34	48	1020694	0	530007
35	28	1021674	0	530007
36	13	1022142	0	530007
37	15	1022697	0	530007
38	7	1022963	0	530007
39	5	1023158	0	530007
40	4	1023318	0	530007
41	3	1023441	0	530007
42	3	1023567	0	530007
43	1	1023610	0	530007
44	1	1023654	0	530007
45	0	1023654	0	530007
46	1	1023700	0	530007
47	0	1023700	0	530007
48	1	1023748	0	530007
49	0	1023748	0	530007
50	1	1023798	0	530007
total var= 1023798
total pis= 530007
sampled unlinked SNPs= 109682
sampled unlinked bi-allelic SNPs= 89479



In [18]:

    
%%bash
head -n 20 empirical_5/stats/empirical_5_m2.stats









    



218875      ## loci with > minsp containing data
212818      ## loci with > minsp containing data & paralogs removed
212818      ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
cydno_ERS074404    	40114
cydno_ERS074405    	41446
cydno_ERS074406    	40560
cydno_ERS074407    	40694
cydno_ERS074408    	40758
cydno_ERS074409    	41320
cydno_ERS074410    	35272
cydno_ERS074411    	32803
cydno_ERS074412    	51705
cydno_ERS074413    	47291
cydno_ERS074414    	52449
cydno_ERS074415    	52917

Infer ML phylogeny in raxml as an unrooted tree



In [ ]:

    
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_5/ \
                      -n empirical_5_m4 -s empirical_5/outfiles/empirical_5_m4.phy



In [ ]:

    
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_5/ \
                      -n empirical_5_m2 -s empirical_5/outfiles/empirical_5_m2.phy



In [96]:

    
%%bash 
head -n 20 empirical_5/RAxML_info.empirical_5_m4









    



This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 546632 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 70.04%

RAxML rapid bootstrapping and subsequent ML search



In [19]:

    
%%bash 
head -n 20 empirical_5/RAxML_info.empirical_5_m2









    



This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 338778 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 80.98%

RAxML rapid bootstrapping and subsequent ML search

Get phylo distance (GTRgamma dist)



In [ ]:

    
%load_ext rpy2.ipython



In [27]:

    
%%R
library(ape)
ltre <- read.tree()
mean(cophenetic.phylo(ltre))









    





[1] 0.05029949