Notebook 3:

This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.



In [2]:

    
### Notebook 3
### Data set 3 (American oaks)
### Authors: Eaton et al. (2015)
### Data Location: NCBI SRA SRP055977

Download the sequence data

Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt for this project which contains all of the information we need to download the data.

Project SRA: SRP055977
BioProject ID: PRJNA277574
Biosample numbers: SAMN03394519 - SAMN03394561
Runs: SRR1915524 -- SRR1915566
SRA link: http://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP055977



In [45]:

    
%%bash
## make a new directory for this analysis
mkdir -p empirical_3/fastq/



In [46]:

    
## IPython code
import pandas as pd
import urllib2
import os

## open the SRA run table from github url
url = "https://raw.githubusercontent.com/"+\
      "dereneaton/RADmissing/master/empirical_3_SraRunTable.txt"
intable = urllib2.urlopen(url)
indata = pd.read_table(intable, sep="\t")

## print first few rows
print indata.head()









    



    BioSample_s Library_Name_s  MBases_l  MBytes_l           Organism_s  \
0  SAMN03394519          AR_re       366       240    Quercus arizonica   
1  SAMN03394520      BJSL25_re       484       313   Quercus brandegeei   
2  SAMN03394521      BJVL19_re       427       275   Quercus brandegeei   
3  SAMN03394522          CH_re       411       268  Quercus chrysolepis   
4  SAMN03394523     CRL0001_re       339       220     Quercus oleoides   

  ReleaseDate_s       Run_s SRA_Sample_s Sample_Name_s  \
0  Mar 13, 2015  SRR1915524    SRS868426         AR_re   
1  Mar 13, 2015  SRR1915525    SRS874263     BJSL25_re   
2  Mar 13, 2015  SRR1915526    SRS874262     BJVL19_re   
3  Mar 13, 2015  SRR1915527    SRS874261         CH_re   
4  Mar 13, 2015  SRR1915528    SRS874260    CRL0001_re   

                                  geo_loc_name_s Assay_Type_s  AssemblyName_s  \
0              USA: UMN Greenhouse, St. Paul, MN          WGS  <not provided>   
1         Mexico: Sierra la Laguna, Baja del Sur          WGS  <not provided>   
2            Mexico: Valle Perdido, Baja del Sur          WGS  <not provided>   
3              USA: UMN Greenhouse, St. Paul, MN          WGS  <not provided>   
4  Costa Rica: RincÃ³n de la Vieja National Park          WGS  <not provided>   

  BioProject_s BioSampleModel_s            Center_Name_s Consent_s  \
0  PRJNA277574            Plant  UNIVERSITY OF MINNESOTA    public   
1  PRJNA277574            Plant  UNIVERSITY OF MINNESOTA    public   
2  PRJNA277574            Plant  UNIVERSITY OF MINNESOTA    public   
3  PRJNA277574            Plant  UNIVERSITY OF MINNESOTA    public   
4  PRJNA277574            Plant  UNIVERSITY OF MINNESOTA    public   

   InsertSize_l LibraryLayout_s    LoadDate_s Platform_s      
0             0          SINGLE  Mar 13, 2015   ILLUMINA ...  
1             0          SINGLE  Mar 13, 2015   ILLUMINA ...  
2             0          SINGLE  Mar 13, 2015   ILLUMINA ...  
3             0          SINGLE  Mar 13, 2015   ILLUMINA ...  
4             0          SINGLE  Mar 13, 2015   ILLUMINA ...  

[5 rows x 27 columns]



In [47]:

    
def wget_download(SRR, outdir, outname):
    """ Python function to get sra data from ncbi and write to
    outdir with a new name using bash call wget """
    
    ## get output name
    output = os.path.join(outdir, outname+".sra")
    
    ## create a call string 
    call = "wget -q -r -nH --cut-dirs=9 -O "+output+" "+\
           "ftp://ftp-trace.ncbi.nlm.nih.gov/"+\
           "sra/sra-instant/reads/ByRun/sra/SRR/"+\
           "{}/{}/{}.sra;".format(SRR[:6], SRR, SRR)
        
    ## call bash script
    ! $call

Here we pass the SRR number and the sample name to the wget_download function so that the files are saved with their sample names.



In [48]:

    
for ID, SRR in zip(indata.Library_Name_s, indata.Run_s):
    wget_download(SRR, "empirical_3/fastq/", ID)



In [49]:

    
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_3/fastq/  empirical_3/fastq/*.sra

## remove .sra files
rm empirical_3/fastq/*.sra









    



Read 4046890 spots for empirical_3/fastq/AR_re.sra
Written 4046890 spots for empirical_3/fastq/AR_re.sra
Read 931926 spots for empirical_3/fastq/BJSB3_v.sra
Written 931926 spots for empirical_3/fastq/BJSB3_v.sra
Read 5352627 spots for empirical_3/fastq/BJSL25_re.sra
Written 5352627 spots for empirical_3/fastq/BJSL25_re.sra
Read 969575 spots for empirical_3/fastq/BJSL25_v.sra
Written 969575 spots for empirical_3/fastq/BJSL25_v.sra
Read 4715624 spots for empirical_3/fastq/BJVL19_re.sra
Written 4715624 spots for empirical_3/fastq/BJVL19_re.sra
Read 817443 spots for empirical_3/fastq/BJVL19_v.sra
Written 817443 spots for empirical_3/fastq/BJVL19_v.sra
Read 849191 spots for empirical_3/fastq/BZBB1_v.sra
Written 849191 spots for empirical_3/fastq/BZBB1_v.sra
Read 4539385 spots for empirical_3/fastq/CH_re.sra
Written 4539385 spots for empirical_3/fastq/CH_re.sra
Read 3742953 spots for empirical_3/fastq/CRL0001_re.sra
Written 3742953 spots for empirical_3/fastq/CRL0001_re.sra
Read 1012884 spots for empirical_3/fastq/CRL0001_v.sra
Written 1012884 spots for empirical_3/fastq/CRL0001_v.sra
Read 3242503 spots for empirical_3/fastq/CRL0030_re.sra
Written 3242503 spots for empirical_3/fastq/CRL0030_re.sra
Read 1204952 spots for empirical_3/fastq/CRL0030_v.sra
Written 1204952 spots for empirical_3/fastq/CRL0030_v.sra
Read 570148 spots for empirical_3/fastq/CUCA4_v.sra
Written 570148 spots for empirical_3/fastq/CUCA4_v.sra
Read 90661 spots for empirical_3/fastq/CUMM5_v.sra
Written 90661 spots for empirical_3/fastq/CUMM5_v.sra
Read 775357 spots for empirical_3/fastq/CUSV6_v.sra
Written 775357 spots for empirical_3/fastq/CUSV6_v.sra
Read 1130911 spots for empirical_3/fastq/CUVN10_v.sra
Written 1130911 spots for empirical_3/fastq/CUVN10_v.sra
Read 1953553 spots for empirical_3/fastq/DO_re.sra
Written 1953553 spots for empirical_3/fastq/DO_re.sra
Read 3744449 spots for empirical_3/fastq/DU_re.sra
Written 3744449 spots for empirical_3/fastq/DU_re.sra
Read 743556 spots for empirical_3/fastq/EN_re.sra
Written 743556 spots for empirical_3/fastq/EN_re.sra
Read 3219505 spots for empirical_3/fastq/FLAB109_re.sra
Written 3219505 spots for empirical_3/fastq/FLAB109_re.sra
Read 347949 spots for empirical_3/fastq/FLAB109_v.sra
Written 347949 spots for empirical_3/fastq/FLAB109_v.sra
Read 2826213 spots for empirical_3/fastq/FLBA140_re.sra
Written 2826213 spots for empirical_3/fastq/FLBA140_re.sra
Read 975393 spots for empirical_3/fastq/FLBA140_v.sra
Written 975393 spots for empirical_3/fastq/FLBA140_v.sra
Read 941880 spots for empirical_3/fastq/FLCK18_v.sra
Written 941880 spots for empirical_3/fastq/FLCK18_v.sra
Read 791393 spots for empirical_3/fastq/FLCK216_v.sra
Written 791393 spots for empirical_3/fastq/FLCK216_v.sra
Read 905151 spots for empirical_3/fastq/FLMO62_v.sra
Written 905151 spots for empirical_3/fastq/FLMO62_v.sra
Read 3976506 spots for empirical_3/fastq/FLSA185_v.sra
Written 3976506 spots for empirical_3/fastq/FLSA185_v.sra
Read 857771 spots for empirical_3/fastq/FLSF33_v.sra
Written 857771 spots for empirical_3/fastq/FLSF33_v.sra
Read 908978 spots for empirical_3/fastq/FLSF47_v.sra
Written 908978 spots for empirical_3/fastq/FLSF47_v.sra
Read 2543324 spots for empirical_3/fastq/FLSF54_re.sra
Written 2543324 spots for empirical_3/fastq/FLSF54_re.sra
Read 1115926 spots for empirical_3/fastq/FLSF54_v.sra
Written 1115926 spots for empirical_3/fastq/FLSF54_v.sra
Read 1099920 spots for empirical_3/fastq/FLWO6_v.sra
Written 1099920 spots for empirical_3/fastq/FLWO6_v.sra
Read 2777404 spots for empirical_3/fastq/HE_re.sra
Written 2777404 spots for empirical_3/fastq/HE_re.sra
Read 1112292 spots for empirical_3/fastq/HNDA09_v.sra
Written 1112292 spots for empirical_3/fastq/HNDA09_v.sra
Read 1436228 spots for empirical_3/fastq/LALC2_v.sra
Written 1436228 spots for empirical_3/fastq/LALC2_v.sra
Read 1154841 spots for empirical_3/fastq/MXED8_v.sra
Written 1154841 spots for empirical_3/fastq/MXED8_v.sra
Read 1347187 spots for empirical_3/fastq/MXGT4_v.sra
Written 1347187 spots for empirical_3/fastq/MXGT4_v.sra
Read 1046574 spots for empirical_3/fastq/MXSA3017_v.sra
Written 1046574 spots for empirical_3/fastq/MXSA3017_v.sra
Read 4482353 spots for empirical_3/fastq/NI_re.sra
Written 4482353 spots for empirical_3/fastq/NI_re.sra
Read 505504 spots for empirical_3/fastq/SCCU3_v.sra
Written 505504 spots for empirical_3/fastq/SCCU3_v.sra
Read 879229 spots for empirical_3/fastq/TXGR3_v.sra
Written 879229 spots for empirical_3/fastq/TXGR3_v.sra
Read 934386 spots for empirical_3/fastq/TXMD3_v.sra
Written 934386 spots for empirical_3/fastq/TXMD3_v.sra
Read 163427 spots for empirical_3/fastq/TXWV2_v.sra
Written 163427 spots for empirical_3/fastq/TXWV2_v.sra
Read 76783922 spots total
Written 76783922 spots total

This study includes several re-sequenced individuals. We combine them before beginning the analysis.



In [50]:

    
##IPython code
import glob

taxa = [i.split("/")[-1].split('_')[0] for i in glob.glob("empirical_3/fastq/*.gz")]
for taxon in set(taxa):
    if taxa.count(taxon) > 1:
        print taxon, "merged"
        
        ## merge replicate files
        ! cat empirical_3/fastq/$taxon\_v.fastq.gz \
              empirical_3/fastq/$taxon\_re.fastq.gz \
            > empirical_3/fastq/$taxon\_me.fastq.gz
            
        ## remove ind replicate files
        ! rm empirical_3/fastq/$taxon\_v.fastq.gz
        ! rm empirical_3/fastq/$taxon\_re.fastq.gz









    



BJVL19 merged
CRL0001 merged
BJSL25 merged
FLBA140 merged
FLAB109 merged
CRL0030 merged
FLSF54 merged

Make a params file



In [51]:

    
%%bash
pyrad --version









    



pyRAD 3.0.63



In [52]:

    
%%bash
## remove old params file if it exists
rm params.txt 

## create a new default params file
pyrad -n









    



	new params.txt file created



In [21]:

    
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_3/           ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAG                  ## 6. cutters ' params.txt
sed -i '/## 7. /c\20                     ## 7. N processors      ' params.txt
sed -i '/## 9. /c\6                      ## 9. NQual             ' params.txt
sed -i '/## 10./c\.85                    ## 10. clust threshold  ' params.txt
sed -i '/## 12./c\4                      ## 12. MinCov           ' params.txt
sed -i '/## 13./c\10                     ## 13. maxSH            ' params.txt
sed -i '/## 14./c\empirical_3_m4         ## 14. output name      ' params.txt
sed -i '/## 18./c\empirical_3/fastq/*.gz ## 18. data location    ' params.txt
sed -i '/## 29./c\2,2                    ## 29. trim overhang    ' params.txt
sed -i '/## 30./c\p,n,s                  ## 30. output formats   ' params.txt



In [22]:

    
cat params.txt









    



==** parameter inputs for pyRAD version 3.0.63  **======================== affected step ==
empirical_3/           ## 1. working directory 
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAG                  ## 6. cutters 
20                     ## 7. N processors      
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
6                      ## 9. NQual             
.85                    ## 10. clust threshold  
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                      ## 12. MinCov           
10                     ## 13. maxSH            
empirical_3_m4         ## 14. output name      
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
empirical_3/fastq/*.gz ## 18. data location    
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
2,2                    ## 29. trim overhang    
p,n,s                  ## 30. output formats   
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Assemble in pyrad



In [ ]:

    
%%bash
pyrad -p params.txt -s 234567 >> log.txt 2>&1



In [23]:

    
%%bash
sed -i '/## 12./c\2                      ## 12. MinCov           ' params.txt
sed -i '/## 14./c\empirical_3_m2         ## 14. output name      ' params.txt



In [24]:

    
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1

Results

We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.

Raw data amounts

The average number of raw reads per sample is 1.36M.



In [2]:

    
## read in the data
s2dat = pd.read_table("empirical_3/stats/s2.rawedit.txt", header=0, nrows=36)

## print summary stats
print s2dat["passed.total"].describe()

## find which sample has the most raw data
maxraw = s2dat["passed.total"].max()
print "\nmost raw data in sample:"
print s2dat['sample '][s2dat['passed.total']==maxraw]









    



count         36.000000
mean     1955564.750000
std      1594388.931033
min        84505.000000
25%       806803.750000
50%      1050559.000000
75%      3425603.250000
max      5829065.000000
Name: passed.total, dtype: float64

most raw data in sample:
2    BJSL25_me
Name: sample , dtype: object

Look at distributions of coverage

pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage. The std of coverages is pretty low in this data set compared to several others.



In [9]:

    
## read in the s3 results
s3dat = pd.read_table("empirical_3/stats/s3.clusters.txt", header=0, nrows=39)

## print summary stats
print "summary of means\n=================="
print s3dat['dpt.me'].describe()

## print summary stats
print "\nsummary of std\n=================="
print s3dat['dpt.sd'].describe()

## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s3dat['d>5.tot']/s3dat["total"]).describe()

## find which sample has the greatest depth of retained loci
max_hiprop = (s3dat["d>5.tot"]/s3dat["total"]).max()
print "\nhighest coverage in sample:"
print s3dat['taxa'][s3dat['d>5.tot']/s3dat["total"]==max_hiprop]









    



summary of means
==================
count    36.00000
mean     17.80300
std      13.02276
min       3.31500
25%       9.02300
50%      11.61800
75%      25.41200
max      51.11900
Name: dpt.me, dtype: float64

summary of std
==================
count     36.000000
mean      70.155611
std       59.520107
min        3.718000
25%       28.191000
50%       47.720500
75%       98.996000
max      211.783000
Name: dpt.sd, dtype: float64

summary of proportion lowdepth
==================
count    36.000000
mean      0.427849
std       0.157876
min       0.211693
25%       0.300563
50%       0.440962
75%       0.497383
max       0.860847
dtype: float64

highest coverage in sample:
0    AR_re
Name: taxa, dtype: object



In [11]:

    
import numpy as np
## print mean and std of coverage for the highest coverage sample
with open("empirical_3/clust.85/AR_re.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
print depths.mean(), depths.std()









    



37.41198396 122.321191848

Plot the coverage for the sample with highest mean coverage

Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.



In [19]:

    
import toyplot
import toyplot.svg
import numpy as np

## read in the depth information for this sample
with open("empirical_3/clust.85/AR_re.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)", 
                   ylabel="N loci", 
                   label="dataset3/sample=AR_re")

## select the loci with depth > 5 (kept)
keeps = depths[depths>5]

## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)

#toyplot.svg.render(canvas, "empirical_3_depthplot.svg")









    Out[19]:





<toyplot.mark.BarMagnitudes at 0x7f5de591dad0>






    




Save as .csv

Print final stats table



In [25]:

    
cat empirical_3/stats/empirical_3_m4.stats









    




90871       ## loci with > minsp containing data
87969       ## loci with > minsp containing data & paralogs removed
87969       ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
AR_re     	54176
BJSB3_v   	38030
BJSL25_me 	61596
BJVL19_me 	60159
BZBB1_v   	35466
CH_re     	51058
CRL0001_me	62704
CRL0030_me	60181
CUCA4_v   	25875
CUMM5_v   	2520
CUSV6_v   	34898
CUVN10_v  	43516
DO_re     	46865
DU_re     	52925
EN_re     	28342
FLAB109_me	53770
FLBA140_me	59903
FLCK18_v  	35834
FLCK216_v 	33674
FLMO62_v  	37431
FLSA185_v 	29650
FLSF33_v  	35589
FLSF47_v  	37411
FLSF54_me 	60404
FLWO6_v   	35292
HE_re     	42687
HNDA09_v  	43262
LALC2_v   	48666
MXED8_v   	41094
MXGT4_v   	47152
MXSA3017_v	40169
NI_re     	44506
SCCU3_v   	18442
TXGR3_v   	34111
TXMD3_v   	37451
TXWV2_v   	5109


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	7085	*	87969
5	5234	*	80884
6	4497	*	75650
7	3579	*	71153
8	3074	*	67574
9	2755	*	64500
10	2629	*	61745
11	2561	*	59116
12	2409	*	56555
13	2434	*	54146
14	2477	*	51712
15	2302	*	49235
16	2464	*	46933
17	2394	*	44469
18	2444	*	42075
19	2517	*	39631
20	2580	*	37114
21	2652	*	34534
22	2775	*	31882
23	2944	*	29107
24	2830	*	26163
25	2960	*	23333
26	3060	*	20373
27	3228	*	17313
28	3093	*	14085
29	3047	*	10992
30	2766	*	7945
31	2285	*	5179
32	1581	*	2894
33	941	*	1313
34	310	*	372
35	59	*	62
36	3	*	3


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	4738	0	14177	0
1	4963	4963	12086	12086
2	5663	16289	11896	35878
3	6227	34970	10819	68335
4	6540	61130	9289	105491
5	6974	96000	8034	145661
6	6661	135966	6292	183413
7	6482	181340	4703	216334
8	6305	231780	3517	244470
9	5622	282378	2420	266250
10	5119	333568	1656	282810
11	4321	381099	1104	294954
12	3815	426879	772	304218
13	3219	468726	507	310809
14	2572	504734	297	314967
15	2151	536999	165	317442
16	1660	563559	107	319154
17	1261	584996	61	320191
18	998	602960	33	320785
19	753	617267	11	320994
20	559	628447	10	321194
21	402	636889	4	321278
22	287	643203	5	321388
23	205	647918	3	321457
24	134	651134	0	321457
25	112	653934	1	321482
26	73	655832	0	321482
27	64	657560	0	321482
28	33	658484	0	321482
29	14	658890	0	321482
30	16	659370	0	321482
31	7	659587	0	321482
32	3	659683	0	321482
33	5	659848	0	321482
34	6	660052	0	321482
35	2	660122	0	321482
36	1	660158	0	321482
37	1	660195	0	321482
38	0	660195	0	321482
39	1	660234	0	321482
total var= 660234
total pis= 321482
sampled unlinked SNPs= 83231
sampled unlinked bi-allelic SNPs= 51854



In [26]:

    
%%bash
head -n 10 empirical_3/stats/empirical_3_m2.stats









    




145112      ## loci with > minsp containing data
142210      ## loci with > minsp containing data & paralogs removed
142210      ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
AR_re     	58912
BJSB3_v   	39297

Infer ML phylogeny in raxml as an unrooted tree



In [ ]:

    
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_3/ \
                      -n empirical_3_m4 -s empirical_3/outfiles/empirical_3_m4.phy



In [ ]:

    
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_3/ \
                      -n empirical_3_m2 -s empirical_3/outfiles/empirical_3_m2.phy



In [27]:

    
%%bash
head -n 20 empirical_3/RAxML_info.empirical_3_m4









    




This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 612971 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 53.74%

RAxML rapid bootstrapping and subsequent ML search



In [28]:

    
%%bash
head -n 20 empirical_3/RAxML_info.empirical_3_m2









    



head: cannot open ‘empirical_3/RAxML_info.empirical_3_m2’ for reading: No such file or directory

Plot the tree in R using `ape`



In [29]:

    
%load_ext rpy2.ipython



In [30]:

    
%%R -w 400 -h 800
library(ape)
tre <- read.tree("empirical_3/RAxML_bipartitions.empirical_3")
ltre <- ladderize(tre)
plot(ltre, edge.width=2)

Meaure phylo distance (GTRgamma distance)



In [33]:

    
%%R
mean(cophenetic.phylo(ltre))









    





[1] 0.01793173