Notebook 4:

This is an IPython notebook. Most of the code is composed of bash scripts, indicated by %%bash at the top of the cell, otherwise it is IPython code. This notebook includes code to download, assemble and analyze a published RADseq data set.



In [34]:

    
### Notebook 4
### Data set 4 (Orestias)
### Authors: Takahashi & Moreno (2015)
### Data Location: DDBJ DRA DRA003595

Download the sequence data

Sequence data for this study are archived on the NCBI sequence read archive (SRA). Below I read in SraRunTable.txt for this project which contains all of the information we need to download the data.

Project DRA: DRA003595
Study: DRP002750
Experiments: DRX033006 -- DRX033069
Samples: DRS020928 -- DRS020991
SRA link: http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=DRA003595
Publication address: doi:10.1016/j.ympev.2015.08.012



In [4]:

    
%%bash
## make a new directory for this analysis
mkdir -p empirical_4/fastq/



In [16]:

    
import os

def wget_download_ddbj(SRR, outdir):
    """ Python function to get sra data from ncbi and write to
    outdir with a new name using bash call wget """
    
    ## create a call string 
    call = "wget -q -r -nH --cut-dirs=9 -P "+outdir+" "+\
           "ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sra/ByExp/"+\
           "sra/DRX/DRX033/DRX033{:03d}".format(SRR)
    
    ## run wget call 
    ! $call

Here we pass the SRR number and the sample name to the wget_download function so that the files are saved. In this case we do not have the sample names, just their SRR IDs.



In [17]:

    
for ID in range(6,70):
    wget_download_ddbj(ID, "empirical_4/fastq/")



In [18]:

    
%%bash
## convert sra files to fastq using fastq-dump tool
## output as gzipped into the fastq directory
fastq-dump --gzip -O empirical_4/fastq/  empirical_4/fastq/*.sra

## remove .sra files
rm empirical_4/fastq/*.sra









    



Read 3850280 spots for empirical_4/fastq/DRR036765.sra
Written 3850280 spots for empirical_4/fastq/DRR036765.sra
Read 5135835 spots for empirical_4/fastq/DRR036766.sra
Written 5135835 spots for empirical_4/fastq/DRR036766.sra
Read 3812927 spots for empirical_4/fastq/DRR036767.sra
Written 3812927 spots for empirical_4/fastq/DRR036767.sra
Read 4279789 spots for empirical_4/fastq/DRR036768.sra
Written 4279789 spots for empirical_4/fastq/DRR036768.sra
Read 4238415 spots for empirical_4/fastq/DRR036769.sra
Written 4238415 spots for empirical_4/fastq/DRR036769.sra
Read 4897736 spots for empirical_4/fastq/DRR036770.sra
Written 4897736 spots for empirical_4/fastq/DRR036770.sra
Read 4110217 spots for empirical_4/fastq/DRR036771.sra
Written 4110217 spots for empirical_4/fastq/DRR036771.sra
Read 4381798 spots for empirical_4/fastq/DRR036772.sra
Written 4381798 spots for empirical_4/fastq/DRR036772.sra
Read 3692455 spots for empirical_4/fastq/DRR036773.sra
Written 3692455 spots for empirical_4/fastq/DRR036773.sra
Read 4146148 spots for empirical_4/fastq/DRR036774.sra
Written 4146148 spots for empirical_4/fastq/DRR036774.sra
Read 5296764 spots for empirical_4/fastq/DRR036775.sra
Written 5296764 spots for empirical_4/fastq/DRR036775.sra
Read 3916368 spots for empirical_4/fastq/DRR036776.sra
Written 3916368 spots for empirical_4/fastq/DRR036776.sra
Read 5076822 spots for empirical_4/fastq/DRR036777.sra
Written 5076822 spots for empirical_4/fastq/DRR036777.sra
Read 3986174 spots for empirical_4/fastq/DRR036778.sra
Written 3986174 spots for empirical_4/fastq/DRR036778.sra
Read 3652015 spots for empirical_4/fastq/DRR036779.sra
Written 3652015 spots for empirical_4/fastq/DRR036779.sra
Read 4887011 spots for empirical_4/fastq/DRR036780.sra
Written 4887011 spots for empirical_4/fastq/DRR036780.sra
Read 4027193 spots for empirical_4/fastq/DRR036781.sra
Written 4027193 spots for empirical_4/fastq/DRR036781.sra
Read 5385172 spots for empirical_4/fastq/DRR036782.sra
Written 5385172 spots for empirical_4/fastq/DRR036782.sra
Read 5128952 spots for empirical_4/fastq/DRR036783.sra
Written 5128952 spots for empirical_4/fastq/DRR036783.sra
Read 4593914 spots for empirical_4/fastq/DRR036784.sra
Written 4593914 spots for empirical_4/fastq/DRR036784.sra
Read 2334669 spots for empirical_4/fastq/DRR036785.sra
Written 2334669 spots for empirical_4/fastq/DRR036785.sra
Read 4626890 spots for empirical_4/fastq/DRR036786.sra
Written 4626890 spots for empirical_4/fastq/DRR036786.sra
Read 3888993 spots for empirical_4/fastq/DRR036787.sra
Written 3888993 spots for empirical_4/fastq/DRR036787.sra
Read 4512205 spots for empirical_4/fastq/DRR036788.sra
Written 4512205 spots for empirical_4/fastq/DRR036788.sra
Read 4174220 spots for empirical_4/fastq/DRR036789.sra
Written 4174220 spots for empirical_4/fastq/DRR036789.sra
Read 767653 spots for empirical_4/fastq/DRR036790.sra
Written 767653 spots for empirical_4/fastq/DRR036790.sra
Read 4019423 spots for empirical_4/fastq/DRR036791.sra
Written 4019423 spots for empirical_4/fastq/DRR036791.sra
Read 3286802 spots for empirical_4/fastq/DRR036792.sra
Written 3286802 spots for empirical_4/fastq/DRR036792.sra
Read 4601860 spots for empirical_4/fastq/DRR036793.sra
Written 4601860 spots for empirical_4/fastq/DRR036793.sra
Read 5436565 spots for empirical_4/fastq/DRR036794.sra
Written 5436565 spots for empirical_4/fastq/DRR036794.sra
Read 4528702 spots for empirical_4/fastq/DRR036795.sra
Written 4528702 spots for empirical_4/fastq/DRR036795.sra
Read 4328453 spots for empirical_4/fastq/DRR036796.sra
Written 4328453 spots for empirical_4/fastq/DRR036796.sra
Read 4845789 spots for empirical_4/fastq/DRR036797.sra
Written 4845789 spots for empirical_4/fastq/DRR036797.sra
Read 4231660 spots for empirical_4/fastq/DRR036798.sra
Written 4231660 spots for empirical_4/fastq/DRR036798.sra
Read 4610513 spots for empirical_4/fastq/DRR036799.sra
Written 4610513 spots for empirical_4/fastq/DRR036799.sra
Read 5079930 spots for empirical_4/fastq/DRR036800.sra
Written 5079930 spots for empirical_4/fastq/DRR036800.sra
Read 4435752 spots for empirical_4/fastq/DRR036801.sra
Written 4435752 spots for empirical_4/fastq/DRR036801.sra
Read 3428835 spots for empirical_4/fastq/DRR036802.sra
Written 3428835 spots for empirical_4/fastq/DRR036802.sra
Read 1880356 spots for empirical_4/fastq/DRR036803.sra
Written 1880356 spots for empirical_4/fastq/DRR036803.sra
Read 5300107 spots for empirical_4/fastq/DRR036804.sra
Written 5300107 spots for empirical_4/fastq/DRR036804.sra
Read 2912768 spots for empirical_4/fastq/DRR036805.sra
Written 2912768 spots for empirical_4/fastq/DRR036805.sra
Read 4221013 spots for empirical_4/fastq/DRR036806.sra
Written 4221013 spots for empirical_4/fastq/DRR036806.sra
Read 4787167 spots for empirical_4/fastq/DRR036807.sra
Written 4787167 spots for empirical_4/fastq/DRR036807.sra
Read 4955937 spots for empirical_4/fastq/DRR036808.sra
Written 4955937 spots for empirical_4/fastq/DRR036808.sra
Read 4351758 spots for empirical_4/fastq/DRR036809.sra
Written 4351758 spots for empirical_4/fastq/DRR036809.sra
Read 2199538 spots for empirical_4/fastq/DRR036810.sra
Written 2199538 spots for empirical_4/fastq/DRR036810.sra
Read 4750069 spots for empirical_4/fastq/DRR036811.sra
Written 4750069 spots for empirical_4/fastq/DRR036811.sra
Read 4544226 spots for empirical_4/fastq/DRR036812.sra
Written 4544226 spots for empirical_4/fastq/DRR036812.sra
Read 4825518 spots for empirical_4/fastq/DRR036813.sra
Written 4825518 spots for empirical_4/fastq/DRR036813.sra
Read 5517492 spots for empirical_4/fastq/DRR036814.sra
Written 5517492 spots for empirical_4/fastq/DRR036814.sra
Read 3299739 spots for empirical_4/fastq/DRR036815.sra
Written 3299739 spots for empirical_4/fastq/DRR036815.sra
Read 4464593 spots for empirical_4/fastq/DRR036816.sra
Written 4464593 spots for empirical_4/fastq/DRR036816.sra
Read 4873706 spots for empirical_4/fastq/DRR036817.sra
Written 4873706 spots for empirical_4/fastq/DRR036817.sra
Read 3321824 spots for empirical_4/fastq/DRR036818.sra
Written 3321824 spots for empirical_4/fastq/DRR036818.sra
Read 3625781 spots for empirical_4/fastq/DRR036819.sra
Written 3625781 spots for empirical_4/fastq/DRR036819.sra
Read 3779872 spots for empirical_4/fastq/DRR036820.sra
Written 3779872 spots for empirical_4/fastq/DRR036820.sra
Read 5404611 spots for empirical_4/fastq/DRR036821.sra
Written 5404611 spots for empirical_4/fastq/DRR036821.sra
Read 5276329 spots for empirical_4/fastq/DRR036822.sra
Written 5276329 spots for empirical_4/fastq/DRR036822.sra
Read 3720213 spots for empirical_4/fastq/DRR036823.sra
Written 3720213 spots for empirical_4/fastq/DRR036823.sra
Read 4306471 spots for empirical_4/fastq/DRR036824.sra
Written 4306471 spots for empirical_4/fastq/DRR036824.sra
Read 4862198 spots for empirical_4/fastq/DRR036825.sra
Written 4862198 spots for empirical_4/fastq/DRR036825.sra
Read 4332795 spots for empirical_4/fastq/DRR036826.sra
Written 4332795 spots for empirical_4/fastq/DRR036826.sra
Read 4589896 spots for empirical_4/fastq/DRR036827.sra
Written 4589896 spots for empirical_4/fastq/DRR036827.sra
Read 4980042 spots for empirical_4/fastq/DRR036828.sra
Written 4980042 spots for empirical_4/fastq/DRR036828.sra
Read 272718918 spots total
Written 272718918 spots total

Make a params file



In [19]:

    
%%bash
pyrad --version









    



pyRAD 3.0.63



In [20]:

    
%%bash
## delete existing params file if it exits
rm params.txt

## create a new default params file
pyrad -n









    



	new params.txt file created



In [10]:

    
%%bash
## substitute new parameters into file
sed -i '/## 1. /c\empirical_4/           ## 1. working directory ' params.txt
sed -i '/## 6. /c\TGCAGG                 ## 6. cutters ' params.txt
sed -i '/## 7. /c\30                     ## 7. N processors      ' params.txt
sed -i '/## 9. /c\6                      ## 9. NQual             ' params.txt
sed -i '/## 10./c\.85                    ## 10. clust threshold  ' params.txt
sed -i '/## 12./c\4                      ## 12. MinCov           ' params.txt
sed -i '/## 13./c\10                     ## 13. maxSH            ' params.txt
sed -i '/## 14./c\empirical_4_m4         ## 14. output name      ' params.txt
sed -i '/## 18./c\empirical_4/fastq/*.gz ## 18. data location    ' params.txt
sed -i '/## 29./c\2,2                    ## 29. trim overhang    ' params.txt
sed -i '/## 30./c\p,n,s                  ## 30. output formats   ' params.txt



In [11]:

    
cat params.txt









    



==** parameter inputs for pyRAD version 3.0.63  **======================== affected step ==
empirical_4/           ## 1. working directory 
./*.fastq.gz              ## 2. Loc. of non-demultiplexed files (if not line 18)  (s1)
./*.barcodes              ## 3. Loc. of barcode file (if not line 18)             (s1)
vsearch                   ## 4. command (or path) to call vsearch (or usearch)    (s3,s6)
muscle                    ## 5. command (or path) to call muscle                  (s3,s7)
TGCAGG                 ## 6. cutters 
30                     ## 7. N processors      
6                         ## 8. Mindepth: min coverage for a cluster              (s4,s5)
6                      ## 9. NQual             
.85                    ## 10. clust threshold  
rad                       ## 11. Datatype: rad,gbs,pairgbs,pairddrad,(others:see docs)(all)
4                      ## 12. MinCov           
10                     ## 13. maxSH            
empirical_4_m4         ## 14. output name      
==== optional params below this line ===================================  affected step ==
                       ## 15.opt.: select subset (prefix* only selector)            (s2-s7)
                       ## 16.opt.: add-on (outgroup) taxa (list or prefix*)         (s6,s7)
                       ## 17.opt.: exclude taxa (list or prefix*)                   (s7)
empirical_4/fastq/*.gz ## 18. data location    
                       ## 19.opt.: maxM: N mismatches in barcodes (def= 1)          (s1)
                       ## 20.opt.: phred Qscore offset (def= 33)                    (s2)
                       ## 21.opt.: filter: def=0=NQual 1=NQual+adapters. 2=strict   (s2)
                       ## 22.opt.: a priori E,H (def= 0.001,0.01, if not estimated) (s5)
                       ## 23.opt.: maxN: max Ns in a cons seq (def=5)               (s5)
                       ## 24.opt.: maxH: max heterozyg. sites in cons seq (def=5)   (s5)
                       ## 25.opt.: ploidy: max alleles in cons seq (def=2;see docs) (s4,s5)
                       ## 26.opt.: maxSNPs: (def=100). Paired (def=100,100)         (s7)
                       ## 27.opt.: maxIndels: within-clust,across-clust (def. 3,99) (s3,s7)
                       ## 28.opt.: random number seed (def. 112233)              (s3,s6,s7)
2,2                    ## 29. trim overhang    
p,n,s                  ## 30. output formats   
                       ## 31.opt.: maj. base call at depth>x<mindepth (def.x=mindepth) (s5)
                       ## 32.opt.: keep trimmed reads (def=0). Enter min length.    (s2)
                       ## 33.opt.: max stack size (int), def= max(500,mean+2*SD)    (s3)
                       ## 34.opt.: minDerep: exclude dereps with <= N copies, def=1 (s3)
                       ## 35.opt.: use hierarchical clustering (def.=0, 1=yes)      (s6)
                       ## 36.opt.: repeat masking (def.=1='dust' method, 0=no)      (s3,s6)
                       ## 37.opt.: vsearch max threads per job (def.=6; see docs)   (s3,s6)
==== optional: list group/clade assignments below this line (see docs) ==================

Assemble in pyrad



In [ ]:

    
pyrad -p params.txt -s 234567 >> log.txt 2>&1



In [16]:

    
%%bash
sed -i '/## 12./c\2                      ## 12. MinCov           ' params.txt
sed -i '/## 14./c\empirical_4_m2         ## 14. output name      ' params.txt



In [17]:

    
%%bash
pyrad -p params.txt -s 7 >> log.txt 2>&1

Results

We are interested in the relationship between the amount of input (raw) data between any two samples, the average coverage they recover when clustered together, and the phylogenetic distances separating samples.

Raw data amounts

The average number of raw reads per sample is 1.36M.



In [15]:

    
## import data frame 
import pandas as pd

## read in the data
s4dat = pd.read_table("empirical_4/stats/s2.rawedit.txt", header=0, nrows=65)

## print summary stats
print s4dat["passed.total"].describe()

## find which sample has the most raw data
maxraw = s4dat["passed.total"].max()
print "\nmost raw data in sample:"
print s4dat['sample '][s4dat['passed.total']==maxraw]









    



count         64.000000
mean     3959338.812500
std       811579.026685
min       731712.000000
25%      3635737.000000
50%      4054918.000000
75%      4540098.750000
max      5074496.000000
Name: passed.total, dtype: float64

most raw data in sample:
10    DRR036775
Name: sample , dtype: object

Look at distributions of coverage

pyrad v.3.0.63 outputs depth information for each sample which I read in here and plot. First let's ask which sample has the highest depth of coverage.



In [18]:

    
## read in the s3 results
s4dat = pd.read_table("empirical_4/stats/s3.clusters.txt", header=0, nrows=65)

## print summary stats
print "summary of means\n=================="
print s4dat['dpt.me'].describe()

## print summary stats
print "\nsummary of std\n=================="
print s4dat['dpt.sd'].describe()

## print summary stats
print "\nsummary of proportion lowdepth\n=================="
print pd.Series(1-s4dat['d>5.tot']/s4dat["total"]).describe()

## find which sample has the greatest depth of retained loci
max_hiprop = (s4dat["d>5.tot"]/s4dat["total"]).max()
print "\nhighest coverage in sample:"
print s4dat['taxa'][s4dat['d>5.tot']/s4dat["total"]==max_hiprop]









    



summary of means
==================
count     64.000000
mean     106.000750
std       17.392284
min       40.235000
25%       98.422750
50%      107.650000
75%      118.500750
max      131.455000
Name: dpt.me, dtype: float64

summary of std
==================
count     64.000000
mean     289.564375
std      140.245059
min       95.184000
25%      205.293000
50%      235.668500
75%      325.144000
max      757.646000
Name: dpt.sd, dtype: float64

summary of proportion lowdepth
==================
count    64.000000
mean      0.053360
std       0.010158
min       0.038625
25%       0.046790
50%       0.051447
75%       0.058441
max       0.104888
dtype: float64

highest coverage in sample:
10    DRR036775
Name: taxa, dtype: object

Plot the coverage for the sample with highest mean coverage

Green shows the loci that were discarded and orange the loci that were retained. The majority of data were discarded for being too low of coverage.



In [9]:

    
import toyplot
import toyplot.svg
import numpy as np

## read in the depth information for this sample
with open("empirical_4/clust.85/DRR036775.depths", 'rb') as indat:
    depths = np.array(indat.read().strip().split(","), dtype=int)
    
## make a barplot in Toyplot
canvas = toyplot.Canvas(width=350, height=300)
axes = canvas.axes(xlabel="Depth of coverage (N reads)", 
                   ylabel="N loci", 
                   label="dataset4/sample=DRR036775")

## select the loci with depth > 5 (kept)
keeps = depths[depths>5]

## plot kept and discarded loci
edat = np.histogram(depths, range(30)) # density=True)
kdat = np.histogram(keeps, range(30)) #, density=True)
axes.bars(edat)
axes.bars(kdat)

#toyplot.svg.render(canvas, "empirical_4_depthplot.svg")









    Out[9]:





<toyplot.mark.BarMagnitudes at 0x7ff380951310>






    




Save as .csv

Print final stats table



In [30]:

    
cat empirical_4/stats/empirical_4_m4.stats









    




46277       ## loci with > minsp containing data
42877       ## loci with > minsp containing data & paralogs removed
42877       ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
DRR036765	27950
DRR036766	30223
DRR036767	29273
DRR036768	28909
DRR036769	28879
DRR036770	29524
DRR036771	28375
DRR036772	28756
DRR036773	27906
DRR036774	28927
DRR036775	31633
DRR036776	27658
DRR036777	30722
DRR036778	28349
DRR036779	27420
DRR036780	29359
DRR036781	28123
DRR036782	29928
DRR036783	29606
DRR036784	28864
DRR036785	25495
DRR036786	28829
DRR036787	29177
DRR036788	28432
DRR036789	28598
DRR036790	13855
DRR036791	28826
DRR036792	27106
DRR036793	30117
DRR036794	30466
DRR036795	29279
DRR036796	28378
DRR036797	29160
DRR036798	28070
DRR036799	29059
DRR036800	29404
DRR036801	28522
DRR036802	27407
DRR036803	22502
DRR036804	29971
DRR036805	26894
DRR036806	28787
DRR036807	30322
DRR036808	29587
DRR036809	28472
DRR036810	24144
DRR036811	29457
DRR036812	29349
DRR036813	28553
DRR036814	29899
DRR036815	27081
DRR036816	27908
DRR036817	30888
DRR036818	27955
DRR036819	28459
DRR036820	28671
DRR036821	30179
DRR036822	30425
DRR036823	28337
DRR036824	28810
DRR036825	29466
DRR036826	28469
DRR036827	28876
DRR036828	29248


## nloci = number of loci with data for exactly ntaxa
## ntotal = number of loci for which at least ntaxa have data
ntaxa	nloci	saved	ntotal
1	-
2	-		-
3	-		-
4	562	*	42877
5	496	*	42315
6	485	*	41819
7	517	*	41334
8	627	*	40817
9	910	*	40190
10	1361	*	39280
11	1620	*	37919
12	1031	*	36299
13	343	*	35268
14	184	*	34925
15	160	*	34741
16	171	*	34581
17	212	*	34410
18	192	*	34198
19	192	*	34006
20	200	*	33814
21	187	*	33614
22	215	*	33427
23	199	*	33212
24	221	*	33013
25	248	*	32792
26	243	*	32544
27	225	*	32301
28	276	*	32076
29	233	*	31800
30	227	*	31567
31	245	*	31340
32	295	*	31095
33	273	*	30800
34	293	*	30527
35	314	*	30234
36	254	*	29920
37	322	*	29666
38	349	*	29344
39	352	*	28995
40	373	*	28643
41	397	*	28270
42	416	*	27873
43	446	*	27457
44	521	*	27011
45	558	*	26490
46	723	*	25932
47	828	*	25209
48	875	*	24381
49	1021	*	23506
50	1062	*	22485
51	1072	*	21423
52	957	*	20351
53	892	*	19394
54	1047	*	18502
55	1189	*	17455
56	1476	*	16266
57	1890	*	14790
58	2200	*	12900
59	2433	*	10700
60	2672	*	8267
61	2464	*	5595
62	1846	*	3131
63	1019	*	1285
64	266	*	266


## nvar = number of loci containing n variable sites (pis+autapomorphies).
## sumvar = sum of variable sites (SNPs).
## pis = number of loci containing n parsimony informative sites.
## sumpis = sum of parsimony informative sites.
	nvar	sumvar	PIS	sumPIS
0	4922	0	12420	0
1	5533	5533	8723	8723
2	5742	17017	7273	23269
3	5663	34006	5269	39076
4	5184	54742	3641	53640
5	4499	77237	2239	64835
6	3409	97691	1263	72413
7	2451	114848	733	77544
8	1691	128376	401	80752
9	1177	138969	279	83263
10	739	146359	199	85253
11	537	152266	138	86771
12	358	156562	88	87827
13	263	159981	70	88737
14	169	162347	46	89381
15	135	164372	28	89801
16	84	165716	24	90185
17	86	167178	7	90304
18	47	168024	14	90556
19	50	168974	3	90613
20	34	169654	2	90653
21	23	170137	6	90779
22	18	170533	8	90955
23	19	170970	1	90978
24	13	171282	1	91002
25	8	171482	0	91002
26	5	171612	0	91002
27	7	171801	0	91002
28	2	171857	0	91002
29	1	171886	1	91031
30	2	171946	0	91031
31	1	171977	0	91031
32	3	172073	0	91031
33	0	172073	0	91031
34	1	172107	0	91031
35	0	172107	0	91031
36	0	172107	0	91031
37	0	172107	0	91031
38	0	172107	0	91031
39	0	172107	0	91031
40	0	172107	0	91031
41	0	172107	0	91031
42	0	172107	0	91031
43	1	172150	0	91031
total var= 172150
total pis= 91031
sampled unlinked SNPs= 37955
sampled unlinked bi-allelic SNPs= -30031



In [19]:

    
%%bash
head -n 20 empirical_4/stats/empirical_4_m2.stats









    




48546       ## loci with > minsp containing data
45146       ## loci with > minsp containing data & paralogs removed
45146       ## loci with > minsp containing data & paralogs removed & final filtering

## number of loci recovered in final data set for each taxon.
taxon	nloci
DRR036765	28070
DRR036766	30376
DRR036767	29411
DRR036768	29051
DRR036769	29022
DRR036770	29587
DRR036771	28435
DRR036772	28812
DRR036773	27965
DRR036774	28996
DRR036775	31778
DRR036776	27701

Infer ML phylogeny in raxml as an unrooted tree



In [ ]:

    
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_4/ \
                      -n empirical_4_m4 -s empirical_4/outfiles/empirical_4_m4.phy



In [ ]:

    
%%bash
## raxml argumement w/ ...
raxmlHPC-PTHREADS-AVX -f a -m GTRGAMMA -N 100 -x 12345 -p 12345 -T 20 \
                      -w /home/deren/Documents/RADmissing/empirical_4/ \
                      -n empirical_4_m2 -s empirical_4/outfiles/empirical_4_m2.phy



In [2]:

    
%%bash
head -n 40 empirical_4/RAxML_info.empirical_4









    




This is RAxML version 8.0.16 released by Alexandros Stamatakis on March 21 2014.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)
David Dao         (KIT)
Charlie Taylor    (UF)


Alignment has 339249 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 33.82%

RAxML rapid bootstrapping and subsequent ML search

Using 1 distinct models/data partitions with joint branch length optimization



Executing 100 rapid bootstrap inferences and thereafter a thorough ML search 

All free model parameters will be estimated by RAxML
GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter

GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units

Partition: 0
Alignment Patterns: 339249
Name: No Name Provided
DataType: DNA
Substitution Matrix: GTR

Plot the tree in R using `ape`

The backbone of the ingroup taxa has very low support across the radiation. The same result was found in the original paper (see Fig. 4 and Supplemental Fig S1 of Takahashi et al). Below we plot the full tree and a zoomed in tree of just the ingroup taxa.



In [21]:

    
%load_ext rpy2.ipython



In [27]:

    
%%R -w 1000 -h 800
library(ape)
tre <- read.tree("empirical_4/RAxML_bipartitions.empirical_4")
ltre <- ladderize(tre)
outgroups = c("DRR036791", "DRR036765", "DRR036790", "DRR036767", "DRR036769",
              "DRR036766", "DRR036777", "DRR036793", "DRR036778", "DRR036778", 
              "DRR036792", "DRR036768", "DRR036775")
rtre <- root(ltre, outgroups)
ingrouptre <- drop.tip(ltre, outgroups)    
par(mfrow=c(1,2))

plot(ltre, edge.width=2)
nodelabels(ltre$node.label, cex=1)
plot(ingrouptre, edge.width=2)
nodelabels(ingrouptre$node.label, cex=1)



In [28]:

    
%%R
mean(cophenetic.phylo(ltre))









    





[1] 0.009676341

An ultrametric tree for plotting



In [33]:

    
%%R -h 700
utre <- ladderize(chronopl(rtre, 0.5, resolve.root=TRUE))
plot(utre, edge.width=2)