In [1]:
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
import matplotlib
import seaborn as sns
import scipy


from IPython.display import HTML

Assessing the relative impacts of constraints and calibrations on dating accuracy


In [2]:
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')


Out[2]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

Analysis of 1200 MCMC runs on simulated data.

The simulation was performed on 10 different trees, used to generate alignments. For each tree, we gathered up to 5 calibration points (0, 1, 3, 5), and up to 50 constraints (0, 1, 3, 5, 20, 50). We ran the MCMC in each combination of conditions (=16 combinations), in 5 replicates, changing the nature of the calibrations and constraints. So in total we have $10~trees*24~conditions*5~replicates=1200~points$.

Plot of the 10 chronograms and the 10 trees with altered branch lengths


In [3]:
from ete3 import Tree, TreeStyle, NodeStyle

def readTreeFromFile(file):
    try:
        f=open(file, 'r')
    except IOError:
        print ("Unknown file: "+file)
        sys.exit()

    line = ""
    for l in f:
        line += l.strip()
    
    f.close()
    t = Tree( line )
    return t

# chronograms:
files = ["AnalysisBEFORE06052019/FirstAttempt/extantTree_1_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_2_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_3_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_4_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_5_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_6_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_7_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_8_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_9_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_10_rescaled.dnd"]

ts = TreeStyle()
ts.min_leaf_separation= 10
ts.scale = 2020
ts.show_leaf_name = False
nstyle = NodeStyle()
nstyle["size"] = 0

thickness = 1
vt_line_width =  2
hz_line_width =  2

ts.scale =  thickness * 2400 # 120 pixels per branch length unit
#ts.min_leaf_separation= 10
ts.show_scale = False

trees = list()
i = 0
for f in files:
    trees.append(readTreeFromFile(f))
    for n in trees[i].traverse():
       n.set_style(nstyle)
    trees[i].render("extantTree"+str(i+1)+".png", tree_style=ts)
    i = i +1

    
    
# rescaled trees:
files = ["AnalysisBEFORE06052019/FirstAttempt/extantTree_1_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_2_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_3_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_4_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_5_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_6_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_7_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_8_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_9_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_10_rescaled_altered.dnd"]

ts = TreeStyle()
ts.min_leaf_separation= 10
ts.scale = 2020 
ts.show_leaf_name = False
nstyle = NodeStyle()
nstyle["size"] = 0

thickness = 1
vt_line_width =  2
hz_line_width =  2

ts.scale =  thickness * 2400 # 120 pixels per branch length unit
#ts.min_leaf_separation= 10
#ts.show_scale = False

trees = list()
i = 0
for f in files:
    trees.append(readTreeFromFile(f))
    for n in trees[i].traverse():
       n.set_style(nstyle)
    trees[i].render("extantTree"+str(i+1)+"_rescaled_altered.png", tree_style=ts)
    i = i +1

Reading the data


In [4]:
colNames = ["treeId","numCalib","numCons","numReplicate","correlation", "rmsd", "correlation_bl", "rmsd_bl", "numNodes","numInHPD","fracInHPD","percent0","percent25","percent50","percent75","percent100"]
d = pd.read_csv ("resultAllTrees.txt", sep="\t", header=None, names=colNames)

In [5]:
d.describe()


Out[5]:
treeId numCalib numCons numReplicate correlation rmsd correlation_bl rmsd_bl numNodes numInHPD fracInHPD percent0 percent25 percent50 percent75 percent100
count 1400.000000 1400.000000 1400.000000 1400.000000 1400.000000 1400.000000 1400.000000 1400.000000 1400.0 1400.000000 1400.000000 1390.000000 1390.000000 1390.000000 1390.000000 1390.000000
mean 5.500000 2.250000 12.714286 3.000000 0.896741 3.004010 0.671940 2.484593 29.0 16.212857 55.906404 0.922015 2.166397 3.452197 5.118951 7.323804
std 2.873308 1.920973 16.495840 1.414719 0.084683 2.324495 0.174765 1.350331 0.0 5.885042 20.293247 0.766357 1.168117 1.664224 2.480621 3.634441
min 1.000000 0.000000 0.000000 1.000000 0.499728 0.520120 -0.134308 0.403570 29.0 0.000000 0.000000 0.000010 0.180254 0.206052 0.206052 0.206052
25% 3.000000 0.750000 1.000000 2.000000 0.859657 1.674116 0.573120 1.744306 29.0 12.000000 41.379310 0.523212 1.349880 2.332442 3.466733 4.909215
50% 5.500000 2.000000 5.000000 3.000000 0.921131 2.226946 0.704012 2.139378 29.0 17.000000 58.620690 0.791527 1.994972 3.202436 4.579952 6.444696
75% 8.000000 3.500000 20.000000 4.000000 0.958706 3.262030 0.805392 2.694706 29.0 20.000000 68.965517 1.193830 2.783481 4.466631 6.438647 9.037210
max 10.000000 5.000000 50.000000 5.000000 0.996401 15.680831 0.988294 10.224202 29.0 28.000000 96.551724 11.319780 11.462858 11.870524 15.111232 22.906924

Analysis of the impact of constraints

In the following 6 plots, we investigate the impact of constraints, not controlling for the number of calibrations.


In [6]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["correlation_bl"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["correlation_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Correlation in branch lengths')


Out[6]:
[Text(0,0.5,'Correlation in branch lengths'),
 Text(0.5,0,'Number of constraints')]

In [7]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["rmsd_bl"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["rmsd_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='RMSD in branch lengths')


Out[7]:
[Text(0,0.5,'RMSD in branch lengths'), Text(0.5,0,'Number of constraints')]

In [8]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["rmsd"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["rmsd"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='RMSD in node ages')


Out[8]:
[Text(0,0.5,'RMSD in node ages'), Text(0.5,0,'Number of constraints')]

In [9]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["correlation"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["correlation"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Correlation with true dates')


Out[9]:
[Text(0,0.5,'Correlation with true dates'),
 Text(0.5,0,'Number of constraints')]

In [10]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["fracInHPD"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["fracInHPD"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Fraction of dates in 95% HPD')


Out[10]:
[Text(0,0.5,'Fraction of dates in 95% HPD'),
 Text(0.5,0,'Number of constraints')]

In [11]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["percent50"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["percent50"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Median size of the 95% HPD')


Out[11]:
[Text(0,0.5,'Median size of the 95% HPD'), Text(0.5,0,'Number of constraints')]

Partial conclusion on the above

As the number of constraints increases from 0 to 50, the accuracy of the reconstruction improves a little bit, despite smaller 95%HPD. Using all constraints available, we usually get better results, and smaller 95%HPD.

Analysis of the impact of calibrations

In the following 6 plots, we investigate the impact of calibrations, not controlling for the number of constraints.


In [12]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["rmsd_bl"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["rmsd_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='RMSD in branch lengths')


Out[12]:
[Text(0,0.5,'RMSD in branch lengths'), Text(0.5,0,'Number of calibrations')]

In [13]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["correlation_bl"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["correlation_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Correlation in branch lengths')


Out[13]:
[Text(0,0.5,'Correlation in branch lengths'),
 Text(0.5,0,'Number of calibrations')]

In [14]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["rmsd"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["rmsd"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='RMSD in node ages')


Out[14]:
[Text(0,0.5,'RMSD in node ages'), Text(0.5,0,'Number of calibrations')]

In [15]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["correlation"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["correlation"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Correlation with true dates')


Out[15]:
[Text(0,0.5,'Correlation with true dates'),
 Text(0.5,0,'Number of calibrations')]

In [16]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["fracInHPD"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["fracInHPD"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Fraction in 95% HPD')


Out[16]:
[Text(0,0.5,'Fraction in 95% HPD'), Text(0.5,0,'Number of calibrations')]

In [17]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["percent50"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["percent50"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Median size of the 95% HPD')


Out[17]:
[Text(0,0.5,'Median size of the 95% HPD'),
 Text(0.5,0,'Number of calibrations')]

Partial conclusion on the above

As the number of calibrations increases from 0 to 5, the accuracy of the reconstruction improves, despite smaller 95%HPD. The effect of calibrations seems stronger than that of constraints.

Let's look at constraints, controlling for calibrations

In the following 10 plots, we separate the impacts of constraints and calibrations.


In [18]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="rmsd_bl", x="numCalib", data=d)
#ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average rmsd in branch lengths')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[18]:
<matplotlib.legend.Legend at 0x7f0c3c455b00>

In [19]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="correlation_bl", x="numCalib", data=d)
#ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average correlation in branch lengths')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[19]:
<matplotlib.legend.Legend at 0x7f0c47c831d0>

In [20]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="rmsd", x="numCalib", data=d)
#ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average rmsd in node ages')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[20]:
<matplotlib.legend.Legend at 0x7f0c3c01c748>

In [21]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="correlation", x="numCalib", data=d)
ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average correlation in node ages')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[21]:
<matplotlib.legend.Legend at 0x7f0c3c064080>

In [22]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="fracInHPD", x="numCalib", data=d)
ax.set_ylim(50, 90)
ax.set(xlabel='Number of calibrations', ylabel='Average fraction in 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[22]:
<matplotlib.legend.Legend at 0x7f0c316a09b0>

In [23]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent0", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average smallest 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[23]:
<matplotlib.legend.Legend at 0x7f0c316a0cf8>

In [24]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent25", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average 1st quartile 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[24]:
<matplotlib.legend.Legend at 0x7f0c314f6128>

In [25]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent50", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average median 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[25]:
<matplotlib.legend.Legend at 0x7f0c3c397048>

In [26]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent75", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average 3rd quartile 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[26]:
<matplotlib.legend.Legend at 0x7f0c312f7cc0>

In [27]:
%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent100", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average maximum 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])


Out[27]:
<matplotlib.legend.Legend at 0x7f0c311cf160>

Partial conclusion on the above

Constraints and calibrations both improve accuracy. The number of constraints does not seem to have much of an effect on the size of the 95% HPD, contrary to calibrations, unless you get to lots of constraints (20, 5à or 100). However, when using large numbers of constraints, results are much improved, although with a lot of variance if there aren't any calibration. It is possible that on some trees convergence was difficult in the fully constrained runs.

Comparison between trees

In the following 6 plots, we look at the impact of the 10 different trees.

First let's order the trees from good to bad according to the correlation in node ages when all constraints are used:


In [28]:
d_sel = d.loc[d['numCons'] == 100]
print(d_sel.sort_values(['correlation'], ascending=False))


Empty DataFrame
Columns: [treeId, numCalib, numCons, numReplicate, correlation, rmsd, correlation_bl, rmsd_bl, numNodes, numInHPD, fracInHPD, percent0, percent25, percent50, percent75, percent100]
Index: []

In [29]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["treeId"], y=d["rmsd"], inner=None)
ax = sns.swarmplot(d["treeId"], y=d["rmsd"], color="white", edgecolor="gray")
ax.set(xlabel='Tree ID', ylabel='Rmsd in node ages')


Out[29]:
[Text(0,0.5,'Rmsd in node ages'), Text(0.5,0,'Tree ID')]

In [30]:
%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["treeId"], y=d["correlation"], inner=None)
ax = sns.swarmplot(d["treeId"], y=d["correlation"], color="white", edgecolor="gray")
ax.set(xlabel='Tree ID', ylabel='Correlation in node ages')


Out[30]:
[Text(0,0.5,'Correlation in node ages'), Text(0.5,0,'Tree ID')]