``````

In [1]:

import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
import matplotlib
import seaborn as sns
import scipy

from IPython.display import HTML

``````

# Assessing the relative impacts of constraints and calibrations on dating accuracy

``````

In [2]:

HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
\$('div.input').hide();
} else {
\$('div.input').show();
}
code_show = !code_show
}
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

``````
``````

Out[2]:

code_show=true;
function code_toggle() {
if (code_show){
\$('div.input').hide();
} else {
\$('div.input').show();
}
code_show = !code_show
}

The raw code for this IPython notebook is by default hidden for easier reading.

``````

## Analysis of 1200 MCMC runs on simulated data.

The simulation was performed on 10 different trees, used to generate alignments. For each tree, we gathered up to 5 calibration points (0, 1, 3, 5), and up to 50 constraints (0, 1, 3, 5, 20, 50). We ran the MCMC in each combination of conditions (=16 combinations), in 5 replicates, changing the nature of the calibrations and constraints. So in total we have \$10~trees*24~conditions*5~replicates=1200~points\$.

# Plot of the 10 chronograms and the 10 trees with altered branch lengths

``````

In [3]:

from ete3 import Tree, TreeStyle, NodeStyle

try:
f=open(file, 'r')
except IOError:
print ("Unknown file: "+file)
sys.exit()

line = ""
for l in f:
line += l.strip()

f.close()
t = Tree( line )
return t

# chronograms:
files = ["AnalysisBEFORE06052019/FirstAttempt/extantTree_1_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_2_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_3_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_4_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_5_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_6_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_7_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_8_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_9_rescaled.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_10_rescaled.dnd"]

ts = TreeStyle()
ts.min_leaf_separation= 10
ts.scale = 2020
ts.show_leaf_name = False
nstyle = NodeStyle()
nstyle["size"] = 0

thickness = 1
vt_line_width =  2
hz_line_width =  2

ts.scale =  thickness * 2400 # 120 pixels per branch length unit
#ts.min_leaf_separation= 10
ts.show_scale = False

trees = list()
i = 0
for f in files:
for n in trees[i].traverse():
n.set_style(nstyle)
trees[i].render("extantTree"+str(i+1)+".png", tree_style=ts)
i = i +1

# rescaled trees:
files = ["AnalysisBEFORE06052019/FirstAttempt/extantTree_1_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_2_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_3_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_4_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_5_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_6_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_7_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_8_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_9_rescaled_altered.dnd","AnalysisBEFORE06052019/FirstAttempt/extantTree_10_rescaled_altered.dnd"]

ts = TreeStyle()
ts.min_leaf_separation= 10
ts.scale = 2020
ts.show_leaf_name = False
nstyle = NodeStyle()
nstyle["size"] = 0

thickness = 1
vt_line_width =  2
hz_line_width =  2

ts.scale =  thickness * 2400 # 120 pixels per branch length unit
#ts.min_leaf_separation= 10
#ts.show_scale = False

trees = list()
i = 0
for f in files:
for n in trees[i].traverse():
n.set_style(nstyle)
trees[i].render("extantTree"+str(i+1)+"_rescaled_altered.png", tree_style=ts)
i = i +1

``````

``````

In [4]:

colNames = ["treeId","numCalib","numCons","numReplicate","correlation", "rmsd", "correlation_bl", "rmsd_bl", "numNodes","numInHPD","fracInHPD","percent0","percent25","percent50","percent75","percent100"]

``````
``````

In [5]:

d.describe()

``````
``````

Out[5]:

text-align: right;
}

text-align: left;
}

.dataframe tbody tr th {
vertical-align: top;
}

treeId
numCalib
numCons
numReplicate
correlation
rmsd
correlation_bl
rmsd_bl
numNodes
numInHPD
fracInHPD
percent0
percent25
percent50
percent75
percent100

count
1400.000000
1400.000000
1400.000000
1400.000000
1400.000000
1400.000000
1400.000000
1400.000000
1400.0
1400.000000
1400.000000
1390.000000
1390.000000
1390.000000
1390.000000
1390.000000

mean
5.500000
2.250000
12.714286
3.000000
0.896741
3.004010
0.671940
2.484593
29.0
16.212857
55.906404
0.922015
2.166397
3.452197
5.118951
7.323804

std
2.873308
1.920973
16.495840
1.414719
0.084683
2.324495
0.174765
1.350331
0.0
5.885042
20.293247
0.766357
1.168117
1.664224
2.480621
3.634441

min
1.000000
0.000000
0.000000
1.000000
0.499728
0.520120
-0.134308
0.403570
29.0
0.000000
0.000000
0.000010
0.180254
0.206052
0.206052
0.206052

25%
3.000000
0.750000
1.000000
2.000000
0.859657
1.674116
0.573120
1.744306
29.0
12.000000
41.379310
0.523212
1.349880
2.332442
3.466733
4.909215

50%
5.500000
2.000000
5.000000
3.000000
0.921131
2.226946
0.704012
2.139378
29.0
17.000000
58.620690
0.791527
1.994972
3.202436
4.579952
6.444696

75%
8.000000
3.500000
20.000000
4.000000
0.958706
3.262030
0.805392
2.694706
29.0
20.000000
68.965517
1.193830
2.783481
4.466631
6.438647
9.037210

max
10.000000
5.000000
50.000000
5.000000
0.996401
15.680831
0.988294
10.224202
29.0
28.000000
96.551724
11.319780
11.462858
11.870524
15.111232
22.906924

``````

## Analysis of the impact of constraints

In the following 6 plots, we investigate the impact of constraints, not controlling for the number of calibrations.

``````

In [6]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["correlation_bl"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["correlation_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Correlation in branch lengths')

``````
``````

Out[6]:

[Text(0,0.5,'Correlation in branch lengths'),
Text(0.5,0,'Number of constraints')]

``````
``````

In [7]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["rmsd_bl"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["rmsd_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='RMSD in branch lengths')

``````
``````

Out[7]:

[Text(0,0.5,'RMSD in branch lengths'), Text(0.5,0,'Number of constraints')]

``````
``````

In [8]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["rmsd"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["rmsd"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='RMSD in node ages')

``````
``````

Out[8]:

[Text(0,0.5,'RMSD in node ages'), Text(0.5,0,'Number of constraints')]

``````
``````

In [9]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["correlation"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["correlation"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Correlation with true dates')

``````
``````

Out[9]:

[Text(0,0.5,'Correlation with true dates'),
Text(0.5,0,'Number of constraints')]

``````
``````

In [10]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["fracInHPD"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["fracInHPD"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Fraction of dates in 95% HPD')

``````
``````

Out[10]:

[Text(0,0.5,'Fraction of dates in 95% HPD'),
Text(0.5,0,'Number of constraints')]

``````
``````

In [11]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCons"], y=d["percent50"], inner=None)
ax = sns.swarmplot(d["numCons"], y=d["percent50"], color="white", edgecolor="gray")
ax.set(xlabel='Number of constraints', ylabel='Median size of the 95% HPD')

``````
``````

Out[11]:

[Text(0,0.5,'Median size of the 95% HPD'), Text(0.5,0,'Number of constraints')]

``````

### Partial conclusion on the above

As the number of constraints increases from 0 to 50, the accuracy of the reconstruction improves a little bit, despite smaller 95%HPD. Using all constraints available, we usually get better results, and smaller 95%HPD.

## Analysis of the impact of calibrations

In the following 6 plots, we investigate the impact of calibrations, not controlling for the number of constraints.

``````

In [12]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["rmsd_bl"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["rmsd_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='RMSD in branch lengths')

``````
``````

Out[12]:

[Text(0,0.5,'RMSD in branch lengths'), Text(0.5,0,'Number of calibrations')]

``````
``````

In [13]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["correlation_bl"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["correlation_bl"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Correlation in branch lengths')

``````
``````

Out[13]:

[Text(0,0.5,'Correlation in branch lengths'),
Text(0.5,0,'Number of calibrations')]

``````
``````

In [14]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["rmsd"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["rmsd"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='RMSD in node ages')

``````
``````

Out[14]:

[Text(0,0.5,'RMSD in node ages'), Text(0.5,0,'Number of calibrations')]

``````
``````

In [15]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["correlation"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["correlation"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Correlation with true dates')

``````
``````

Out[15]:

[Text(0,0.5,'Correlation with true dates'),
Text(0.5,0,'Number of calibrations')]

``````
``````

In [16]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["fracInHPD"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["fracInHPD"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Fraction in 95% HPD')

``````
``````

Out[16]:

[Text(0,0.5,'Fraction in 95% HPD'), Text(0.5,0,'Number of calibrations')]

``````
``````

In [17]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["numCalib"], y=d["percent50"], inner=None)
ax = sns.swarmplot(d["numCalib"], y=d["percent50"], color="white", edgecolor="gray")
ax.set(xlabel='Number of calibrations', ylabel='Median size of the 95% HPD')

``````
``````

Out[17]:

[Text(0,0.5,'Median size of the 95% HPD'),
Text(0.5,0,'Number of calibrations')]

``````

### Partial conclusion on the above

As the number of calibrations increases from 0 to 5, the accuracy of the reconstruction improves, despite smaller 95%HPD. The effect of calibrations seems stronger than that of constraints.

## Let's look at constraints, controlling for calibrations

In the following 10 plots, we separate the impacts of constraints and calibrations.

``````

In [18]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="rmsd_bl", x="numCalib", data=d)
#ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average rmsd in branch lengths')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[18]:

<matplotlib.legend.Legend at 0x7f0c3c455b00>

``````
``````

In [19]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="correlation_bl", x="numCalib", data=d)
#ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average correlation in branch lengths')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[19]:

<matplotlib.legend.Legend at 0x7f0c47c831d0>

``````
``````

In [20]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="rmsd", x="numCalib", data=d)
#ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average rmsd in node ages')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[20]:

<matplotlib.legend.Legend at 0x7f0c3c01c748>

``````
``````

In [21]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="correlation", x="numCalib", data=d)
ax.set_ylim(0.8, 1.0)
ax.set(xlabel='Number of calibrations', ylabel='Average correlation in node ages')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[21]:

<matplotlib.legend.Legend at 0x7f0c3c064080>

``````
``````

In [22]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="fracInHPD", x="numCalib", data=d)
ax.set_ylim(50, 90)
ax.set(xlabel='Number of calibrations', ylabel='Average fraction in 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[22]:

<matplotlib.legend.Legend at 0x7f0c316a09b0>

``````
``````

In [23]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent0", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average smallest 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[23]:

<matplotlib.legend.Legend at 0x7f0c316a0cf8>

``````
``````

In [24]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent25", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average 1st quartile 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[24]:

<matplotlib.legend.Legend at 0x7f0c314f6128>

``````
``````

In [25]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent50", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average median 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[25]:

<matplotlib.legend.Legend at 0x7f0c3c397048>

``````
``````

In [26]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent75", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average 3rd quartile 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[26]:

<matplotlib.legend.Legend at 0x7f0c312f7cc0>

``````
``````

In [27]:

%matplotlib inline
ax = sns.barplot(hue="numCons", y="percent100", x="numCalib", data=d)
ax.set(xlabel='Number of calibrations', ylabel='Average maximum 95% HPD')
# Put the legend out of the figure
handles, labels = ax.get_legend_handles_labels()
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., handles=handles, labels=["0 constraint", "1 constraint", "3 constraints", "5 constraints","10 constraints", "20 constraints","50 constraints"])

``````
``````

Out[27]:

<matplotlib.legend.Legend at 0x7f0c311cf160>

``````

### Partial conclusion on the above

Constraints and calibrations both improve accuracy. The number of constraints does not seem to have much of an effect on the size of the 95% HPD, contrary to calibrations, unless you get to lots of constraints (20, 5à or 100). However, when using large numbers of constraints, results are much improved, although with a lot of variance if there aren't any calibration. It is possible that on some trees convergence was difficult in the fully constrained runs.

## Comparison between trees

In the following 6 plots, we look at the impact of the 10 different trees.

First let's order the trees from good to bad according to the correlation in node ages when all constraints are used:

``````

In [28]:

d_sel = d.loc[d['numCons'] == 100]
print(d_sel.sort_values(['correlation'], ascending=False))

``````
``````

Empty DataFrame
Columns: [treeId, numCalib, numCons, numReplicate, correlation, rmsd, correlation_bl, rmsd_bl, numNodes, numInHPD, fracInHPD, percent0, percent25, percent50, percent75, percent100]
Index: []

``````
``````

In [29]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["treeId"], y=d["rmsd"], inner=None)
ax = sns.swarmplot(d["treeId"], y=d["rmsd"], color="white", edgecolor="gray")
ax.set(xlabel='Tree ID', ylabel='Rmsd in node ages')

``````
``````

Out[29]:

[Text(0,0.5,'Rmsd in node ages'), Text(0.5,0,'Tree ID')]

``````
``````

In [30]:

%matplotlib inline
sns.set_style("whitegrid")
ax = sns.violinplot(d["treeId"], y=d["correlation"], inner=None)
ax = sns.swarmplot(d["treeId"], y=d["correlation"], color="white", edgecolor="gray")
ax.set(xlabel='Tree ID', ylabel='Correlation in node ages')

``````
``````

Out[30]:

[Text(0,0.5,'Correlation in node ages'), Text(0.5,0,'Tree ID')]

``````