In [1]:
import ete2
import numpy as np
import matplotlib.pyplot as plt
import re
%pylab inline
In [2]:
nw = ("((((((801940:0.16748,3825327:0.36897):0.07831,1928988:0.27799)'p__Crenarchaeota':0.04405,"
"(4455990:0.14403,1128285:0.14461):0.16704):0.01855,823009:0.3881):0.01587,"
"(3770699:0.60035,426860:0.28962)'c__[Parvarchaea]':0.07841)'k__Archaea':0.31772,"
"((3761685:0.18326,4423155:0.5605):0.07543,(((((4391683:0.26125,(4336814:0.35428,"
"(((3190878:0.36049,4452949:0.08994):0.12795,4251079:0.19473):0.02384,"
"(2107103:0.54203,4363563:0.46236):0.06044):0.20722):0.05703):0.08253,229854:0.24026):0.02393,"
"4459468:0.37727):0.01293,696036:0.63811):0.0212,"
"(3779572:0.1464,4363260:0.11356)0.081:0.30451):0.09868):0.15861);")
First, here are the branch lengths, which we pull out with a regular expression.
In [3]:
lengths = map(float,[e[1:] for e in re.findall(':\d\.\d+', nw)])
lengths.sort()
print lengths
print len(lengths), np.min(lengths), np.median(lengths), np.max(lengths)
_ = plt.hist(lengths)
Next, let's parse the newick file. Note that I'm purposefully using the default format here, because this is what I did in my experiment and it gave me misleading results.
In [4]:
t = ete2.Tree(nw)
t.describe()
list(t.traverse("postorder"))
Out[4]:
There are four "branch lengths" that ete2 can't interpret. These are the ones that contain named internal nodes such as 'p__Crenarchaeota':0.04405
. Note that 0.04405
isn't in the list of branch lengths below - it's instead replaced by 1.0
. This is a silent failure that can give very misleading results.
In [5]:
lengths = [n.dist for n in t.traverse("postorder")]
lengths.sort()
print lengths
print len(lengths), np.min(lengths), np.median(lengths), np.max(lengths)
_ = plt.hist(lengths)
If we instead use the correct format, the results are somewhat better...
In [6]:
t = ete2.Tree(nw, format=1)
list(t.traverse("postorder"))
Out[6]:
In [7]:
lengths = [n.dist for n in t.traverse("postorder")]
lengths.sort()
print lengths
print len(lengths), np.min(lengths), np.median(lengths), np.max(lengths)
_ = plt.hist(lengths)
But note that there are still two 1.0 branch lengths that are not in the input tree. There is a weird node name (0.081:0.30451
) but ete2 does seem to be getting those right. Again, the 1.0
branch lengths can give very misleading results.