Here we want to place n events of transition between an ancestral lifestyle and a convergent lifestyle in a phylogeny.

We want these n events to be independent, not nested. We return them in a format compatible with Bio++ input files for bppseqgen and bppml.

First, we generate a random topology:


In [1]:
from ete3 import Tree
import string
import scipy.stats as stats
import numpy as np

tl = Tree()
# We create a random tree topology
numTips = 20
candidateNames = list(string.ascii_lowercase)
tipNames = candidateNames[0:20]
tl.populate(numTips, names_library=tipNames)
print (tl)

#Alternatively we could read a tree from a file into a string "line", and then use: 
# tl =  Tree( line )


         /-b
      /-|
     |   \-a
   /-|
  |  |   /-t
  |   \-|
  |     |   /-s
  |      \-|
  |        |   /-r
  |         \-|
  |            \-q
--|
  |      /-p
  |   /-|
  |  |  |   /-o
  |  |   \-|
  |  |      \-n
  |  |
  |  |      /-m
  |  |   /-|
   \-|  |  |   /-l
     |  |   \-|
     |  |     |   /-k
     |  |      \-|
     |  |        |   /-j
     |  |         \-|
     |  |           |   /-i
      \-|            \-|
        |              |   /-h
        |               \-|
        |                  \-g
        |
        |      /-f
        |   /-|
        |  |   \-e
         \-|
           |   /-d
            \-|
               \-c

Now, let's number the nodes of the tree.


In [2]:
def reNumberNodes (tl):
    nodeId = 0
    for n in tl.traverse():
        n.add_features(ND=nodeId)
        if n.name=="":
            n.name = str(nodeId)
        nodeId = nodeId + 1
    return

reNumberNodes(tl)


#Writing in NHX format
tl.write(features=['ND'])
print(tl.get_ascii(show_internal=True))


         /-b
      /3|
     |   \-a
   /1|
  |  |   /-t
  |   \4|
  |     |   /-s
  |      \10
  |        |   /-r
  |         \16
  |            \-q
-0|
  |      /-p
  |   /5|
  |  |  |   /-o
  |  |   \12
  |  |      \-n
  |  |
  |  |      /-m
  |  |   /13
   \2|  |  |   /-l
     |  |   \20
     |  |     |   /-k
     |  |      \26
     |  |        |   /-j
     |  |         \32
     |  |           |   /-i
      \6|            \34
        |              |   /-h
        |               \36
        |                  \-g
        |
        |      /-f
        |   /21
        |  |   \-e
         \14
           |   /-d
            \22
               \-c

Now we create a function to place the n events of transition.

We don't care about branch lengths, meaning that we decide that a transition is not more likely on a long branch than on a short one.


In [3]:
# We could use some dynamic programming to be able to generate paths that yield n transitions exactly.
# INstead we randomly generate transitions on the tree until we get the desired number.
# We have two states: ancestral (0) and convergent (1).
# We count the numbers of transitions
def randomTransitions(numTransitions, tree):
    numberOfNodes = len(tree.get_tree_root().get_descendants()) + 1
    rate = float(numTransitions)/float(numberOfNodes)
    ancestralTransition=dict()
    totalNumberOfTransitions = 0
    nodesWithTransitions = list()
    for node in tree.traverse("levelorder"):
        if node.is_root() :
            ancestralTransition[node] = False
        elif ( ancestralTransition[node.up] == True):
            ancestralTransition[node] = True
        else :
            sisterHasAlreadyTransitioned=False
            if ancestralTransition.__contains__(node.get_sisters()[0]): #Here we assume binary trees!
                sisterHasAlreadyTransitioned=True
            #randomly draw whether we do a transition or not
            transitionBool = stats.bernoulli.rvs(rate, size=1)[0] == 1
            if (transitionBool and not sisterHasAlreadyTransitioned):
                ancestralTransition[node] = True
                nodesWithTransitions.append(node)
                totalNumberOfTransitions = totalNumberOfTransitions + 1
            else:
                ancestralTransition[node] = False
    return nodesWithTransitions, totalNumberOfTransitions, ancestralTransition
        
        
        
def placeNTransitionsInTree(numTransitions, tree):
    observedNumTransitions = 2*numTransitions
    nodesWithTransitions = list()
    numTries = 0
    convergentModel = dict()
    while observedNumTransitions != numTransitions and numTries < 100:
        observedNumTransitions = 0
        nodesWithTransitions, observedNumTransitions, convergentModel = randomTransitions(numTransitions, tree)
        print ("Observed Number of Transitions: "+ str(observedNumTransitions ) + " compared to "+ str(numTransitions) + " wanted")
        numTries = numTries + 1
    if numTries < 100:
        for n in nodesWithTransitions:
            print(n.get_ascii())
    else:
        print("It seems like it is too difficult to place "+ str(numTransitions) + " events in this tree.")
    return convergentModel

And now we place the transition events.


In [4]:
convergentModel = dict()
convergentModel = placeNTransitionsInTree(5, tl)


Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 3 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 3 compared to 5 wanted
Observed Number of Transitions: 3 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 3 compared to 5 wanted
Observed Number of Transitions: 3 compared to 5 wanted
Observed Number of Transitions: 0 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 4 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 4 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 1 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 2 compared to 5 wanted
Observed Number of Transitions: 5 compared to 5 wanted

--s

--o

--f

--j

--h

Now we want to output the tree and the command line for bppseqgen and bppml.


In [5]:
# convergentModel is the ouput from thefunction that places transitions
# C1 and C2 are two profile numbers
# Nch is the number of characters
def getBppSeqGenCommandFromNodesWithTransitions(convergentModel, C1, C2, Nch):
    #First, get the nodes with the convergent model and the nodes with the ancestral model.
    nodesWithConvergentModel = list()
    nodesWithAncestralModel = list()
    for k,v in convergentModel.items():
        if v == True:
            nodesWithConvergentModel.append(k.ND)
        if v == False:
            nodesWithAncestralModel.append(k.ND)

    n1="\""+ str(nodesWithAncestralModel[0])
    for n in nodesWithAncestralModel[1:len(nodesWithAncestralModel)-1]:
        n1 += "," + str(n) 
    n1 += "\""
    n2="\""+ str(nodesWithConvergentModel[0])
    for n in nodesWithConvergentModel[1:len(nodesWithConvergentModel)-1]:
        n2 += "," + str(n) 
    n2 += "\""
    #Dummy values:
    command="bppseqgen param=CATseq.bpp mod1Nodes=%s mod2Nodes=%s Nch=%d Ns1=%d Ns2=%d Ne1=%d Ne2=%d"%(n1,n2,Nch,C1,C2,C1,C2)
    return (command)


getBppSeqGenCommandFromNodesWithTransitions(convergentModel, 1, 2, 1000)


Out[5]:
'bppseqgen param=CATseq.bpp mod1Nodes="0,3,32,14,26,2,35,4,34,9,6,36,5,10,13,38,23,24,28,11,7,16,8,30,12,31,29,18,25,19,22,20,21" mod2Nodes="33,15,37,27" Nch=1000 Ns1=1 Ns2=2 Ne1=1 Ne2=2'

We have a problem: the random algorithm above does not work well for large numbers of convergent events: it needs to do a large number of trials and errors to get something that works, and often fails. Therefore we need to use another algorithm.

Improved algorithm


In [6]:
# we load a small tree, for testing purpose
t = Tree('((((H,K)D,(F,I)G)B,E)A,((L,(N,Q)O)J,(P,S)M)C);', format=1)

print(t.write(format=1))
print(t.get_ascii(show_internal=True))


#for node in t.traverse("levelorder"):
  # Do some analysis on node
  #print (node.name)


((((H:1,K:1)D:1,(F:1,I:1)G:1)B:1,E:1)A:1,((L:1,(N:1,Q:1)O:1)J:1,(P:1,S:1)M:1)C:1);

            /-H
         /D|
        |   \-K
      /B|
     |  |   /-F
   /A|   \G|
  |  |      \-I
  |  |
  |   \-E
--|
  |      /-L
  |   /J|
  |  |  |   /-N
  |  |   \O|
   \C|      \-Q
     |
     |   /-P
      \M|
         \-S

In [7]:
def getCherries(leaves, tree):
    length = len(leaves)
    dist=list()
    leaveslist = list(leaves)
    for i in range(length-1):
        for j in range(i+1,length):
            di = tree.get_distance(leaveslist[i].name, leaveslist[j].name, topology_only=True)
            dist.append([di, leaveslist[i].name, leaveslist[j].name])
    #print(dist)
    cherries = list()
    for d in dist:
        if (d[0] == 1.0):
            cherries.append(d)
    return (cherries)

def placeTransitionOnBranchIfSisterHasNotTransitioned (node, ancestralTransition, nodesWithTransitions):
    if node.is_leaf():
        if ancestralTransition.__contains__(node.get_sisters()[0]) and ancestralTransition[node.get_sisters()[0]]==True:
            print("Problem: sister of node "+node.name+" has transitioned!")
            return
        ancestralTransition[ node ] = True
        nodesWithTransitions.append(node)
        print ("Adding transition on leaf node "+node.name)
        return
    descen = node.get_descendants()
    SisterHasTransitioned = True
    converg = 0
    while (not SisterHasTransitioned):
        converg = np.random.randint( 0, high=len(descen) )
        if ancestralTransition.__contains__(descen[converg].get_sisters()[0] and ancestralTransition[descen[converg].get_sisters()[0]]==True):
            pass
        else:
            SisterHasTransitioned = False
    ancestralTransition[ descen[converg] ] = True
    nodesWithTransitions.append(descen[converg])
    print ("Adding transition on node "+node.name)
    return


def randomTransitionsWithHypergeometricDistribution(numTransitions, tree):
    numberOfNodes = len(tree.get_tree_root().get_descendants()) + 1
    rate = float(numTransitions)/float(numberOfNodes)
    node2leaves = tree.get_cached_content()
    #Computing the maximum number of transitions possible
    numLeaves = len(node2leaves[tree.get_tree_root()])
    cherries = getCherries(node2leaves[tree.get_tree_root()], tree)
    #print("cherries: "+str(cherries))
    maxNumTrans = numLeaves - len(cherries)
    print("Maximum number of transitions: "+str(maxNumTrans))
    if maxNumTrans < numTransitions:
        print ("Sorry, we cannot fit "+str(numTransitions)+ " in this tree, which can only accommodate "+str(maxNumTrans)+" transitions.")
    # Now, we want to annotate all nodes with the number of available underlying branches.
    listOfCherryPartners = list()
    for c in cherries:
        listOfCherryPartners.append( c[1] ) 
    #print("listOfCherryPartners: "+str(listOfCherryPartners))
    for node in tree.traverse("levelorder"):
        numLea = len(node2leaves[node])
        numCherries = 0
        for n in node2leaves[node]:
            if n.name in listOfCherryPartners:
                numCherries = numCherries +1
        node.add_feature("numberOfAvailableBranches", numLea - numCherries)
        #print(node.name + " : "+ str(node.numberOfAvailableBranches))
    ancestralTransition=dict()
    for node in tree.traverse("levelorder"):
        ancestralTransition[node] = False
    nodesWithTransitions = list()
    #Now we traverse the tree from the root, and at each node choose 
    #how many transitions we place in the right and left subtrees
    tree.get_tree_root().add_feature("underlyingNumTransitions", numTransitions)
    for node in tree.traverse("preorder"):
        if (not node.is_leaf()):
            rightChild = node.children[0]
            leftChild = node.children[1]
            if node.underlyingNumTransitions <= 1:
                    rightChild.add_feature("underlyingNumTransitions", 
                                           1)
                    leftChild.add_feature("underlyingNumTransitions", 
                                          1)
            else:
                numRight = np.random.hypergeometric(rightChild.numberOfAvailableBranches, 
                                       leftChild.numberOfAvailableBranches, 
                                       node.underlyingNumTransitions)
                numLeft = node.underlyingNumTransitions - numRight
                rightChild.add_feature("underlyingNumTransitions", 
                                       numRight)
                leftChild.add_feature("underlyingNumTransitions", 
                                      numLeft)
                if (numRight == 1):
                    #We randomly place the transition in one of the branches of the right subtree
                    placeTransitionOnBranchIfSisterHasNotTransitioned (rightChild, 
                                                                       ancestralTransition, 
                                                                       nodesWithTransitions)
                if (numLeft == 1):
                    #We randomly place the transition in one of the branches of the left subtree
                    placeTransitionOnBranchIfSisterHasNotTransitioned (leftChild, 
                                                                       ancestralTransition, 
                                                                       nodesWithTransitions)
    print ("\t\tTotal number of transitions placed in the tree: "+str(len(nodesWithTransitions)))
    return nodesWithTransitions, ancestralTransition

Test of the function: 10 times we try to insert 5 transition in tree t.


In [8]:
for i in range (10):
    n,a = randomTransitionsWithHypergeometricDistribution(5, t)


Maximum number of transitions: 6
Adding transition on node D
Adding transition on node G
Adding transition on node M
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on node B
Adding transition on leaf node E
Adding transition on node M
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on leaf node E
Adding transition on node D
Adding transition on node G
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on node B
Adding transition on leaf node E
Adding transition on node M
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on node B
Adding transition on leaf node E
Adding transition on node M
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on leaf node E
Adding transition on node D
Adding transition on node G
Adding transition on node J
Adding transition on node M
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on node D
Adding transition on node G
Adding transition on node M
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on node B
Adding transition on leaf node E
Adding transition on node M
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on leaf node E
Adding transition on node D
Adding transition on node G
Adding transition on node J
Adding transition on node M
		Total number of transitions placed in the tree: 5
Maximum number of transitions: 6
Adding transition on node B
Adding transition on leaf node E
Adding transition on node M
Adding transition on leaf node L
Adding transition on node O
		Total number of transitions placed in the tree: 5

Entire function to place convergent events in a tree and output the parameters for bppsuite.


In [9]:
def placeTransitionsAndGetBppSeqGenCommand(numTransitions, tree, C1, C2, Nch):
    reNumberNodes(tree) 
    nodesWithTransitions,convergentModel = randomTransitionsWithHypergeometricDistribution(numTransitions, tree)
    return( getBppSeqGenCommandFromNodesWithTransitions(convergentModel, C1, C2, Nch) )

# We try the function on the large tree simulated at the beginning
placeTransitionsAndGetBppSeqGenCommand(10, tl, 1, 2, 1000)


Maximum number of transitions: 14
Adding transition on node 3
Adding transition on leaf node t
Adding transition on node 10
Adding transition on node 5
Adding transition on node 14
Adding transition on leaf node m
Adding transition on leaf node l
Adding transition on leaf node j
Adding transition on leaf node i
Adding transition on node 36
		Total number of transitions placed in the tree: 10
Out[9]:
'bppseqgen param=CATseq.bpp mod1Nodes="0,3,32,14,26,2,4,34,6,36,5,10,13,38,23,27,24,28,16,8,17,30,12,31,29,18,22,20" mod2Nodes="33,35,9,15,37,11,7,25,19" Nch=1000 Ns1=1 Ns2=2 Ne1=1 Ne2=2'

In [ ]: