SuchLinkedTrees

In the last article, we saw how to use SuchTree to probe the topology of very large trees. In this article, we're going to look at the other component of the package, SuchLinkedTrees.

If you are interested in studying how two groups of organisms interact (or, rather, have interacted over evolutionary time), you will find yourself with two trees of distinct groups of taxa that are linked by a matrix of interaction observations. This is sometimes called a 'dueling trees' problem.

If the trees happen to have the same number of taxa, and the interaction matrix happens to be a unit matrix, then you can compute the distance matrix for each of your trees and use the Mantel test to compare them. However, this is a pretty special case. Hommola et al. describe a method extends the Mantel test in this paper here :

This is implemented in scikit-bio as hommola_cospeciation. Unfortunately, the version in scikit-bio does not scale to very large trees, and does not expose the computed distances for analysis. This is where SuchLinkedTrees can help.


In [1]:
%pylab inline
%config InlineBackend.figure_format='retina'

from SuchTree import SuchTree, SuchLinkedTrees
import seaborn
import pandas
from scipy.cluster.hierarchy import ClusterWarning
from scipy.stats import pearsonr

warnings.simplefilter( 'ignore', UserWarning )


Populating the interactive namespace from numpy and matplotlib

To get started, we need to initialize two trees and a table of observations linking the taxa on the two trees.


In [2]:
T1 = SuchTree( 'data/bigtrees/host.tree' )
T2 = SuchTree( 'data/bigtrees/guest.tree')
LK = pandas.read_csv( 'data/bigtrees/links.csv', index_col=0 )

print( 'host tree taxa     : %d' % T1.n_leafs )
print( 'guest tree taxa    : %d' % T2.n_leafs )
print( 'observation matrix : %d x %d' % LK.shape )


host tree taxa     : 14
guest tree taxa    : 103446
observation matrix : 14 x 103446

To create a SuchLinkedTrees instance, you need two SuchTrees and a pandas DataFrame, where the taxa in the first tree matches the DataFrame index, and the taxa in the second tree matches the DataFrame columns.

This is a pretty large dataset, so it takes a bit of time to load.


In [3]:
%time SLT = SuchLinkedTrees( T1, T2, LK )


CPU times: user 8min, sys: 340 ms, total: 8min 1s
Wall time: 8min 4s

In [4]:
n_links = sum(LK.apply(sum))
print( 'total observations : %d' % n_links )
print( 'observation pairs  : %d' % int( ( n_links * ( n_links - 1 ) ) / 2 ) )


total observations : 44904
observation pairs  : 1008162156

To test for cospeciation, Hommola's method does the following :

  • calculate the patristic distance between the host taxa from the two observations
  • calculate the patristic distance between the guest taxa from the two observations
  • calculate the Pearson's correlation of the distance measures

Then, to calculate the significance of the correlation, it randomly permutes the observation table and recalculates the distances and correlations. A significance measure (a $p$ value) is estimated based on how likely the correlation measure on unpermuted observations could belong to the set of correlation measures on permuted observations.

For each correlation measure, we'd have to do calculate 1,008,162,156 patristic distances through each of the two trees. To calculate the significance, we would then need to permute the observations and then repeat the process about 50 times. That's 100,816,215,600 tree traversals!

How long would that take? In our previous example, we benchmarked 1,000,000 distance calculations at about 14 seconds on a single thread. For this dataset, one correlation measure would require about a thousand times as many lookups, so it should have a run time of about four hours. With the significance test, that would be a little more than one CPU-week. I suppose that's not impossible, but for large datasets like this, we probably don't need an exhaustive search of every possible pair of observations to get a fairly accurate correlation measure.

So, we'ere going to use SuchLinkedTrees.sample_linked_distances(), which returns a representative sample of distances. It does this by filling a user-specified number of buckets (default : 64) with distances between randomly chosen observations. It stops when the standard deviation of the standard deviation of the buckets falls bellow sigma (default : 0.001).


In [5]:
%time result = SLT.sample_linked_distances( sigma=0.001, buckets=64, n=4096 )


CPU times: user 39.5 s, sys: 40 ms, total: 39.6 s
Wall time: 39.8 s

In [6]:
result


Out[6]:
{'TreeA': array([0.264332  , 0.21065001, 0.264911  , ..., 0.34968701, 0.265522  ,
        0.264911  ]),
 'TreeB': array([0.02564294, 0.31347001, 0.98371613, ..., 0.28150666, 0.17121513,
        0.20058692]),
 'n_pairs': 1008162156,
 'n_samples': 3145728,
 'deviation_a': 0.00039790052687749267,
 'deviation_b': 0.0009455526014789939}

In [7]:
print( 'sampled link pairs    : %d' % len(result['TreeA']) )
print( 'Pearson\'s correlation : r=%f, p=%f' % pearsonr( result['TreeA'],
                                                         result['TreeB'] ) )


sampled link pairs    : 3145728
Pearson's correlation : r=0.016714, p=0.000000

Not too bad. The algorithm went through ten iterations, placing ten blocks of 4096 pairs into each bucket before it converged on our stopping condition after testing 2,621,440 pairs (about 0.2% of the possible pairs). Note that the $p$-value we see here is not Hommola's $p$-value -- it doesn't include any information about the topologies of the trees.

Let's see what the distribution of sampled distances looks like.


In [8]:
df = pandas.DataFrame( { 'microbe tree distances' : result['TreeA'], 
                         'host tree distances'    : result['TreeB'] } )

seaborn.jointplot( 'microbe tree distances', 'host tree distances',
                   data=df, alpha=0.3, size=8 )


Out[8]:
<seaborn.axisgrid.JointGrid at 0x7f07801c86d8>

Well... that's... something? In this case, we are looking at the entire microbiome of a complex of 14 host species that's about 10-20 million years old. Because bacteria and archaea are older than that, we don't expect to see a meaningful pattern of coevolution at the scale of the whole microbial community.

If we're looking for coevolution, we want to examine clades within the microbial community. This is what SuchLinkedTrees really designed to do.

SuchLinkedTrees has two functions that allow you to examine individual clades. subset_a() takes the node id of an internal node of the first tree (usually, the host organisms), and masks the data within the SuchLinkedTrees instance so that that node behaves as the root. subset_b() does the same for the second tree (usually, the guest organisms).


In [9]:
SLT.subset_b_size


Out[9]:
103446

The observations are also masked so that distance calculations are constrained to within that clade. The masking operation is extremely efficient, even for very large datasets.


In [10]:
SLT.subset_b(121)
SLT.subset_b_leafs


Out[10]:
array([122, 124,   6,   4,   0,   2,  18, 118, 120,   8,  14,  16,  10,
        12,  32,  34, 116,  26,  28,  30,  36,  86, 102, 104, 110, 112,
       114,  24,  38,  40,  84,  88,  90,  92,  94, 100, 106, 108,  20,
        22,  42,  44,  78,  96,  98,  76,  80,  82,  52,  70,  46,  72,
        74,  48,  50,  54,  56,  58,  60,  62,  64,  66,  68])

So, all we need to do is iterate over the internal nodes of the microbe tree (which we can get from SuchTree's get_internal_nodes() function), subset the guest tree to that node, and apply Hommola's algorithm to the masked SuchLinkedTrees instance.

I'm going to put some simple constrains based on clade size. You could also use the average or total tree depth for each clade. It takes about an hour to finish all 103,445 clades, so let's look at a random sample of 10,000 of them.


In [11]:
from pyprind import ProgBar

warnings.simplefilter( 'ignore', RuntimeWarning )

N = len( T2.get_internal_nodes() )
progbar = ProgBar( N, title='Chugging through microbime data...' )

data = []
for n,nodeid in enumerate( T2.get_internal_nodes() ) :
    
    SLT.subset_b( nodeid )
    progbar.update()
    if SLT.subset_b_size < 10 :
        continue
    if SLT.subset_n_links > 2500 :
        continue

    d = {}
    d['name']    = 'clade_' + str(nodeid)
    d['n_links'] = SLT.subset_n_links
    d['n_leafs'] = SLT.subset_b_size
    
    ld = SLT.linked_distances()
    
    d['r'], d['p'] = pearsonr( ld['TreeA'], ld['TreeB'] ) 
    
    data.append( d )

data = pandas.DataFrame( data ).dropna()


Chugging through microbime data...
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:59:38

Let's see what we've got!


In [12]:
data.head()


Out[12]:
n_leafs n_links name p r
0 105 46 clade_125 2.459121e-04 -0.113734
1 319 123 clade_159421 1.014772e-03 -0.037933
2 63 18 clade_121 8.836506e-02 -0.138239
3 42 28 clade_127 1.432212e-02 -0.125883
4 242 87 clade_158329 1.748205e-07 -0.085288

In [13]:
seaborn.jointplot( data.n_leafs, data.r, alpha=0.3, size=8 )


Out[13]:
<seaborn.axisgrid.JointGrid at 0x7f07803ff128>

Are there any clades that are big enough to to be interesting that show a significant correlation above 0.6?


In [14]:
data.loc[ ( data.r > 0.6      ) &
          ( data.n_leafs > 10 ) &
          ( data.n_links > 15 ) &
          ( data.p < 0.01     ) ]


Out[14]:
n_leafs n_links name p r
14799 108 22 clade_26971 5.982235e-33 0.681951
14999 103 22 clade_27163 5.982235e-33 0.681951
15196 96 20 clade_27145 3.844922e-27 0.680072
15395 87 17 clade_27061 1.621913e-19 0.676366

Cool. Let's go back and look at these in more detail.


In [15]:
SLT.subset_b( 26971 )

ld = SLT.linked_distances()

seaborn.jointplot( ld['TreeA'], ld['TreeB'] )


Out[15]:
<seaborn.axisgrid.JointGrid at 0x7f0780928160>

Huh. Well, that looks a lot less interesting than I hoped. This is the problem with correlation measures -- they don't test that the data obeys their assumptions. In this case, we're using Pierson's $r$, which assumes that the data from the two sources is normally distributed, which this clearly is not. If you haven't seen this before, check out Anscombe's quartet; the gist of his argument is that it's not a good idea to apply any statistic without examining the data graphically.

Let's have a look at the trees so we can get a better idea of why this is broken. Unfortunately, I don't have a great way of pulling out the subtree for plotting yet, so this will require some help from dendropy.


In [16]:
from dendropy import Tree
from tempfile import NamedTemporaryFile

tmpfile1 = NamedTemporaryFile()
tmpfile2 = NamedTemporaryFile()

# invert the taxa : node_id map
# FIXME : I need a better interface for this, suggestions welcome

sfeal = dict( zip( SLT.TreeB.leafs.values(), SLT.TreeB.leafs.keys() ) )

subset_taxa = [ sfeal[i] for i in SLT.subset_b_leafs ]

guest_tree = Tree.get_from_path( 'data/bigtrees/guest.tree',
                                 schema='newick',
                                 preserve_underscores=True ) # Newick is the worst

subset_tree = guest_tree.extract_tree_with_taxa_labels( subset_taxa )
subset_tree.write_to_path( tmpfile1.name, schema='newick' )

LK[ subset_taxa ].to_csv( tmpfile2.name )

In [17]:
%load_ext rpy2.ipython

cladepath = tmpfile1.name
linkpath = tmpfile2.name

outpath = 'clade_26971.svg'

In [18]:
%%R -i cladepath -i linkpath -i outpath -w 800 -h 800 -u px

library("phytools")
library("igraph")

tr1 <- read.tree( "data/bigtrees/host.tree" )
tr2 <- read.tree( cladepath )
links <- read.csv( linkpath, row.names=1, stringsAsFactors = F )
im <- graph_from_incidence_matrix( as.matrix( links ) )
assoc <- as_edgelist( im )
obj <- cophylo( tr1, tr2, assoc=assoc )
svg( outpath, width = 10, height = 12 )
plot( obj )


Rotating nodes to optimize matching...
Done.

Now that we've gotten through the plotting hampsterdance, we can have a look at the structure of this clade and its relationship with the host organisms :

If we're hoping to find an example of coevolution, this is an excellent example of what we are not looking for! The Hommola test is not really appropriate for this application. The Hommola test is really intended for cases where you have something that looks like it might be an example of coevolution, and you would like to measure how strong the effect is. We are abusing it somewhat by asking it to distinguish coevolution from a background of other kinds of interactions, to say nothing of ignoring multiple testing effects.

So, we'll need a more sophisticated way to test for coevolution. With SuchTree and SuchLinkedTrees handling the grunt work, we can focus on the those models.