In the last article, we saw how to use SuchTree
to probe the topology of
very large trees. In this article, we're going to look at the other component
of the package, SuchLinkedTrees
.
If you are interested in studying how two groups of organisms interact (or, rather, have interacted over evolutionary time), you will find yourself with two trees of distinct groups of taxa that are linked by a matrix of interaction observations. This is sometimes called a 'dueling trees' problem.
If the trees happen to have the same number of taxa, and the interaction matrix happens to be a unit matrix, then you can compute the distance matrix for each of your trees and use the Mantel test to compare them. However, this is a pretty special case. Hommola et al. describe a method extends the Mantel test in this paper here :
This is implemented in scikit-bio
as hommola_cospeciation
.
Unfortunately, the version in scikit-bio
does not scale to very large trees,
and does not expose the computed distances for analysis. This is where
SuchLinkedTrees
can help.
In [1]:
%pylab inline
%config InlineBackend.figure_format='retina'
from SuchTree import SuchTree, SuchLinkedTrees
import seaborn
import pandas
from scipy.cluster.hierarchy import ClusterWarning
from scipy.stats import pearsonr
warnings.simplefilter( 'ignore', UserWarning )
To get started, we need to initialize two trees and a table of observations linking the taxa on the two trees.
In [2]:
T1 = SuchTree( 'data/bigtrees/host.tree' )
T2 = SuchTree( 'data/bigtrees/guest.tree')
LK = pandas.read_csv( 'data/bigtrees/links.csv', index_col=0 )
print( 'host tree taxa : %d' % T1.n_leafs )
print( 'guest tree taxa : %d' % T2.n_leafs )
print( 'observation matrix : %d x %d' % LK.shape )
To create a SuchLinkedTrees
instance, you need two SuchTree
s and a pandas
DataFrame, where the taxa in the first tree matches the DataFrame index, and the
taxa in the second tree matches the DataFrame columns.
This is a pretty large dataset, so it takes a bit of time to load.
In [3]:
%time SLT = SuchLinkedTrees( T1, T2, LK )
In [4]:
n_links = sum(LK.apply(sum))
print( 'total observations : %d' % n_links )
print( 'observation pairs : %d' % int( ( n_links * ( n_links - 1 ) ) / 2 ) )
To test for cospeciation, Hommola's method does the following :
Then, to calculate the significance of the correlation, it randomly permutes the observation table and recalculates the distances and correlations. A significance measure (a $p$ value) is estimated based on how likely the correlation measure on unpermuted observations could belong to the set of correlation measures on permuted observations.
For each correlation measure, we'd have to do calculate 1,008,162,156 patristic distances through each of the two trees. To calculate the significance, we would then need to permute the observations and then repeat the process about 50 times. That's 100,816,215,600 tree traversals!
How long would that take? In our previous example, we benchmarked 1,000,000 distance calculations at about 14 seconds on a single thread. For this dataset, one correlation measure would require about a thousand times as many lookups, so it should have a run time of about four hours. With the significance test, that would be a little more than one CPU-week. I suppose that's not impossible, but for large datasets like this, we probably don't need an exhaustive search of every possible pair of observations to get a fairly accurate correlation measure.
So, we'ere going to use SuchLinkedTrees.sample_linked_distances()
,
which returns a representative sample of distances. It does this
by filling a user-specified number of buckets (default : 64) with distances
between randomly chosen observations. It stops when the standard deviation of
the standard deviation of the buckets falls bellow sigma
(default : 0.001).
In [5]:
%time result = SLT.sample_linked_distances( sigma=0.001, buckets=64, n=4096 )
In [6]:
result
Out[6]:
In [7]:
print( 'sampled link pairs : %d' % len(result['TreeA']) )
print( 'Pearson\'s correlation : r=%f, p=%f' % pearsonr( result['TreeA'],
result['TreeB'] ) )
Not too bad. The algorithm went through ten iterations, placing ten blocks of 4096 pairs into each bucket before it converged on our stopping condition after testing 2,621,440 pairs (about 0.2% of the possible pairs). Note that the $p$-value we see here is not Hommola's $p$-value -- it doesn't include any information about the topologies of the trees.
Let's see what the distribution of sampled distances looks like.
In [8]:
df = pandas.DataFrame( { 'microbe tree distances' : result['TreeA'],
'host tree distances' : result['TreeB'] } )
seaborn.jointplot( 'microbe tree distances', 'host tree distances',
data=df, alpha=0.3, size=8 )
Out[8]:
Well... that's... something? In this case, we are looking at the entire microbiome of a complex of 14 host species that's about 10-20 million years old. Because bacteria and archaea are older than that, we don't expect to see a meaningful pattern of coevolution at the scale of the whole microbial community.
If we're looking for coevolution, we want to examine clades within the microbial
community. This is what SuchLinkedTrees
really designed to do.
SuchLinkedTrees
has two functions that allow you to examine individual clades.
subset_a()
takes the node id of an internal node of the first tree (usually, the
host organisms), and masks the data within the SuchLinkedTrees
instance so that
that node behaves as the root. subset_b()
does the same for the second tree
(usually, the guest organisms).
In [9]:
SLT.subset_b_size
Out[9]:
The observations are also masked so that distance calculations are constrained to within that clade. The masking operation is extremely efficient, even for very large datasets.
In [10]:
SLT.subset_b(121)
SLT.subset_b_leafs
Out[10]:
So, all we need to do is iterate over the internal nodes of the microbe tree
(which we can get from SuchTree
's get_internal_nodes()
function), subset
the guest tree to that node, and apply Hommola's algorithm to the masked
SuchLinkedTrees
instance.
I'm going to put some simple constrains based on clade size. You could also use the average or total tree depth for each clade. It takes about an hour to finish all 103,445 clades, so let's look at a random sample of 10,000 of them.
In [11]:
from pyprind import ProgBar
warnings.simplefilter( 'ignore', RuntimeWarning )
N = len( T2.get_internal_nodes() )
progbar = ProgBar( N, title='Chugging through microbime data...' )
data = []
for n,nodeid in enumerate( T2.get_internal_nodes() ) :
SLT.subset_b( nodeid )
progbar.update()
if SLT.subset_b_size < 10 :
continue
if SLT.subset_n_links > 2500 :
continue
d = {}
d['name'] = 'clade_' + str(nodeid)
d['n_links'] = SLT.subset_n_links
d['n_leafs'] = SLT.subset_b_size
ld = SLT.linked_distances()
d['r'], d['p'] = pearsonr( ld['TreeA'], ld['TreeB'] )
data.append( d )
data = pandas.DataFrame( data ).dropna()
Let's see what we've got!
In [12]:
data.head()
Out[12]:
In [13]:
seaborn.jointplot( data.n_leafs, data.r, alpha=0.3, size=8 )
Out[13]:
Are there any clades that are big enough to to be interesting that show a significant correlation above 0.6?
In [14]:
data.loc[ ( data.r > 0.6 ) &
( data.n_leafs > 10 ) &
( data.n_links > 15 ) &
( data.p < 0.01 ) ]
Out[14]:
Cool. Let's go back and look at these in more detail.
In [15]:
SLT.subset_b( 26971 )
ld = SLT.linked_distances()
seaborn.jointplot( ld['TreeA'], ld['TreeB'] )
Out[15]:
Huh. Well, that looks a lot less interesting than I hoped. This is the problem with correlation measures -- they don't test that the data obeys their assumptions. In this case, we're using Pierson's $r$, which assumes that the data from the two sources is normally distributed, which this clearly is not. If you haven't seen this before, check out Anscombe's quartet; the gist of his argument is that it's not a good idea to apply any statistic without examining the data graphically.
Let's have a look at the trees so we can get a better idea of why this is
broken. Unfortunately, I don't have a great way of pulling out the subtree
for plotting yet, so this will require some help from dendropy
.
In [16]:
from dendropy import Tree
from tempfile import NamedTemporaryFile
tmpfile1 = NamedTemporaryFile()
tmpfile2 = NamedTemporaryFile()
# invert the taxa : node_id map
# FIXME : I need a better interface for this, suggestions welcome
sfeal = dict( zip( SLT.TreeB.leafs.values(), SLT.TreeB.leafs.keys() ) )
subset_taxa = [ sfeal[i] for i in SLT.subset_b_leafs ]
guest_tree = Tree.get_from_path( 'data/bigtrees/guest.tree',
schema='newick',
preserve_underscores=True ) # Newick is the worst
subset_tree = guest_tree.extract_tree_with_taxa_labels( subset_taxa )
subset_tree.write_to_path( tmpfile1.name, schema='newick' )
LK[ subset_taxa ].to_csv( tmpfile2.name )
In [17]:
%load_ext rpy2.ipython
cladepath = tmpfile1.name
linkpath = tmpfile2.name
outpath = 'clade_26971.svg'
In [18]:
%%R -i cladepath -i linkpath -i outpath -w 800 -h 800 -u px
library("phytools")
library("igraph")
tr1 <- read.tree( "data/bigtrees/host.tree" )
tr2 <- read.tree( cladepath )
links <- read.csv( linkpath, row.names=1, stringsAsFactors = F )
im <- graph_from_incidence_matrix( as.matrix( links ) )
assoc <- as_edgelist( im )
obj <- cophylo( tr1, tr2, assoc=assoc )
svg( outpath, width = 10, height = 12 )
plot( obj )
Now that we've gotten through the plotting hampsterdance, we can have a look at the structure of this clade and its relationship with the host organisms :
If we're hoping to find an example of coevolution, this is an excellent example of what we are not looking for! The Hommola test is not really appropriate for this application. The Hommola test is really intended for cases where you have something that looks like it might be an example of coevolution, and you would like to measure how strong the effect is. We are abusing it somewhat by asking it to distinguish coevolution from a background of other kinds of interactions, to say nothing of ignoring multiple testing effects.
So, we'll need a more sophisticated way to test for coevolution. With SuchTree
and
SuchLinkedTrees
handling the grunt work, we can focus on the those models.