If you are interested in studying how two groups of organisms interact (or, rather, have interacted over evolutionary time), you will find yourself with two trees of distinct groups of taxa that are linked by a matrix of interaction observations. This is sometimes called a 'dueling trees' problem.
If the trees happen to have the same number of taxa, and the interaction matrix happens to be a unit matrix, then you can compute the distance matrix for each of your tres and use the Mantel test to compare them. However, this is a pretty special case. Hommola et al. describe a method extends the Mantel test in this paper here :
In [1]:
%load_ext Cython
%pylab inline
In [2]:
from SuchTree import SuchTree
import pandas as pd
import numpy as np
import seaborn
In [ ]:
from SuchTree import SuchLinkedTrees, pearson
In [3]:
T1 = SuchTree( 'SuchTree/tests/test.tree' )
T2 = SuchTree( 'http://edhar.genomecenter.ucdavis.edu/~russell/fishpoo/fishpoo2_p200_c2_unique_2_clustalo_fasttree.tree' )
links = pd.read_csv( 'http://edhar.genomecenter.ucdavis.edu/~russell/fishpoo/fishpoo2_p200_c2_host_count_table.tsv',
sep='\t', index_col='Host')
links.index = map( lambda x : x.replace(' ','_'), links.index )
In [4]:
%%time
SLT = SuchLinkedTrees( T1, T2, links )
In [5]:
SLT.TreeB.get_leafs( 7027 )
Out[5]:
This one looks good.
Now, we tell our SuchLinkedTrees
object to subset itself using that clade.
The default subset is the root node of the guest tree.
In [53]:
SLT.subset( 7027 )
print 'subset size :', SLT.subset_size
print 'subset links :', SLT.subset_n_links
print 'link pairs :', ( SLT.subset_n_links * ( SLT.subset_n_links -1 ) ) / 2
print SLT.col_ids print SLT.subset_columns print SLT.subset_leafs
We can test whether or not the guest organisms are likely to have co-diversified with the host organisms by computing the distances between each possible pair of links in the link matrix through each of the two trees, and then testing how well those two sets of distances correlate with one another.
In [55]:
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[55]:
Yeesh. Lousy correlation, isn't it?
Anyway, thanks to SuchTree
, it only takes a tiny sliver of
a second to compute the 29,890 distances through each of the two
trees.
Unfortunately, for $n$ links, the number of link pairs is
$$\frac{n (n-1)}{2}$$This is $\mathcal{O}(n^2)$ scaling, which is $bad$. The biggest clade is
$$ \frac{ 54327 \times 54326 }{2}= 1,475,684,301 $$link pairs. And that's just one particularly big clade!
In [42]:
result_sampled = SLT.sample_linked_distances(sigma=0.05, n=10000, buckets=10)
result_sampled
Out[42]:
Clade 7027
is not particularly big, so we've actually over-sampled it
by a little more than $10\times$.
It shoudln't much different, though.
In [43]:
seaborn.jointplot( result_sampled['TreeA'], result_sampled['TreeB'] )
Out[43]:
But, of couse, sampled distributions don't look exactly like the distributions from which they were taken, so let's have a look at 'em.
In [44]:
seaborn.kdeplot(result['TreeB'])
Out[44]:
In [45]:
seaborn.kdeplot(result_sampled['TreeB'][100000:400000])
Out[45]:
In [46]:
import pyprind
p = pyprind.ProgBar( len( list( SLT.TreeB.get_internal_nodes() ) ), monitor=True, title='sampling trees...' )
big_nodes = []
table = {}
for n,node in enumerate( SLT.TreeB.get_internal_nodes() ) :
p.update()
SLT.subset( node )
if SLT.subset_n_links > 4000 :
big_nodes.append( node )
result = SLT.sample_linked_distances( sigma=0.05, n=1000, buckets=100)
else :
result = SLT.linked_distances()
table[node] = { 'n_leafs' : SLT.subset_size,
'n_links' : SLT.subset_n_links,
'n_pairs' : result['n_pairs'],
'n_samples' : result['n_samples'],
'deviatnon_a': result['deviation_a'],
'deviation_b': result['deviation_b'],
'r' : pearson( result['TreeA'], result['TreeB'] ) }
If exhaustively measure correlations for clades with fewer than 4000 links, and used sampled distances for clades with more than 4000 links, it takes about six hours on one core.
It can be threaded, but I haven't gotten around to trying that yet.
In [47]:
C = pd.DataFrame( table ).T
seaborn.jointplot( 'n_links', 'r', data=C )
Out[47]:
Most of this is garbage. Let's focus on the good bits.
In [16]:
seaborn.jointplot( 'n_links', 'r', data=C.query('n_leafs > 5 and r > 0.05') )
Out[16]:
In [20]:
CC = C.query('n_leafs > 5 and r > 0.05').sort_values('r', ascending=False)
print CC.shape
CC.head()
Out[20]:
Well, the only reason I'm using Pearson's $r$ is because it happens to be an $\mathcal{O}(n)$ algorithm. It also assumes that the distributions of the things you are testing are normal.
We already saw that they are not normal.
We really should use a rank-order correlation test like Kendall's $\tau$ because they don't make any assumptions about the distributions, but these are all $\mathcal{O}(n^2)$ algorithms.
Once again, the massive diversity of microbes forces us to really sweat the details.
So, let's just use Pearson's $r$ to out what the correlations probably are, and then find Kendall's $\tau$ for the ones that look interesting.
In [35]:
from scipy.stats import kendalltau, pearsonr
pearson_p = {}
kendall_tau = {}
kendall_p = {}
for n,node in enumerate( CC.index ) :
SLT.subset(node)
result = SLT.linked_distances()
p_r,p_p = pearsonr( result['TreeA'], result['TreeB'] )
k_t,k_p = kendalltau( result['TreeA'], result['TreeB'] )
pearson_p[node] = p_p
kendall_tau[node] = k_t
kendall_p[node] = k_p
CC['pearson_p'] = pd.Series(pearson_p)
CC['kendall_tau'] = pd.Series(kendall_tau)
CC['kendall_p'] = pd.Series(kendall_p)
CC.head()
Out[35]:
In [36]:
seaborn.jointplot( 'n_links', 'kendall_tau', data=CC )
Out[36]:
In [40]:
SLT.subset( 79047 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[40]:
In [ ]: