In which I flagrantly parasitize Sarah's paper Comparative Gut Microbiota of 59 Neotropical Bird Species for my own nefarious purposes.
In [1]:
%pylab inline
from SuchTree import SuchTree, SuchLinkedTrees, pearson
import pandas as pd
import numpy as np
import seaborn
The data for Sarah's paper lives in a FigShare repository.
The FASTA files contains two duplicated entries, Autoch72780X186_005960
and
Myitub72089X232_006148
that needed to be removed before the reads could be processed.
Sarah hasn't gotten back to me with the nice tree she built for this paper, so I cobbled one together from the Global Phylogeny of Birds. Three of the species names had to be switched to synonymous taxonomic listings.
The raw reads, the host tree, and the mapping file were then run through Shand
, which build an alignment of unique reads using
clustal-omega
, a phylogeny using
FastTree
, and a count table.
In [7]:
T1 = SuchTree( 'birds.tree' )
T2 = SuchTree( 'sarah_birds_unique_2_clustalo_fasttree.tree' )
links = pd.read_csv( 'sarah_birds_host_count_table.tsv', sep='\t', index_col='Host' )
links.index = map( lambda x : x.replace(' ','_'), links.index )
In [8]:
links.head()
Out[8]:
Now, we create a SuchLinkedTrees
object, which connects the host and
guest trees to the link matrix (True
or False
values derived from the
count matrix). This pre-indexes the table for fast access later, so creating
object takes a little while.
In [9]:
%%time
SLT = SuchLinkedTrees( T1, T2, links )
Let's pick a random clade within the guest tree and have a look at it.
In [19]:
SLT.subset( 7063 )
print 'subset size :', SLT.subset_size
print 'subset links :', SLT.subset_n_links
print 'link pairs :', ( SLT.subset_n_links * ( SLT.subset_n_links -1 ) ) / 2
In [20]:
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[20]:
That's a pretty big clade, actually. There are 19,359,753 pairs of links for which distances must be computed in each of the trees.
In [21]:
result_sampled = SLT.sample_linked_distances(sigma=0.05, n=10000, buckets=10)
result_sampled
Out[21]:
The sampling function converged after sampling 3,000,000 pairs, out of a possible 19,359,753, or about 15%. One must wonder, of course, how well the sampled distribution of distances captures the full set.
So, here is the kernel density of the full set of distances...
In [22]:
seaborn.kdeplot(result['TreeB'])
Out[22]:
...and the sampled distances. Not terrible, I think.
In [23]:
seaborn.kdeplot(result_sampled['TreeB'][100000:400000])
Out[23]:
So, let's crunch the numbers for all the clades.
Well, I'm in a hurry to make something for lab meeting, so let's just look at the clades that are smaller than 400 leafs.
In [48]:
import pyprind
p = pyprind.ProgBar( len( list( SLT.TreeB.get_internal_nodes() ) ), monitor=True, title='sampling trees...' )
big_nodes = []
table = {}
for n,node in enumerate( SLT.TreeB.get_internal_nodes() ) :
p.update()
SLT.subset( node )
if SLT.subset_size > 400 :
big_nodes.append( node )
#result = SLT.sample_linked_distances( sigma=0.05, n=1000, buckets=100)
continue
else :
result = SLT.linked_distances()
table[node] = { 'n_leafs' : SLT.subset_size,
'n_links' : SLT.subset_n_links,
'n_pairs' : result['n_pairs'],
'n_samples' : result['n_samples'],
'deviatnon_a': result['deviation_a'],
'deviation_b': result['deviation_b'],
'r' : pearson( result['TreeA'], result['TreeB'] ) }
In [85]:
print 'computed', len(table), 'clades'
print len(big_nodes), 'cades were skipped because they were too big.'
print ( len(table) - len(big_nodes) ) / float(len(table))
Computed correlations for 76,388 clades (97% of the clades).
1,927 cades were skipped because they were too big.
In [49]:
C = pd.DataFrame( table ).T
seaborn.jointplot( 'n_leafs', 'r', data=C )
Out[49]:
Let's zoom in on the good ones...
In [56]:
seaborn.jointplot( 'n_leafs', 'r', data=C.query('n_leafs > 10 and r > 0.05') )
Out[56]:
In [70]:
CC = C.query('n_leafs > 10 and r > 0.05').sort_values('r', ascending=False)
print CC.shape
CC.head()
Out[70]:
In [78]:
SLT.subset( 67463 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[78]:
In [79]:
SLT.subset( 67477 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[79]:
In [80]:
SLT.subset( 47675 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[80]:
Pearson's r assumes the distributions are normal, but we saw earlier that they are not. So, we should use a rank-order test, like Kendall's $\tau$.
In [71]:
from scipy.stats import kendalltau, pearsonr
p = pyprind.ProgBar( CC.shape[0], monitor=True, title='resampling trees...' )
pearson_p = {}
kendall_tau = {}
kendall_p = {}
for n,node in enumerate( CC.index ) :
SLT.subset(node)
result = SLT.linked_distances()
p_r,p_p = pearsonr( result['TreeA'], result['TreeB'] )
k_t,k_p = kendalltau( result['TreeA'], result['TreeB'] )
pearson_p[node] = p_p
kendall_tau[node] = k_t
kendall_p[node] = k_p
p.update()
CC['pearson_p'] = pd.Series(pearson_p)
CC['kendall_tau'] = pd.Series(kendall_tau)
CC['kendall_p'] = pd.Series(kendall_p)
In [72]:
CC.sort_values( 'kendall_tau', ascending=False, inplace=True )
CC.head()
Out[72]:
In [73]:
seaborn.jointplot( 'n_leafs', 'kendall_tau', data=CC )
Out[73]:
In [74]:
SLT.subset( 102199 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[74]:
In [75]:
SLT.subset( 19427 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[75]:
In [76]:
SLT.subset( 59641 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )
Out[76]: