Hird's Birds

In which I flagrantly parasitize Sarah's paper Comparative Gut Microbiota of 59 Neotropical Bird Species for my own nefarious purposes.



In [1]:

    
%pylab inline
from SuchTree import SuchTree, SuchLinkedTrees, pearson
import pandas as pd
import numpy as np
import seaborn









    



Populating the interactive namespace from numpy and matplotlib

Stealing, washing and loading the data

The data for Sarah's paper lives in a FigShare repository.

The FASTA files contains two duplicated entries, Autoch72780X186_005960 and Myitub72089X232_006148 that needed to be removed before the reads could be processed.

Sarah hasn't gotten back to me with the nice tree she built for this paper, so I cobbled one together from the Global Phylogeny of Birds. Three of the species names had to be switched to synonymous taxonomic listings.

The raw reads, the host tree, and the mapping file were then run through Shand, which build an alignment of unique reads using clustal-omega, a phylogeny using FastTree, and a count table.



In [7]:

    
T1 = SuchTree( 'birds.tree' )
T2 = SuchTree( 'sarah_birds_unique_2_clustalo_fasttree.tree' )
links = pd.read_csv( 'sarah_birds_host_count_table.tsv', sep='\t', index_col='Host' )
links.index = map( lambda x : x.replace(' ','_'), links.index )



In [8]:

    
links.head()









    Out[8]:






  
    
      
      Amatza71825X040_000000
      Amatza71825X040_000001
      Amatza71825X040_000002
      Amatza71825X040_000003
      Amatza71825X040_000005
      Amatza71825X040_000006
      Amatza71825X040_000009
      Amatza71825X040_000011
      Amatza71825X040_000014
      Amatza71825X040_000017
      ...
      Xypsus71941X006_094834
      Xypsus71993X168_002758
      Xypsus71993X168_004833
      Xypsus71993X168_006305
      Xypsus71993X168_007733
      Xypsus71993X168_019351
      Xypsus71993X168_020474
      Xypsus71993X168_022695
      Xypsus71993X168_035747
      Xypsus71993X168_050143
    
  
  
    
      Amazilia_tzacatl
      96
      9531
      13416
      1
      4
      4261
      1914
      26094
      27
      55
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Arremon_aurantiirostris
      18
      98
      150
      0
      5
      40
      551
      878
      19
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Arremonops_conirostris
      0
      21
      41
      31
      0
      11
      152
      173
      15
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Attila_spadiceus
      4
      2
      24
      0
      4
      1
      227
      115
      19
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      Automolus_ochrolaemus
      1
      0
      5
      0
      0
      0
      5
      0
      4
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 78316 columns

Now, we create a SuchLinkedTrees object, which connects the host and guest trees to the link matrix (True or False values derived from the count matrix). This pre-indexes the table for fast access later, so creating object takes a little while.



In [9]:

    
%%time
SLT = SuchLinkedTrees( T1, T2, links )









    



140006060453328 allocating columns in 3011505640
bulding default subset.
bulding default link list.
CPU times: user 9min 51s, sys: 848 ms, total: 9min 52s
Wall time: 9min 49s

Let's pick a random clade within the guest tree and have a look at it.



In [19]:

    
SLT.subset( 7063 )
print 'subset size  :', SLT.subset_size
print 'subset links :', SLT.subset_n_links
print 'link pairs   :', ( SLT.subset_n_links * ( SLT.subset_n_links -1 ) ) / 2









    



subset size  : 1247
subset links : 6223
link pairs   : 19359753



In [20]:

    
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )









    Out[20]:





<seaborn.axisgrid.JointGrid at 0x7f55b1faa790>

That's a pretty big clade, actually. There are 19,359,753 pairs of links for which distances must be computed in each of the trees.



In [21]:

    
result_sampled = SLT.sample_linked_distances(sigma=0.05, n=10000, buckets=10)
result_sampled









    Out[21]:





{'TreeA': array([   0.        ,  167.33294678,  142.4214325 , ...,   55.96265793,
         142.42144775,  142.4214325 ]),
 'TreeB': array([ 0.05355315,  0.07728969,  0.07194377, ...,  0.10965108,
         0.08275844,  0.10838196]),
 'deviation_a': 0.04419417306780815,
 'deviation_b': 0.0001068115234375,
 'n_pairs': 19359753,
 'n_samples': 3000000}

The sampling function converged after sampling 3,000,000 pairs, out of a possible 19,359,753, or about 15%. One must wonder, of course, how well the sampled distribution of distances captures the full set.

So, here is the kernel density of the full set of distances...



In [22]:

    
seaborn.kdeplot(result['TreeB'])









    Out[22]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f55b2808ed0>

...and the sampled distances. Not terrible, I think.



In [23]:

    
seaborn.kdeplot(result_sampled['TreeB'][100000:400000])









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f55aa6da710>

So, let's crunch the numbers for all the clades.

Well, I'm in a hurry to make something for lab meeting, so let's just look at the clades that are smaller than 400 leafs.



In [48]:

    
import pyprind

p = pyprind.ProgBar( len( list( SLT.TreeB.get_internal_nodes() ) ), monitor=True, title='sampling trees...' )

big_nodes = []
table = {}
for n,node in enumerate( SLT.TreeB.get_internal_nodes() ) :
    p.update()
    SLT.subset( node )
    if SLT.subset_size > 400 :
        big_nodes.append( node )
        #result = SLT.sample_linked_distances( sigma=0.05, n=1000, buckets=100)
        continue
    else :
        result = SLT.linked_distances()
    table[node] = { 'n_leafs'    : SLT.subset_size, 
                    'n_links'    : SLT.subset_n_links,
                    'n_pairs'    : result['n_pairs'],
                    'n_samples'  : result['n_samples'],
                    'deviatnon_a': result['deviation_a'],
                    'deviation_b': result['deviation_b'],
                    'r'          : pearson( result['TreeA'], result['TreeB'] ) }









    



sampling trees...
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:30:02



In [85]:

    
print 'computed', len(table), 'clades'
print len(big_nodes), 'cades were skipped because they were too big.'
print ( len(table) - len(big_nodes) ) / float(len(table))









    



computed 76388 clades
1927 cades were skipped because they were too big.
0.974773524637

Computed correlations for 76,388 clades (97% of the clades).

1,927 cades were skipped because they were too big.



In [49]:

    
C = pd.DataFrame( table ).T
seaborn.jointplot( 'n_leafs', 'r', data=C )









    Out[49]:





<seaborn.axisgrid.JointGrid at 0x7f55ad3d12d0>

Let's zoom in on the good ones...



In [56]:

    
seaborn.jointplot( 'n_leafs', 'r', data=C.query('n_leafs > 10 and r > 0.05')  )









    Out[56]:





<seaborn.axisgrid.JointGrid at 0x7f55aa561bd0>



In [70]:

    
CC = C.query('n_leafs > 10 and r > 0.05').sort_values('r', ascending=False)
print CC.shape
CC.head()









    



(1256, 7)






    Out[70]:






  
    
      
      deviation_b
      deviatnon_a
      n_leafs
      n_links
      n_pairs
      n_samples
      r
    
  
  
    
      67463
      NaN
      NaN
      11.0
      21.0
      210.0
      210.0
      0.313125
    
    
      67477
      NaN
      NaN
      13.0
      26.0
      325.0
      325.0
      0.283790
    
    
      47675
      NaN
      NaN
      12.0
      25.0
      300.0
      300.0
      0.267909
    
    
      19427
      NaN
      NaN
      11.0
      27.0
      351.0
      351.0
      0.264996
    
    
      102199
      NaN
      NaN
      12.0
      25.0
      300.0
      300.0
      0.247379



In [78]:

    
SLT.subset( 67463 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )









    Out[78]:





<seaborn.axisgrid.JointGrid at 0x7f55b0bf3710>



In [79]:

    
SLT.subset( 67477 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )









    Out[79]:





<seaborn.axisgrid.JointGrid at 0x7f55b0ba6910>



In [80]:

    
SLT.subset( 47675 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )









    Out[80]:





<seaborn.axisgrid.JointGrid at 0x7f55a96579d0>

Pearson's r assumes the distributions are normal, but we saw earlier that they are not. So, we should use a rank-order test, like Kendall's $\tau$.



In [71]:

    
from scipy.stats import kendalltau, pearsonr

p = pyprind.ProgBar( CC.shape[0], monitor=True, title='resampling trees...' )

pearson_p   = {}
kendall_tau = {}
kendall_p   = {}

for n,node in enumerate( CC.index ) :
    SLT.subset(node)
    result = SLT.linked_distances()
    p_r,p_p = pearsonr(   result['TreeA'], result['TreeB'] )
    k_t,k_p = kendalltau( result['TreeA'], result['TreeB'] )
    pearson_p[node]  = p_p
    kendall_tau[node] = k_t
    kendall_p[node]   = k_p
    p.update()

CC['pearson_p'] = pd.Series(pearson_p)
CC['kendall_tau'] = pd.Series(kendall_tau)
CC['kendall_p'] = pd.Series(kendall_p)









    



resampling trees...
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:03:36



In [72]:

    
CC.sort_values( 'kendall_tau', ascending=False, inplace=True )
CC.head()









    Out[72]:






  
    
      
      deviation_b
      deviatnon_a
      n_leafs
      n_links
      n_pairs
      n_samples
      r
      pearson_p
      kendall_tau
      kendall_p
    
  
  
    
      102199
      NaN
      NaN
      12.0
      25.0
      300.0
      300.0
      0.247379
      1.461595e-05
      0.286272
      1.420392e-13
    
    
      19427
      NaN
      NaN
      11.0
      27.0
      351.0
      351.0
      0.264996
      4.721002e-07
      0.278858
      6.304481e-15
    
    
      59641
      NaN
      NaN
      12.0
      38.0
      703.0
      703.0
      0.225111
      1.583268e-09
      0.253798
      7.590780e-24
    
    
      113695
      NaN
      NaN
      12.0
      32.0
      496.0
      496.0
      0.211792
      1.945230e-06
      0.237155
      2.910762e-15
    
    
      83151
      NaN
      NaN
      11.0
      31.0
      465.0
      465.0
      0.175859
      1.380015e-04
      0.221637
      9.188342e-13



In [73]:

    
seaborn.jointplot( 'n_leafs', 'kendall_tau', data=CC )









    Out[73]:





<seaborn.axisgrid.JointGrid at 0x7f55ab6ba090>



In [74]:

    
SLT.subset( 102199 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )









    Out[74]:





<seaborn.axisgrid.JointGrid at 0x7f55a62e18d0>



In [75]:

    
SLT.subset( 19427 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )









    Out[75]:





<seaborn.axisgrid.JointGrid at 0x7f55a90d4e90>



In [76]:

    
SLT.subset( 59641 )
result = SLT.linked_distances()
seaborn.jointplot( result['TreeA'], result['TreeB'] )









    Out[76]:





<seaborn.axisgrid.JointGrid at 0x7f55ac124d50>

Conclusion

If there is any cospeciation between birds and their gut bacteria, it's a pretty messy relationship.

	Amatza71825X040_000000	Amatza71825X040_000001	Amatza71825X040_000002	Amatza71825X040_000003	Amatza71825X040_000005	Amatza71825X040_000006	Amatza71825X040_000009	Amatza71825X040_000011	Amatza71825X040_000014	Amatza71825X040_000017	...
Amazilia_tzacatl	96	9531	13416	1	4	4261	1914	26094	27	55	...
Arremon_aurantiirostris	18	98	150	0	5	40	551	878	19	1	...
Arremonops_conirostris	0	21	41	31	0	11	152	173	15	0	...
Attila_spadiceus	4	2	24	0	4	1	227	115	19	0	...
Automolus_ochrolaemus	1	0	5	0	0	0	5	0	4	0	...

	deviation_b	deviatnon_a	n_leafs	n_links	n_pairs	n_samples	r
67463	NaN	NaN	11.0	21.0	210.0	210.0	0.313125
67477	NaN	NaN	13.0	26.0	325.0	325.0	0.283790
47675	NaN	NaN	12.0	25.0	300.0	300.0	0.267909
19427	NaN	NaN	11.0	27.0	351.0	351.0	0.264996
102199	NaN	NaN	12.0	25.0	300.0	300.0	0.247379