American Gut Beyond Bacteria summaries

  • Authors
    • Daniel McDonald (mcdonadt@colorado.edu)
    • Embriette Hyde (ehyde@ucsd.edu)
    • Luke Thompson (luthompson@ucsd.edu)
    • Justine Debelius (justine.debelius@colorado.edu)
  • License: BSD

Introduction

The American Gut Beyond Bacteria samples are a perk that go much deeper than the 16S prep used for the majority of the samples in the AG dataset. Specifically, in addition to 16S, these samples also undergo whole genome shotgun sequencing, fungal sequencing, 18S sequencing and viromics preps.

Whole genome shotgun sequencing allows us to gather information across the genomes represented within the sample, and to infer functional potential. The added sequence data also helps to refine specific observed species in the samples allowing for a higher level of specificity relative to 16S. The basic strategy is to take all of the small sequences, and compare them to a well-characterized reference database of previously observed genes.

Fungal sequencing is similar to 16S in that a single part of each genome is targeted, in this case, the internal transcribed spacer region of the ribosomal operon. Fungal sequencing is difficult as the ribosomal genes are not specific enough to resolve differences between fungi (e.g., they're evolving faster than the ribosomal genes). As a result, a common practice is to instead target the ITS region which evolves very fast. The downside is that, since the ITS region is not conserved and evolves so quickly, it is not feasible to align it and produce a phylogenetic tree which can hinder downstream analysis.

18S sequencing is like 16S, except that it will amplify microbial eukaryotes. The samples we've processed so far for Beyond Bacteria have not ampified 18S, however. In other words, we haven't been able to detect any microbial eukaryotes.

The virome is an analogous data set to the whole genome shotgun one except that additional sample prep work is performed to help isolate virus like particles. The challenge then is to infer what "species" of virus are represented using a characterized reference database.

Within this Notebook, we're operating on data that has already been preprocessed (e.g., sequences mapped to their respective references).


In [2]:
%matplotlib inline

First, let's load up the precomputed tables. We're only exploring WGS at this time as processing is still underway for ITS and virome. The 16S data will be incorporated into the regular 16S pipeline used for the rest of the American Gut samples. The tables being operated on were produced using Metaphlan and HUMAnN by our collaborator, Dr. Joe Petrosino.


In [39]:
from biom import load_table
from skbio import TreeNode

wgs_taxa = load_table('../data/AG/BeyondBacteria/AmericanGutMetaphlanWGS.biom').sort()
wgs_pathway = load_table('../data/AG/BeyondBacteria/PathwayTable.biom').sort()

# For consistency purposes, let's also update the sample IDs. The site that processed 
# the Beyond Bacteria samples stripped the leading zeros of the barcodes, however, 
# the rest of the American Gut data contain those zeros.

# IDs are of the form AG-1234
_ = wgs_taxa.update_ids({k: "%.9d" % int(k.split('-', 1)[1]) for k in wgs_taxa.ids()})
_ = wgs_pathway.update_ids({k: "%.9d" % int(k.split('-', 1)[1]) for k in wgs_pathway.ids()})

# Need to update the sample indices. This is a workaround for an cache invalidation bug
# https://github.com/biocore/biom-format/issues/599
wgs_taxa._index_ids()
wgs_pathway._index_ids()

Next, we're going to load a little more contextual information about what is contained in these tables. The taxonomy tree is just that: the bacterial taxonomy (though limited to just those organisms in the tree). This will let us order subsequent plots by taxonomy. We're also going to load up pathway names, which are more informative than just the raw IDs contained in the table.


In [40]:
wgs_taxa_tree = TreeNode.from_newick(open('../data/AG/BeyondBacteria/AmericanGutMetaphlanWGS.taxonomy.tree'))
taxa_order = [n.name for n in wgs_taxa_tree.tips()]

wgs_pathway_names = {}
for line in open('../data/AG/BeyondBacteria/PathwayTable.names.txt'):
    line = line.strip()
    module_id, _ = line.split(' ', 1)
    wgs_pathway_names[module_id] = line

Next, let's go ahead and order the WGS taxonomy table and "augment" the IDs in the pathways table.


In [41]:
wgs_taxa_ordered = wgs_taxa.sort_order(taxa_order, axis='observation')
_ = wgs_pathway.update_ids(wgs_pathway_names, axis='observation')

The precomputed pathway table is not in relative abundance, so we will quickly convert it.


In [42]:
wgs_pathway_rel = wgs_pathway.norm(inplace=False).transform(lambda v, i, md: v * 100)

Now, let's define a helper function to form a heatmap. We're going to use a log transform on the data as well to add a little "pop".


In [43]:
def heatmap(table, title, transform=lambda x: x):
    # Adapted from http://www.bertplot.com/visualization/?p=292

    fig, ax = subplots(figsize=(15,25))

    ax.pcolor(transform(table.matrix_data.toarray()), cmap=plt.cm.Reds)

    ax.set_xticks(arange(table.ids().size) + 0.5)
    ax.set_yticks(arange(table.ids(axis='observation').size) + 0.5)

    ax.set_xticklabels(table.ids())
    ax.set_yticklabels(table.ids(axis='observation'))

    ax.xaxis.tick_top()
    ax.yaxis.tick_left()

    plt.text(0.5, 1.02, title, fontsize=25, horizontalalignment='center', transform=ax.transAxes)

    fig.tight_layout()

And last (for setup), lets write a small method that can form hyperlinks to make it easy to track down additional information about any of the observations. We're only going to show the top few observations by relative abundance per sample.


In [53]:
from IPython.core.display import HTML

def linkify(url_fmt, query, name, abund):
    url = url_fmt.format(query.replace(' ', '+'))
    return '&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;%0.2f%% <a href=%s target="_blank">%s</a></br>' % (abund, url, name)

google_scholar_url_fmt = "http://scholar.google.com/scholar?hl=en&as_sdt=1,5&as_vis=1&q=%22{}%22"
kegg_module_url_fmt = "http://www.genome.jp/dbget-bin/www_bget?md:{}"

def table_links(table, url_fmt, name_transform=lambda x: x, id_tag=''):
    TOP_N_HITS = 10
    content = []
    for id_ in table.ids():
        content.append("<a id='%s_%s'></a>%s</br>" % (id_, id_tag, id_))
        abund_and_id = zip(table.data(id_), table.ids(axis='observation'))
        for v, name in sorted(abund_and_id, reverse=True)[:TOP_N_HITS]:
            if v > 0:
                query, name = name_transform(name)
                content.append(linkify(url_fmt, query, name, v))
    return ''.join(content)

Taxonomy Heatmap

We can now begin to produce some results! The next two cells will show a heatmap of the species observed within each sample, and then list out the relative abundance of those organisms and links to where more information can be found. Dark red indicates that the species is at a high relative abundance within the sample, while lighter colors indicate the organism is at low relative abundance within the sample.


In [45]:
# perform a log transform on the data to add pop on the display
heatmap(wgs_taxa_ordered, 'Beyond Bacteria WGS taxonomic summary, ordered by taxonomy', transform=lambda x: log(x + 1e-20))


Top species by relative abundance


In [54]:
HTML(table_links(wgs_taxa, google_scholar_url_fmt, name_transform=lambda x: [x.split('__', 1)[1].replace('_', ' ')] * 2, id_tag='taxa'))


Out[54]:
000004607        20.49% Methanobrevibacter smithii        16.60% Akkermansia muciniphila        7.40% Dorea longicatena        6.84% Eubacterium hallii        6.19% Faecalibacterium prausnitzii        4.65% Butyrivibrio crossotus        4.56% Clostridium leptum        4.46% Collinsella aerofaciens        3.94% Coprococcus comes        2.78% Coprococcus catus000009544        25.15% Roseburia intestinalis        18.96% Roseburia inulinivorans        12.13% Eubacterium rectale        11.33% Sutterella wadsworthensis        8.14% Faecalibacterium prausnitzii        3.73% Eubacterium eligens        3.00% Catenibacterium mitsuokai        2.79% Eubacterium hallii        2.36% Faecalibacterium unclassified        1.84% Ruminococcus torques000010538        14.21% Prevotella copri        13.35% Eubacterium eligens        11.75% Faecalibacterium prausnitzii        9.92% Alistipes putredinis        9.16% Akkermansia muciniphila        6.89% Ruminococcus bromii        5.94% Bacteroides plebeius        4.86% Bifidobacterium adolescentis        3.52% Eubacterium rectale        3.36% Faecalibacterium cf000011924        62.28% Lactobacillus gasseri        14.19% Bifidobacterium breve        6.63% Lactobacillus reuteri        6.26% Lactobacillus crispatus        2.38% Lactobacillus salivarius        1.63% Lactobacillus vaginalis        1.60% Bifidobacterium adolescentis        1.08% Lactobacillus delbrueckii        0.89% Lactobacillus coleohominis        0.69% Dialister invisus

Pathway heatmap

These next two cells will show pathway results. Same as before: dark red indicates a high relative abundance while ligher colors indicate low relative abundance within each sample.


In [47]:
heatmap(wgs_pathway_rel, 'Beyond Bacteria WGS pathway summary')


Top pathways by relative abundance


In [55]:
HTML(table_links(wgs_pathway_rel, kegg_module_url_fmt, name_transform=lambda x: x.split(' ', 1), id_tag='pathway'))


Out[55]:
000004607        6.32% Reductive citrate cycle (Arnon-Buchanan cycle)        5.09% Pyrimidine ribonucleotide biosynthesis, UMP => UDP/UTP,CDP/CTP [PATH:map00240]        4.57% Methionine degradation [PATH:map00270]        4.49% Methionine salvage pathway [PATH:map00270]        4.38% Cysteine biosynthesis, methionine => cysteine [PATH:map01230 map00270]        3.80% Gluconeogenesis, oxaloacetate => fructose-6P [PATH:map00010 map00020]        3.69% Dicarboxylate-hydroxybutyrate cycle [PATH:map01200 map00720]        3.09% Citrate cycle (TCA cycle, Krebs cycle) [PATH:map01200 map00020]        2.85% C1-unit interconversion, prokaryotes [PATH:map00670]        2.54% Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate [PATH:map01200 map00010]000009544        4.89% Methionine degradation [PATH:map00270]        4.77% Gluconeogenesis, oxaloacetate => fructose-6P [PATH:map00010 map00020]        4.72% Methionine salvage pathway [PATH:map00270]        4.71% Reductive citrate cycle (Arnon-Buchanan cycle)        4.64% Cysteine biosynthesis, methionine => cysteine [PATH:map01230 map00270]        4.28% Pyrimidine ribonucleotide biosynthesis, UMP => UDP/UTP,CDP/CTP [PATH:map00240]        3.55% C1-unit interconversion, prokaryotes [PATH:map00670]        3.17% Inosine monophosphate biosynthesis, PRPP + glutamine => IMP [PATH:map00230]        2.89% Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate [PATH:map01200 map00010]        2.83% Uridine monophosphate biosynthesis, glutamine (+ PRPP) => UMP [PATH:map00240]000010538        6.46% Reductive citrate cycle (Arnon-Buchanan cycle)        4.68% Pyrimidine ribonucleotide biosynthesis, UMP => UDP/UTP,CDP/CTP [PATH:map00240]        4.55% Methionine degradation [PATH:map00270]        4.45% Gluconeogenesis, oxaloacetate => fructose-6P [PATH:map00010 map00020]        4.34% Methionine salvage pathway [PATH:map00270]        4.30% Cysteine biosynthesis, methionine => cysteine [PATH:map01230 map00270]        3.60% Dicarboxylate-hydroxybutyrate cycle [PATH:map01200 map00720]        3.39% Citrate cycle (TCA cycle, Krebs cycle) [PATH:map01200 map00020]        3.25% C1-unit interconversion, prokaryotes [PATH:map00670]        2.75% Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate [PATH:map01200 map00010]000011924        5.23% Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate [PATH:map01200 map00010]        4.18% Pyrimidine deoxyribonuleotide biosynthesis, CDP/CTP => dCDP/dCTP,dTDP/dTTP [PATH:map00240]        3.87% Uridine monophosphate biosynthesis, glutamine (+ PRPP) => UMP [PATH:map00240]        3.65% Glycolysis, core module involving three-carbon compounds [PATH:map01200 map01230 map00010]        3.60% Gluconeogenesis, oxaloacetate => fructose-6P [PATH:map00010 map00020]        3.14% Guanine ribonucleotide biosynthesis IMP => GDP,GTP [PATH:map00230]        3.03% Reductive citrate cycle (Arnon-Buchanan cycle)        2.89% Cysteine biosynthesis, methionine => cysteine [PATH:map01230 map00270]        2.84% Formaldehyde assimilation, serine pathway [PATH:map01200 map00680]        2.65% Methionine salvage pathway [PATH:map00270]

In [ ]: