License: BSD
Copyright: Copyright American Gut Project, 2015


In [1]:
# This cell allows us to render the notebook in the way we wish no matter where the notebook is rendered.
from IPython.core.display import HTML
css_file = 'ag.css'
HTML(open(css_file, "r").read())


Out[1]:

Preprocessing and File Generation

The goal of this notebook is to take the OTU table and mapping files generated by the American Gut Primary Processing Pipeline Notebook and manipulate them to produce a uniform set of rarefied and filtered tables which can be used in downstream analyses, such as our Power Notebook, or using QIIME, EMPeror, or PICRUSt [1 - 3]. This processing is centralized since it can be computationally expensive, given the size of the tables involved, and because it removes some error associated with the random number generation used in some steps.

Return to the top

Data Sets Generated by this Notebook

This notebook will generate American Gut datasets for three body sites. This process will include a dataset focused on all the samples collected for that location, and a set of samples which represent a single sample from each donor at that body site. These are calculated on a per body site basis, so an individual who donated two fecal samples and an oral sample will have one sample in the single fecal sample set, and one sample in the single oral set.

Additionally, we decided to create a healthy subset for fecal samples. The definition is provided below.

The final directory structure will follow the following organization. Parent Directories are bolded.

  • sample_data/
    • all/
    • fecal/
      • all_participants_all_samples/
      • all_participants_one_sample/
      • sub_participants_all_samples/
      • sub_participants_one_sample/
    • oral/
      • all_participants_all_samples/
      • all_participants_one_sample/
    • skin/
      • all_participants_all_samples/
      • all_participants_one_sample/

You can choose to download the complete directory (rather than running this notebook) by setting the data_download variable.

Return to the top

Files and File Types

Our folders will contain three types of files: two metadata files, two Biom-format OTU table files and two distance matrices:

  • AGP_100nt_fecal.txt* (metadata)
  • AGP_100nt_fecal.biom* (otu table)
  • AGP_100nt_fecal_even10k.txt* (metadata)
  • AGP_100nt_fecal_even10k.biom* (otu table)
  • unweighted_unifrac_AGP_100nt_fecal_even10k.txt* (distance matrix)
  • weighted_unifrac_AGP_100nt_fecal_even10k.txt* (distance matrix)

*_oral and _skin may be substituted for _fecal in file names.

In addition, a text file called either single_samples.txt or subset_samples.txt may be generated. This is the list of samples in the data set. The file is primarily auxiliary, and is used by this notebook to filter Biom and distance matrix files into single sample data sets.

Let’s talk a bit more about what information is contained in each file type.

Metadata

A metadata file, sometimes called a mapping file, provides information about our samples which cannot be determined from 16S marker gene sequences alone. Metadata is as important as the bacterial sequence information for determining information about the microbiome. Metadata tells us information about the sample, such as the body site where it was collected, the participant’s age or whether or not they have certain diseases. Since these can make a very large difference in the microbiome, it’s important to have this information!

American Gut metadata is collected through the participant survey. The survey allows participants to skip any question they do not wish to answer, meaning that some samples are missing fields. The python library used to handle metadata in these notebooks will specially encode these fields. Within IPython analysis notebooks, missing data will be represented as a Numpy NaN. Printed notebooks are set return empty spaces for missing values, although this can be changed by altering the write_na variable. For certain QIIME scripts, leaving these fields blank will allow the script to ignore samples missing metadata.

The American Gut metadata is also de-identified. This means the metadata does not contain information which could be used to identify a participant, like their name, email address, or the kit ID. Instead, each participant is assigned a code. This allows us to identify multiple samples from the same individual. The samples are identified by the barcode number, which appears on the sample tube. This number connects the survey metadata to the sample data in the OTU table and distance matrix.

This notebook will add Alpha Diversity to the mapping file for rarefied data. Files which contain the alpha diversity will always have a rarefaction depth notation in the file name. The convention here is to include the word even, and then the rarefaction depth in the file name (i.e. AGP_100nt_even10k_fecal.txt).

OTU Tables

An OTU table describes the bacterial content in a group of samples. An OTU, or operational taxonomic unit, is a cluster of sequences which are identical to some degree of similarity. We use these as stand-ins for bacterial groups. More information about OTUs, the level of similarity used, and the methods we use to generate OTUs can be found in the Primary Processing Pipeline Notebook. The OTU table allows us to link the sequencing results from our 16S data to the sample ID in a usable way, and gives an easier platform for comparison across samples.

Our OTU tables are saved as a special file format, the Biom format [4]. Unlike the other file types we will generate here, Biom files cannot be viewed using a normal spreadsheet program on your computer. The benefit of Biom format is that it allows us to save large amounts of data in a smaller amount of space. Biom data may be encoded as a JSON string, or using the same HDF5 compression NASA does. Because of the size of the American Gut data, we recommend the HDF5 format.

Distance Matrix

Finally, Distance Matrices describe the relationship between any two samples based on their community structure, measuring an ecological concept called Beta Diversity. In this notebook, we will measure the UniFrac Distance between all of our samples. This takes into account the evolutionary relationship between bacteria when communities are compared [5, 6]. Each cell in the distance matrix tells the distance between the sample given by the row and the sample given by the column. Distance matrices are symmetrical, which means that we can draw a line across the diagonal of the distance matrix, and the distances on either side of this line will be equal. The distances along that line should come from the same sample, and will have a distance of 0. We can use our distance matrix information directly, or use multidimensional scaling techniques like Principal Coordinates Analysis (PCoA) or make a UPGMA tree.

Metadata OTU Table Distance Matrix
File Type tab-delimted text file
(ends in .txt)
Biom-format file [4]
(ends in .biom)
tab-delimted text file
(ends in .txt)
Other viewing option text editor (i.e. Emacs, TextEdit)
Spreadsheet program (i.e. Excel)
Using the Biom-format python package
Using the R BIOM package
text editor (i.e. Emacs, TextEdit)
Spreadsheet program (i.e. Excel)
Rows Samples are in rows. OTU sequence clusters are the rows in the table. Rows are samples.
Columns Columns are metadata categories like AGE, BMI or disease status. Columns are samples Columns are samples.

Return to the top

Notebook Requirements

$\LaTeX$ is also recommended for running this suite of analysis notebooks, although it is not required for this notebook. LiveTex offers one installation solution.

Return to the top

Function Imports

We will start our analysis by importing the functions we need from python libraries.


In [2]:
import os
import shutil
import copy
import datetime

import numpy as np
import pandas as pd
import skbio
import biom

import americangut.diversity_analysis as div
import americangut.geography_lib as geo

from qiime_default_reference import get_reference_tree

Analysis Parameters

It is also important to define certain aspects of how we'll handle files and do our analysis. It can be easier to set all these at the same time, so the systems are consistent every time we repeat the process, rather than repeat them multiple places. This way, we only have to change the parameter once.

File Saving Parameters

In the course of this analysis, a series of files can be generated. The File Saving Parameters determine whether new files are saved.

download_data
(boolean)
This will download a directory of precomputed data tables, when True. download_data will supersede overwrite. (That is, if download_data is True, overwrite must be False.)
overwrite
(boolean)

When overwrite is True, new files will be generated and saved during data processing. It is recommended that overwrite be set to False, in which case new files will only be generated when the file does not exist. This substantially decreases analysis time.


In [3]:
data_download = False
overwrite = False

# Checks the data download-overwrite relationship is valid
if data_download:
    overwrite = True

Metadata and text file handling parameters

We'll start by defining how we'll handle certain files, especially metadata files. We will use the Pandas library to handle most of our text files. This library provides some spreadsheet like functionality.

overwrite
(boolean)
When overwrite is True, new files will be generated and saved during data processing.
It is recommended that overwrite be set to False, in which case new files will only be generated when the file does not exist. This substantially decreases analysis time.
txt_delim
(string)
txt_delim specifies the way columns are separated in the files. QIIME typically consumes and produces tab-delimited ("\t") text files (.txt) for metadata and results generation.
map_index
(string)
The name of the column containing the sample names. In QIIME, this column is called #SampleID.
map_nas
(list of strings)
It is possible a mapping file may be missing values, since American Gut participants are free to skip any question. The pandas package is able to omit these missing samples from analysis. In raw American Gut files, missing values are typically denoted as “NA”, “no_data”, “unknown”, and empty spaces (“”).
write_na
(string)
The value to denote missing values when text files are written from Pandas data frames. Using an empty space, (“”) will allow certain QIIME scripts, like group_significance.py, to ignore the missing values.

In [4]:
txt_delim = '\t'
map_index = '#SampleID'
map_nas = ['NA', 'no_data', 'unknown', '']
write_na = ''

Rarefaction parameters

We rarefy our data to set an even depth and allow head-to-head comparison of alpha and beta diversity. Rarefaction begins by removing samples from the table which do not have the minimum number of counts. Sequences are then drawn randomly out of a weighted pool until we reach the rarefaction depth.

rarefaction_depth
(int)
The rarefaction_depth specifies the number of sequence per samples to be used for analysis. A depth of 10,000 sequences/sample was selected because it balances a better picture of diversity with retaining samples.
num_rarefactions
(int)
The number of times we draw new rarefaction tables. This controls for bias due to single rarefaction instances. We selected 10 rarefactions to achieve a balance between computational efficiency and appropriate depth.

In [5]:
rarefaction_depth = 10000
num_rarefactions = 10

Split Parameters

We will split our OTU tables by body site, since the collection site on the human body plays a large role in community compositions in healthy adults [7].

split_field
(string)
The metadata category which contains the body site information we will use to split the OTU table and mapping file.
split_prefix
(string)
Under the standards used to format the American Gut metadata, a constant prefix is used to denote body site. This is used for string formatting and file naming.

In [6]:
split_field = 'BODY_HABITAT'
split_prefix = 'UBERON:'

Alpha Diversity Parameters

Alpha diversity looks at the variety of species within a sample. In this notebook, we calculate alpha diversity using a variety of metrics, and append these to the mapping file. A more complete discussion of alpha diversity can be found below.

alpha_metrics
(string)
There are multiple alpha diversity metrics which can be used. We will calculate four alpha diversity metrics here: PD Whole Tree [8], Observed Species, Chao1 [9], and Shannon [10] diversity. Among these metrics, PD whole tree diversity is unique in that it considers the evolutionary relationship between OTUs in a sample by calculating the branch length on the phylogenetic tree covered by a sample. A list of available metrics can be found on the scikit-bio website. Metric names should be connected with a comma (and no spaces).

In [7]:
alpha_metrics = 'PD_whole_tree,observed_species,chao1,shannon'

Beta Diversity Parameters

Beta Diversity compares the ecology community across multiple sites. A further discussion of beta diversity can be found below. We will use QIIME to calculate beta diversity.

beta_metrics
(string)
As with alpha diversity, there are multiple ways we can calculate our beta diversity metrics. Here, we use weighted and unweighted UniFrac distances [5]. UniFrac distance determines the amount of the phylogenetic tree which does not overlap between two samples. Weighted UniFrac takes into account the abundance of taxa within a sample, while unweighted UniFrac distance only considers the presence or absence of a particular community member. Additional options are available in the documentation for beta_diversity.py.

In [8]:
beta_metrics = "unweighted_unifrac,weighted_unifrac"

Data Set Selection

We can select the datasets which we’ll generate using this notebook. The default for each body site is to generate a data set (OTU table, mapping file and distance matrices) for all participants and all samples. We may want to limit our samples to a single sample per individual. Or, we could choose only to work with a subset of the data (see Identification of a Healthy Subset of Adults).

habitat_bodysite
(list of strings)
A list of the body site names used by our American Gut metadata standards. Some of these are inconvenient for file naming, and so we will rename some of these fields using the corresponding all_bodysites name. For example, the standard name for a mouth sample is to label it as an “oral cavity”, however, spaces in file paths make life difficult, so this is mapped to “oral” in our all_bodysites list.
all_bodysites
(list of strings)
A list of all the possible body sites which will be used to generate the datasets here. The order of body sites must correspond to the order of body sites in habitat_bodysites.
sub_part_sites
(set)
We may also want to generate data sets which limits our sample set to exclude samples which are already known to significantly affect the microbiome. The subset currently being selected focuses mainly on fecal samples.
one_samp_sites
(set)
For some types of analysis, there is an assumption that samples are independent, which in this context includes the requirement that there are not multiple samples per individual. To limit analysis to a single sample from each individual, we can select body site where we want to filter for a single sample per individual. We recommend doing this for all body sites.

In [9]:
# Lists all bodysites to be analyzed
habitat_sites = ['feces', 'oral cavity', 'skin']
all_bodysites = ['fecal', 'oral', 'skin']

# Handles healthy subset OTU tables
sub_part_sites = {'fecal'}

# Handles single sample OTU tables
one_samp_sites = {'fecal', 'oral', 'skin'}

File paths and Directories

To help organize the results of this notebook, we’ll start by setting up a series of files where results can be saved. This will provide a common file structure (base_dir, working_dir, etc.) for the results. We’re going to set up three primary directories here, and then nest additional directories inside.

As we set up directories, we’ll make use the of the check_dir function. This will create the directories we identify if they do not exist.

Base Directory

We need a general location to do all our analysis; this is the base_dir. All our other directories will exist within the base_dir, and allow us to work.

base_dir
(string)
The file path for the directory where any files associated with the analysis should be saved. It is suggested this be a directory called "agp_analysis", located in the same directory as the IPython notebooks.

In [10]:
base_dir = os.path.join(os.path.abspath('.'), 'agp_analysis')
div.check_dir(base_dir)

Reference Directories and Files

Some of the steps in our diversity calculations will require a phylogenetic tree. This contains information about the evolutionary relationship between OTUs which can be leveraged in calculating PD Whole Tree Diversity and UniFrac Distance. While there are multiple ways to pick OTUs, this table was generated using a reference-based technique. Therefore, we can simply download the phylogenetic tree file for the reference set. Our reference for this dataset was the Greengenes version 13_8 at 97% similarity [11]. Please refer to the Primary Processing Pipeline Notebook for more information about how OTUs are picked.

tree_fp
(string)
This gives the location of the correct tree file inside your Greengenes 13_8 directory. The default tree from your QIIME installation is called, here.

In [11]:
tree_fp = get_reference_tree()

Working Directories and Files

The working directories will be used to save files we generate while cleaning the data. These may include items like downloaded OTU tables, and rarefaction instances.

working_dir
(string)
The file path for a directory where intermediate files (i.e. rarefaction instances) generated during the run of this notebook can be stored. It is recommended this be located within the base analysis directory.

In [12]:
# Sets up a directory to save intermediate files and downloads
working_dir = os.path.join(base_dir, 'intermediate_files')
div.check_dir(working_dir)

Download Directories and Files

Our first analysis step is to locate the American Gut OTU tables generated by the Primary Processing Pipeline Notebook The tables may be generated locally, may be located in local GitHub Repository or they may be downloaded directly from GitHub.

download_dir
(string)
The file path where downloaded files should be saved. The download_dir may be located within the working_dir, it’s also likely the download_dir may be located outside the base_dir.
download_otu_fp
(string)
The uncompressed OTU table from GitHub is located at this file path. This should be a .biom file, with no compression.
download_map_fp
(string)
The location of the American Gut mapping file, downloaded from GitHub.

In [13]:
# Creates a directory for unprocessed file downloads
download_dir = os.path.join(working_dir, 'downloads')
div.check_dir(download_dir)

# Sets the filepaths for downloaded files
download_otu_fp = os.path.join(download_dir, 'AG_100nt.biom')
download_map_fp = os.path.join(download_dir, 'AG_100nt.txt')

Rarefaction Directory and Files

When we perform rarefaction on our data, we will generate several similarly named files. To help keep these organized, we will create a directory for each set of rarefaction files.

rare_dir
(string)
The file path to the directory where we should save all of our rarefaction files. This should be located in the working_dir.
rare_pattern
(string)
This describes the way rarefied OTU tables will be saved in our output directories. The “%i” will allow us to substitute any integers. In this case, we will specify the rarefaction depth, and the rarefaction instance for each table.

In [14]:
# Creates a parent directory for rarefaction instances
rare_dir = os.path.join(working_dir, 'rarefaction')
div.check_dir(rare_dir)

# Sets a pattern for the filenames of the rarefaction files
rare_pattern = 'rarefaction_%(rare_depth)i_%(rare_instance)i.biom'

Alpha Diversity Directories and Files

When we perform the alpha diversity on our data, we will use similarly named rarefaction tables and generate several similarly named alpha diversity files. To help keep these organized, we will create a directory for each set of alpha diversity files.

alpha_dir
(string)
The file path to the directory where we should save all of our alpha diversity files. This should be located in the working_dir.
alpha_pattern
(string)
This describes the way alpha diversity files will be saved in our output directories. The “%i” will allow us to substitute any integer. In this case, we will specify the rarefaction depth, and the rarefaction instance for each table. The rarefaction instances are numbered sequentially, from 0 to the number of rarefactions.

In [15]:
# Creates a parent directory for the alpha diversity files
alpha_dir = os.path.join(working_dir, 'alpha')
div.check_dir(alpha_dir)

# Sets a pattern for the filenames of alpha diversity tables
alpha_pattern = 'alpha_rarefaction_%(rare_depth)i_%(rare_instance)i.txt'

Split Directories and Files

At some point in our analysis, it will become necessary to split our OTU table by body site. The directory we create next will specify the location, while the file patterns will allow us to iterate or select one of many possible files using a simple string substitution.

split_raw_dir
(string)
The file path where we should put the unrarefied OTU table after splitting by body site. This should be located in the working_dir.
split_rare_dir
(string)
The file path where we should put the rarefied OTU table after splitting by body site. This should be located in the working_dir.
split_fn
(string)
The files generated by OTU splitting will follow this naming convention. The blanks, rare_suffix, split_field, split_prefix, split_group and extension are used to specify the level of rarefaction, the field used for splitting the data, the group in that split, and the type of file generated. Here, we expect the split_group to be a body site.

In [16]:
# Creates a directory for splitting the OTU table by bodysite
split_raw_dir = os.path.join(working_dir, 'split_by_bodysite_raw')
div.check_dir(split_raw_dir)
split_rare_dir = os.path.join(working_dir, 'split_by_bodysite_rare')
div.check_dir(split_rare_dir)

# Sets a pattern for filenames in the split directory
split_fn = 'AGP_100nt%(rare_suffix)s__%(split_field)s_%(split_prefix)s%(split_group)s__.%(extension)s'

Output Directories and Files

Finally, we need to set up the directories and filenames where we will save our results.

Data Directory

We’ll start by creating an output directory where our results should be located.

data_dir
(string)
The file path for a directory where the results of this notebook (OTU tables, mapping files, and distance matrices) should be saved. This should be a directory in base_dir.

In [17]:
# Sets up a directory where all results should be saved
data_dir = os.path.join(base_dir, 'sample_data')
if not data_download:
    div.check_dir(data_dir)

Body Site Directories

Next, we'll create directories specific to the body sites for which we have samples. The file path for a directory where each of the result sets from this notebook should be stored. The folders are described above.

all_dir;
(string)
The all data directory will contain the full American Gut results.
bodysite directories
(string)
The bodysite directories can be identified by adjusting the all_bodysite variable.

In [18]:
# Sets up an all sample directory
all_dir = os.path.join(data_dir, 'all')
if not data_download:
    div.check_dir(all_dir)

# Creates body-site specific directories
if not data_download:
    for site in all_bodysites:
        div.check_dir(os.path.join(data_dir, site))

Data Set Directories

We will set up file names for the output directories where we’ll save our final files.

The “%s” prefix will allow us to insert any file path for the output directory. We expect the final file paths used with these directories to be located in the data_dir.

asab_pattern
(string)
A file pattern for the directory where we’ll save the data from all samples from all participants.
assb_pattern
(string)
A file pattern for the directory where data from a single sample for each individual at each body site is stored. Note that it's possible to have multiple samples for the same individual, as long as the individual contributed samples at multiple body sites. However, the two samples from the same individual will be represented in different tables.
ssab_pattern
(string)
A file pattern for the directory where we’ll save the data from all samples from a subset of participants.
sssb_pattern
(string)
A file pattern for the directory where we’ll save the data from a single samples from each participant in the healthy subset.

In [19]:
# Sets up the file path pattern for the possible sets of sample at each body
# site
asab_pattern = os.path.join(data_dir, '%(site)s/all_participants_all_samples')
assb_pattern = os.path.join(data_dir, '%(site)s/all_participants_one_sample')
ssab_pattern = os.path.join(data_dir, '%(site)s/sub_participants_all_samples')
sssb_pattern = os.path.join(data_dir, '%(site)s/sub_participants_one_sample')

# Checks the file paths
if not data_download:
    for site in all_bodysites:
        site_blank = {'site': site}
        div.check_dir(asab_pattern % site_blank)
        if site in one_samp_sites:
            div.check_dir(assb_pattern % site_blank)
        if site in sub_part_sites:
            div.check_dir(ssab_pattern % site_blank)
        if site in sub_part_sites and site in one_samp_sites:
            div.check_dir(sssb_pattern % site_blank)

Data File Names

Finally, we will set up file names for the file names we’ll put in the output directories. We can combine these with our directories to get our final file paths.

The AGP_100nt (the data comes from the American Gut Project and sequences were trimmed to 100 nucleotides length). We can describe the rarefaction depth and body site in the blanks.

otu_fn
(string)
A pattern for the filename for output OTU table files.
map_fn
(string)
A pattern for the filename for output mapping files.
dm_fn
(string)
A pattern for the file name used by the distance matrix files generated by the beta_metrics


In [20]:
otu_fn = 'AGP_100nt%(rare_depth)s%(spacer)s%(site)s.biom'
map_fn = 'AGP_100nt%(rare_depth)s%(spacer)s%(site)s.txt'
dm_fn = ['%(metric)s_AGP_100nt%(rare_depth)s%(spacer)s%(site)s.txt' 
         % {'metric': m, 'rare_depth': '%(rare_depth)s', 
            'spacer': '%(spacer)s', 'site': '%(site)s'}
         for m in beta_metrics.split(',')]

sin_fn = 'single_samples.txt'
sub_fn = 'subset_samples.txt'

File Exensions

There are a set of blanks which are filled in for each file name. Some of these blanks will follow consistent patterns, which we can set before the files are used.

map_extension
(string)
The file extension for mapping files. Mapping files are typically tab-delimited text files.
otu_extension
(string)
The file extension for OTU table files. OTU tables are typically Biom-formatted files.
rare_suffix
(string)
This is added to file names to denote that rarefaction has occurred. Typically, this should be “even” with the rarefaction depth.
raw_suffix
(string)
This is added to files in which rarefaction has not been performed. Usually, this will be an empty string. This is required to maintain appropriate string formatting.
site_pad;
all_
(string)
These are spacers used to keep name formats clean and correct.

In [21]:
map_extension = 'txt'
otu_extension = 'biom'

site_pad = '_'
all_ = ''

rare_suffix = '_even10k'
raw_suffix = ''

File Blanks

We can set up the values to fill in the file blanks, using the body site names and the file paths we’ve provided. We’ll generate some of these using substitutions.

In cases where body site must be specified, this is handled in two ways. When dealing with splitting data by body site, the body site will be given by the split_group, and we'll use nan as a placeholder.

After the tables are split, we generate body site-specific information as a list.

last_rare
(dict)
Fills in the blanks for a rarefaction table or alpha diversity file. This is used to check if all rarefaction of alpha diversity files have been generated for an analysis.
all_raw_blanks
(dict)
Fills in the blanks for the unrarefied, all-sample files.
all_rare_blanks
(dict)
Fills in the blanks for the rarefied, all-sample files.
otu_raw_split_blanks
(dict)
Fills in the blanks for the unrarefied split file patterns to identify the split OTU table.
map_raw_split_blanks
(dict)
Fills in the blanks for the unrarefied split file patterns to identify the split metadata file. The nan value for split_group is a place holder.
otu_rare_split_blanks
(dict)
Fills in the blanks for the rarefied split file patterns to identify the split OTU table.
map_rare_split_blanks
(dict)
Fills in the blanks for the rarefied split file patterns to identify the split metadata file.
raw_sample_blanks
(dict)
Fills in the blanks for the rarefied files at each body site.
rare_sample_blanks
(dict)
Fills in the blanks for the rarefied files at each body site.

In [22]:
last_rare = {'rare_depth': rarefaction_depth,
             'rare_instance': num_rarefactions - 1}

all_raw_blanks = {'rare_depth': raw_suffix,
                  'spacer': all_,
                  'site': all_}
all_rare_blanks = {'rare_depth': rare_suffix,
                   'spacer': all_,
                   'site': all_}

otu_raw_split_blanks = {'rare_suffix': raw_suffix,
                        'split_field': split_field,
                        'split_prefix': split_prefix,
                        'split_group': np.nan,
                        'extension': otu_extension}

map_raw_split_blanks = {'rare_suffix': raw_suffix,
                        'split_field': split_field,
                        'split_prefix': split_prefix,
                        'split_group': np.nan,
                        'extension': map_extension}

otu_rare_split_blanks = {'rare_suffix': rare_suffix,
                         'split_field': split_field,
                         'split_prefix': split_prefix,
                         'split_group': np.nan,
                         'extension': otu_extension}

map_rare_split_blanks = {'rare_suffix': rare_suffix,
                         'split_field': split_field,
                         'split_prefix': split_prefix,
                         'split_group': np.nan,
                         'extension': map_extension}
raw_sample_blanks = {'site': np.nan,
                     'rare_depth': raw_suffix,
                     'spacer': site_pad}

rare_sample_blanks = {'site': np.nan,
                      'rare_depth': rare_suffix,
                      'spacer': site_pad}

Data Download

We can now start our analysis by downloading the American Gut mapping file and OTU tables if they’re not already located in the download_dir, they will be downloaded to this location. If the files exist, new versions will be downloaded only if overwrite is set to True.

Note that this step requires an internet connection.


In [23]:
if data_download:
    !curl -OL ftp://ftp.microbio.me/pub/American-Gut-precomputed/r1-15/sample_data.tgz
    !tar -xzf sample_data.tgz
    shutil.move('./sample_data', base_dir)

In [24]:
# Gets the biom file
if not data_download and (not os.path.exists(download_otu_fp) or overwrite):
    # Downloads the compressed biom file
    !curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.biom?raw=true
    # Moves the biom file to its final location
    shutil.move(os.path.join(os.path.abspath('.'), 'AG_100nt.biom?raw=true'), download_otu_fp)

# Gets the mapping file
if not data_download and (not os.path.exists(download_map_fp) or overwrite):
    # Downloads the mapping files
    !curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.txt?raw=true
    # Moves the file to the download file path
    shutil.move(os.path.join('.', 'AG_100nt.txt?raw=true'), download_map_fp)

Mapping File Clean up

We will start by adjusting the metadata. This will correct errors and provide a uniform format for derived columns we may wish to use later or in downstream analyses.


In [25]:
# Loads the mapping file
raw_map = pd.read_csv(download_map_fp,
                      sep=txt_delim, 
                      na_values=map_nas,
                      index_col=False,
                      dtype={map_index: str},
                      low_memory=False)
raw_map.index = raw_map[map_index]
del raw_map[map_index]

# Loads the OTU table
raw_otu = biom.load_table(download_otu_fp)

# Filters the raw map to remove any samples that are not present in the biom table
raw_map = raw_map.loc[raw_otu.ids()]

Age

There are also a set of columns which are not included in the map, but may be useful for downstream analyses. These include age binned by decade (AGE_CAT). While there are QIIME analyses which can handle continuous metadata, binning can help reduce some of the noise. Here, we bin age by decade, with the exception of people under the age of 20. The gut develops in the first two years of life, and the guts of young children are significantly different than older children or adults [12, 13]. We will also combine individuals over the age of 70 into their own category, due to the low sample counts of people over 80 as of round 14 (n < 20).


In [26]:
# Bins age by decade (with the exception of young children)
def categorize_age(x):
    if np.isnan(x):
        return x
    elif x < 3:
        return "baby"
    elif x < 13:
        return "child"
    elif x < 20:
        return "teen"
    elif x < 30:
        return "20s"
    elif x < 40:
        return "30s"
    elif x < 50:
        return "40s"
    elif x < 60:
        return "50s"
    elif x < 70:
        return "60s"
    else:
        return "70+"
raw_map['AGE_CAT'] = raw_map.AGE.apply(categorize_age)

Alcohol Consumption

In addition to considering the frequency with which people use alcohol (Never, Rarely, Occasionally, Regularly, or Daily), it may be helpful to simply look for an effect associated with any alcohol consumption.


In [27]:
def categorize_etoh(x):
    if x == 'Never':
        return "No"
    elif isinstance(x, str):
        return "Yes"
    elif np.isnan(x):
        return x
    
raw_map['ALCOHOL_CONSUMPTION'] = raw_map.ALCOHOL_FREQUENCY.apply(categorize_etoh)

Body Mass Index

Body Mass Index (BMI) can be stratified into categories which give an approximate idea of body shape. It is worth noting that these stratifications do not hold well for growing children, where the BMI qualification is based on age and gender.


In [28]:
# Categorizes the BMI into groups
def categorize_bmi(x):
    if np.isnan(x):
        return x
    elif x < 18.5:
        return "Underweight"
    elif x < 25:
        return "Normal"
    elif x < 30:
        return "Overweight"
    else:
        return "Obese"
raw_map['BMI_CAT'] = raw_map.BMI.apply(categorize_bmi)

Collection Season

American Gut samples have been collected since December of 2012. To look for patterns associated with the time of year samples were collected, we bin this date information into month, and season.

We currently define our seasons according to the calendar in the Northern Hemisphere, because as of round fifteen, 99% of our samples were collected north of the equator. Additionally, rather than defining our seasons by the solar calendar, we have elected to use the first day of the month the solstice or equinox occurs in as the start of our season. So, while Winter technically begins on December 20th or 21st, according to the solar calendar, we consider December 1st as the first day of our Winter.


In [29]:
def convert_date(x):
    """Converts strings to a date object"""
    if isinstance(x, str) and "/" in x:
        try:
            return pd.tseries.tools.to_datetime(x)
        except:
            return np.nan
    else:
        return x

raw_map.COLLECTION_DATE = raw_map.COLLECTION_DATE.apply(convert_date)

In [30]:
# Categorizes data by collection month and collection season
month_map = {-1: [np.nan, np.nan],
             np.nan: [np.nan, np.nan],
             1: ['January', 'Winter'],
             2: ['February', 'Winter'],
             3: ['March', 'Spring'],
             4: ['April', 'Spring'],
             5: ['May', 'Spring'],
             6: ['June', 'Summer'],
             7: ['July', 'Summer'],
             8: ['August', 'Summer'],
             9: ['September', 'Fall'],
             10: ['October', 'Fall'],
             11: ['November', 'Fall'],
             12: ['December', 'Winter']}

def map_month(x):
    try:
        return month_map[x.month][0]
    except:
        return np.nan

def map_season(x):
    try:
        return month_map[x.month][1]
    except:
        return np.nan

# Maps the data as a month
raw_map['COLLECTION_MONTH'] = \
    raw_map.COLLECTION_DATE.apply(map_month)

# Maps the data as a season
raw_map['COLLECTION_SEASON'] = \
    raw_map.COLLECTION_DATE.apply(map_season)

Collection Location

The American Gut Project includes some geographical information about where samples were collected. While the data may be leveraged as-is, it can also be helpful to clean up the data.

We'll start by checking for uniform country mapping. This will allow us to combine samples from countries or groups of countries with multiple descriptive names, such as the Great Britain and the United Kingdom.


In [31]:
def map_countries(x):
    return geo.country_map.get(x, x)

raw_map.COUNTRY = raw_map.COUNTRY.apply(map_countries)

In Rounds 1-15 participants come predominantly from the US, UK, Belgium and Canada. Since the area occupied by Belgium and the UK are smaller than the size of many states in the contiguous US (including some of the most represented states in the American Gut), we have elected to only consider the STATE field for American and Canadian samples.

This does not alter the information provided by the zip (postal) code or Longitude/Latitude information.


In [32]:
# Removes state information for any state not in the US
# (This may change as additional countries are added.)
countries = raw_map.groupby('COUNTRY').count().STATE.index.values
for country in countries:
    if country not in {'GAZ:United States of America', 'GAZ:Canada'}:
        raw_map.loc[raw_map.COUNTRY == country, 'STATE'] = np.nan

# Handles regional mapping, cleaning up states so that only American and
# Canadian states are included 
def check_state(x):
    if isinstance(x, str) and x in geo.us_state_map:
        return geo.us_state_map[x.upper()]
    elif  isinstance(x, str) and x in geo.canadian_map_english:
        return geo.canadian_map_english[x.upper()]
    else:
        return np.nan
raw_map['STATE'] = raw_map.STATE.apply(check_state)

We may also choose to use predefined regions to further bin our location data, and allow us to look for social or economic trends. To this end, we can apply regions defined by the US Census Bureau and Economic Regions defined by the US Bureau of Economic Analysis, which is part of the Department of Commerce.


In [33]:
# Bins data by census region
def census_f(x):
    if  isinstance(x, str) and x in geo.regions_by_state:
        return geo.regions_by_state[x]['Census_1']
    else:
        return np.nan
raw_map['CENSUS_REGION'] = raw_map.STATE.apply(census_f)


# Bins data by economic region
def economic_f(x):
    if isinstance(x, str) and  x in geo.regions_by_state:
        return geo.regions_by_state[x]['Economic']
    else:
        return np.nan
raw_map['ECONOMIC_REGION'] = raw_map.STATE.apply(economic_f)

Sleep Duration

As of round 15, there are 36 participants who report sleeping less than five hours a night. To all for a larger sample size, we will pool these with the individuals who report sleeping between five and six hours a night, to create a group who report sleeping less than six hours a night.


In [34]:
raw_map.loc[raw_map.SLEEP_DURATION == 'Less than 5 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'
raw_map.loc[raw_map.SLEEP_DURATION == '5-6 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'

Identification of a Healthy Subset of Adults

Certain health states are known to influence the microbiome in extreme ways. For some analyses we will do later, it may be useful to limit the noise associated with these conditions to allow us to look for new patterns. We have identified five metadata categories which we will use to limit our “healthy” subset.

First, we limit based on age for several reasons. We chose to omit anyone under the age of twenty. The microbiome of very young children is not yet stable, and differs greatly from that of adults [12, 13]. Additionally, BMI limits are not easily assigned in people who are still growing. Without stratifying by gender, we assumed that growth will be complete in most people by the age of 20, and set our limit there. The limit at seventy was based on the number of individuals over that age, and on differences in the microbiome seen in older individuals [14, 15].

We also used Body Mass Index as an exclusion criteria, considering only people in the “normal” and “overweight” categories. (BMI 18.5 - 30). It has been suggested that obesity changes the gut microbiome, although the effect is not consistent across all studies [16]. Additionally, we noticed that there were also alterations in our sample of underweight individuals.

Recent antibiotic decreases alpha diversity and affects the microbiome [17]. We chose to define “recent” as any time within the last year. We also excluded anyone who reported having Inflammatory Bowel Disease [16], Type I Diabetes [18-21], or Type II Diabetes [22], since all three conditions are known to affect the microbiome.


In [35]:
# Creates the subset if its not already in the mapping file
if 'SUBSET' not in raw_map.columns:
    subset_f = {'AGE': lambda x: 19 < x < 70 and not np.isnan(x),
                'DIABETES': lambda x: x == 'I do not have diabetes',
                'IBD': lambda x: x == 'I do not have IBD',
                'ANTIBIOTIC_SELECT': lambda x: x == 'Not in the last year',
                'BMI': lambda x: 18.5 <= x < 30 and not np.isnan(x)}

    # Determines which samples meet the requirements of the categories
    new_bin = {}
    for cat, f in subset_f.iteritems():
        new_bin[cat] = raw_map[cat].apply(f)

    # Builds up the new binary dataframe
    bin_frame = pd.DataFrame(new_bin)

    # Adds a column to the current dataframe to look at the subset
    bin_series = pd.DataFrame(new_bin).all(1)
    bin_series.name = 'SUBSET'

    raw_map = raw_map.join(bin_series)

Whole Table Rarefaction

We will start by rarefying the whole body table using the rarefaction parameters we set earlier. Rarefaction is a technique which filters out samples below a certain sequencing depth. Sequences are picked from a weighted average from the remaining samples to that all samples have an even depth. To control for bias which might occur with a single, random subsampling of the data, we use multiple rounds of rarefaction to more accurately estimate the alpha diversity.

Rarefaction is important to make intra sample diversity (alpha diversity) comparisons possible. Below is a panel from Figure 1 of Human Gut Microbiome and Risk of Colorectal Cancer[23]. The figure compares Shannon Diversity between individuals with colorectal cancer (n=47, red circles) and healthy controls (n=94, empty triangles) over several rarefaction depths, or sequence counts per sample.

The figure also illustrates the importance of even sampling depth. If a control sample with 500 sequences per sample were compared with a cancer sample at a depth of 2500 sequences per sample, the cancer sample would appear more diverse. Comparisons at the same depth reveal the true pattern in the data: cancer samples are less diverse than controls.

To perform multiple rarefactions, we will use the QIIME script, multiple_rarefactions_even_depth.py.


In [36]:
if not data_download and (not os.path.exists(os.path.join(rare_dir, rare_pattern) %last_rare) or overwrite):
    !multiple_rarefactions_even_depth.py -i $download_otu_fp -o $rare_dir -n $num_rarefactions -d $rarefaction_depth --lineages_included

Whole Table Alpha Diversity

We will use our rarefaction tables to calculate the alpha diversity associated with each rarefaction. Alpha diversity is a measure of intra sample diversity. Imagine that we could put up an imaginary wall around a 100ft x 100ft x 10 ft box in Yellowstone National Park, trapping all the vertebrate animals in that area for a short period of time. Imagine that we then made a very careful list (or took photographs) of the area so that we could document all the life we found in the area. We could count all the different types of animals we found in that area. This would be one measure of alpha diversity.

Instead of just considering each type of animal to be equally similar, we wanted to include an evolutionary relationship between the animals. So, if our area contained a mouse, a squirrel and a rabbit, we might say these animals are more similar (and therefore less diverse) than if we found a mouse, a squirrel, and a sagebrush lizard in the same area. So, even though we’ve found three species in each case, the third species being a reptile would make it more diverse than the third species being a rodent.

A diversity metric which accounts for shared evolutionary history between species is called a phylogenetic metric. This often uses a phylogenetic tree to provide information about that shared history. PD Whole Tree Diversity is a commonly used phylogenetic alpha diversity metric in microbiome research [8]. A taxonomic metric assumes all species are equally different. Common taxonomic metrics for alpha diversity used in microbiome research include Observed Species Diversity and Chao1 Diversity [9, 10].

Depending on what information we’re looking for, we might want to include information about the number of each animal belonging to the species we see. We might also want to consider the number of each different species we find in the area, weighting our diversity. So, if in our little area of Yellowstone, 90% of the animals we see are mice, while 5% are rabbits and 5% are trout, we would consider this less diverse than if 40% of the animals were mice, 30% were rabbits and 30% were trout. A metric which takes into account the counts of each species is a quantitative metric, while a qualitative metric looks only at the presence or absence of a species.

While alpha diversity is calculated completely independently for each sample, the comparison of alpha diversity may provide clues about environmental changes. For example, pollution or an algal bloom may be associated with lower alpha diversity, and indicate a potential change in the health of the ecosystem. We’ll start our work with alpha diversity by calculating the diversity for our rarefied American Gut tables using the four metrics we selected in the alpha diversity parameters: the phylogenetic PD whole Tree Diversity, and the taxonomic metrics, Observed Species Diversity, Chao1 Diversity and Shannon Diversity. All the diversity metrics we are using here are qualitative metrics.


In [37]:
if not data_download and (not os.path.exists(os.path.join(alpha_dir, alpha_pattern) % last_rare) or overwrite):
    !alpha_diversity.py -i $rare_dir -o $alpha_dir -m $alpha_metrics -t $tree_fp

The alpha diversity results from QIIME are loaded into the notebook. To identify the best rarefaction instance, which we'll use as our OTU table moving forward, we try to identify the rarefaction instance which has alpha diversity closest to the mean alpha diversity represented in the table. We define "closest" using the normalized Euclidian distance.


In [38]:
if not data_download:
    # Preallocates an output object for alpha diversity
    alpha_rounds = {m: {} for m in alpha_metrics.split(',')}
    div_metric = alpha_metrics.split(',')[0]

    # Loops through the rarefaction instances
    for ri in range(num_rarefactions):
        a_file_blanks = {'rare_depth': rarefaction_depth,
                         'rare_instance': ri}

        # Sets the alpha diversity file path
        alpha_fp = os.path.join(alpha_dir, alpha_pattern) % a_file_blanks

        # Loads the alpha diversity table
        alpha = pd.read_csv(alpha_fp,
                            sep=txt_delim,
                            index_col=False)
        alpha.index = alpha['Unnamed: 0']
        del alpha['Unnamed: 0']

        # Extracts the alpha diversity metrics
        for col in alpha_rounds:
            alpha_rounds[col]['%i' %ri] = alpha[col]
            alpha_rounds[col]['%i' %ri].name = '%i' % ri

In [39]:
if not data_download:
    # Compiles the alpha diversity results into a single table
    alpha_df = pd.DataFrame({'%s_mean' % metric: pd.DataFrame(alpha_rounds[metric]).mean(1)
                             for metric in alpha_metrics.split(',')})

    # Adds the alpha diversity results to the rarefied table
    rare_map = raw_map.copy()
    rare_map = rare_map.join(alpha_df)
    rare_check = np.isnan(rare_map['%s_mean' % div_metric]) == False
    rare_map = rare_map.loc[rare_check]

    # Draws the data associated with each of the alpha diversity rounds
    all_rounds = pd.DataFrame(alpha_rounds[div_metric])

    # Lines up the data so the indices match (as a precaution)
    all_rounds = all_rounds.sort_index()
    alpha_df = alpha_df.sort_index()

    # Calculates the distance between each round and the mean
    mean_rounds = ([alpha_df['%s_mean' % div_metric].values] * 
                   np.ones((num_rarefactions, 1))).transpose()
    diff = np.sqrt(np.square(all_rounds.values - np.square(mean_rounds))) / mean_rounds

    # Determines the minimum distance between the round and the mean
    round_labels = np.arange(0, 10)
    round_avg = diff.mean(0)
    best_rarefaction = round_labels[round_avg == min(round_avg)][0]
    best_blanks = {'rare_depth': rarefaction_depth,
                   'rare_instance': best_rarefaction}

We’ll save the whole body tables in their own directory, and the modified mapping files. We’ll also copy the raw OTU table and the rarefaction instance closest to the mean alpha diversity.


In [40]:
# Saves the unrarefied mapping file
if not data_download:
    raw_map.to_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,
                   sep=txt_delim,
                   na_rep=write_na,
                   index_label=map_index)

    # Saves the rarefied mapping file
    rare_map.to_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,
                   sep=txt_delim,
                   na_rep=write_na,
                   index_label=map_index)

    # Copies the raw OTU table
    shutil.copy2(download_otu_fp, 
                 os.path.join(all_dir, otu_fn) % all_raw_blanks)

    # Copies the rarefied OTU table
    shutil.copy2(os.path.join(rare_dir, rare_pattern) % best_blanks,
                 os.path.join(all_dir, otu_fn) % all_rare_blanks)

raw_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,
                      sep=txt_delim,
                      na_values=map_nas,
                      index_col=False,
                      low_memory=False,
                      dtype={map_index: str})
raw_map.index = raw_map[map_index]
del raw_map[map_index]

rare_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,
                       sep=txt_delim,
                       na_values=map_nas,
                       index_col=False,
                       low_memory=False,
                       dtype={map_index: str})
rare_map.index = rare_map[map_index]
del rare_map[map_index]

Whole Table Beta Diversity

Beta Diversity allows us to make comparisons between samples and environments. Let’s go back to our 100ft x 100ft x 10ft cube in Yellowstone where we catalogued all the vertebrates. Let’s imagine that we’ve set up the same type of cube in New York City’s Central Park and cataloged all the vertebrates in that area as well.

We could compare the two communities by seeing how many species are shared between the two, or, by making some measure that approximates the species. We might expect some overlap: depending on where we selected our regions, it would be unsurprising to encounter Chipmunks in both Central Park and Yellowstone National Park. However, there should also be some differences. Unless our Central Park location includes the zoo, it’s unlikely we’d find a Buffalo in New York City!

If we use a taxonomic metric, based only on the species we find in the two locations, we might get very little overlap. While we might expect to find a squirrel in both Central Park and Yellowstone, the animals might be members of different genera! New York is home to the Eastern Grey Squirrel, Sciurus carolinensis, while we might find the American Red Squirrel, Tamiasciurus hudsonicus, in Yellowstone. [24, 25]. In this case, a phylogenetic metric, which can account for some similarity between the two species of squirrels, may serve us much better.

When we compare microbial communities for beta diversity, we frequently select a phylogenetic metric called UniFrac distance [5]. This metric uses a phylogenetic tree, and determines what fraction of the tree is not shared between two communities.

If we consider only the presence and absence of each OTU in the samples, we have a qualitative metric, unweighted UniFrac distance. Unweighted UniFrac distance may take on values between 0 (everything the same) and 1 (everything different). Weighted UniFrac distance takes into account the abundance of the OTUs, and can take on values greater than 1.

The UniFrac distance for each pairwise sample is arranged into a Distance Matrix. We can visualize the distance matrix by many techniques, like making PCoA plots in Emperor, or UPGMA trees like the one shown in the figure below [2].

Since UniFrac distance is calculated for each sample pair in the table, this is one of the most computationally expensive steps we will perform. However, once the UniFrac distance has been calculated for all of our samples, we can simply filter the table to focus on the samples we want. We can leverage the QIIME script, beta_diveristy.py to perform our analysis.


In [41]:
# Sets up the filepaths for the all sample rarified table
all_otu_rare_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks

check_dm_fp = np.array([os.path.exists(os.path.join(all_dir, fn_) 
                                       % all_rare_blanks) for fn_ in dm_fn])

# Calculates the beta diversity
if not data_download and (not check_dm_fp.all() or overwrite):
    !beta_diversity.py -i $all_otu_rare_fp -m $beta_metrics -t $tree_fp -o $all_dir

Body Site Split

Now that we’ve generated alpha and beta diversity results for all the body sites, we can start filtering the results. Body site has one of the largest impacts on the microbiome in adult humans [7]. As a result, many analyses will focus on a single body site, often fecal samples.

We’ll use the QIIME script, split_otu_table.py to split our rarefied and unrarefied OTU tables by body site. We’ll put the output files in intermediate directories, and then move them to the appropriate locations.


In [42]:
# Sets the raw location file names
all_raw_otu_fp = os.path.join(all_dir, otu_fn) % all_raw_blanks
all_raw_map_fp = os.path.join(all_dir, map_fn) % all_raw_blanks

# Sets the rarefied location file names
all_rare_otu_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks
all_rare_map_fp = os.path.join(all_dir, map_fn) % all_rare_blanks

In [43]:
# Checks that the raw and rarified bodysite split tables exist
raw_split_check = np.array([])
rare_split_check = np.array([])

for site in all_bodysites:
    # Checks the unrarefied splits exists
    otu_raw_split_blanks['split_group'] = site
    map_raw_split_blanks['split_group'] = site
    raw_otu_exist = os.path.exists(os.path.join(split_raw_dir, split_fn) 
                                   % otu_raw_split_blanks)
    raw_map_exist = os.path.exists(os.path.join(split_raw_dir, split_fn) 
                                   % map_raw_split_blanks)
    raw_split_check = np.hstack((raw_split_check, raw_otu_exist, raw_map_exist))
    
    # Checks the rarefied splits exist
    otu_rare_split_blanks['split_group'] = site
    map_rare_split_blanks['split_group'] = site
    rare_otu_exist = os.path.exists(os.path.join(split_rare_dir, split_fn) 
                                   % otu_rare_split_blanks)
    rare_map_exist = os.path.exists(os.path.join(split_rare_dir, split_fn) 
                                   % map_rare_split_blanks)
    rare_split_check = np.hstack((rare_split_check, rare_otu_exist, rare_map_exist))

In [44]:
# Splits the otu table and mapping file by bodysite
if not data_download and (not raw_split_check.any() or overwrite):
    !split_otu_table.py -i $all_raw_otu_fp -m $all_raw_map_fp -f BODY_HABITAT -o $split_raw_dir 

# Splits the otu table and mapping file by bodysite
if not data_download and (not rare_split_check.any() or overwrite):
    !split_otu_table.py -i $all_rare_otu_fp -m $all_rare_map_fp -f BODY_HABITAT -o $split_rare_dir

We’ll move and rename our split files to their final location.


In [45]:
# Copies the files to their correct final folder
if not data_download:
    for idx, h_site in enumerate(habitat_sites):
        otu_raw_split_blanks['split_group'] = h_site
        map_raw_split_blanks['split_group'] = h_site
        otu_rare_split_blanks['split_group'] = h_site
        map_rare_split_blanks['split_group'] = h_site
        raw_sample_blanks['site'] = all_bodysites[idx]
        rare_sample_blanks['site'] = all_bodysites[idx]

        # Copies the unrarefied mapping file
        shutil.copy2(os.path.join(split_raw_dir, split_fn) 
                     % map_raw_split_blanks,
                     os.path.join(asab_pattern, map_fn) 
                     % raw_sample_blanks)

        # Copies the unrarefied OTU table
        shutil.copy2(os.path.join(split_raw_dir, split_fn) 
                     % otu_raw_split_blanks,
                     os.path.join(asab_pattern, otu_fn) 
                     % raw_sample_blanks)

        # Copies the rarefied mapping file
        shutil.copy2(os.path.join(split_rare_dir, split_fn) 
                     % map_rare_split_blanks,
                     os.path.join(asab_pattern, map_fn) 
                     % rare_sample_blanks)

        # Copies the rarefied OTU table
        shutil.copy2(os.path.join(split_rare_dir, split_fn) 
                     % otu_rare_split_blanks,
                     os.path.join(asab_pattern, otu_fn)
                     % rare_sample_blanks)

To get our distance matrices for each OTU table, we’ll use the QIIME script, filter_distance_matrix.py. We will use the mapping file for each body site to filter the distance matrices.


In [46]:
if not data_download:
    for idx, a_site in enumerate(all_bodysites):
        
        rare_sample_blanks['site'] = a_site

        # Gets the rarefied mapping file for the site
        map_in = os.path.join(asab_pattern, map_fn)  % rare_sample_blanks

        for fn_ in dm_fn:
            dm_in = os.path.join(all_dir, fn_) % all_rare_blanks
            dm_out = os.path.join(asab_pattern, fn_) % rare_sample_blanks

            if not os.path.exists(dm_out) or overwrite:
                !filter_distance_matrix.py -i $dm_in -o $dm_out --sample_id_fp $map_in

Select a Single Sample for Each Participant

For some analyses we will choose to perform, it can be useful to work with a single sample for each participant at each body site. Many statistical tests assume sample independence. The microbiome among healthy adults is relatively stable across multiple samples within an individual; there is a higher correlation between your personal samples collected across several days than there is between your sample and another person’s sample collected at the same time [7].

We’re going to start defining our single sample data sets by writing a function which will allow us to randomly select a sample from each individual. This will take a pandas data frame as an input. We’ll group the data so we can look at each individual (given by the HOST_SUBJECT_ID), and then randomly select one sample id per individual.


In [47]:
def identify_single_samples(map_):
    """Selects a single sample for each participant

    Parameters
    ----------
    map_ : pandas DataFrame
        A mapping file for our set of samples. A single body site should be
        used with human samples.

    Returns
    -------
    single_ids : ndarray
        A list of ids which represent a single sample per individual

    """
    # Identifies a single sample per individual
    single_ids = np.hstack([np.random.choice(np.array(ids, dtype=str), 1)
                            for indv, ids in
                            map_.groupby('HOST_SUBJECT_ID').groups.iteritems()])
    return single_ids

We’ll apply our filtering function at each body site. To do this, we'll define another function which will allow the filtering used functions we can define, like identify_single_samples.

The function we're writing here, filter_dataset, will first identify the samples to be filtered using the function we pass in. It will use that list of samples to filter the rarefied and unrarefied (raw) mapping files. We will leverage the QIIME scripts, biom subset-table and filter_distance_matrix.py to filter our OTU table and distance matrices.


In [48]:
def filter_dataset(filter_fun, site, dir_in, dir_out, ids_fp):
    """Filters data set to create a subset
    
    Parameters
    ----------
    filter_fun : function
        A function which takes a pandas map and returns a list of sample 
        ids.
    site : str
        The body site for which the samples are being generated.
    dir_in : str
        The directory in which the input analysis files are located. The
        directory and files are assumed to exist.
    dir_out : str
        The directory where the filtered files should be put. The directory
        must exist.
    ids_fp : str
        The filepath where the list of ids in the subset is located.
    
    Returns
    -------
    There are no explicit python returns. Rarefied and unrarefied OTU tables,
    and their corresponding mapping files (the rarefied file includes alpha
    diversity) as well as distance matrices for all distance metrics used
    are saved in the `dir_out`.

    """
    rare_sample_blanks['site'] = site
    raw_sample_blanks['site'] = site

    # Sets up the file names for the original files
    rare_map_in_fp = os.path.join(dir_in, map_fn) % rare_sample_blanks
    rare_otu_in_fp = os.path.join(dir_in, otu_fn) % rare_sample_blanks

    raw_map_in_fp = os.path.join(dir_in, map_fn) % raw_sample_blanks
    raw_otu_in_fp = os.path.join(dir_in, otu_fn) % raw_sample_blanks

    rare_map_out_fp = os.path.join(dir_out, map_fn) % rare_sample_blanks
    rare_otu_out_fp = os.path.join(dir_out, otu_fn) % rare_sample_blanks

    raw_map_out_fp = os.path.join(dir_out, map_fn) % raw_sample_blanks
    raw_otu_out_fp = os.path.join(dir_out, otu_fn) % raw_sample_blanks

    # Checks if the single sample id filepath exists
    if not os.path.exists(ids_fp) or not os.path.exists(rare_map_out_fp) or overwrite:
        # Reads in the rarefied table
        rare_map_in = pd.read_csv(rare_map_in_fp,
                                  sep=txt_delim,
                                  na_values=map_nas,
                                  index_col=False,
                                  dtype={map_index: str})
        rare_map_in.index = rare_map_in[map_index]
        del rare_map_in[map_index]

        # Identifies the sample ids
        filt_ids = filter_fun(rare_map_in)
        
        rare_map_out = rare_map_in.loc[filt_ids]

        # Saves the single sample filepath
        ids_file = file(ids_fp, 'w')
        ids_file.write('\n'.join(list(filt_ids)))
        ids_file.close()
        
        # Saves the rarefied mapping file
        rare_map_out.to_csv(rare_map_out_fp,
                            sep=txt_delim,
                            na_rep=write_na,
                            index_label=map_index)
    else:
        ids_file = open(ids_fp, 'r')
        filt_ids = ids_file.read().split('\n')
        ids_file.close()
        
    if not os.path.exists(raw_map_out_fp) or overwrite:
        raw_map_in = pd.read_csv(rare_map_in_fp,
                                 sep=txt_delim,
                                 na_values=map_nas,
                                 index_col=False,
                                 dtype={map_index: str})
        raw_map_in.index = raw_map_in[map_index]
        del raw_map_in[map_index]
        raw_map_out = raw_map_in.loc[filt_ids]
        raw_map_out.to_csv(raw_map_out_fp,
                           sep=txt_delim,
                           na_rep=write_na,
                           index_label=map_index)

    # Filters the OTU table down to single samples
    if not os.path.exists(rare_otu_out_fp) or overwrite:
        !biom subset-table -i $rare_otu_in_fp -o $rare_otu_out_fp -a sample -s $ids_fp

    if not os.path.exists(raw_otu_out_fp) or overwrite:
        !biom subset-table -i $raw_otu_in_fp -o $raw_otu_out_fp -a sample -s $ids_fp

    # Filters the distance matrices
    for dm_ in dm_fn:
        dm_in_fp = os.path.join(dir_in, dm_) % rare_sample_blanks
        dm_out_fp = os.path.join(dir_out, dm_) % rare_sample_blanks
        if not os.path.exists(dm_out_fp) or overwrite:
            !filter_distance_matrix.py -i $dm_in_fp -o $dm_out_fp --sample_id_fp $ids_fp

Let's apply our two functions to identify a single sample per individual at each site.


In [49]:
for idx, a_site in enumerate(all_bodysites):
    # Skips any site where the healthy subset criteria should not be applied
    if a_site not in one_samp_sites:
        continue
    filter_dataset(filter_fun=identify_single_samples, site=a_site, dir_in=asab_pattern, dir_out=assb_pattern,
                   ids_fp=os.path.join(assb_pattern, sin_fn) % {'site':a_site})

Filter the Table to the Healthy Subset

Finally, we may wish to have a healthy subset of individuals for certain analyses. The criteria we’ve used to define the healthy subset are described above.

We're going to define a quick function that makes use of SUBSET.


In [50]:
def identify_subset(map_):
    return map_.loc[map_.SUBSET == True].index.values

Now, we’ll use essentially the same pipeline we leveraged for filtering the single samples.


In [51]:
for idx, a_site in enumerate(all_bodysites):
    # Skips any site where the healthy subset criteria should not be applied
    if a_site not in sub_part_sites:
        continue
    filter_dataset(identify_subset, a_site, asab_pattern, ssab_pattern,
                   os.path.join(ssab_pattern, sub_fn) % {'site':a_site})

In [52]:
for idx, a_site in enumerate(all_bodysites):
    # Skips any site where the healthy subset criteria should not be applied
    if a_site not in sub_part_sites or a_site not in one_samp_sites:
        continue
    
    filter_dataset(identify_subset, a_site, assb_pattern, sssb_pattern,
                   os.path.join(sssb_pattern, sub_fn) % {'site':a_site})

We have generated body site specific, rarefied OTU tables, mapping files with alpha diversity and UniFrac distance matrices for our American Gut Data, as well as creating focused datasets. We can choose to create further-filtered tables, or we can take the outputs of this notebook and use it for downstream analysis.

Return to the top

References

  1. Caporaso, J.G.; Kuczynski, J.; Strombaugh, J.; Bittinger, K.; Bushman, F.D.; Costello, E.K.; Fierer, N.; Peña, A.G., Goodrich, J.K.; Gordon, J.I.; Huttley, G.A.; Kelley, S.T.; Knights, D.; Koenig, J.E.; Ley, R.E.; Lozupone, C.A.; McDonald, D.; Muegge, B.D.; Pirrung, M.; Reeder, J.; Sevinsky, J.R.; Turnbaugh, P.J.; Walters, W.A.; Widmann, J.; Yatsunenko, T.; Zaneveld, J. and Knight, R. (2010) “QIIME allows analysis of high-throughput community sequence data.” Nature Methods. 7: 335 - 336.

  2. Vázquez-Baeza, Y.; Pirrung, M.; Gonzalez, A.; and Knight, R. (2013). “EMPeror: a tool for visualizing high-throughput microbial community data.” Gigascience. 2: 16.

  3. Langille, M.G.; Zaneveld, J.; Caporaso, J.G.; McDonald, D.; Knights, D.; Reyes, J.A.; Clemente, J.C.; Burkepile, D.E.; Vega Thurber, R.L.; Knight, R.; Beiko, R.G.; and Huttenhower, C. (2013). “Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences.” Nat Biotechnol. 31: 814-821.

  4. McDonald, D.; Clemente, J.C.; Kuczynski, J.; Rideout, J.R.; Stombaugh, J.; Wendel, D.; Wilke, A.; Huse, S.; Hufnagle, J.; Meyer, F.; Knight, R.; and Caporaso, J.G. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome.” Gigascience. 1:7.

  5. Lozupone, C.; and Knight, R. (2005). “UniFrac: a new phylogenetic method for comparing microbial communities.” Appl Enviro Microbiol. 71: 8228-8235.

  6. Lozupone, C.; LLadser, M.E.; Knights, D.; Stombaugh, J.; and Knight, R. (2011). “UniFrac: an effective distance metric for microbial community composition.” ISME J 5: 169-172.

  7. The Human Microbiome Consortium. (2012) “Structure, Function and diversity of the healthy human microbiome.Nature. 486: 207-214.

  8. Eckburg, P.B.; Bik, E.M.; Bernstein C.N.; Purdom, E.; Dethlefson, L.; Sargent, M.; Gill, S.R.; Nelson, K.E.; Relman, D.A. (2005) “Diversity of the human intestinal microbial flora.Science. 308: 1635-1638.

  9. Chao, A. (1984) “Nonparametric estimation of the number of classes in a population.” Scandinavian J Stats. 11: 265-270.

  10. Seaby, R.M.H. and Henderson, P.A. (2006). “Species Diversity and Richness 4.” http://www.pisces-conservation.com/sdrhelp/index.html.

  11. McDonald, D.; Price, N.M.; Goodrich, J.; Nawrocki, E.P.; DeSantis, T.Z.; Probst, A.; Andersen, G.L.; Knight, R. and Hugenholtz, P. (2012). “An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.ISME J. 6:610 - 618.

  12. Koenig, J.E.; Spor, A.; Scalfone, N.; Fricker, A.D.; Stombaugh, J.; Knight, R.; Angenent, L.T.; and Ley, R.E. (2011). “Succession of microbial consortia in the developing infant gut microbiome.” PNAS. 108 Suppl 1: 4578 - 4585.

  13. Yatsunenko, T.; Rey, F.E.; Manary, M.J.; Trehan, I.; Dominguez-Bello, M.G.; Contreras, M.; Magris, M.; Hidalgo, G.; Baldassano, R.N.; Anokhin, A.P.; Heath, A.C.; Warner, B.; Rdder, J.; Kuczynski, J.; Caporaso, J.G.; Lozupone, C.A.; Lauber, C.; Clemente, J.C.; Knights, D.; Knight, R. and Gordon, J.I. (2012) “Human Gut microbiome viewed across age and geography.” Nature. 486: 222-227.

  14. Claesson, M.J.; Cusacks, S.; O’Sullivan, O.; Greene-Diniz, R.; de Weerd, H.; Flannery, E.; Marchesi, J.R.; Falush, D.; Dinan, T.; Fitzgerald, G.; Stanton, C.; van Sinderen, D.; O’Connor, M.; Harnedy, N.; O’Connor, K.; Henry, C.; O’Mahony, D.; Fitzgerald, A.P.; Shananhan, F.; Twomey, C.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2011). “Composition, variability and temporal stability of the intestinal microbiota of the elderly.” PNAS. 108 Suppl 1: 4586 - 4591.

  15. Claesson, M.J.; Jeffery, I.B.; Conde, S.; Power, S.E.; O’Connor, E.M.; Cusack, S.; Harris, H.M.; Coakley, M.; Lakshminarayanan, B.; O’Sullivan, O.; Fitzgerald, G.F; Deane, J.; O’Connor, M.; Harnedy, N.; O’Connor, K.; O’Mahony, D.; van Sinderen, D.; Wallace, M.; Brennan, L.; Stanton, C.; Marchesi, J.R.; Fitzgerald, A.P.; Shanahan, F.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2012). “Gut microbiota composition correlates with diet and health in the elderly.” Nature. 488: 178-184.

  16. Walters, W.A.; Zu, Z.; and Knight, R. (2014) “Meta-analysis of human gut microbes associated with obesity and IBD.” FEBS Letters. 588: 4223-4233.

  17. Dethlefsen, L. and Relman, D.A. (2011) “Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation.” PNAS. 108 Suppl 1: 4554-4561.

  18. de Goffau, M.C.; Luopajärvi, K.; Knip, M.; Ilonen, J.; Ruohtula, T.; Härkönen, T.; Orivuori, L.; Hakala, S.; Welling, G.W.; Harmensen, H.J.; and Vaarala, O. (2013). “Fecal Microbiota composition differs between children with B-cell autoimmunity and those without.” Diabetes. 62: 1238-1244.

  19. Giongo, A.; Gano, K.A.; Crabb, D.B.; Mukherjee, N.; Novelo, L.L.; Casella, G.; Drew, J.C.; Ilonen, J.; Knip, M.; Hyöty, H; Veijola, R.; Simell, T.; Simell, O.; Neu, J.; Wasserfall, C.H.; Schatz, D.; Atkinson, M.A.; and Triplett, E.W. (2011). “Toward defining the autoimmune microbiome for type 1 diabetes.” ISME J. 5: 82-91.

  20. Mejía-León, M.E.; Petrosino, J.F.; Ajami, N.J.; Domínguez-Bello, M.G.; and de la Barca, A.M. (2014). “Fecal microbiota imbalance in Mexican children with type 1 diabetes.” Science Reports. 4: 3814.

  21. Murrim M.; Leiva, I.; Gomez-Zumaquero, J.M.; Tinahones, F.J.; Cardona, F.; Soriguer, F.; and Queipo-Ortuño, M.I. (2013). “Gut microbiota in children with type 1 diabetes differs from that in healthy children: a case-control study.” BMC Med. 11:46.

  22. Larsen, N.; Vogensen, F.K.; van den Berg, F.W.; Nielsen, D.S.; Andreasen, A.S.; Pedersen, B.K.; Al-Soud, W.A.; Sørensen, S.J.; Hansen, H.L. and Jakobsen, M. (2010). “Gut Microbiota in human adults with type 2 diabetes differs from non diabetic adults.” PLoS One. 5: e9085.

  23. Ahn, J.; Sinha, R.; Pei, Z.; Cominanni, C.; Wu, J.; Shi, J.; Goedert, J.J.; Hayes, R.B.; and Yang, L. (2013). "Human gut microbiome and risk for colorectal cancer." J Natl Cancer Inst. 105: 1907-1911.

  24. National Park Service. (2015). “Mammal Checklist.” Yellowstone National Park.

  25. Milieris, V. (2011) “Biodiversity in Central Park”. Exploring Central Park. CUNY.

Return to the top