In [1]:
# This cell allows us to render the notebook in the way we wish no matter where the notebook is rendered.
from IPython.core.display import HTML
css_file = 'ag.css'
HTML(open(css_file, "r").read())
Out[1]:
The goal of this notebook is to take the OTU table and mapping files generated by the American Gut Primary Processing Pipeline Notebook and manipulate them to produce a uniform set of rarefied and filtered tables which can be used in downstream analyses, such as our Power Notebook, or using QIIME, EMPeror, or PICRUSt [1 - 3]. This processing is centralized since it can be computationally expensive, given the size of the tables involved, and because it removes some error associated with the random number generation used in some steps.
This notebook will generate American Gut datasets for three body sites. This process will include a dataset focused on all the samples collected for that location, and a set of samples which represent a single sample from each donor at that body site. These are calculated on a per body site basis, so an individual who donated two fecal samples and an oral sample will have one sample in the single fecal sample set, and one sample in the single oral set.
Additionally, we decided to create a healthy subset for fecal samples. The definition is provided below.
The final directory structure will follow the following organization. Parent Directories are bolded.
You can choose to download the complete directory (rather than running this notebook) by setting the data_download
variable.
Our folders will contain three types of files: two metadata files, two Biom-format OTU table files and two distance matrices:
*_oral and _skin may be substituted for _fecal in file names.
In addition, a text file called either single_samples.txt or subset_samples.txt may be generated. This is the list of samples in the data set. The file is primarily auxiliary, and is used by this notebook to filter Biom and distance matrix files into single sample data sets.
Let’s talk a bit more about what information is contained in each file type.
A metadata file, sometimes called a mapping file, provides information about our samples which cannot be determined from 16S marker gene sequences alone. Metadata is as important as the bacterial sequence information for determining information about the microbiome. Metadata tells us information about the sample, such as the body site where it was collected, the participant’s age or whether or not they have certain diseases. Since these can make a very large difference in the microbiome, it’s important to have this information!
American Gut metadata is collected through the participant survey. The survey allows participants to skip any question they do not wish to answer, meaning that some samples are missing fields. The python library used to handle metadata in these notebooks will specially encode these fields. Within IPython analysis notebooks, missing data will be represented as a Numpy NaN
. Printed notebooks are set return empty spaces for missing values, although this can be changed by altering the write_na variable. For certain QIIME scripts, leaving these fields blank will allow the script to ignore samples missing metadata.
The American Gut metadata is also de-identified. This means the metadata does not contain information which could be used to identify a participant, like their name, email address, or the kit ID. Instead, each participant is assigned a code. This allows us to identify multiple samples from the same individual. The samples are identified by the barcode number, which appears on the sample tube. This number connects the survey metadata to the sample data in the OTU table and distance matrix.
This notebook will add Alpha Diversity to the mapping file for rarefied data. Files which contain the alpha diversity will always have a rarefaction depth notation in the file name. The convention here is to include the word even, and then the rarefaction depth in the file name (i.e. AGP_100nt_even10k_fecal.txt
).
An OTU table describes the bacterial content in a group of samples. An OTU, or operational taxonomic unit, is a cluster of sequences which are identical to some degree of similarity. We use these as stand-ins for bacterial groups. More information about OTUs, the level of similarity used, and the methods we use to generate OTUs can be found in the Primary Processing Pipeline Notebook. The OTU table allows us to link the sequencing results from our 16S data to the sample ID in a usable way, and gives an easier platform for comparison across samples.
Our OTU tables are saved as a special file format, the Biom format [4]. Unlike the other file types we will generate here, Biom files cannot be viewed using a normal spreadsheet program on your computer. The benefit of Biom format is that it allows us to save large amounts of data in a smaller amount of space. Biom data may be encoded as a JSON string, or using the same HDF5 compression NASA does. Because of the size of the American Gut data, we recommend the HDF5 format.
Finally, Distance Matrices describe the relationship between any two samples based on their community structure, measuring an ecological concept called Beta Diversity. In this notebook, we will measure the UniFrac Distance between all of our samples. This takes into account the evolutionary relationship between bacteria when communities are compared [5, 6]. Each cell in the distance matrix tells the distance between the sample given by the row and the sample given by the column. Distance matrices are symmetrical, which means that we can draw a line across the diagonal of the distance matrix, and the distances on either side of this line will be equal. The distances along that line should come from the same sample, and will have a distance of 0. We can use our distance matrix information directly, or use multidimensional scaling techniques like Principal Coordinates Analysis (PCoA) or make a UPGMA tree.
Metadata | OTU Table | Distance Matrix | |
---|---|---|---|
$\LaTeX$ is also recommended for running this suite of analysis notebooks, although it is not required for this notebook. LiveTex offers one installation solution.
We will start our analysis by importing the functions we need from python libraries.
In [2]:
import os
import shutil
import copy
import datetime
import numpy as np
import pandas as pd
import skbio
import biom
import americangut.diversity_analysis as div
import americangut.geography_lib as geo
from qiime_default_reference import get_reference_tree
It is also important to define certain aspects of how we'll handle files and do our analysis. It can be easier to set all these at the same time, so the systems are consistent every time we repeat the process, rather than repeat them multiple places. This way, we only have to change the parameter once.
In the course of this analysis, a series of files can be generated. The File Saving Parameters determine whether new files are saved.
In [3]:
data_download = False
overwrite = False
# Checks the data download-overwrite relationship is valid
if data_download:
overwrite = True
We'll start by defining how we'll handle certain files, especially metadata files. We will use the Pandas library to handle most of our text files. This library provides some spreadsheet like functionality.
In [4]:
txt_delim = '\t'
map_index = '#SampleID'
map_nas = ['NA', 'no_data', 'unknown', '']
write_na = ''
We rarefy our data to set an even depth and allow head-to-head comparison of alpha and beta diversity. Rarefaction begins by removing samples from the table which do not have the minimum number of counts. Sequences are then drawn randomly out of a weighted pool until we reach the rarefaction depth.
In [5]:
rarefaction_depth = 10000
num_rarefactions = 10
We will split our OTU tables by body site, since the collection site on the human body plays a large role in community compositions in healthy adults [7].
In [6]:
split_field = 'BODY_HABITAT'
split_prefix = 'UBERON:'
Alpha diversity looks at the variety of species within a sample. In this notebook, we calculate alpha diversity using a variety of metrics, and append these to the mapping file. A more complete discussion of alpha diversity can be found below.
In [7]:
alpha_metrics = 'PD_whole_tree,observed_species,chao1,shannon'
Beta Diversity compares the ecology community across multiple sites. A further discussion of beta diversity can be found below. We will use QIIME to calculate beta diversity.
In [8]:
beta_metrics = "unweighted_unifrac,weighted_unifrac"
We can select the datasets which we’ll generate using this notebook. The default for each body site is to generate a data set (OTU table, mapping file and distance matrices) for all participants and all samples. We may want to limit our samples to a single sample per individual. Or, we could choose only to work with a subset of the data (see Identification of a Healthy Subset of Adults).
In [9]:
# Lists all bodysites to be analyzed
habitat_sites = ['feces', 'oral cavity', 'skin']
all_bodysites = ['fecal', 'oral', 'skin']
# Handles healthy subset OTU tables
sub_part_sites = {'fecal'}
# Handles single sample OTU tables
one_samp_sites = {'fecal', 'oral', 'skin'}
To help organize the results of this notebook, we’ll start by setting up a series of files where results can be saved. This will provide a common file structure (base_dir
, working_dir
, etc.) for the results. We’re going to set up three primary directories here, and then nest additional directories inside.
As we set up directories, we’ll make use the of the check_dir
function. This will create the directories we identify if they do not exist.
We need a general location to do all our analysis; this is the base_dir
. All our other directories will exist within the base_dir, and allow us to work.
In [10]:
base_dir = os.path.join(os.path.abspath('.'), 'agp_analysis')
div.check_dir(base_dir)
Some of the steps in our diversity calculations will require a phylogenetic tree. This contains information about the evolutionary relationship between OTUs which can be leveraged in calculating PD Whole Tree Diversity and UniFrac Distance. While there are multiple ways to pick OTUs, this table was generated using a reference-based technique. Therefore, we can simply download the phylogenetic tree file for the reference set. Our reference for this dataset was the Greengenes version 13_8 at 97% similarity [11]. Please refer to the Primary Processing Pipeline Notebook for more information about how OTUs are picked.
In [11]:
tree_fp = get_reference_tree()
The working directories will be used to save files we generate while cleaning the data. These may include items like downloaded OTU tables, and rarefaction instances.
In [12]:
# Sets up a directory to save intermediate files and downloads
working_dir = os.path.join(base_dir, 'intermediate_files')
div.check_dir(working_dir)
Our first analysis step is to locate the American Gut OTU tables generated by the Primary Processing Pipeline Notebook The tables may be generated locally, may be located in local GitHub Repository or they may be downloaded directly from GitHub.
In [13]:
# Creates a directory for unprocessed file downloads
download_dir = os.path.join(working_dir, 'downloads')
div.check_dir(download_dir)
# Sets the filepaths for downloaded files
download_otu_fp = os.path.join(download_dir, 'AG_100nt.biom')
download_map_fp = os.path.join(download_dir, 'AG_100nt.txt')
When we perform rarefaction on our data, we will generate several similarly named files. To help keep these organized, we will create a directory for each set of rarefaction files.
In [14]:
# Creates a parent directory for rarefaction instances
rare_dir = os.path.join(working_dir, 'rarefaction')
div.check_dir(rare_dir)
# Sets a pattern for the filenames of the rarefaction files
rare_pattern = 'rarefaction_%(rare_depth)i_%(rare_instance)i.biom'
When we perform the alpha diversity on our data, we will use similarly named rarefaction tables and generate several similarly named alpha diversity files. To help keep these organized, we will create a directory for each set of alpha diversity files.
In [15]:
# Creates a parent directory for the alpha diversity files
alpha_dir = os.path.join(working_dir, 'alpha')
div.check_dir(alpha_dir)
# Sets a pattern for the filenames of alpha diversity tables
alpha_pattern = 'alpha_rarefaction_%(rare_depth)i_%(rare_instance)i.txt'
At some point in our analysis, it will become necessary to split our OTU table by body site. The directory we create next will specify the location, while the file patterns will allow us to iterate or select one of many possible files using a simple string substitution.
In [16]:
# Creates a directory for splitting the OTU table by bodysite
split_raw_dir = os.path.join(working_dir, 'split_by_bodysite_raw')
div.check_dir(split_raw_dir)
split_rare_dir = os.path.join(working_dir, 'split_by_bodysite_rare')
div.check_dir(split_rare_dir)
# Sets a pattern for filenames in the split directory
split_fn = 'AGP_100nt%(rare_suffix)s__%(split_field)s_%(split_prefix)s%(split_group)s__.%(extension)s'
Finally, we need to set up the directories and filenames where we will save our results.
We’ll start by creating an output directory where our results should be located.
In [17]:
# Sets up a directory where all results should be saved
data_dir = os.path.join(base_dir, 'sample_data')
if not data_download:
div.check_dir(data_dir)
Next, we'll create directories specific to the body sites for which we have samples. The file path for a directory where each of the result sets from this notebook should be stored. The folders are described above.
In [18]:
# Sets up an all sample directory
all_dir = os.path.join(data_dir, 'all')
if not data_download:
div.check_dir(all_dir)
# Creates body-site specific directories
if not data_download:
for site in all_bodysites:
div.check_dir(os.path.join(data_dir, site))
We will set up file names for the output directories where we’ll save our final files.
The “%s”
prefix will allow us to insert any file path for the output directory. We expect the final file paths used with these directories to be located in the data_dir
.
In [19]:
# Sets up the file path pattern for the possible sets of sample at each body
# site
asab_pattern = os.path.join(data_dir, '%(site)s/all_participants_all_samples')
assb_pattern = os.path.join(data_dir, '%(site)s/all_participants_one_sample')
ssab_pattern = os.path.join(data_dir, '%(site)s/sub_participants_all_samples')
sssb_pattern = os.path.join(data_dir, '%(site)s/sub_participants_one_sample')
# Checks the file paths
if not data_download:
for site in all_bodysites:
site_blank = {'site': site}
div.check_dir(asab_pattern % site_blank)
if site in one_samp_sites:
div.check_dir(assb_pattern % site_blank)
if site in sub_part_sites:
div.check_dir(ssab_pattern % site_blank)
if site in sub_part_sites and site in one_samp_sites:
div.check_dir(sssb_pattern % site_blank)
Finally, we will set up file names for the file names we’ll put in the output directories. We can combine these with our directories to get our final file paths.
The AGP_100nt
(the data comes from the American Gut Project and sequences were trimmed to 100 nucleotides length). We can describe the rarefaction depth and body site in the blanks.
In [20]:
otu_fn = 'AGP_100nt%(rare_depth)s%(spacer)s%(site)s.biom'
map_fn = 'AGP_100nt%(rare_depth)s%(spacer)s%(site)s.txt'
dm_fn = ['%(metric)s_AGP_100nt%(rare_depth)s%(spacer)s%(site)s.txt'
% {'metric': m, 'rare_depth': '%(rare_depth)s',
'spacer': '%(spacer)s', 'site': '%(site)s'}
for m in beta_metrics.split(',')]
sin_fn = 'single_samples.txt'
sub_fn = 'subset_samples.txt'
There are a set of blanks which are filled in for each file name. Some of these blanks will follow consistent patterns, which we can set before the files are used.
In [21]:
map_extension = 'txt'
otu_extension = 'biom'
site_pad = '_'
all_ = ''
rare_suffix = '_even10k'
raw_suffix = ''
We can set up the values to fill in the file blanks, using the body site names and the file paths we’ve provided. We’ll generate some of these using substitutions.
In cases where body site must be specified, this is handled in two ways. When dealing with splitting data by body site, the body site will be given by the split_group
, and we'll use nan
as a placeholder.
After the tables are split, we generate body site-specific information as a list.
In [22]:
last_rare = {'rare_depth': rarefaction_depth,
'rare_instance': num_rarefactions - 1}
all_raw_blanks = {'rare_depth': raw_suffix,
'spacer': all_,
'site': all_}
all_rare_blanks = {'rare_depth': rare_suffix,
'spacer': all_,
'site': all_}
otu_raw_split_blanks = {'rare_suffix': raw_suffix,
'split_field': split_field,
'split_prefix': split_prefix,
'split_group': np.nan,
'extension': otu_extension}
map_raw_split_blanks = {'rare_suffix': raw_suffix,
'split_field': split_field,
'split_prefix': split_prefix,
'split_group': np.nan,
'extension': map_extension}
otu_rare_split_blanks = {'rare_suffix': rare_suffix,
'split_field': split_field,
'split_prefix': split_prefix,
'split_group': np.nan,
'extension': otu_extension}
map_rare_split_blanks = {'rare_suffix': rare_suffix,
'split_field': split_field,
'split_prefix': split_prefix,
'split_group': np.nan,
'extension': map_extension}
raw_sample_blanks = {'site': np.nan,
'rare_depth': raw_suffix,
'spacer': site_pad}
rare_sample_blanks = {'site': np.nan,
'rare_depth': rare_suffix,
'spacer': site_pad}
We can now start our analysis by downloading the American Gut mapping file and OTU tables if they’re not already located in the download_dir
, they will be downloaded to this location. If the files exist, new versions will be downloaded only if overwrite
is set to True
.
Note that this step requires an internet connection.
In [23]:
if data_download:
!curl -OL ftp://ftp.microbio.me/pub/American-Gut-precomputed/r1-15/sample_data.tgz
!tar -xzf sample_data.tgz
shutil.move('./sample_data', base_dir)
In [24]:
# Gets the biom file
if not data_download and (not os.path.exists(download_otu_fp) or overwrite):
# Downloads the compressed biom file
!curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.biom?raw=true
# Moves the biom file to its final location
shutil.move(os.path.join(os.path.abspath('.'), 'AG_100nt.biom?raw=true'), download_otu_fp)
# Gets the mapping file
if not data_download and (not os.path.exists(download_map_fp) or overwrite):
# Downloads the mapping files
!curl -OL https://github.com/biocore/American-Gut/blob/master/data/AG/AG_100nt.txt?raw=true
# Moves the file to the download file path
shutil.move(os.path.join('.', 'AG_100nt.txt?raw=true'), download_map_fp)
We will start by adjusting the metadata. This will correct errors and provide a uniform format for derived columns we may wish to use later or in downstream analyses.
In [25]:
# Loads the mapping file
raw_map = pd.read_csv(download_map_fp,
sep=txt_delim,
na_values=map_nas,
index_col=False,
dtype={map_index: str},
low_memory=False)
raw_map.index = raw_map[map_index]
del raw_map[map_index]
# Loads the OTU table
raw_otu = biom.load_table(download_otu_fp)
# Filters the raw map to remove any samples that are not present in the biom table
raw_map = raw_map.loc[raw_otu.ids()]
There are also a set of columns which are not included in the map, but may be useful for downstream analyses. These include age binned by decade (AGE_CAT
). While there are QIIME analyses which can handle continuous metadata, binning can help reduce some of the noise.
Here, we bin age by decade, with the exception of people under the age of 20. The gut develops in the first two years of life, and the guts of young children are significantly different than older children or adults [12, 13]. We will also combine individuals over the age of 70 into their own category, due to the low sample counts of people over 80 as of round 14 (n < 20).
In [26]:
# Bins age by decade (with the exception of young children)
def categorize_age(x):
if np.isnan(x):
return x
elif x < 3:
return "baby"
elif x < 13:
return "child"
elif x < 20:
return "teen"
elif x < 30:
return "20s"
elif x < 40:
return "30s"
elif x < 50:
return "40s"
elif x < 60:
return "50s"
elif x < 70:
return "60s"
else:
return "70+"
raw_map['AGE_CAT'] = raw_map.AGE.apply(categorize_age)
In addition to considering the frequency with which people use alcohol (Never, Rarely, Occasionally, Regularly, or Daily), it may be helpful to simply look for an effect associated with any alcohol consumption.
In [27]:
def categorize_etoh(x):
if x == 'Never':
return "No"
elif isinstance(x, str):
return "Yes"
elif np.isnan(x):
return x
raw_map['ALCOHOL_CONSUMPTION'] = raw_map.ALCOHOL_FREQUENCY.apply(categorize_etoh)
Body Mass Index (BMI) can be stratified into categories which give an approximate idea of body shape. It is worth noting that these stratifications do not hold well for growing children, where the BMI qualification is based on age and gender.
In [28]:
# Categorizes the BMI into groups
def categorize_bmi(x):
if np.isnan(x):
return x
elif x < 18.5:
return "Underweight"
elif x < 25:
return "Normal"
elif x < 30:
return "Overweight"
else:
return "Obese"
raw_map['BMI_CAT'] = raw_map.BMI.apply(categorize_bmi)
American Gut samples have been collected since December of 2012. To look for patterns associated with the time of year samples were collected, we bin this date information into month, and season.
We currently define our seasons according to the calendar in the Northern Hemisphere, because as of round fifteen, 99% of our samples were collected north of the equator. Additionally, rather than defining our seasons by the solar calendar, we have elected to use the first day of the month the solstice or equinox occurs in as the start of our season. So, while Winter technically begins on December 20th or 21st, according to the solar calendar, we consider December 1st as the first day of our Winter.
In [29]:
def convert_date(x):
"""Converts strings to a date object"""
if isinstance(x, str) and "/" in x:
try:
return pd.tseries.tools.to_datetime(x)
except:
return np.nan
else:
return x
raw_map.COLLECTION_DATE = raw_map.COLLECTION_DATE.apply(convert_date)
In [30]:
# Categorizes data by collection month and collection season
month_map = {-1: [np.nan, np.nan],
np.nan: [np.nan, np.nan],
1: ['January', 'Winter'],
2: ['February', 'Winter'],
3: ['March', 'Spring'],
4: ['April', 'Spring'],
5: ['May', 'Spring'],
6: ['June', 'Summer'],
7: ['July', 'Summer'],
8: ['August', 'Summer'],
9: ['September', 'Fall'],
10: ['October', 'Fall'],
11: ['November', 'Fall'],
12: ['December', 'Winter']}
def map_month(x):
try:
return month_map[x.month][0]
except:
return np.nan
def map_season(x):
try:
return month_map[x.month][1]
except:
return np.nan
# Maps the data as a month
raw_map['COLLECTION_MONTH'] = \
raw_map.COLLECTION_DATE.apply(map_month)
# Maps the data as a season
raw_map['COLLECTION_SEASON'] = \
raw_map.COLLECTION_DATE.apply(map_season)
The American Gut Project includes some geographical information about where samples were collected. While the data may be leveraged as-is, it can also be helpful to clean up the data.
We'll start by checking for uniform country mapping. This will allow us to combine samples from countries or groups of countries with multiple descriptive names, such as the Great Britain and the United Kingdom.
In [31]:
def map_countries(x):
return geo.country_map.get(x, x)
raw_map.COUNTRY = raw_map.COUNTRY.apply(map_countries)
In Rounds 1-15 participants come predominantly from the US, UK, Belgium and Canada. Since the area occupied by Belgium and the UK are smaller than the size of many states in the contiguous US (including some of the most represented states in the American Gut), we have elected to only consider the STATE field for American and Canadian samples.
This does not alter the information provided by the zip (postal) code or Longitude/Latitude information.
In [32]:
# Removes state information for any state not in the US
# (This may change as additional countries are added.)
countries = raw_map.groupby('COUNTRY').count().STATE.index.values
for country in countries:
if country not in {'GAZ:United States of America', 'GAZ:Canada'}:
raw_map.loc[raw_map.COUNTRY == country, 'STATE'] = np.nan
# Handles regional mapping, cleaning up states so that only American and
# Canadian states are included
def check_state(x):
if isinstance(x, str) and x in geo.us_state_map:
return geo.us_state_map[x.upper()]
elif isinstance(x, str) and x in geo.canadian_map_english:
return geo.canadian_map_english[x.upper()]
else:
return np.nan
raw_map['STATE'] = raw_map.STATE.apply(check_state)
We may also choose to use predefined regions to further bin our location data, and allow us to look for social or economic trends. To this end, we can apply regions defined by the US Census Bureau and Economic Regions defined by the US Bureau of Economic Analysis, which is part of the Department of Commerce.
In [33]:
# Bins data by census region
def census_f(x):
if isinstance(x, str) and x in geo.regions_by_state:
return geo.regions_by_state[x]['Census_1']
else:
return np.nan
raw_map['CENSUS_REGION'] = raw_map.STATE.apply(census_f)
# Bins data by economic region
def economic_f(x):
if isinstance(x, str) and x in geo.regions_by_state:
return geo.regions_by_state[x]['Economic']
else:
return np.nan
raw_map['ECONOMIC_REGION'] = raw_map.STATE.apply(economic_f)
As of round 15, there are 36 participants who report sleeping less than five hours a night. To all for a larger sample size, we will pool these with the individuals who report sleeping between five and six hours a night, to create a group who report sleeping less than six hours a night.
In [34]:
raw_map.loc[raw_map.SLEEP_DURATION == 'Less than 5 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'
raw_map.loc[raw_map.SLEEP_DURATION == '5-6 hours', 'SLEEP_DURATION'] = 'Less than 6 hours'
Certain health states are known to influence the microbiome in extreme ways. For some analyses we will do later, it may be useful to limit the noise associated with these conditions to allow us to look for new patterns. We have identified five metadata categories which we will use to limit our “healthy” subset.
First, we limit based on age for several reasons. We chose to omit anyone under the age of twenty. The microbiome of very young children is not yet stable, and differs greatly from that of adults [12, 13]. Additionally, BMI limits are not easily assigned in people who are still growing. Without stratifying by gender, we assumed that growth will be complete in most people by the age of 20, and set our limit there. The limit at seventy was based on the number of individuals over that age, and on differences in the microbiome seen in older individuals [14, 15].
We also used Body Mass Index as an exclusion criteria, considering only people in the “normal” and “overweight” categories. (BMI 18.5 - 30). It has been suggested that obesity changes the gut microbiome, although the effect is not consistent across all studies [16]. Additionally, we noticed that there were also alterations in our sample of underweight individuals.
Recent antibiotic decreases alpha diversity and affects the microbiome [17]. We chose to define “recent” as any time within the last year. We also excluded anyone who reported having Inflammatory Bowel Disease [16], Type I Diabetes [18-21], or Type II Diabetes [22], since all three conditions are known to affect the microbiome.
In [35]:
# Creates the subset if its not already in the mapping file
if 'SUBSET' not in raw_map.columns:
subset_f = {'AGE': lambda x: 19 < x < 70 and not np.isnan(x),
'DIABETES': lambda x: x == 'I do not have diabetes',
'IBD': lambda x: x == 'I do not have IBD',
'ANTIBIOTIC_SELECT': lambda x: x == 'Not in the last year',
'BMI': lambda x: 18.5 <= x < 30 and not np.isnan(x)}
# Determines which samples meet the requirements of the categories
new_bin = {}
for cat, f in subset_f.iteritems():
new_bin[cat] = raw_map[cat].apply(f)
# Builds up the new binary dataframe
bin_frame = pd.DataFrame(new_bin)
# Adds a column to the current dataframe to look at the subset
bin_series = pd.DataFrame(new_bin).all(1)
bin_series.name = 'SUBSET'
raw_map = raw_map.join(bin_series)
We will start by rarefying the whole body table using the rarefaction parameters we set earlier. Rarefaction is a technique which filters out samples below a certain sequencing depth. Sequences are picked from a weighted average from the remaining samples to that all samples have an even depth. To control for bias which might occur with a single, random subsampling of the data, we use multiple rounds of rarefaction to more accurately estimate the alpha diversity.
Rarefaction is important to make intra sample diversity (alpha diversity) comparisons possible. Below is a panel from Figure 1 of Human Gut Microbiome and Risk of Colorectal Cancer[23]. The figure compares Shannon Diversity between individuals with colorectal cancer (n=47, red circles) and healthy controls (n=94, empty triangles) over several rarefaction depths, or sequence counts per sample.
The figure also illustrates the importance of even sampling depth. If a control sample with 500 sequences per sample were compared with a cancer sample at a depth of 2500 sequences per sample, the cancer sample would appear more diverse. Comparisons at the same depth reveal the true pattern in the data: cancer samples are less diverse than controls.
To perform multiple rarefactions, we will use the QIIME script, multiple_rarefactions_even_depth.py.
In [36]:
if not data_download and (not os.path.exists(os.path.join(rare_dir, rare_pattern) %last_rare) or overwrite):
!multiple_rarefactions_even_depth.py -i $download_otu_fp -o $rare_dir -n $num_rarefactions -d $rarefaction_depth --lineages_included
We will use our rarefaction tables to calculate the alpha diversity associated with each rarefaction. Alpha diversity is a measure of intra sample diversity. Imagine that we could put up an imaginary wall around a 100ft x 100ft x 10 ft box in Yellowstone National Park, trapping all the vertebrate animals in that area for a short period of time. Imagine that we then made a very careful list (or took photographs) of the area so that we could document all the life we found in the area. We could count all the different types of animals we found in that area. This would be one measure of alpha diversity.
Instead of just considering each type of animal to be equally similar, we wanted to include an evolutionary relationship between the animals. So, if our area contained a mouse, a squirrel and a rabbit, we might say these animals are more similar (and therefore less diverse) than if we found a mouse, a squirrel, and a sagebrush lizard in the same area. So, even though we’ve found three species in each case, the third species being a reptile would make it more diverse than the third species being a rodent.
A diversity metric which accounts for shared evolutionary history between species is called a phylogenetic metric. This often uses a phylogenetic tree to provide information about that shared history. PD Whole Tree Diversity is a commonly used phylogenetic alpha diversity metric in microbiome research [8]. A taxonomic metric assumes all species are equally different. Common taxonomic metrics for alpha diversity used in microbiome research include Observed Species Diversity and Chao1 Diversity [9, 10].
Depending on what information we’re looking for, we might want to include information about the number of each animal belonging to the species we see. We might also want to consider the number of each different species we find in the area, weighting our diversity. So, if in our little area of Yellowstone, 90% of the animals we see are mice, while 5% are rabbits and 5% are trout, we would consider this less diverse than if 40% of the animals were mice, 30% were rabbits and 30% were trout. A metric which takes into account the counts of each species is a quantitative metric, while a qualitative metric looks only at the presence or absence of a species.
While alpha diversity is calculated completely independently for each sample, the comparison of alpha diversity may provide clues about environmental changes. For example, pollution or an algal bloom may be associated with lower alpha diversity, and indicate a potential change in the health of the ecosystem. We’ll start our work with alpha diversity by calculating the diversity for our rarefied American Gut tables using the four metrics we selected in the alpha diversity parameters: the phylogenetic PD whole Tree Diversity, and the taxonomic metrics, Observed Species Diversity, Chao1 Diversity and Shannon Diversity. All the diversity metrics we are using here are qualitative metrics.
In [37]:
if not data_download and (not os.path.exists(os.path.join(alpha_dir, alpha_pattern) % last_rare) or overwrite):
!alpha_diversity.py -i $rare_dir -o $alpha_dir -m $alpha_metrics -t $tree_fp
The alpha diversity results from QIIME are loaded into the notebook. To identify the best rarefaction instance, which we'll use as our OTU table moving forward, we try to identify the rarefaction instance which has alpha diversity closest to the mean alpha diversity represented in the table. We define "closest" using the normalized Euclidian distance.
In [38]:
if not data_download:
# Preallocates an output object for alpha diversity
alpha_rounds = {m: {} for m in alpha_metrics.split(',')}
div_metric = alpha_metrics.split(',')[0]
# Loops through the rarefaction instances
for ri in range(num_rarefactions):
a_file_blanks = {'rare_depth': rarefaction_depth,
'rare_instance': ri}
# Sets the alpha diversity file path
alpha_fp = os.path.join(alpha_dir, alpha_pattern) % a_file_blanks
# Loads the alpha diversity table
alpha = pd.read_csv(alpha_fp,
sep=txt_delim,
index_col=False)
alpha.index = alpha['Unnamed: 0']
del alpha['Unnamed: 0']
# Extracts the alpha diversity metrics
for col in alpha_rounds:
alpha_rounds[col]['%i' %ri] = alpha[col]
alpha_rounds[col]['%i' %ri].name = '%i' % ri
In [39]:
if not data_download:
# Compiles the alpha diversity results into a single table
alpha_df = pd.DataFrame({'%s_mean' % metric: pd.DataFrame(alpha_rounds[metric]).mean(1)
for metric in alpha_metrics.split(',')})
# Adds the alpha diversity results to the rarefied table
rare_map = raw_map.copy()
rare_map = rare_map.join(alpha_df)
rare_check = np.isnan(rare_map['%s_mean' % div_metric]) == False
rare_map = rare_map.loc[rare_check]
# Draws the data associated with each of the alpha diversity rounds
all_rounds = pd.DataFrame(alpha_rounds[div_metric])
# Lines up the data so the indices match (as a precaution)
all_rounds = all_rounds.sort_index()
alpha_df = alpha_df.sort_index()
# Calculates the distance between each round and the mean
mean_rounds = ([alpha_df['%s_mean' % div_metric].values] *
np.ones((num_rarefactions, 1))).transpose()
diff = np.sqrt(np.square(all_rounds.values - np.square(mean_rounds))) / mean_rounds
# Determines the minimum distance between the round and the mean
round_labels = np.arange(0, 10)
round_avg = diff.mean(0)
best_rarefaction = round_labels[round_avg == min(round_avg)][0]
best_blanks = {'rare_depth': rarefaction_depth,
'rare_instance': best_rarefaction}
We’ll save the whole body tables in their own directory, and the modified mapping files. We’ll also copy the raw OTU table and the rarefaction instance closest to the mean alpha diversity.
In [40]:
# Saves the unrarefied mapping file
if not data_download:
raw_map.to_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,
sep=txt_delim,
na_rep=write_na,
index_label=map_index)
# Saves the rarefied mapping file
rare_map.to_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,
sep=txt_delim,
na_rep=write_na,
index_label=map_index)
# Copies the raw OTU table
shutil.copy2(download_otu_fp,
os.path.join(all_dir, otu_fn) % all_raw_blanks)
# Copies the rarefied OTU table
shutil.copy2(os.path.join(rare_dir, rare_pattern) % best_blanks,
os.path.join(all_dir, otu_fn) % all_rare_blanks)
raw_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_raw_blanks,
sep=txt_delim,
na_values=map_nas,
index_col=False,
low_memory=False,
dtype={map_index: str})
raw_map.index = raw_map[map_index]
del raw_map[map_index]
rare_map = pd.read_csv(os.path.join(all_dir, map_fn) % all_rare_blanks,
sep=txt_delim,
na_values=map_nas,
index_col=False,
low_memory=False,
dtype={map_index: str})
rare_map.index = rare_map[map_index]
del rare_map[map_index]
Beta Diversity allows us to make comparisons between samples and environments. Let’s go back to our 100ft x 100ft x 10ft cube in Yellowstone where we catalogued all the vertebrates. Let’s imagine that we’ve set up the same type of cube in New York City’s Central Park and cataloged all the vertebrates in that area as well.
We could compare the two communities by seeing how many species are shared between the two, or, by making some measure that approximates the species. We might expect some overlap: depending on where we selected our regions, it would be unsurprising to encounter Chipmunks in both Central Park and Yellowstone National Park. However, there should also be some differences. Unless our Central Park location includes the zoo, it’s unlikely we’d find a Buffalo in New York City!
If we use a taxonomic metric, based only on the species we find in the two locations, we might get very little overlap. While we might expect to find a squirrel in both Central Park and Yellowstone, the animals might be members of different genera! New York is home to the Eastern Grey Squirrel, Sciurus carolinensis, while we might find the American Red Squirrel, Tamiasciurus hudsonicus, in Yellowstone. [24, 25]. In this case, a phylogenetic metric, which can account for some similarity between the two species of squirrels, may serve us much better.
When we compare microbial communities for beta diversity, we frequently select a phylogenetic metric called UniFrac distance [5]. This metric uses a phylogenetic tree, and determines what fraction of the tree is not shared between two communities.
If we consider only the presence and absence of each OTU in the samples, we have a qualitative metric, unweighted UniFrac distance. Unweighted UniFrac distance may take on values between 0 (everything the same) and 1 (everything different). Weighted UniFrac distance takes into account the abundance of the OTUs, and can take on values greater than 1.
The UniFrac distance for each pairwise sample is arranged into a Distance Matrix. We can visualize the distance matrix by many techniques, like making PCoA plots in Emperor, or UPGMA trees like the one shown in the figure below [2].
Since UniFrac distance is calculated for each sample pair in the table, this is one of the most computationally expensive steps we will perform. However, once the UniFrac distance has been calculated for all of our samples, we can simply filter the table to focus on the samples we want. We can leverage the QIIME script, beta_diveristy.py to perform our analysis.
In [41]:
# Sets up the filepaths for the all sample rarified table
all_otu_rare_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks
check_dm_fp = np.array([os.path.exists(os.path.join(all_dir, fn_)
% all_rare_blanks) for fn_ in dm_fn])
# Calculates the beta diversity
if not data_download and (not check_dm_fp.all() or overwrite):
!beta_diversity.py -i $all_otu_rare_fp -m $beta_metrics -t $tree_fp -o $all_dir
Now that we’ve generated alpha and beta diversity results for all the body sites, we can start filtering the results. Body site has one of the largest impacts on the microbiome in adult humans [7]. As a result, many analyses will focus on a single body site, often fecal samples.
We’ll use the QIIME script, split_otu_table.py to split our rarefied and unrarefied OTU tables by body site. We’ll put the output files in intermediate directories, and then move them to the appropriate locations.
In [42]:
# Sets the raw location file names
all_raw_otu_fp = os.path.join(all_dir, otu_fn) % all_raw_blanks
all_raw_map_fp = os.path.join(all_dir, map_fn) % all_raw_blanks
# Sets the rarefied location file names
all_rare_otu_fp = os.path.join(all_dir, otu_fn) % all_rare_blanks
all_rare_map_fp = os.path.join(all_dir, map_fn) % all_rare_blanks
In [43]:
# Checks that the raw and rarified bodysite split tables exist
raw_split_check = np.array([])
rare_split_check = np.array([])
for site in all_bodysites:
# Checks the unrarefied splits exists
otu_raw_split_blanks['split_group'] = site
map_raw_split_blanks['split_group'] = site
raw_otu_exist = os.path.exists(os.path.join(split_raw_dir, split_fn)
% otu_raw_split_blanks)
raw_map_exist = os.path.exists(os.path.join(split_raw_dir, split_fn)
% map_raw_split_blanks)
raw_split_check = np.hstack((raw_split_check, raw_otu_exist, raw_map_exist))
# Checks the rarefied splits exist
otu_rare_split_blanks['split_group'] = site
map_rare_split_blanks['split_group'] = site
rare_otu_exist = os.path.exists(os.path.join(split_rare_dir, split_fn)
% otu_rare_split_blanks)
rare_map_exist = os.path.exists(os.path.join(split_rare_dir, split_fn)
% map_rare_split_blanks)
rare_split_check = np.hstack((rare_split_check, rare_otu_exist, rare_map_exist))
In [44]:
# Splits the otu table and mapping file by bodysite
if not data_download and (not raw_split_check.any() or overwrite):
!split_otu_table.py -i $all_raw_otu_fp -m $all_raw_map_fp -f BODY_HABITAT -o $split_raw_dir
# Splits the otu table and mapping file by bodysite
if not data_download and (not rare_split_check.any() or overwrite):
!split_otu_table.py -i $all_rare_otu_fp -m $all_rare_map_fp -f BODY_HABITAT -o $split_rare_dir
We’ll move and rename our split files to their final location.
In [45]:
# Copies the files to their correct final folder
if not data_download:
for idx, h_site in enumerate(habitat_sites):
otu_raw_split_blanks['split_group'] = h_site
map_raw_split_blanks['split_group'] = h_site
otu_rare_split_blanks['split_group'] = h_site
map_rare_split_blanks['split_group'] = h_site
raw_sample_blanks['site'] = all_bodysites[idx]
rare_sample_blanks['site'] = all_bodysites[idx]
# Copies the unrarefied mapping file
shutil.copy2(os.path.join(split_raw_dir, split_fn)
% map_raw_split_blanks,
os.path.join(asab_pattern, map_fn)
% raw_sample_blanks)
# Copies the unrarefied OTU table
shutil.copy2(os.path.join(split_raw_dir, split_fn)
% otu_raw_split_blanks,
os.path.join(asab_pattern, otu_fn)
% raw_sample_blanks)
# Copies the rarefied mapping file
shutil.copy2(os.path.join(split_rare_dir, split_fn)
% map_rare_split_blanks,
os.path.join(asab_pattern, map_fn)
% rare_sample_blanks)
# Copies the rarefied OTU table
shutil.copy2(os.path.join(split_rare_dir, split_fn)
% otu_rare_split_blanks,
os.path.join(asab_pattern, otu_fn)
% rare_sample_blanks)
To get our distance matrices for each OTU table, we’ll use the QIIME script, filter_distance_matrix.py. We will use the mapping file for each body site to filter the distance matrices.
In [46]:
if not data_download:
for idx, a_site in enumerate(all_bodysites):
rare_sample_blanks['site'] = a_site
# Gets the rarefied mapping file for the site
map_in = os.path.join(asab_pattern, map_fn) % rare_sample_blanks
for fn_ in dm_fn:
dm_in = os.path.join(all_dir, fn_) % all_rare_blanks
dm_out = os.path.join(asab_pattern, fn_) % rare_sample_blanks
if not os.path.exists(dm_out) or overwrite:
!filter_distance_matrix.py -i $dm_in -o $dm_out --sample_id_fp $map_in
For some analyses we will choose to perform, it can be useful to work with a single sample for each participant at each body site. Many statistical tests assume sample independence. The microbiome among healthy adults is relatively stable across multiple samples within an individual; there is a higher correlation between your personal samples collected across several days than there is between your sample and another person’s sample collected at the same time [7].
We’re going to start defining our single sample data sets by writing a function which will allow us to randomly select a sample from each individual. This will take a pandas data frame as an input. We’ll group the data so we can look at each individual (given by the HOST_SUBJECT_ID
), and then randomly select one sample id per individual.
In [47]:
def identify_single_samples(map_):
"""Selects a single sample for each participant
Parameters
----------
map_ : pandas DataFrame
A mapping file for our set of samples. A single body site should be
used with human samples.
Returns
-------
single_ids : ndarray
A list of ids which represent a single sample per individual
"""
# Identifies a single sample per individual
single_ids = np.hstack([np.random.choice(np.array(ids, dtype=str), 1)
for indv, ids in
map_.groupby('HOST_SUBJECT_ID').groups.iteritems()])
return single_ids
We’ll apply our filtering function at each body site. To do this, we'll define another function which will allow the filtering used functions we can define, like identify_single_samples
.
The function we're writing here, filter_dataset
, will first identify the samples to be filtered using the function we pass in. It will use that list of samples to filter the rarefied and unrarefied (raw) mapping files. We will leverage the QIIME scripts, biom subset-table
and filter_distance_matrix.py to filter our OTU table and distance matrices.
In [48]:
def filter_dataset(filter_fun, site, dir_in, dir_out, ids_fp):
"""Filters data set to create a subset
Parameters
----------
filter_fun : function
A function which takes a pandas map and returns a list of sample
ids.
site : str
The body site for which the samples are being generated.
dir_in : str
The directory in which the input analysis files are located. The
directory and files are assumed to exist.
dir_out : str
The directory where the filtered files should be put. The directory
must exist.
ids_fp : str
The filepath where the list of ids in the subset is located.
Returns
-------
There are no explicit python returns. Rarefied and unrarefied OTU tables,
and their corresponding mapping files (the rarefied file includes alpha
diversity) as well as distance matrices for all distance metrics used
are saved in the `dir_out`.
"""
rare_sample_blanks['site'] = site
raw_sample_blanks['site'] = site
# Sets up the file names for the original files
rare_map_in_fp = os.path.join(dir_in, map_fn) % rare_sample_blanks
rare_otu_in_fp = os.path.join(dir_in, otu_fn) % rare_sample_blanks
raw_map_in_fp = os.path.join(dir_in, map_fn) % raw_sample_blanks
raw_otu_in_fp = os.path.join(dir_in, otu_fn) % raw_sample_blanks
rare_map_out_fp = os.path.join(dir_out, map_fn) % rare_sample_blanks
rare_otu_out_fp = os.path.join(dir_out, otu_fn) % rare_sample_blanks
raw_map_out_fp = os.path.join(dir_out, map_fn) % raw_sample_blanks
raw_otu_out_fp = os.path.join(dir_out, otu_fn) % raw_sample_blanks
# Checks if the single sample id filepath exists
if not os.path.exists(ids_fp) or not os.path.exists(rare_map_out_fp) or overwrite:
# Reads in the rarefied table
rare_map_in = pd.read_csv(rare_map_in_fp,
sep=txt_delim,
na_values=map_nas,
index_col=False,
dtype={map_index: str})
rare_map_in.index = rare_map_in[map_index]
del rare_map_in[map_index]
# Identifies the sample ids
filt_ids = filter_fun(rare_map_in)
rare_map_out = rare_map_in.loc[filt_ids]
# Saves the single sample filepath
ids_file = file(ids_fp, 'w')
ids_file.write('\n'.join(list(filt_ids)))
ids_file.close()
# Saves the rarefied mapping file
rare_map_out.to_csv(rare_map_out_fp,
sep=txt_delim,
na_rep=write_na,
index_label=map_index)
else:
ids_file = open(ids_fp, 'r')
filt_ids = ids_file.read().split('\n')
ids_file.close()
if not os.path.exists(raw_map_out_fp) or overwrite:
raw_map_in = pd.read_csv(rare_map_in_fp,
sep=txt_delim,
na_values=map_nas,
index_col=False,
dtype={map_index: str})
raw_map_in.index = raw_map_in[map_index]
del raw_map_in[map_index]
raw_map_out = raw_map_in.loc[filt_ids]
raw_map_out.to_csv(raw_map_out_fp,
sep=txt_delim,
na_rep=write_na,
index_label=map_index)
# Filters the OTU table down to single samples
if not os.path.exists(rare_otu_out_fp) or overwrite:
!biom subset-table -i $rare_otu_in_fp -o $rare_otu_out_fp -a sample -s $ids_fp
if not os.path.exists(raw_otu_out_fp) or overwrite:
!biom subset-table -i $raw_otu_in_fp -o $raw_otu_out_fp -a sample -s $ids_fp
# Filters the distance matrices
for dm_ in dm_fn:
dm_in_fp = os.path.join(dir_in, dm_) % rare_sample_blanks
dm_out_fp = os.path.join(dir_out, dm_) % rare_sample_blanks
if not os.path.exists(dm_out_fp) or overwrite:
!filter_distance_matrix.py -i $dm_in_fp -o $dm_out_fp --sample_id_fp $ids_fp
Let's apply our two functions to identify a single sample per individual at each site.
In [49]:
for idx, a_site in enumerate(all_bodysites):
# Skips any site where the healthy subset criteria should not be applied
if a_site not in one_samp_sites:
continue
filter_dataset(filter_fun=identify_single_samples, site=a_site, dir_in=asab_pattern, dir_out=assb_pattern,
ids_fp=os.path.join(assb_pattern, sin_fn) % {'site':a_site})
Finally, we may wish to have a healthy subset of individuals for certain analyses. The criteria we’ve used to define the healthy subset are described above.
We're going to define a quick function that makes use of SUBSET
.
In [50]:
def identify_subset(map_):
return map_.loc[map_.SUBSET == True].index.values
Now, we’ll use essentially the same pipeline we leveraged for filtering the single samples.
In [51]:
for idx, a_site in enumerate(all_bodysites):
# Skips any site where the healthy subset criteria should not be applied
if a_site not in sub_part_sites:
continue
filter_dataset(identify_subset, a_site, asab_pattern, ssab_pattern,
os.path.join(ssab_pattern, sub_fn) % {'site':a_site})
In [52]:
for idx, a_site in enumerate(all_bodysites):
# Skips any site where the healthy subset criteria should not be applied
if a_site not in sub_part_sites or a_site not in one_samp_sites:
continue
filter_dataset(identify_subset, a_site, assb_pattern, sssb_pattern,
os.path.join(sssb_pattern, sub_fn) % {'site':a_site})
We have generated body site specific, rarefied OTU tables, mapping files with alpha diversity and UniFrac distance matrices for our American Gut Data, as well as creating focused datasets. We can choose to create further-filtered tables, or we can take the outputs of this notebook and use it for downstream analysis.
Caporaso, J.G.; Kuczynski, J.; Strombaugh, J.; Bittinger, K.; Bushman, F.D.; Costello, E.K.; Fierer, N.; Peña, A.G., Goodrich, J.K.; Gordon, J.I.; Huttley, G.A.; Kelley, S.T.; Knights, D.; Koenig, J.E.; Ley, R.E.; Lozupone, C.A.; McDonald, D.; Muegge, B.D.; Pirrung, M.; Reeder, J.; Sevinsky, J.R.; Turnbaugh, P.J.; Walters, W.A.; Widmann, J.; Yatsunenko, T.; Zaneveld, J. and Knight, R. (2010) “QIIME allows analysis of high-throughput community sequence data.” Nature Methods. 7: 335 - 336.
Vázquez-Baeza, Y.; Pirrung, M.; Gonzalez, A.; and Knight, R. (2013). “EMPeror: a tool for visualizing high-throughput microbial community data.” Gigascience. 2: 16.
Langille, M.G.; Zaneveld, J.; Caporaso, J.G.; McDonald, D.; Knights, D.; Reyes, J.A.; Clemente, J.C.; Burkepile, D.E.; Vega Thurber, R.L.; Knight, R.; Beiko, R.G.; and Huttenhower, C. (2013). “Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences.” Nat Biotechnol. 31: 814-821.
McDonald, D.; Clemente, J.C.; Kuczynski, J.; Rideout, J.R.; Stombaugh, J.; Wendel, D.; Wilke, A.; Huse, S.; Hufnagle, J.; Meyer, F.; Knight, R.; and Caporaso, J.G. (2012). The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome.” Gigascience. 1:7.
Lozupone, C.; and Knight, R. (2005). “UniFrac: a new phylogenetic method for comparing microbial communities.” Appl Enviro Microbiol. 71: 8228-8235.
Lozupone, C.; LLadser, M.E.; Knights, D.; Stombaugh, J.; and Knight, R. (2011). “UniFrac: an effective distance metric for microbial community composition.” ISME J 5: 169-172.
The Human Microbiome Consortium. (2012) “Structure, Function and diversity of the healthy human microbiome.” Nature. 486: 207-214.
Eckburg, P.B.; Bik, E.M.; Bernstein C.N.; Purdom, E.; Dethlefson, L.; Sargent, M.; Gill, S.R.; Nelson, K.E.; Relman, D.A. (2005) “Diversity of the human intestinal microbial flora.” Science. 308: 1635-1638.
Chao, A. (1984) “Nonparametric estimation of the number of classes in a population.” Scandinavian J Stats. 11: 265-270.
Seaby, R.M.H. and Henderson, P.A. (2006). “Species Diversity and Richness 4.” http://www.pisces-conservation.com/sdrhelp/index.html.
McDonald, D.; Price, N.M.; Goodrich, J.; Nawrocki, E.P.; DeSantis, T.Z.; Probst, A.; Andersen, G.L.; Knight, R. and Hugenholtz, P. (2012). “An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea.” ISME J. 6:610 - 618.
Koenig, J.E.; Spor, A.; Scalfone, N.; Fricker, A.D.; Stombaugh, J.; Knight, R.; Angenent, L.T.; and Ley, R.E. (2011). “Succession of microbial consortia in the developing infant gut microbiome.” PNAS. 108 Suppl 1: 4578 - 4585.
Yatsunenko, T.; Rey, F.E.; Manary, M.J.; Trehan, I.; Dominguez-Bello, M.G.; Contreras, M.; Magris, M.; Hidalgo, G.; Baldassano, R.N.; Anokhin, A.P.; Heath, A.C.; Warner, B.; Rdder, J.; Kuczynski, J.; Caporaso, J.G.; Lozupone, C.A.; Lauber, C.; Clemente, J.C.; Knights, D.; Knight, R. and Gordon, J.I. (2012) “Human Gut microbiome viewed across age and geography.” Nature. 486: 222-227.
Claesson, M.J.; Cusacks, S.; O’Sullivan, O.; Greene-Diniz, R.; de Weerd, H.; Flannery, E.; Marchesi, J.R.; Falush, D.; Dinan, T.; Fitzgerald, G.; Stanton, C.; van Sinderen, D.; O’Connor, M.; Harnedy, N.; O’Connor, K.; Henry, C.; O’Mahony, D.; Fitzgerald, A.P.; Shananhan, F.; Twomey, C.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2011). “Composition, variability and temporal stability of the intestinal microbiota of the elderly.” PNAS. 108 Suppl 1: 4586 - 4591.
Claesson, M.J.; Jeffery, I.B.; Conde, S.; Power, S.E.; O’Connor, E.M.; Cusack, S.; Harris, H.M.; Coakley, M.; Lakshminarayanan, B.; O’Sullivan, O.; Fitzgerald, G.F; Deane, J.; O’Connor, M.; Harnedy, N.; O’Connor, K.; O’Mahony, D.; van Sinderen, D.; Wallace, M.; Brennan, L.; Stanton, C.; Marchesi, J.R.; Fitzgerald, A.P.; Shanahan, F.; Hill, C.; Ross, R.P.; and O’Toole, P.W. (2012). “Gut microbiota composition correlates with diet and health in the elderly.” Nature. 488: 178-184.
Walters, W.A.; Zu, Z.; and Knight, R. (2014) “Meta-analysis of human gut microbes associated with obesity and IBD.” FEBS Letters. 588: 4223-4233.
Dethlefsen, L. and Relman, D.A. (2011) “Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation.” PNAS. 108 Suppl 1: 4554-4561.
de Goffau, M.C.; Luopajärvi, K.; Knip, M.; Ilonen, J.; Ruohtula, T.; Härkönen, T.; Orivuori, L.; Hakala, S.; Welling, G.W.; Harmensen, H.J.; and Vaarala, O. (2013). “Fecal Microbiota composition differs between children with B-cell autoimmunity and those without.” Diabetes. 62: 1238-1244.
Giongo, A.; Gano, K.A.; Crabb, D.B.; Mukherjee, N.; Novelo, L.L.; Casella, G.; Drew, J.C.; Ilonen, J.; Knip, M.; Hyöty, H; Veijola, R.; Simell, T.; Simell, O.; Neu, J.; Wasserfall, C.H.; Schatz, D.; Atkinson, M.A.; and Triplett, E.W. (2011). “Toward defining the autoimmune microbiome for type 1 diabetes.” ISME J. 5: 82-91.
Mejía-León, M.E.; Petrosino, J.F.; Ajami, N.J.; Domínguez-Bello, M.G.; and de la Barca, A.M. (2014). “Fecal microbiota imbalance in Mexican children with type 1 diabetes.” Science Reports. 4: 3814.
Murrim M.; Leiva, I.; Gomez-Zumaquero, J.M.; Tinahones, F.J.; Cardona, F.; Soriguer, F.; and Queipo-Ortuño, M.I. (2013). “Gut microbiota in children with type 1 diabetes differs from that in healthy children: a case-control study.” BMC Med. 11:46.
Larsen, N.; Vogensen, F.K.; van den Berg, F.W.; Nielsen, D.S.; Andreasen, A.S.; Pedersen, B.K.; Al-Soud, W.A.; Sørensen, S.J.; Hansen, H.L. and Jakobsen, M. (2010). “Gut Microbiota in human adults with type 2 diabetes differs from non diabetic adults.” PLoS One. 5: e9085.
Ahn, J.; Sinha, R.; Pei, Z.; Cominanni, C.; Wu, J.; Shi, J.; Goedert, J.J.; Hayes, R.B.; and Yang, L. (2013). "Human gut microbiome and risk for colorectal cancer." J Natl Cancer Inst. 105: 1907-1911.
National Park Service. (2015). “Mammal Checklist.” Yellowstone National Park.
Milieris, V. (2011) “Biodiversity in Central Park”. Exploring Central Park. CUNY.