License: BSD
Copyright: Copyright American Gut Project, 2015



In [1]:

    
# This cell allows us to render the notebook in the way we wish no matter where the notebook is rendered.
from IPython.core.display import HTML
css_file = 'ag.css'
HTML(open(css_file, "r").read())









    Out[1]:

Preprocessing and File Generation
- Data Sets Generated by this Notebook
- Files and File Types
Notebook Requirements
Function Imports
Analysis Parameters
File paths and Directories
Data Download
Mapping File Clean up
Identification of a Healthy Subset of Adults
Whole Table Rarefaction
Whole Table Alpha Diversity
Whole Table Beta Diversity
Split the table by bodysite
Select a Single Sample for Each Participant
Filter the Table to the Healthy Subset
References

Preprocessing and File Generation

The goal of this notebook is to take the OTU table and mapping files generated by the American Gut Primary Processing Pipeline Notebook and manipulate them to produce a uniform set of rarefied and filtered tables which can be used in downstream analyses, such as our Power Notebook, or using QIIME, EMPeror, or PICRUSt [1 - 3]. This processing is centralized since it can be computationally expensive, given the size of the tables involved, and because it removes some error associated with the random number generation used in some steps.

	Metadata	OTU Table	Distance Matrix
File Type	tab-delimted text file (ends in .txt)	Biom-format file [4] (ends in .biom)	tab-delimted text file (ends in .txt)
Other viewing option	text editor (i.e. Emacs, TextEdit) Spreadsheet program (i.e. Excel)	Using the Biom-format python package Using the R BIOM package	text editor (i.e. Emacs, TextEdit) Spreadsheet program (i.e. Excel)
Rows	Samples are in rows.	OTU sequence clusters are the rows in the table.	Rows are samples.
Columns	Columns are metadata categories like AGE, BMI or disease status.	Columns are samples	Columns are samples.

download_data (boolean)	This will download a directory of precomputed data tables, when `True`. `download_data` will supersede `overwrite`. (That is, if `download_data` is `True`, `overwrite` must be `False`.)
overwrite (boolean)	When `overwrite` is `True`, new files will be generated and saved during data processing. It is recommended that overwrite be set to `False`, in which case new files will only be generated when the file does not exist. This substantially decreases analysis time.

overwrite (boolean)	When `overwrite` is `True`, new files will be generated and saved during data processing. It is recommended that overwrite be set to `False`, in which case new files will only be generated when the file does not exist. This substantially decreases analysis time.
txt_delim (string)	`txt_delim` specifies the way columns are separated in the files. QIIME typically consumes and produces tab-delimited (`"\t"`) text files (.txt) for metadata and results generation.
map_index (string)	The name of the column containing the sample names. In QIIME, this column is called `#SampleID`.
map_nas (list of strings)	It is possible a mapping file may be missing values, since American Gut participants are free to skip any question. The pandas package is able to omit these missing samples from analysis. In raw American Gut files, missing values are typically denoted as `“NA”`, `“no_data”`, `“unknown”`, and empty spaces (`“”`).
write_na (string)	The value to denote missing values when text files are written from Pandas data frames. Using an empty space, (`“”`) will allow certain QIIME scripts, like group_significance.py, to ignore the missing values.

rarefaction_depth (int)	The `rarefaction_depth` specifies the number of sequence per samples to be used for analysis. A depth of 10,000 sequences/sample was selected because it balances a better picture of diversity with retaining samples.
num_rarefactions (int)	The number of times we draw new rarefaction tables. This controls for bias due to single rarefaction instances. We selected 10 rarefactions to achieve a balance between computational efficiency and appropriate depth.

split_field (string)	The metadata category which contains the body site information we will use to split the OTU table and mapping file.
split_prefix (string)	Under the standards used to format the American Gut metadata, a constant prefix is used to denote body site. This is used for string formatting and file naming.

habitat_bodysite (list of strings)	A list of the body site names used by our American Gut metadata standards. Some of these are inconvenient for file naming, and so we will rename some of these fields using the corresponding `all_bodysites` name. For example, the standard name for a mouth sample is to label it as an `“oral cavity”`, however, spaces in file paths make life difficult, so this is mapped to `“oral”` in our `all_bodysites` list.
all_bodysites (list of strings)	A list of all the possible body sites which will be used to generate the datasets here. The order of body sites must correspond to the order of body sites in `habitat_bodysites`.
sub_part_sites (set)	We may also want to generate data sets which limits our sample set to exclude samples which are already known to significantly affect the microbiome. The subset currently being selected focuses mainly on fecal samples.
one_samp_sites (set)	For some types of analysis, there is an assumption that samples are independent, which in this context includes the requirement that there are not multiple samples per individual. To limit analysis to a single sample from each individual, we can select body site where we want to filter for a single sample per individual. We recommend doing this for all body sites.

download_dir (string)	The file path where downloaded files should be saved. The `download_dir` may be located within the `working_dir`, it’s also likely the `download_dir` may be located outside the `base_dir`.
download_otu_fp (string)	The uncompressed OTU table from GitHub is located at this file path. This should be a .biom file, with no compression.
download_map_fp (string)	The location of the American Gut mapping file, downloaded from GitHub.

rare_dir (string)	The file path to the directory where we should save all of our rarefaction files. This should be located in the `working_dir`.
rare_pattern (string)	This describes the way rarefied OTU tables will be saved in our output directories. The `“%i”` will allow us to substitute any integers. In this case, we will specify the rarefaction depth, and the rarefaction instance for each table.

alpha_dir (string)	The file path to the directory where we should save all of our alpha diversity files. This should be located in the `working_dir`.
alpha_pattern (string)	This describes the way alpha diversity files will be saved in our output directories. The `“%i”` will allow us to substitute any integer. In this case, we will specify the rarefaction depth, and the rarefaction instance for each table. The rarefaction instances are numbered sequentially, from 0 to the number of rarefactions.

split_raw_dir (string)	The file path where we should put the unrarefied OTU table after splitting by body site. This should be located in the `working_dir`.
split_rare_dir (string)	The file path where we should put the rarefied OTU table after splitting by body site. This should be located in the `working_dir`.
split_fn (string)	The files generated by OTU splitting will follow this naming convention. The blanks, rare_suffix, split_field, split_prefix, split_group and extension are used to specify the level of rarefaction, the field used for splitting the data, the group in that split, and the type of file generated. Here, we expect the split_group to be a body site.

all_dir; (string)	The all data directory will contain the full American Gut results.
bodysite directories (string)	The bodysite directories can be identified by adjusting the `all_bodysite` variable.

asab_pattern (string)	A file pattern for the directory where we’ll save the data from all samples from all participants.
assb_pattern (string)	A file pattern for the directory where data from a single sample for each individual at each body site is stored. Note that it's possible to have multiple samples for the same individual, as long as the individual contributed samples at multiple body sites. However, the two samples from the same individual will be represented in different tables.
ssab_pattern (string)	A file pattern for the directory where we’ll save the data from all samples from a subset of participants.
sssb_pattern (string)	A file pattern for the directory where we’ll save the data from a single samples from each participant in the healthy subset.

otu_fn (string)	A pattern for the filename for output OTU table files.
map_fn (string)	A pattern for the filename for output mapping files.
dm_fn (string)	A pattern for the file name used by the distance matrix files generated by the `beta_metrics`

map_extension (string)	The file extension for mapping files. Mapping files are typically tab-delimited text files.
otu_extension (string)	The file extension for OTU table files. OTU tables are typically Biom-formatted files.
rare_suffix (string)	This is added to file names to denote that rarefaction has occurred. Typically, this should be `“even”` with the rarefaction depth.
raw_suffix (string)	This is added to files in which rarefaction has not been performed. Usually, this will be an empty string. This is required to maintain appropriate string formatting.
site_pad; all_ (string)	These are spacers used to keep name formats clean and correct.

last_rare (dict)	Fills in the blanks for a rarefaction table or alpha diversity file. This is used to check if all rarefaction of alpha diversity files have been generated for an analysis.
all_raw_blanks (dict)	Fills in the blanks for the unrarefied, all-sample files.
all_rare_blanks (dict)	Fills in the blanks for the rarefied, all-sample files.
otu_raw_split_blanks (dict)	Fills in the blanks for the unrarefied split file patterns to identify the split OTU table.
map_raw_split_blanks (dict)	Fills in the blanks for the unrarefied split file patterns to identify the split metadata file. The `nan` value for `split_group` is a place holder.
otu_rare_split_blanks (dict)	Fills in the blanks for the rarefied split file patterns to identify the split OTU table.
map_rare_split_blanks (dict)	Fills in the blanks for the rarefied split file patterns to identify the split metadata file.
raw_sample_blanks (dict)	Fills in the blanks for the rarefied files at each body site.
rare_sample_blanks (dict)	Fills in the blanks for the rarefied files at each body site.

Table of Contents

Preprocessing and File Generation

Data Sets Generated by this Notebook

Files and File Types

Metadata

OTU Tables

Distance Matrix

Notebook Requirements

Function Imports

Analysis Parameters

File Saving Parameters

Metadata and text file handling parameters

Rarefaction parameters

Split Parameters

Alpha Diversity Parameters

Beta Diversity Parameters

Data Set Selection

File paths and Directories

Base Directory

Reference Directories and Files

Working Directories and Files

Download Directories and Files

Rarefaction Directory and Files

Alpha Diversity Directories and Files

Split Directories and Files

Output Directories and Files

Data Directory

Body Site Directories

Data Set Directories

Data File Names

File Exensions

File Blanks

Data Download

Mapping File Clean up

Age

Alcohol Consumption

Body Mass Index

Collection Season

Collection Location

Sleep Duration

Identification of a Healthy Subset of Adults

Whole Table Rarefaction

Whole Table Alpha Diversity

Whole Table Beta Diversity

Body Site Split

Select a Single Sample for Each Participant

Filter the Table to the Healthy Subset

References