The OmicExperiment object is the heart of the omicexperiment package. It has the ultimate goal of providing a pleasant API for rapid analysis of 'omic experiments' in an interactive environment.
The R bioinformatics community has already provided similar implementations for similar functionality. Examples include DESeqDataSet (from the package DeSeq2), MRExperiment (from the package metagenomeSeq), phyloseq-class (from the package phyloseq). To my knowledge, there exists no similar powerful functionality available to users of python.
Powerful manipulation of large datasets in comparative omic experiment (and these are numerous, and include amplicon microbiome studies, microbial metagenomics, transcriptomics/RNA-Seq, metabolomics, and proteomics, and microarray). According to the Biological Observation Matrix (BIOM) format: "they all share an underlying, core data type: the “sample by observation contingency table”.
The OmicExperiment object (perhaps intended to be subclassed for each specific use-case of the above) is the centrepiece of this functionality.
The philosophy of this package is to build upon solid foundations of the python scientific stack and try not to re-invent the wheel. Packages such as numpy and pandas are powerful optimized libraries in dealing with matrix and tabular data, respectively. This package's main dependency is pandas objects (as well as the rest of the scientific python stack).
As of this date, I have started mainly with implementing functionality in the MicrbiomeExperiment subclass of the parent OmicExperiment class, as this is what I have been recently working on in my research.
Clever scientific interactive computing environments are an excellent way to apply and demonstrate the analysis of scientific datasets. The Jupyter notebook is perhaps the prime example of such an environment, perfectly suited for rapid iteration and exploratory analysis, as well as documentation and literate programming.
In [1]:
%load_ext autoreload
%autoreload 2
from omicexperiment.experiment.microbiome import MicrobiomeExperiment
mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"
#the MicrobiomeExperiment constructor currently needs three parameters
exp = MicrobiomeExperiment(biom, mapping,tax)
#the first parameter is the _data_ DataFrame
#the second parameter is the mapping dataframe
#the third paramenter is the taxonomy dataframe
The parameters passed to MicrobiomeExperiment constructor could be:
The data DataFrame (attribute data_df) is the central dataframe of the Experiment object. It is what we refer to as the frequency table (OTU table in microbiome studies, etc.).
data_df always has the 'observations' (or OTUs in microbiome OTU tables) as the rows, so that the observatin names/ids constitute the index of the dataframe.
On the other hand, the samples always constitute the sample names.
In [2]:
exp.data_df
Out[2]:
In [3]:
print(exp.data_df.columns)
#
print("#OR")
#
print(exp.samples)
The observations are obtained through the index of the data DataFrame.
Alternatively, the Experiment object exposes a shorthand 'observations' attribute.
In the example below, the otu ids are actually hashes, which allows easy deduplication (i.e. clustering at 100% level) whilst allowing for pooling of experiments. Following the excellent gist by Greg Caporaso https://gist.github.com/gregcaporaso/f3c042e5eb806349fa18 .
In [4]:
print(exp.data_df.index)
#
print("#OR")
#
print(exp.observations)
The mapping dataframe (attribute mapping_df) is the dataframe which holds the sample metadata or "phenotypic" data.
mapping_df has the sample ids as the rows, as to constitute the index of that dataframe.
The columns contain the various variables of interest to our experiment.
The mapping dataframe below is an example of a tsv file, formatted according to QIIME conventions. The first column also has the sample ids. You can see various variables in the columns. The group column, for example, contains the disease state of the samples in our example table. CRSwNP (i.e. Chronic Rhinosinusitis with Nasal Polyposis), CRSsNP (i.e. Chronic Rhinosinusitis sans Nasal Polyposis) and "healthy" controls.
In [5]:
exp.mapping_df
Out[5]:
In [6]:
#An example of filtering the mapping dataframe,
#as per the excellent pandas indexing filtration feature
exp.mapping_df[exp.mapping_df.group == 'CRSwNP']
#Note:
#similar operations are available on the counts and taxonomy dataframes, since
#all of these are pandas DataFrame objects
Out[6]:
The taxonomy dataframe (attribute taxonomy_df) is the dataframe which holds the taxonomy assignment information for our OTUs.
This dataframe is only available on MicrobiomeExperiment results, and is not available on the base class Experiment.
As of the time of writing this notebook, only the taxonomy assignment txt files according to the QIIME convention has been implemented. Otherwise, a manually constructed taxonomy pandas DataFrame object (e.g., manually imported from a different pipeline or program using your own in-house glue code) can be passed to the constructor of the Microbiome Experiment object.
The taxonomy dataframe's index contains the "OTU" ids. The index object thus has the name 'otu'.
The taxonomy dataframe below is such an example.
In the module omicexperiment.taxonomy, the function tax_as_dataframe is responsible for opening up a taxonomy assignment txt file, formatted according to QIIME's conventions.
In [7]:
exp.taxonomy_df
Out[7]:
In [8]:
#The taxonomic levels are as follows:
from omicexperiment.taxonomy import TAX_RANKS
print(TAX_RANKS)
#The taxonomy dataframe also has extra columns
list(exp.taxonomy_df.columns)
Out[8]:
The extra columns in the taxonomy dataframe include:
In [9]:
#Rank Resolutions example
for row in exp.taxonomy_df[['tax', 'rank_resolution']].iterrows():
print(str(row[1][0]))
print("RESOLUTION: " + str(row[1][1]))
print("\n")
In [10]:
#Note that the rank_resolution column (a pandas Series) is actually of an ordered Category type.
#It thus support the filtration methods in the following cells
exp.taxonomy_df.rank_resolution
Out[10]:
In [11]:
#only two OTUs are assigned at the species level ( > genus )
exp.taxonomy_df.rank_resolution > 'genus'
Out[11]:
It is good practice to start the experiment with a 'raw counts' matrix. Since it is very easy to convert that to a relative abundance (by sample) type matrix. This is afforded through the to_relative_abundance method. This transforms the data such that each sample counts add up to a 100.
Note that the to_relative_abundance method actually instantiates a new experiment object with the newly transformed data DataFrame.
This API is provided to mimic various pandas DataFrame methods, which usually provides a new DataFrame object, and thus allows retaining older frames, as well as allow "chaining of methods", which is very important in interactive environments where economical typing is desirable. This pattern is also called "Fluent interface" (https://en.wikipedia.org/wiki/Fluent_interface).
In [12]:
new_exp = exp.to_relative_abundance()
new_exp.data_df
#OR, similarly:
#to demonstrate method chaining
exp.to_relative_abundance().data_df
Out[12]:
In [13]:
#a 'rarefy' (i.e. subsampling) method is also available
#for the MicrobiomeExperiment subclass
print(exp.data_df.sum())
rarefied_df = exp.rarefy(90000).data_df #this will discard sample0
rarefied_df
#This method will discard samples with counts lower than a cutoff value,
#then subsample the counts of each sample to that cutoff
#this method was shown to be of benefit for various downstream diversity analyses
#the functionality is built into qiime and various other pipelines
Out[13]:
In [14]:
#you can also specify a num_reps argument
#i.e. (how many randomizations are done before returning the ultimate (rarefied) data
rarefied_df = exp.rarefy(90000, num_reps=50).data_df #this will discard sample0
rarefied_df
Out[14]:
In [15]:
#the with_data_df method replaces the data_df in the experiment object
#with the one passed as the parameter to the method
#and!! constructs a new Experiment (this paradigm will keep repeating!)
exp.with_data_df(rarefied_df).data_df
#Note:
#similarly, there is a with_mapping_df method
#and perhaps in the future a with_taxonomy_df method
Out[15]:
In the next notebook/chapter, we will demonstrate the important 'filter' and 'efilter' methods of the Experiment objects.