The Experiment object

The OmicExperiment object is the heart of the omicexperiment package. It has the ultimate goal of providing a pleasant API for rapid analysis of 'omic experiments' in an interactive environment.

The R bioinformatics community has already provided similar implementations for similar functionality. Examples include DESeqDataSet (from the package DeSeq2), MRExperiment (from the package metagenomeSeq), phyloseq-class (from the package phyloseq). To my knowledge, there exists no similar powerful functionality available to users of python.

Powerful manipulation of large datasets in comparative omic experiment (and these are numerous, and include amplicon microbiome studies, microbial metagenomics, transcriptomics/RNA-Seq, metabolomics, and proteomics, and microarray). According to the Biological Observation Matrix (BIOM) format: "they all share an underlying, core data type: the “sample by observation contingency table”.

The OmicExperiment object (perhaps intended to be subclassed for each specific use-case of the above) is the centrepiece of this functionality.

The philosophy of this package is to build upon solid foundations of the python scientific stack and try not to re-invent the wheel. Packages such as numpy and pandas are powerful optimized libraries in dealing with matrix and tabular data, respectively. This package's main dependency is pandas objects (as well as the rest of the scientific python stack).

As of this date, I have started mainly with implementing functionality in the MicrbiomeExperiment subclass of the parent OmicExperiment class, as this is what I have been recently working on in my research.

Clever scientific interactive computing environments are an excellent way to apply and demonstrate the analysis of scientific datasets. The Jupyter notebook is perhaps the prime example of such an environment, perfectly suited for rapid iteration and exploratory analysis, as well as documentation and literate programming.

Instantiating a MicrobiomeExperiment object

Let us start by showing how we start an experiment; this time, a microbiome experiment.


In [1]:
%load_ext autoreload
%autoreload 2

from omicexperiment.experiment.microbiome import MicrobiomeExperiment

mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"

#the MicrobiomeExperiment constructor currently needs three parameters
exp = MicrobiomeExperiment(biom, mapping,tax)
#the first parameter is the _data_ DataFrame
#the second parameter is the mapping dataframe
#the third paramenter is the taxonomy dataframe

The parameters passed to MicrobiomeExperiment constructor could be:

  • pandas dataframes
  • a filepath (str) to a csv or tsv file
  • a filepath (str) to a biom file (in case of the data DataFrame)
  • a Table object from the biom python package (in case of the data DataFrame)
  • a filepath (str) to a Qiime taxonomy assignment txt file - other implementations will be provided in the future, with possible subclassing of MicrobiomeDataFrame into a QiimeExperiment to provide optimal and specific Qiime functionality, while allowing integration with other pipeline software conventions

The data DataFrame

The data DataFrame (attribute data_df) is the central dataframe of the Experiment object. It is what we refer to as the frequency table (OTU table in microbiome studies, etc.).

data_df always has the 'observations' (or OTUs in microbiome OTU tables) as the rows, so that the observatin names/ids constitute the index of the dataframe.

On the other hand, the samples always constitute the sample names.


In [2]:
exp.data_df


Out[2]:
1234 9876 sample0 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225872 0 2 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 2 1 0 91911 100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 133138 0 0 21 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 86870 0 0

To list the samples:

The sample names are easily obtained through looking at the columns of your data_df.

Alternatively, the Experiment object exposes a shorthand 'samples' attribute, which is essentially as list(exp.data_df.columns)


In [3]:
print(exp.data_df.columns)
#
print("#OR")
#
print(exp.samples)


Index(['1234', '9876', 'sample0', 'sample1', 'sample2'], dtype='object')
#OR
['1234', '9876', 'sample0', 'sample1', 'sample2']

To list the observations:

The observations are obtained through the index of the data DataFrame.

Alternatively, the Experiment object exposes a shorthand 'observations' attribute.

In the example below, the otu ids are actually hashes, which allows easy deduplication (i.e. clustering at 100% level) whilst allowing for pooling of experiments. Following the excellent gist by Greg Caporaso https://gist.github.com/gregcaporaso/f3c042e5eb806349fa18 .


In [4]:
print(exp.data_df.index)
#
print("#OR")
#
print(exp.observations)


Index(['2f328e48f4252bbade0dd7f66b0d5bf1b09617dd',
       'ae0ddda08027454fdb5db77c96b94691b8274cdd',
       '8f52abc02aed2ce6c63be04570a7e609f9cdac5f',
       '3cb3c2347cdbe128b645e432f4dcbca702e0e8e3',
       '8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0'],
      dtype='object')
#OR
['2f328e48f4252bbade0dd7f66b0d5bf1b09617dd', 'ae0ddda08027454fdb5db77c96b94691b8274cdd', '8f52abc02aed2ce6c63be04570a7e609f9cdac5f', '3cb3c2347cdbe128b645e432f4dcbca702e0e8e3', '8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0']

The Mapping DataFrame

The mapping dataframe (attribute mapping_df) is the dataframe which holds the sample metadata or "phenotypic" data.

mapping_df has the sample ids as the rows, as to constitute the index of that dataframe.

The columns contain the various variables of interest to our experiment.

The mapping dataframe below is an example of a tsv file, formatted according to QIIME conventions. The first column also has the sample ids. You can see various variables in the columns. The group column, for example, contains the disease state of the samples in our example table. CRSwNP (i.e. Chronic Rhinosinusitis with Nasal Polyposis), CRSsNP (i.e. Chronic Rhinosinusitis sans Nasal Polyposis) and "healthy" controls.


In [5]:
exp.mapping_df


Out[5]:
#SampleID BarcodeSequence LinkerPrimerSequence Description patient_id group asthma vas amplicon_conc
#SampleID
sample0 sample0 ACTGAGCG AAAA sample0 132 CRSsNP 0 49 4.3
sample1 sample1 AAGAGGCA AAAA sample1 315 CRSwNP 1 43 2.3
sample2 sample2 ATCTCAGG AAAA sample2 742 CRSsNP 0 23 3.2
1234 1234 ATGCGCAG AAAA 1234 927 control 1 87 1.0
9876 9876 TAGGCATG AAAA 9876 538 CRSwNP 1 12 1.3

In [6]:
#An example of filtering the mapping dataframe,
#as per the excellent pandas indexing filtration feature
exp.mapping_df[exp.mapping_df.group == 'CRSwNP']
#Note:
#similar operations are available on the counts and taxonomy dataframes, since
#all of these are pandas DataFrame objects


Out[6]:
#SampleID BarcodeSequence LinkerPrimerSequence Description patient_id group asthma vas amplicon_conc
#SampleID
sample1 sample1 AAGAGGCA AAAA sample1 315 CRSwNP 1 43 2.3
9876 9876 TAGGCATG AAAA 9876 538 CRSwNP 1 12 1.3

The Taxonomy DataFrame

The taxonomy dataframe (attribute taxonomy_df) is the dataframe which holds the taxonomy assignment information for our OTUs.

This dataframe is only available on MicrobiomeExperiment results, and is not available on the base class Experiment.

As of the time of writing this notebook, only the taxonomy assignment txt files according to the QIIME convention has been implemented. Otherwise, a manually constructed taxonomy pandas DataFrame object (e.g., manually imported from a different pipeline or program using your own in-house glue code) can be passed to the constructor of the Microbiome Experiment object.

The taxonomy dataframe's index contains the "OTU" ids. The index object thus has the name 'otu'.

The taxonomy dataframe below is such an example.

In the module omicexperiment.taxonomy, the function tax_as_dataframe is responsible for opening up a taxonomy assignment txt file, formatted according to QIIME's conventions.


In [7]:
exp.taxonomy_df


Out[7]:
kingdom phylum class order family genus species rank_resolution tax otu
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__Aspergillus bombycis species k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... 2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
ae0ddda08027454fdb5db77c96b94691b8274cdd k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__unidentified (g__Aspergillus) genus k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... ae0ddda08027454fdb5db77c96b94691b8274cdd
8f52abc02aed2ce6c63be04570a7e609f9cdac5f k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Pleosporaceae g__unidentified (f__Pleosporaceae) s__unidentified (f__Pleosporaceae) family k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P... 8f52abc02aed2ce6c63be04570a7e609f9cdac5f
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__Aspergillus flavus species k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... 3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Pleosporaceae g__Lewia s__unidentified (g__Lewia) genus k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P... 8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5 k__unidentified (Unassigned) p__unidentified (Unassigned) c__unidentified (Unassigned) o__unidentified (Unassigned) f__unidentified (Unassigned) g__unidentified (Unassigned) s__unidentified (Unassigned) NaN No blast hit 54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5

Taxonomic levels in taxonomy_df

What you can notice in the taxonomy dataframe above, is that it automatically runs code to separate the taxonomic assignment to various taxonomic levels.


In [8]:
#The taxonomic levels are as follows:
from omicexperiment.taxonomy import TAX_RANKS
print(TAX_RANKS)
#The taxonomy dataframe also has extra columns
list(exp.taxonomy_df.columns)


('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')
Out[8]:
['kingdom',
 'phylum',
 'class',
 'order',
 'family',
 'genus',
 'species',
 'rank_resolution',
 'tax',
 'otu']

The extra columns in the taxonomy dataframe include:

  • otu: the otu id (also in the index)
  • tax: the raw taxonomy assignment (includes all levels, conventionally separated by a semi-colon ';')
  • rank_resolution: the rank at highest resolution that the taxonomy assignment method was able to provide. For example, if assignment could not identify a 'species' level, this means that the highest resolution rank was 'genus'. Unidentified levels occurs when the taxon at that level is equal to 'unidentified' or '' (empty string) or 'Unassigned' or 'No blast hit'. These levels then are given the 'unidentified' label. Note that when the assignment was 'Unassigned' or 'No blast hit', or a kingdom level assignment was not available, the rank_resolution will be 'nan'.

In [9]:
#Rank Resolutions example
for row in exp.taxonomy_df[['tax', 'rank_resolution']].iterrows():
    print(str(row[1][0]))
    print("RESOLUTION: " + str(row[1][1]))
    print("\n")


k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Trichocomaceae;g__Aspergillus;s__Aspergillus bombycis
RESOLUTION: species


k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Trichocomaceae;g__Aspergillus
RESOLUTION: genus


k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae
RESOLUTION: family


k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Trichocomaceae;g__Aspergillus;s__Aspergillus flavus
RESOLUTION: species


k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Lewia
RESOLUTION: genus


No blast hit
RESOLUTION: nan



In [10]:
#Note that the rank_resolution column (a pandas Series) is actually of an ordered Category type.
#It thus support the filtration methods in the following cells
exp.taxonomy_df.rank_resolution


Out[10]:
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd    species
ae0ddda08027454fdb5db77c96b94691b8274cdd      genus
8f52abc02aed2ce6c63be04570a7e609f9cdac5f     family
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3    species
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0      genus
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5        NaN
Name: rank_resolution, dtype: category
Categories (7, object): [kingdom < phylum < class < order < family < genus < species]

In [11]:
#only two OTUs are assigned at the species level ( > genus )
exp.taxonomy_df.rank_resolution > 'genus'


Out[11]:
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd     True
ae0ddda08027454fdb5db77c96b94691b8274cdd    False
8f52abc02aed2ce6c63be04570a7e609f9cdac5f    False
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3     True
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0    False
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5    False
Name: rank_resolution, dtype: bool

Other OmicExperiment objects functionality

It is good practice to start the experiment with a 'raw counts' matrix. Since it is very easy to convert that to a relative abundance (by sample) type matrix. This is afforded through the to_relative_abundance method. This transforms the data such that each sample counts add up to a 100.

Note that the to_relative_abundance method actually instantiates a new experiment object with the newly transformed data DataFrame.

This API is provided to mimic various pandas DataFrame methods, which usually provides a new DataFrame object, and thus allows retaining older frames, as well as allow "chaining of methods", which is very important in interactive environments where economical typing is desirable. This pattern is also called "Fluent interface" (https://en.wikipedia.org/wiki/Fluent_interface).


In [12]:
new_exp = exp.to_relative_abundance()
new_exp.data_df

#OR, similarly:
#to demonstrate method chaining
exp.to_relative_abundance().data_df


Out[12]:
1234 9876 sample0 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0.000000 99.999557 0 0.002175 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 0.001502 0.000443 0 99.974982 100
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 99.998498 0.000000 0 0.022842 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0.000000 0.000000 0 0.000000 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0.000000 0.000000 100 0.000000 0

In [13]:
#a 'rarefy' (i.e. subsampling) method is also available
#for the MicrobiomeExperiment subclass
print(exp.data_df.sum())
rarefied_df = exp.rarefy(90000).data_df #this will discard sample0
rarefied_df
#This method will discard samples with counts lower than a cutoff value,
#then subsample the counts of each sample to that cutoff
#this method was shown to be of benefit for various downstream diversity analyses
#the functionality is built into qiime and various other pipelines


1234       133140
9876       225873
sample0     86870
sample1     91934
sample2    100428
dtype: float64
Out[13]:
1234 9876 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 90000 2 0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 89997 0 17 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 3 0 89981 90000

In [14]:
#you can also specify a num_reps argument
#i.e. (how many randomizations are done before returning the ultimate (rarefied) data
rarefied_df = exp.rarefy(90000, num_reps=50).data_df #this will discard sample0
rarefied_df


Out[14]:
1234 9876 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 89999 2 0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 90000 0 32 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 0 1 89966 90000

In [15]:
#the with_data_df method replaces the data_df in the experiment object
#with the one passed as the parameter to the method
#and!! constructs a new Experiment (this paradigm will keep repeating!)
exp.with_data_df(rarefied_df).data_df

#Note:
#similarly, there is a with_mapping_df method
#and perhaps in the future a with_taxonomy_df method


Out[15]:
1234 9876 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 89999 2 0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 90000 0 32 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 0 1 89966 90000

In the next notebook/chapter, we will demonstrate the important 'filter' and 'efilter' methods of the Experiment objects.

Missing MicrobiomeExperiment functionality

This is a list to keep track of missing functionality.

  • Implementation of a Phylogeny/Tree dataframe