The Experiment object

The OmicExperiment object is the heart of the omicexperiment package. It has the ultimate goal of providing a pleasant API for rapid analysis of 'omic experiments' in an interactive environment.

The R bioinformatics community has already provided similar implementations for similar functionality. Examples include DESeqDataSet (from the package DeSeq2), MRExperiment (from the package metagenomeSeq), phyloseq-class (from the package phyloseq). To my knowledge, there exists no similar powerful functionality available to users of python.

Powerful manipulation of large datasets in comparative omic experiment (and these are numerous, and include amplicon microbiome studies, microbial metagenomics, transcriptomics/RNA-Seq, metabolomics, and proteomics, and microarray). According to the Biological Observation Matrix (BIOM) format: "they all share an underlying, core data type: the “sample by observation contingency table”.

The OmicExperiment object (perhaps intended to be subclassed for each specific use-case of the above) is the centrepiece of this functionality.

The philosophy of this package is to build upon solid foundations of the python scientific stack and try not to re-invent the wheel. Packages such as numpy and pandas are powerful optimized libraries in dealing with matrix and tabular data, respectively. This package's main dependency is pandas objects (as well as the rest of the scientific python stack).

As of this date, I have started mainly with implementing functionality in the MicrbiomeExperiment subclass of the parent OmicExperiment class, as this is what I have been recently working on in my research.

Clever scientific interactive computing environments are an excellent way to apply and demonstrate the analysis of scientific datasets. The Jupyter notebook is perhaps the prime example of such an environment, perfectly suited for rapid iteration and exploratory analysis, as well as documentation and literate programming.

Instantiating a MicrobiomeExperiment object

Let us start by showing how we start an experiment; this time, a microbiome experiment.



In [1]:

    
%load_ext autoreload
%autoreload 2

from omicexperiment.experiment.microbiome import MicrobiomeExperiment

mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"

#the MicrobiomeExperiment constructor currently needs three parameters
exp = MicrobiomeExperiment(biom, mapping,tax)
#the first parameter is the _data_ DataFrame
#the second parameter is the mapping dataframe
#the third paramenter is the taxonomy dataframe

The parameters passed to MicrobiomeExperiment constructor could be:

pandas dataframes
a filepath (str) to a csv or tsv file
a filepath (str) to a biom file (in case of the data DataFrame)
a Table object from the biom python package (in case of the data DataFrame)
a filepath (str) to a Qiime taxonomy assignment txt file - other implementations will be provided in the future, with possible subclassing of MicrobiomeDataFrame into a QiimeExperiment to provide optimal and specific Qiime functionality, while allowing integration with other pipeline software conventions

The data DataFrame

The data DataFrame (attribute data_df) is the central dataframe of the Experiment object. It is what we refer to as the frequency table (OTU table in microbiome studies, etc.).

data_df always has the 'observations' (or OTUs in microbiome OTU tables) as the rows, so that the observatin names/ids constitute the index of the dataframe.

On the other hand, the samples always constitute the sample names.



In [2]:

    
exp.data_df









    Out[2]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
      0
      2
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
      0
      91911
      100428
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
      0
      0
      21
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      86870
      0
      0

To list the samples:

The sample names are easily obtained through looking at the columns of your data_df.

Alternatively, the Experiment object exposes a shorthand 'samples' attribute, which is essentially as list(exp.data_df.columns)



In [3]:

    
print(exp.data_df.columns)
#
print("#OR")
#
print(exp.samples)









    



Index(['1234', '9876', 'sample0', 'sample1', 'sample2'], dtype='object')
#OR
['1234', '9876', 'sample0', 'sample1', 'sample2']

To list the observations:

The observations are obtained through the index of the data DataFrame.

Alternatively, the Experiment object exposes a shorthand 'observations' attribute.

In the example below, the otu ids are actually hashes, which allows easy deduplication (i.e. clustering at 100% level) whilst allowing for pooling of experiments. Following the excellent gist by Greg Caporaso https://gist.github.com/gregcaporaso/f3c042e5eb806349fa18 .



In [4]:

    
print(exp.data_df.index)
#
print("#OR")
#
print(exp.observations)









    



Index(['2f328e48f4252bbade0dd7f66b0d5bf1b09617dd',
       'ae0ddda08027454fdb5db77c96b94691b8274cdd',
       '8f52abc02aed2ce6c63be04570a7e609f9cdac5f',
       '3cb3c2347cdbe128b645e432f4dcbca702e0e8e3',
       '8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0'],
      dtype='object')
#OR
['2f328e48f4252bbade0dd7f66b0d5bf1b09617dd', 'ae0ddda08027454fdb5db77c96b94691b8274cdd', '8f52abc02aed2ce6c63be04570a7e609f9cdac5f', '3cb3c2347cdbe128b645e432f4dcbca702e0e8e3', '8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0']

The Mapping DataFrame

The mapping dataframe (attribute mapping_df) is the dataframe which holds the sample metadata or "phenotypic" data.

mapping_df has the sample ids as the rows, as to constitute the index of that dataframe.

The columns contain the various variables of interest to our experiment.

The mapping dataframe below is an example of a tsv file, formatted according to QIIME conventions. The first column also has the sample ids. You can see various variables in the columns. The group column, for example, contains the disease state of the samples in our example table. CRSwNP (i.e. Chronic Rhinosinusitis with Nasal Polyposis), CRSsNP (i.e. Chronic Rhinosinusitis sans Nasal Polyposis) and "healthy" controls.



In [5]:

    
exp.mapping_df









    Out[5]:






  
    
      
      #SampleID
      BarcodeSequence
      LinkerPrimerSequence
      Description
      patient_id
      group
      asthma
      vas
      amplicon_conc
    
    
      #SampleID
      
      
      
      
      
      
      
      
      
    
  
  
    
      sample0
      sample0
      ACTGAGCG
      AAAA
      sample0
      132
      CRSsNP
      0
      49
      4.3
    
    
      sample1
      sample1
      AAGAGGCA
      AAAA
      sample1
      315
      CRSwNP
      1
      43
      2.3
    
    
      sample2
      sample2
      ATCTCAGG
      AAAA
      sample2
      742
      CRSsNP
      0
      23
      3.2
    
    
      1234
      1234
      ATGCGCAG
      AAAA
      1234
      927
      control
      1
      87
      1.0
    
    
      9876
      9876
      TAGGCATG
      AAAA
      9876
      538
      CRSwNP
      1
      12
      1.3



In [6]:

    
#An example of filtering the mapping dataframe,
#as per the excellent pandas indexing filtration feature
exp.mapping_df[exp.mapping_df.group == 'CRSwNP']
#Note:
#similar operations are available on the counts and taxonomy dataframes, since
#all of these are pandas DataFrame objects









    Out[6]:






  
    
      
      #SampleID
      BarcodeSequence
      LinkerPrimerSequence
      Description
      patient_id
      group
      asthma
      vas
      amplicon_conc
    
    
      #SampleID
      
      
      
      
      
      
      
      
      
    
  
  
    
      sample1
      sample1
      AAGAGGCA
      AAAA
      sample1
      315
      CRSwNP
      1
      43
      2.3
    
    
      9876
      9876
      TAGGCATG
      AAAA
      9876
      538
      CRSwNP
      1
      12
      1.3

The Taxonomy DataFrame

The taxonomy dataframe (attribute taxonomy_df) is the dataframe which holds the taxonomy assignment information for our OTUs.

This dataframe is only available on MicrobiomeExperiment results, and is not available on the base class Experiment.

As of the time of writing this notebook, only the taxonomy assignment txt files according to the QIIME convention has been implemented. Otherwise, a manually constructed taxonomy pandas DataFrame object (e.g., manually imported from a different pipeline or program using your own in-house glue code) can be passed to the constructor of the Microbiome Experiment object.

The taxonomy dataframe's index contains the "OTU" ids. The index object thus has the name 'otu'.

The taxonomy dataframe below is such an example.

In the module omicexperiment.taxonomy, the function tax_as_dataframe is responsible for opening up a taxonomy assignment txt file, formatted according to QIIME's conventions.



In [7]:

    
exp.taxonomy_df









    Out[7]:






  
    
      
      kingdom
      phylum
      class
      order
      family
      genus
      species
      rank_resolution
      tax
      otu
    
    
      otu
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__Aspergillus bombycis
      species
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__unidentified (g__Aspergillus)
      genus
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      ae0ddda08027454fdb5db77c96b94691b8274cdd
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      k__Fungi
      p__Ascomycota
      c__Dothideomycetes
      o__Pleosporales
      f__Pleosporaceae
      g__unidentified (f__Pleosporaceae)
      s__unidentified (f__Pleosporaceae)
      family
      k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__Aspergillus flavus
      species
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      k__Fungi
      p__Ascomycota
      c__Dothideomycetes
      o__Pleosporales
      f__Pleosporaceae
      g__Lewia
      s__unidentified (g__Lewia)
      genus
      k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
    
    
      54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5
      k__unidentified (Unassigned)
      p__unidentified (Unassigned)
      c__unidentified (Unassigned)
      o__unidentified (Unassigned)
      f__unidentified (Unassigned)
      g__unidentified (Unassigned)
      s__unidentified (Unassigned)
      NaN
      No blast hit
      54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5

Taxonomic levels in taxonomy_df

What you can notice in the taxonomy dataframe above, is that it automatically runs code to separate the taxonomic assignment to various taxonomic levels.



In [8]:

    
#The taxonomic levels are as follows:
from omicexperiment.taxonomy import TAX_RANKS
print(TAX_RANKS)
#The taxonomy dataframe also has extra columns
list(exp.taxonomy_df.columns)









    



('kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species')






    Out[8]:





['kingdom',
 'phylum',
 'class',
 'order',
 'family',
 'genus',
 'species',
 'rank_resolution',
 'tax',
 'otu']

The extra columns in the taxonomy dataframe include:

otu: the otu id (also in the index)
tax: the raw taxonomy assignment (includes all levels, conventionally separated by a semi-colon ';')
rank_resolution: the rank at highest resolution that the taxonomy assignment method was able to provide. For example, if assignment could not identify a 'species' level, this means that the highest resolution rank was 'genus'. Unidentified levels occurs when the taxon at that level is equal to 'unidentified' or '' (empty string) or 'Unassigned' or 'No blast hit'. These levels then are given the 'unidentified' label. Note that when the assignment was 'Unassigned' or 'No blast hit', or a kingdom level assignment was not available, the rank_resolution will be 'nan'.



In [9]:

    
#Rank Resolutions example
for row in exp.taxonomy_df[['tax', 'rank_resolution']].iterrows():
    print(str(row[1][0]))
    print("RESOLUTION: " + str(row[1][1]))
    print("\n")









    



k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Trichocomaceae;g__Aspergillus;s__Aspergillus bombycis
RESOLUTION: species


k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Trichocomaceae;g__Aspergillus
RESOLUTION: genus


k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae
RESOLUTION: family


k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eurotiales;f__Trichocomaceae;g__Aspergillus;s__Aspergillus flavus
RESOLUTION: species


k__Fungi;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Lewia
RESOLUTION: genus


No blast hit
RESOLUTION: nan



In [10]:

    
#Note that the rank_resolution column (a pandas Series) is actually of an ordered Category type.
#It thus support the filtration methods in the following cells
exp.taxonomy_df.rank_resolution









    Out[10]:





otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd    species
ae0ddda08027454fdb5db77c96b94691b8274cdd      genus
8f52abc02aed2ce6c63be04570a7e609f9cdac5f     family
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3    species
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0      genus
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5        NaN
Name: rank_resolution, dtype: category
Categories (7, object): [kingdom < phylum < class < order < family < genus < species]



In [11]:

    
#only two OTUs are assigned at the species level ( > genus )
exp.taxonomy_df.rank_resolution > 'genus'









    Out[11]:





otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd     True
ae0ddda08027454fdb5db77c96b94691b8274cdd    False
8f52abc02aed2ce6c63be04570a7e609f9cdac5f    False
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3     True
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0    False
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5    False
Name: rank_resolution, dtype: bool

Other OmicExperiment objects functionality

It is good practice to start the experiment with a 'raw counts' matrix. Since it is very easy to convert that to a relative abundance (by sample) type matrix. This is afforded through the to_relative_abundance method. This transforms the data such that each sample counts add up to a 100.

Note that the to_relative_abundance method actually instantiates a new experiment object with the newly transformed data DataFrame.

This API is provided to mimic various pandas DataFrame methods, which usually provides a new DataFrame object, and thus allows retaining older frames, as well as allow "chaining of methods", which is very important in interactive environments where economical typing is desirable. This pattern is also called "Fluent interface" (https://en.wikipedia.org/wiki/Fluent_interface).



In [12]:

    
new_exp = exp.to_relative_abundance()
new_exp.data_df

#OR, similarly:
#to demonstrate method chaining
exp.to_relative_abundance().data_df









    Out[12]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0.000000
      99.999557
      0
      0.002175
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0.001502
      0.000443
      0
      99.974982
      100
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      99.998498
      0.000000
      0
      0.022842
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0.000000
      0.000000
      0
      0.000000
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0.000000
      0.000000
      100
      0.000000
      0



In [13]:

    
#a 'rarefy' (i.e. subsampling) method is also available
#for the MicrobiomeExperiment subclass
print(exp.data_df.sum())
rarefied_df = exp.rarefy(90000).data_df #this will discard sample0
rarefied_df
#This method will discard samples with counts lower than a cutoff value,
#then subsample the counts of each sample to that cutoff
#this method was shown to be of benefit for various downstream diversity analyses
#the functionality is built into qiime and various other pipelines









    



1234       133140
9876       225873
sample0     86870
sample1     91934
sample2    100428
dtype: float64






    Out[13]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      90000
      2
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      89997
      0
      17
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      3
      0
      89981
      90000



In [14]:

    
#you can also specify a num_reps argument
#i.e. (how many randomizations are done before returning the ultimate (rarefied) data
rarefied_df = exp.rarefy(90000, num_reps=50).data_df #this will discard sample0
rarefied_df









    Out[14]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      89999
      2
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      90000
      0
      32
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0
      1
      89966
      90000



In [15]:

    
#the with_data_df method replaces the data_df in the experiment object
#with the one passed as the parameter to the method
#and!! constructs a new Experiment (this paradigm will keep repeating!)
exp.with_data_df(rarefied_df).data_df

#Note:
#similarly, there is a with_mapping_df method
#and perhaps in the future a with_taxonomy_df method









    Out[15]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      89999
      2
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      90000
      0
      32
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0
      1
      89966
      90000

In the next notebook/chapter, we will demonstrate the important 'filter' and 'efilter' methods of the Experiment objects.

Missing MicrobiomeExperiment functionality

This is a list to keep track of missing functionality.

Implementation of a Phylogeny/Tree dataframe

	1234	9876	sample0	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	225872	0	2	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	2	1	0	91911	100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	133138	0	0	21	0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0	0	0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	0	0	86870	0	0

	#SampleID	BarcodeSequence	LinkerPrimerSequence	Description	patient_id	group	asthma	vas	amplicon_conc
#SampleID
sample0	sample0	ACTGAGCG	AAAA	sample0	132	CRSsNP	0	49	4.3
sample1	sample1	AAGAGGCA	AAAA	sample1	315	CRSwNP	1	43	2.3
sample2	sample2	ATCTCAGG	AAAA	sample2	742	CRSsNP	0	23	3.2
1234	1234	ATGCGCAG	AAAA	1234	927	control	1	87	1.0
9876	9876	TAGGCATG	AAAA	9876	538	CRSwNP	1	12	1.3

	kingdom	phylum	class	order	family	genus	species	rank_resolution	tax	otu
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__Aspergillus bombycis	species	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
ae0ddda08027454fdb5db77c96b94691b8274cdd	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__unidentified (g__Aspergillus)	genus	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	ae0ddda08027454fdb5db77c96b94691b8274cdd
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	k__Fungi	p__Ascomycota	c__Dothideomycetes	o__Pleosporales	f__Pleosporaceae	g__unidentified (f__Pleosporaceae)	s__unidentified (f__Pleosporaceae)	family	k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...	8f52abc02aed2ce6c63be04570a7e609f9cdac5f
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__Aspergillus flavus	species	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	k__Fungi	p__Ascomycota	c__Dothideomycetes	o__Pleosporales	f__Pleosporaceae	g__Lewia	s__unidentified (g__Lewia)	genus	k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...	8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5	k__unidentified (Unassigned)	p__unidentified (Unassigned)	c__unidentified (Unassigned)	o__unidentified (Unassigned)	f__unidentified (Unassigned)	g__unidentified (Unassigned)	s__unidentified (Unassigned)	NaN	No blast hit	54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5

	1234	9876	sample0	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0.000000	99.999557	0	0.002175	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	0.001502	0.000443	0	99.974982	100
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	99.998498	0.000000	0	0.022842	0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0.000000	0.000000	0	0.000000	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	0.000000	0.000000	100	0.000000	0

	1234	9876	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	90000	2	0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	89997	0	17	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	3	0	89981	90000

	1234	9876	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	89999	2	0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	90000	0	32	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	0	1	89966	90000