A syntactic sugar API is provided for various common filtering operations on the components of the experiment objects (for example the three dataframes of the MicrobiomeExperiment object).
The rationale behind providing such syntactic sugar in the API is that working with three dataframes at the same time can be taxing, as it describes multidimensional data (and metadata!).
Again, rapid analysis and economy in typing is the ultimate aspiration here, avoiding repetitive boilerplate, especially knowing that almost the same operations are required in particular downstream analyses in various typical omic experiments.
This notebook/chapter will provide various examples (currently for MicrobiomeExperiment), and can be regarded as a cookbook for various operations performed in a microbial amplicon metabarcoding experiment.
In [1]:
%load_ext autoreload
%autoreload 2
#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment
mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"
exp = MicrobiomeExperiment(biom, mapping,tax)
The basis of the filter functionality are two methods on the experiment objects. The first method is called 'filter'. The second method is called 'efilter' (short for "experiment filter").
The filter method basically filters the data_df according to the parameters passed as we are going to explain below. It is important to remember that the data_df is our main dataframe (our contingency table or matrix) and therefore remember that the filter function only filters the data_df and not for example the mapping_df. The method then returns a new pandas DataFrame object (the new "filtered" or "modified" data_df).
The efilter method provides the same functionality as filter. The only difference is that efilter follows the paradigm of providing a whole new experiment object, rather than just providing a stand-alone new data DataFrame object. As explained before, this paradigm is helpful as it allows method chaining etc.
From the filters subpackage, you can import the various filters:
from omicexperiment.transforms.filters import Sample
from omicexperiment.transforms.filters import Observation
from omicexperiment.transforms.filters import Taxonomy
These "filters" are also provided on the MicrobiomeExperiment object, as shortcuts.
Taxonomy = exp.Taxonomy
#OR
from omicexperiment.transforms.filters import Taxonomy
Filters are basically classes (can be objects/instances), that hold attributes that are subclasses of the FilterExpression object. FilterExpressions can be considered fairly magical, as they utilize operator overloading in an attempt to provide a shorthand API with a sugary syntax for applying various operations on the experiment dataframe objects.
The three filters
The only way to get the gist of how these work is perhaps to view the code examples.
In [2]:
exp.data_df
Out[2]:
In [3]:
exp.mapping_df
Out[3]:
In [4]:
Sample = exp.Sample
#OR
from omicexperiment.transforms.filters import Sample
In [5]:
#1. the count filter
exp.dapply(Sample.count > 90000) #note sample0 was filtered off as count = 86870
# this filter implements various operators
Out[5]:
In [6]:
#the count filter implements various operators (due to the FlexibleOperator mixin)
#here we try the __eq__ (==) operator, the cell above we tried the > operator
exp.dapply(Sample.count == 100428)
Out[6]:
In [7]:
#2. the att (attribute) filter
# this filters on the "attributes" (i.e. metadata) of the samples
# present in the mapping dataframe
# this uses an attribute access (dotted) syntax
#here we only select samples in the 'control' group
exp.dapply(Sample.att.group == 'control') #only one sample in this group
Out[7]:
In [8]:
#select only samples of asthmatic patients
exp.dapply(Sample.att.asthma == 1) #only three asthma-positive samples
Out[8]:
In [9]:
#another alias for the att filter is the c attribute on the Sample Filter
#(c is short for "column", as per sqlalchemy convention)
exp.dapply(Sample.c.asthma == 1) #only three asthma-positive samples
Out[9]:
In [10]:
#some columns may not be legal python attribute names,
#so for these we allow the [] (__getitem__) syntax
exp.dapply(Sample.att['#SampleID'] == 'sample0')
Out[10]:
In [11]:
exp.apply(Sample.c.asthma == 1).dapply(Sample.count > 100000) #two samples
Out[11]:
In [12]:
# the Sample groupby filter
#the aggregate function here is the mean,
#then finally normalizes to a 100 (mean relative abundance)
exp.dapply(Sample.groupby("group"))
Out[12]:
In [13]:
# the Sample groupby_sum filter
#the aggregate function here is the sum -- no normalization is applied
exp.dapply(Sample.groupby_sum("group"))
Out[13]:
In [14]:
Taxonomy = exp.Taxonomy #OR from omicexperiment.transforms.filters import Taxonomy
exp.taxonomy_df
Out[14]:
In [15]:
#1. the taxonomy groupby filter
#this filter is very important, as it is used to collapse otus by their taxonomies
#according to the taxonomic rank asked for
exp.dapply(Taxonomy.groupby('genus')) #any taxonomic rank can be passed
Out[15]:
In [16]:
'''
We noticed above that one of the assignments was identified at a highest
resolution only at the family level.
We can utilize 2. the taxonomy attribute filters to remove these OTUs that
were classified at a lower resolution than a genus before continuing with
downstream analyses.
'''
genus_or_higher = exp.apply(Taxonomy.rank_resolution >= 'genus') #note efilter
genus_or_higher.apply(Taxonomy.groupby('genus')).data_df
Out[16]:
In [17]:
#Another example of the various Taxonomy attribute filters
exp.dapply(Taxonomy.genus == 'g__Aspergillus')
#only three otus had a genus assigned as 'g__Aspergillus"
Out[17]:
In [ ]: