A common transformation on experiment objects are those that apply some sort of filtering of subsetting of data. A syntactic sugar API is thus provided for the various common filtering operations on the components of the experiment objects (for example the three dataframes of the MicrobiomeExperiment object).
The rationale behind providing such syntactic sugar in the API is that working with three dataframes at the same time can be taxing.
Again, rapid analysis and economy in typing, enabling quick workflow from one step to the next, are the ultimate aspirations here, avoiding repetitive boilerplate, especially knowing that almost the same operations are required in particular downstream analyses in various typical omic experiments.
This notebook/chapter will provide various examples (currently for MicrobiomeExperiment), and can be regarded as a cookbook for various operations performed in a microbial amplicon metabarcoding experiment.
In [1]:
%load_ext autoreload
%autoreload 2
#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment
mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"
exp = MicrobiomeExperiment(biom, mapping,tax)
A filter applied to a experiment/dataset is basically a sort of Transform. The Filter class itself inherits from Transform class. As such, filters are applied as other transforms, using the apply and dapply methods.
From the transforms.filters subpackage, you can import the various filters:
from omicexperiment.transforms.filters import Sample
from omicexperiment.transforms.filters import Observation
from omicexperiment.transforms.filters import Taxonomy
These "filters" are also provided on the MicrobiomeExperiment object, as shortcuts. However, I am still considering a better interface to filters so I am re-considering this particular API.
Taxonomy = exp.Taxonomy
#OR
from omicexperiment.transforms.filters import Taxonomy
Filters are subclasses of the Filter class. Filters can be considered fairly magical, as they utilize operator overloading in an attempt to provide a shorthand API with a sugary syntax for applying various filtering on the experiment dataframe objects.
The three filters
The only way to get the gist of how these work is perhaps to view the code examples.
In [2]:
exp.data_df
Out[2]:
In [3]:
exp.mapping_df
Out[3]:
In [4]:
Sample = exp.Sample
#OR
from omicexperiment.transforms.filters import Sample
In [5]:
#1. the count filter
exp.dapply(Sample.count > 90000) #note sample0 was filtered off as its count is = 86870
Out[5]:
In [6]:
#note the use of the operator overloading here so that the expression equals to
#a new Filter instance that holds this information
#if you have worked with SQLAlchemy ORM, a very similar technique is used by sqlalchemy filters
(Sample.count > 90000)
Out[6]:
In [7]:
#the count filter actually implements other operators as well (due to the FlexibleOperator mixin)
#here we try the __eq__ (==) operator, the cell above we tried the > operator
exp.dapply(Sample.count == 100428)
#it only selected the sample with exact count of 100428
Out[7]:
In [8]:
#2. the att (attribute) filter
# this filters on the "attributes" (i.e. metadata) of the samples
# present in the mapping dataframe
# this uses an attribute access (dotted) syntax
#here we only select samples in the 'control' group
exp.dapply(Sample.att.group == 'control') #only one sample in this group
Out[8]:
In [9]:
(Sample.att.group == 'control')
Out[9]:
In [10]:
#select only samples of asthmatic patients
exp.dapply(Sample.att.asthma == 1) #only three asthma-positive samples
Out[10]:
In [11]:
#another alias for the att filter is the c attribute on the Sample Filter
#(c is short for "column", as per sqlalchemy convention)
exp.dapply(Sample.c.asthma == 1) #only three asthma-positive samples
Out[11]:
In [12]:
#some columns may not be legal python attribute names,
#so for these we allow the [] (__getitem__) syntax
exp.dapply(Sample.att['#SampleID'] == 'sample0')
Out[12]:
In [13]:
exp.apply(Sample.c.asthma == 1).dapply(Sample.count > 100000) #two samples
Out[13]:
In [14]:
# the Sample groupby Transform
#the aggregate function here is the mean,
exp.dapply(Sample.groupby("group"))
Out[14]:
In [15]:
# we can also change the aggfunc from mean to sum
import numpy as np
exp.dapply(Sample.groupby("group", aggfunc=np.sum))
Out[15]:
In [16]:
Taxonomy = exp.Taxonomy #OR from omicexperiment.transforms.filters import Taxonomy
exp.taxonomy_df
Out[16]:
In [17]:
'''
We noticed above that one of the assignments was identified at a highest
resolution only at the family level.
We can utilize the taxonomy attribute filters to remove these OTUs that
were classified at a lower resolution than a genus.
'''
genus_or_higher = exp.apply(Taxonomy.rank_resolution >= 'genus')
genus_or_higher.data_df
Out[17]:
In [18]:
#The TaxonomyGroupBy Transform
genus_or_higher.apply(Taxonomy.groupby('genus')).data_df
#the Taxonomy.groupby is a shortcut for the TaxonomyGroupBy transform in the transforms.taxonomy module.
Out[18]:
In [19]:
#Another example of the various Taxonomy attribute filters
exp.dapply(Taxonomy.genus == 'g__Aspergillus')
#only three otus had a genus assigned as 'g__Aspergillus"
Out[19]: