Experiment objects filters - the rationale

A common transformation on experiment objects are those that apply some sort of filtering of subsetting of data. A syntactic sugar API is thus provided for the various common filtering operations on the components of the experiment objects (for example the three dataframes of the MicrobiomeExperiment object).

The rationale behind providing such syntactic sugar in the API is that working with three dataframes at the same time can be taxing.

Again, rapid analysis and economy in typing, enabling quick workflow from one step to the next, are the ultimate aspirations here, avoiding repetitive boilerplate, especially knowing that almost the same operations are required in particular downstream analyses in various typical omic experiments.

This notebook/chapter will provide various examples (currently for MicrobiomeExperiment), and can be regarded as a cookbook for various operations performed in a microbial amplicon metabarcoding experiment.


In [1]:
%load_ext autoreload
%autoreload 2

#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment

mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"

exp = MicrobiomeExperiment(biom, mapping,tax)

Experiment filters

A filter applied to a experiment/dataset is basically a sort of Transform. The Filter class itself inherits from Transform class. As such, filters are applied as other transforms, using the apply and dapply methods.

The filter subpackage

From the transforms.filters subpackage, you can import the various filters:

from omicexperiment.transforms.filters import Sample
from omicexperiment.transforms.filters import Observation 
from omicexperiment.transforms.filters import Taxonomy

These "filters" are also provided on the MicrobiomeExperiment object, as shortcuts. However, I am still considering a better interface to filters so I am re-considering this particular API.

Taxonomy = exp.Taxonomy
#OR
from omicexperiment.transforms.filters import Taxonomy

What are filters?

Filters are subclasses of the Filter class. Filters can be considered fairly magical, as they utilize operator overloading in an attempt to provide a shorthand API with a sugary syntax for applying various filtering on the experiment dataframe objects.

The three filters

  • Taxonomy filter: apply various operations on the taxonomy
  • Sample filter: apply various operations on samples/sample metadata
  • Observation filter: apply various operations on observations (i.e. OTUs in a microbiome context)

The only way to get the gist of how these work is perhaps to view the code examples.


In [2]:
exp.data_df


Out[2]:
1234 9876 sample0 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225872 0 2 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 2 1 0 91911 100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 133138 0 0 21 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 86870 0 0

In [3]:
exp.mapping_df


Out[3]:
#SampleID BarcodeSequence LinkerPrimerSequence Description patient_id group asthma vas amplicon_conc
#SampleID
sample0 sample0 ACTGAGCG AAAA sample0 132 CRSsNP 0 49 4.3
sample1 sample1 AAGAGGCA AAAA sample1 315 CRSwNP 1 43 2.3
sample2 sample2 ATCTCAGG AAAA sample2 742 CRSsNP 0 23 3.2
1234 1234 ATGCGCAG AAAA 1234 927 control 1 87 1.0
9876 9876 TAGGCATG AAAA 9876 538 CRSwNP 1 12 1.3

Sample Filter examples


In [4]:
Sample = exp.Sample
#OR
from omicexperiment.transforms.filters import Sample

In [5]:
#1. the count filter
exp.dapply(Sample.count > 90000) #note sample0 was filtered off as its count is = 86870


Out[5]:
1234 9876 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225872 2 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 2 1 91911 100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 133138 0 21 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 0 0

In [6]:
#note the use of the operator overloading here so that the expression equals to
#a new Filter instance that holds this information 
#if you have worked with SQLAlchemy ORM, a very similar technique is used by sqlalchemy filters
(Sample.count > 90000)


Out[6]:
<omicexperiment.transforms.filters.sample.SampleCount object at 0x7f81c77749b0 - operator:__gt__; value:90000;>

In [7]:
#the count filter actually implements other operators as well (due to the FlexibleOperator mixin)
#here we try the __eq__ (==) operator, the cell above we tried the > operator
exp.dapply(Sample.count == 100428)
#it only selected the sample with exact count of 100428


Out[7]:
sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0

In [8]:
#2. the att (attribute) filter
#   this filters on the "attributes" (i.e. metadata) of the samples
#   present in the mapping dataframe
#   this uses an attribute access (dotted) syntax
#here we only select samples in the 'control' group
exp.dapply(Sample.att.group == 'control') #only one sample in this group


Out[8]:
#SampleID 1234
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0

In [9]:
(Sample.att.group == 'control')


Out[9]:
<omicexperiment.transforms.filters.sample.SampleAttributeFilter object at 0x7f81c6a94c18 - attribute:group; operator:__eq__; value:control;>

In [10]:
#select only samples of asthmatic patients
exp.dapply(Sample.att.asthma == 1) #only three asthma-positive samples


Out[10]:
#SampleID sample1 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 2 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 91911 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 21 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 0

In [11]:
#another alias for the att filter is the c attribute on the Sample Filter
#(c is short for "column", as per sqlalchemy convention)
exp.dapply(Sample.c.asthma == 1) #only three asthma-positive samples


Out[11]:
#SampleID sample1 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 2 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 91911 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 21 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 0

In [12]:
#some columns may not be legal python attribute names,
#so for these we allow the [] (__getitem__) syntax
exp.dapply(Sample.att['#SampleID'] == 'sample0')


Out[12]:
#SampleID sample0
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 86870
An example of method chaining using efilter instead of filter

In [13]:
exp.apply(Sample.c.asthma == 1).dapply(Sample.count > 100000) #two samples


Out[13]:
#SampleID 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0

In [14]:
# the Sample groupby Transform
#the aggregate function here is the mean,
exp.dapply(Sample.groupby("group"))


Out[14]:
group CRSsNP CRSwNP control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 112937.0 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 50214 45956.0 2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0 10.5 133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0.0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 43435 0.0 0

In [15]:
# we can also change the aggfunc from mean to sum
import numpy as np
exp.dapply(Sample.groupby("group", aggfunc=np.sum))


Out[15]:
group CRSsNP CRSwNP control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225874 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 100428 91912 2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0 21 133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 86870 0 0

Taxonomy filters

Taxonomy filters allows common operations done on the taxonomy metadata of the Observations/OTUs.


In [16]:
Taxonomy = exp.Taxonomy #OR from omicexperiment.transforms.filters import Taxonomy

exp.taxonomy_df


Out[16]:
kingdom phylum class order family genus species rank_resolution tax otu
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__Aspergillus bombycis species k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... 2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
ae0ddda08027454fdb5db77c96b94691b8274cdd k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__unidentified (g__Aspergillus) genus k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... ae0ddda08027454fdb5db77c96b94691b8274cdd
8f52abc02aed2ce6c63be04570a7e609f9cdac5f k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Pleosporaceae g__unidentified (f__Pleosporaceae) s__unidentified (f__Pleosporaceae) family k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P... 8f52abc02aed2ce6c63be04570a7e609f9cdac5f
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__Aspergillus flavus species k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... 3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Pleosporaceae g__Lewia s__unidentified (g__Lewia) genus k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P... 8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5 k__unidentified (Unassigned) p__unidentified (Unassigned) c__unidentified (Unassigned) o__unidentified (Unassigned) f__unidentified (Unassigned) g__unidentified (Unassigned) s__unidentified (Unassigned) NaN No blast hit 54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5

In [17]:
'''
We noticed above that one of the assignments was identified at a highest
resolution only at the family level.
We can utilize the taxonomy attribute filters to remove these OTUs that
were classified at a lower resolution than a genus. 
'''
genus_or_higher = exp.apply(Taxonomy.rank_resolution >= 'genus')
genus_or_higher.data_df


Out[17]:
1234 9876 sample0 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225872 0 2 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 2 1 0 91911 100428
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 86870 0 0

In [18]:
#The TaxonomyGroupBy Transform
genus_or_higher.apply(Taxonomy.groupby('genus')).data_df
#the Taxonomy.groupby is a shortcut for the TaxonomyGroupBy transform in the transforms.taxonomy module.


Out[18]:
1234 9876 sample0 sample1 sample2
genus
g__Aspergillus 2 225873 0 91913 100428
g__Lewia 0 0 86870 0 0

In [19]:
#Another example of the various Taxonomy attribute filters
exp.dapply(Taxonomy.genus == 'g__Aspergillus')
#only three otus had a genus assigned as 'g__Aspergillus"


Out[19]:
1234 9876 sample0 sample1 sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225872 0 2 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 2 1 0 91911 100428
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0 0