Experiment objects filters - the rationale

A syntactic sugar API is provided for various common filtering operations on the components of the experiment objects (for example the three dataframes of the MicrobiomeExperiment object).

The rationale behind providing such syntactic sugar in the API is that working with three dataframes at the same time can be taxing, as it describes multidimensional data (and metadata!).

Again, rapid analysis and economy in typing is the ultimate aspiration here, avoiding repetitive boilerplate, especially knowing that almost the same operations are required in particular downstream analyses in various typical omic experiments.

This notebook/chapter will provide various examples (currently for MicrobiomeExperiment), and can be regarded as a cookbook for various operations performed in a microbial amplicon metabarcoding experiment.


In [1]:
%load_ext autoreload
%autoreload 2

#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment

mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"

exp = MicrobiomeExperiment(biom, mapping,tax)

Experiment filters

The basis of the filter functionality are two methods on the experiment objects. The first method is called 'filter'. The second method is called 'efilter' (short for "experiment filter").

The filter method basically filters the data_df according to the parameters passed as we are going to explain below. It is important to remember that the data_df is our main dataframe (our contingency table or matrix) and therefore remember that the filter function only filters the data_df and not for example the mapping_df. The method then returns a new pandas DataFrame object (the new "filtered" or "modified" data_df).

The efilter method provides the same functionality as filter. The only difference is that efilter follows the paradigm of providing a whole new experiment object, rather than just providing a stand-alone new data DataFrame object. As explained before, this paradigm is helpful as it allows method chaining etc.

The filter subpackage

From the filters subpackage, you can import the various filters:

from omicexperiment.transforms.filters import Sample
from omicexperiment.transforms.filters import Observation 
from omicexperiment.transforms.filters import Taxonomy

These "filters" are also provided on the MicrobiomeExperiment object, as shortcuts.

Taxonomy = exp.Taxonomy
#OR
from omicexperiment.transforms.filters import Taxonomy

What are filters?

Filters are basically classes (can be objects/instances), that hold attributes that are subclasses of the FilterExpression object. FilterExpressions can be considered fairly magical, as they utilize operator overloading in an attempt to provide a shorthand API with a sugary syntax for applying various operations on the experiment dataframe objects.

The three filters

  • Taxonomy filter: apply various operations on the taxonomy
  • Sample filter: apply various operations on samples/sample metadata
  • Observation filter: apply various operations on observations (i.e. OTUs in a microbiome context)

The only way to get the gist of how these work is perhaps to view the code examples.


In [2]:
exp.data_df


Out[2]:
sample0 sample1 sample2 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 2 0 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 0 91911 100428 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0 21 0 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 86870 0 0 0 0

In [3]:
exp.mapping_df


Out[3]:
#SampleID BarcodeSequence LinkerPrimerSequence Description patient_id group asthma vas amplicon_conc
#SampleID
sample0 sample0 ACTGAGCG AAAA sample0 132 CRSsNP 0 49 4.3
sample1 sample1 AAGAGGCA AAAA sample1 315 CRSwNP 1 43 2.3
sample2 sample2 ATCTCAGG AAAA sample2 742 CRSsNP 0 23 3.2
1234 1234 ATGCGCAG AAAA 1234 927 control 1 87 1.0
9876 9876 TAGGCATG AAAA 9876 538 CRSwNP 1 12 1.3

Sample Filter examples


In [4]:
Sample = exp.Sample
#OR
from omicexperiment.transforms.filters import Sample

In [5]:
#1. the count filter
exp.dapply(Sample.count > 90000) #note sample0 was filtered off as count = 86870
# this filter implements various operators


Out[5]:
sample1 sample2 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 2 0 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 91911 100428 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 21 0 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 0 0

In [6]:
#the count filter implements various operators (due to the FlexibleOperator mixin)
#here we try the __eq__ (==) operator, the cell above we tried the > operator
exp.dapply(Sample.count == 100428)


Out[6]:
sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0

In [7]:
#2. the att (attribute) filter
#   this filters on the "attributes" (i.e. metadata) of the samples
#   present in the mapping dataframe
#   this uses an attribute access (dotted) syntax
#here we only select samples in the 'control' group
exp.dapply(Sample.att.group == 'control') #only one sample in this group


Out[7]:
#SampleID 1234
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0

In [8]:
#select only samples of asthmatic patients
exp.dapply(Sample.att.asthma == 1) #only three asthma-positive samples


Out[8]:
#SampleID sample1 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 2 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 91911 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 21 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 0

In [9]:
#another alias for the att filter is the c attribute on the Sample Filter
#(c is short for "column", as per sqlalchemy convention)
exp.dapply(Sample.c.asthma == 1) #only three asthma-positive samples


Out[9]:
#SampleID sample1 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 2 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 91911 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 21 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0 0

In [10]:
#some columns may not be legal python attribute names,
#so for these we allow the [] (__getitem__) syntax
exp.dapply(Sample.att['#SampleID'] == 'sample0')


Out[10]:
#SampleID sample0
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 86870
An example of method chaining using efilter instead of filter

In [11]:
exp.apply(Sample.c.asthma == 1).dapply(Sample.count > 100000) #two samples


Out[11]:
#SampleID 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 2 1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 133138 0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 0 0

In [12]:
# the Sample groupby filter
#the aggregate function here is the mean,
#then finally normalizes to a 100 (mean relative abundance)
exp.dapply(Sample.groupby("group"))


Out[12]:
group CRSsNP CRSwNP control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0.000000 71.072695 0.000000
ae0ddda08027454fdb5db77c96b94691b8274cdd 53.619366 28.920697 0.001502
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0.000000 0.006608 99.998498
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0.000000 0.000000 0.000000
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 46.380634 0.000000 0.000000

In [13]:
# the Sample groupby_sum filter
#the aggregate function here is the sum -- no normalization is applied
exp.dapply(Sample.groupby_sum("group"))


Out[13]:
group CRSsNP CRSwNP control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 225874 0
ae0ddda08027454fdb5db77c96b94691b8274cdd 100428 91912 2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f 0 21 133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 86870 0 0

Taxonomy filters

Taxonomy filters allows common operations done on the taxonomy metadata of the Observations/OTUs.


In [14]:
Taxonomy = exp.Taxonomy #OR from omicexperiment.transforms.filters import Taxonomy

exp.taxonomy_df


Out[14]:
kingdom phylum class order family genus species rank_resolution tax otu
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__Aspergillus bombycis species k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... 2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
ae0ddda08027454fdb5db77c96b94691b8274cdd k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__unidentified (g__Aspergillus) genus k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... ae0ddda08027454fdb5db77c96b94691b8274cdd
8f52abc02aed2ce6c63be04570a7e609f9cdac5f k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Pleosporaceae g__unidentified (f__Pleosporaceae) s__unidentified (f__Pleosporaceae) family k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P... 8f52abc02aed2ce6c63be04570a7e609f9cdac5f
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Trichocomaceae g__Aspergillus s__Aspergillus flavus species k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu... 3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0 k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Pleosporaceae g__Lewia s__unidentified (g__Lewia) genus k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P... 8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5 No blast hit p__unidentified (Unassigned) c__unidentified (Unassigned) o__unidentified (Unassigned) f__unidentified (Unassigned) g__unidentified (Unassigned) s__unidentified (Unassigned) NaN No blast hit 54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5

In [15]:
#1. the taxonomy groupby filter
#this filter is very important, as it is used to collapse otus by their taxonomies
#according to the taxonomic rank asked for
exp.dapply(Taxonomy.groupby('genus')) #any taxonomic rank can be passed


Out[15]:
sample0 sample1 sample2 1234 9876
genus
g__Aspergillus 0 91913 100428 2 225873
g__Lewia 86870 0 0 0 0
g__unidentified (f__Pleosporaceae) 0 21 0 133138 0

In [16]:
'''
We noticed above that one of the assignments was identified at a highest
resolution only at the family level.
We can utilize 2. the taxonomy attribute filters to remove these OTUs that
were classified at a lower resolution than a genus before continuing with
downstream analyses. 
'''
genus_or_higher = exp.apply(Taxonomy.rank_resolution >= 'genus') #note efilter
genus_or_higher.apply(Taxonomy.groupby('genus')).data_df


Out[16]:
sample0 sample1 sample2 1234 9876
genus
g__Aspergillus 0 91913 100428 2 225873
g__Lewia 86870 0 0 0 0

In [17]:
#Another example of the various Taxonomy attribute filters
exp.dapply(Taxonomy.genus == 'g__Aspergillus')
#only three otus had a genus assigned as 'g__Aspergillus"


Out[17]:
sample0 sample1 sample2 1234 9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd 0 2 0 0 225872
ae0ddda08027454fdb5db77c96b94691b8274cdd 0 91911 100428 2 1
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3 0 0 0 0 0

Missing Taxonomic operations functionality

In future versions, diversity


In [ ]: