Experiment objects filters - the rationale

A common transformation on experiment objects are those that apply some sort of filtering of subsetting of data. A syntactic sugar API is thus provided for the various common filtering operations on the components of the experiment objects (for example the three dataframes of the MicrobiomeExperiment object).

The rationale behind providing such syntactic sugar in the API is that working with three dataframes at the same time can be taxing.

Again, rapid analysis and economy in typing, enabling quick workflow from one step to the next, are the ultimate aspirations here, avoiding repetitive boilerplate, especially knowing that almost the same operations are required in particular downstream analyses in various typical omic experiments.

This notebook/chapter will provide various examples (currently for MicrobiomeExperiment), and can be regarded as a cookbook for various operations performed in a microbial amplicon metabarcoding experiment.



In [1]:

    
%load_ext autoreload
%autoreload 2

#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment

mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"

exp = MicrobiomeExperiment(biom, mapping,tax)

Experiment filters

A filter applied to a experiment/dataset is basically a sort of Transform. The Filter class itself inherits from Transform class. As such, filters are applied as other transforms, using the apply and dapply methods.

The filter subpackage

From the transforms.filters subpackage, you can import the various filters:

from omicexperiment.transforms.filters import Sample
from omicexperiment.transforms.filters import Observation 
from omicexperiment.transforms.filters import Taxonomy

These "filters" are also provided on the MicrobiomeExperiment object, as shortcuts. However, I am still considering a better interface to filters so I am re-considering this particular API.

Taxonomy = exp.Taxonomy
#OR
from omicexperiment.transforms.filters import Taxonomy

What are filters?

Filters are subclasses of the Filter class. Filters can be considered fairly magical, as they utilize operator overloading in an attempt to provide a shorthand API with a sugary syntax for applying various filtering on the experiment dataframe objects.

The three filters

Taxonomy filter: apply various operations on the taxonomy
Sample filter: apply various operations on samples/sample metadata
Observation filter: apply various operations on observations (i.e. OTUs in a microbiome context)

The only way to get the gist of how these work is perhaps to view the code examples.



In [2]:

    
exp.data_df









    Out[2]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
      0
      2
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
      0
      91911
      100428
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
      0
      0
      21
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      86870
      0
      0



In [3]:

    
exp.mapping_df









    Out[3]:






  
    
      
      #SampleID
      BarcodeSequence
      LinkerPrimerSequence
      Description
      patient_id
      group
      asthma
      vas
      amplicon_conc
    
    
      #SampleID
      
      
      
      
      
      
      
      
      
    
  
  
    
      sample0
      sample0
      ACTGAGCG
      AAAA
      sample0
      132
      CRSsNP
      0
      49
      4.3
    
    
      sample1
      sample1
      AAGAGGCA
      AAAA
      sample1
      315
      CRSwNP
      1
      43
      2.3
    
    
      sample2
      sample2
      ATCTCAGG
      AAAA
      sample2
      742
      CRSsNP
      0
      23
      3.2
    
    
      1234
      1234
      ATGCGCAG
      AAAA
      1234
      927
      control
      1
      87
      1.0
    
    
      9876
      9876
      TAGGCATG
      AAAA
      9876
      538
      CRSwNP
      1
      12
      1.3

Sample Filter examples



In [4]:

    
Sample = exp.Sample
#OR
from omicexperiment.transforms.filters import Sample



In [5]:

    
#1. the count filter
exp.dapply(Sample.count > 90000) #note sample0 was filtered off as its count is = 86870









    Out[5]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
      2
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
      91911
      100428
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
      0
      21
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      0
      0



In [6]:

    
#note the use of the operator overloading here so that the expression equals to
#a new Filter instance that holds this information 
#if you have worked with SQLAlchemy ORM, a very similar technique is used by sqlalchemy filters
(Sample.count > 90000)









    Out[6]:





<omicexperiment.transforms.filters.sample.SampleCount object at 0x7f81c77749b0 - operator:__gt__; value:90000;>



In [7]:

    
#the count filter actually implements other operators as well (due to the FlexibleOperator mixin)
#here we try the __eq__ (==) operator, the cell above we tried the > operator
exp.dapply(Sample.count == 100428)
#it only selected the sample with exact count of 100428









    Out[7]:






  
    
      
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      100428
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0



In [8]:

    
#2. the att (attribute) filter
#   this filters on the "attributes" (i.e. metadata) of the samples
#   present in the mapping dataframe
#   this uses an attribute access (dotted) syntax
#here we only select samples in the 'control' group
exp.dapply(Sample.att.group == 'control') #only one sample in this group









    Out[8]:






  
    
      #SampleID
      1234
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0



In [9]:

    
(Sample.att.group == 'control')









    Out[9]:





<omicexperiment.transforms.filters.sample.SampleAttributeFilter object at 0x7f81c6a94c18 - attribute:group; operator:__eq__; value:control;>



In [10]:

    
#select only samples of asthmatic patients
exp.dapply(Sample.att.asthma == 1) #only three asthma-positive samples









    Out[10]:






  
    
      #SampleID
      sample1
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      2
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      91911
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      21
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      0



In [11]:

    
#another alias for the att filter is the c attribute on the Sample Filter
#(c is short for "column", as per sqlalchemy convention)
exp.dapply(Sample.c.asthma == 1) #only three asthma-positive samples









    Out[11]:






  
    
      #SampleID
      sample1
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      2
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      91911
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      21
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      0



In [12]:

    
#some columns may not be legal python attribute names,
#so for these we allow the [] (__getitem__) syntax
exp.dapply(Sample.att['#SampleID'] == 'sample0')









    Out[12]:






  
    
      #SampleID
      sample0
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      86870

An example of method chaining using efilter instead of filter



In [13]:

    
exp.apply(Sample.c.asthma == 1).dapply(Sample.count > 100000) #two samples









    Out[13]:






  
    
      #SampleID
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0



In [14]:

    
# the Sample groupby Transform
#the aggregate function here is the mean,
exp.dapply(Sample.groupby("group"))









    Out[14]:






  
    
      group
      CRSsNP
      CRSwNP
      control
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      112937.0
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      50214
      45956.0
      2
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
      10.5
      133138
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0.0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      43435
      0.0
      0



In [15]:

    
# we can also change the aggfunc from mean to sum
import numpy as np
exp.dapply(Sample.groupby("group", aggfunc=np.sum))









    Out[15]:






  
    
      group
      CRSsNP
      CRSwNP
      control
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225874
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      100428
      91912
      2
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
      21
      133138
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      86870
      0
      0

Taxonomy filters

Taxonomy filters allows common operations done on the taxonomy metadata of the Observations/OTUs.



In [16]:

    
Taxonomy = exp.Taxonomy #OR from omicexperiment.transforms.filters import Taxonomy

exp.taxonomy_df









    Out[16]:






  
    
      
      kingdom
      phylum
      class
      order
      family
      genus
      species
      rank_resolution
      tax
      otu
    
    
      otu
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__Aspergillus bombycis
      species
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__unidentified (g__Aspergillus)
      genus
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      ae0ddda08027454fdb5db77c96b94691b8274cdd
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      k__Fungi
      p__Ascomycota
      c__Dothideomycetes
      o__Pleosporales
      f__Pleosporaceae
      g__unidentified (f__Pleosporaceae)
      s__unidentified (f__Pleosporaceae)
      family
      k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__Aspergillus flavus
      species
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      k__Fungi
      p__Ascomycota
      c__Dothideomycetes
      o__Pleosporales
      f__Pleosporaceae
      g__Lewia
      s__unidentified (g__Lewia)
      genus
      k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
    
    
      54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5
      k__unidentified (Unassigned)
      p__unidentified (Unassigned)
      c__unidentified (Unassigned)
      o__unidentified (Unassigned)
      f__unidentified (Unassigned)
      g__unidentified (Unassigned)
      s__unidentified (Unassigned)
      NaN
      No blast hit
      54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5



In [17]:

    
'''
We noticed above that one of the assignments was identified at a highest
resolution only at the family level.
We can utilize the taxonomy attribute filters to remove these OTUs that
were classified at a lower resolution than a genus. 
'''
genus_or_higher = exp.apply(Taxonomy.rank_resolution >= 'genus')
genus_or_higher.data_df









    Out[17]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
      0
      2
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
      0
      91911
      100428
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      86870
      0
      0



In [18]:

    
#The TaxonomyGroupBy Transform
genus_or_higher.apply(Taxonomy.groupby('genus')).data_df
#the Taxonomy.groupby is a shortcut for the TaxonomyGroupBy transform in the transforms.taxonomy module.









    Out[18]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
    
      genus
      
      
      
      
      
    
  
  
    
      g__Aspergillus
      2
      225873
      0
      91913
      100428
    
    
      g__Lewia
      0
      0
      86870
      0
      0



In [19]:

    
#Another example of the various Taxonomy attribute filters
exp.dapply(Taxonomy.genus == 'g__Aspergillus')
#only three otus had a genus assigned as 'g__Aspergillus"









    Out[19]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
      0
      2
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
      0
      91911
      100428
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
      0

	1234	9876	sample0	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	225872	0	2	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	2	1	0	91911	100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	133138	0	0	21	0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0	0	0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	0	0	86870	0	0

	#SampleID	BarcodeSequence	LinkerPrimerSequence	Description	patient_id	group	asthma	vas	amplicon_conc
#SampleID
sample0	sample0	ACTGAGCG	AAAA	sample0	132	CRSsNP	0	49	4.3
sample1	sample1	AAGAGGCA	AAAA	sample1	315	CRSwNP	1	43	2.3
sample2	sample2	ATCTCAGG	AAAA	sample2	742	CRSsNP	0	23	3.2
1234	1234	ATGCGCAG	AAAA	1234	927	control	1	87	1.0
9876	9876	TAGGCATG	AAAA	9876	538	CRSwNP	1	12	1.3

group	CRSsNP	CRSwNP	control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	112937.0	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	50214	45956.0	2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	0	10.5	133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0.0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	43435	0.0	0

group	CRSsNP	CRSwNP	control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	225874	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	100428	91912	2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	0	21	133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	86870	0	0

	kingdom	phylum	class	order	family	genus	species	rank_resolution	tax	otu
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__Aspergillus bombycis	species	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
ae0ddda08027454fdb5db77c96b94691b8274cdd	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__unidentified (g__Aspergillus)	genus	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	ae0ddda08027454fdb5db77c96b94691b8274cdd
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	k__Fungi	p__Ascomycota	c__Dothideomycetes	o__Pleosporales	f__Pleosporaceae	g__unidentified (f__Pleosporaceae)	s__unidentified (f__Pleosporaceae)	family	k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...	8f52abc02aed2ce6c63be04570a7e609f9cdac5f
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__Aspergillus flavus	species	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	k__Fungi	p__Ascomycota	c__Dothideomycetes	o__Pleosporales	f__Pleosporaceae	g__Lewia	s__unidentified (g__Lewia)	genus	k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...	8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5	k__unidentified (Unassigned)	p__unidentified (Unassigned)	c__unidentified (Unassigned)	o__unidentified (Unassigned)	f__unidentified (Unassigned)	g__unidentified (Unassigned)	s__unidentified (Unassigned)	NaN	No blast hit	54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5