Experiment objects filters - the rationale

A syntactic sugar API is provided for various common filtering operations on the components of the experiment objects (for example the three dataframes of the MicrobiomeExperiment object).

The rationale behind providing such syntactic sugar in the API is that working with three dataframes at the same time can be taxing, as it describes multidimensional data (and metadata!).

Again, rapid analysis and economy in typing is the ultimate aspiration here, avoiding repetitive boilerplate, especially knowing that almost the same operations are required in particular downstream analyses in various typical omic experiments.

This notebook/chapter will provide various examples (currently for MicrobiomeExperiment), and can be regarded as a cookbook for various operations performed in a microbial amplicon metabarcoding experiment.



In [1]:

    
%load_ext autoreload
%autoreload 2

#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment

mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"

exp = MicrobiomeExperiment(biom, mapping,tax)

Experiment filters

The basis of the filter functionality are two methods on the experiment objects. The first method is called 'filter'. The second method is called 'efilter' (short for "experiment filter").

The filter method basically filters the data_df according to the parameters passed as we are going to explain below. It is important to remember that the data_df is our main dataframe (our contingency table or matrix) and therefore remember that the filter function only filters the data_df and not for example the mapping_df. The method then returns a new pandas DataFrame object (the new "filtered" or "modified" data_df).

The efilter method provides the same functionality as filter. The only difference is that efilter follows the paradigm of providing a whole new experiment object, rather than just providing a stand-alone new data DataFrame object. As explained before, this paradigm is helpful as it allows method chaining etc.

The filter subpackage

From the filters subpackage, you can import the various filters:

from omicexperiment.transforms.filters import Sample
from omicexperiment.transforms.filters import Observation 
from omicexperiment.transforms.filters import Taxonomy

These "filters" are also provided on the MicrobiomeExperiment object, as shortcuts.

Taxonomy = exp.Taxonomy
#OR
from omicexperiment.transforms.filters import Taxonomy

What are filters?

Filters are basically classes (can be objects/instances), that hold attributes that are subclasses of the FilterExpression object. FilterExpressions can be considered fairly magical, as they utilize operator overloading in an attempt to provide a shorthand API with a sugary syntax for applying various operations on the experiment dataframe objects.

The three filters

Taxonomy filter: apply various operations on the taxonomy
Sample filter: apply various operations on samples/sample metadata
Observation filter: apply various operations on observations (i.e. OTUs in a microbiome context)

The only way to get the gist of how these work is perhaps to view the code examples.



In [2]:

    
exp.data_df









    Out[2]:






  
    
      
      sample0
      sample1
      sample2
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      2
      0
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0
      91911
      100428
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
      21
      0
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      86870
      0
      0
      0
      0



In [3]:

    
exp.mapping_df









    Out[3]:






  
    
      
      #SampleID
      BarcodeSequence
      LinkerPrimerSequence
      Description
      patient_id
      group
      asthma
      vas
      amplicon_conc
    
    
      #SampleID
      
      
      
      
      
      
      
      
      
    
  
  
    
      sample0
      sample0
      ACTGAGCG
      AAAA
      sample0
      132
      CRSsNP
      0
      49
      4.3
    
    
      sample1
      sample1
      AAGAGGCA
      AAAA
      sample1
      315
      CRSwNP
      1
      43
      2.3
    
    
      sample2
      sample2
      ATCTCAGG
      AAAA
      sample2
      742
      CRSsNP
      0
      23
      3.2
    
    
      1234
      1234
      ATGCGCAG
      AAAA
      1234
      927
      control
      1
      87
      1.0
    
    
      9876
      9876
      TAGGCATG
      AAAA
      9876
      538
      CRSwNP
      1
      12
      1.3

Sample Filter examples



In [4]:

    
Sample = exp.Sample
#OR
from omicexperiment.transforms.filters import Sample



In [5]:

    
#1. the count filter
exp.dapply(Sample.count > 90000) #note sample0 was filtered off as count = 86870
# this filter implements various operators









    Out[5]:






  
    
      
      sample1
      sample2
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      2
      0
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      91911
      100428
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      21
      0
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      0
      0



In [6]:

    
#the count filter implements various operators (due to the FlexibleOperator mixin)
#here we try the __eq__ (==) operator, the cell above we tried the > operator
exp.dapply(Sample.count == 100428)









    Out[6]:






  
    
      
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      100428
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0



In [7]:

    
#2. the att (attribute) filter
#   this filters on the "attributes" (i.e. metadata) of the samples
#   present in the mapping dataframe
#   this uses an attribute access (dotted) syntax
#here we only select samples in the 'control' group
exp.dapply(Sample.att.group == 'control') #only one sample in this group









    Out[7]:






  
    
      #SampleID
      1234
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0



In [8]:

    
#select only samples of asthmatic patients
exp.dapply(Sample.att.asthma == 1) #only three asthma-positive samples









    Out[8]:






  
    
      #SampleID
      sample1
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      2
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      91911
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      21
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      0



In [9]:

    
#another alias for the att filter is the c attribute on the Sample Filter
#(c is short for "column", as per sqlalchemy convention)
exp.dapply(Sample.c.asthma == 1) #only three asthma-positive samples









    Out[9]:






  
    
      #SampleID
      sample1
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      2
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      91911
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      21
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      0



In [10]:

    
#some columns may not be legal python attribute names,
#so for these we allow the [] (__getitem__) syntax
exp.dapply(Sample.att['#SampleID'] == 'sample0')









    Out[10]:






  
    
      #SampleID
      sample0
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      86870

An example of method chaining using efilter instead of filter



In [11]:

    
exp.apply(Sample.c.asthma == 1).dapply(Sample.count > 100000) #two samples









    Out[11]:






  
    
      #SampleID
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0



In [12]:

    
# the Sample groupby filter
#the aggregate function here is the mean,
#then finally normalizes to a 100 (mean relative abundance)
exp.dapply(Sample.groupby("group"))









    Out[12]:






  
    
      group
      CRSsNP
      CRSwNP
      control
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0.000000
      71.072695
      0.000000
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      53.619366
      28.920697
      0.001502
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0.000000
      0.006608
      99.998498
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0.000000
      0.000000
      0.000000
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      46.380634
      0.000000
      0.000000



In [13]:

    
# the Sample groupby_sum filter
#the aggregate function here is the sum -- no normalization is applied
exp.dapply(Sample.groupby_sum("group"))









    Out[13]:






  
    
      group
      CRSsNP
      CRSwNP
      control
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225874
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      100428
      91912
      2
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
      21
      133138
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      86870
      0
      0

Taxonomy filters

Taxonomy filters allows common operations done on the taxonomy metadata of the Observations/OTUs.



In [14]:

    
Taxonomy = exp.Taxonomy #OR from omicexperiment.transforms.filters import Taxonomy

exp.taxonomy_df









    Out[14]:






  
    
      
      kingdom
      phylum
      class
      order
      family
      genus
      species
      rank_resolution
      tax
      otu
    
    
      otu
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__Aspergillus bombycis
      species
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__unidentified (g__Aspergillus)
      genus
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      ae0ddda08027454fdb5db77c96b94691b8274cdd
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      k__Fungi
      p__Ascomycota
      c__Dothideomycetes
      o__Pleosporales
      f__Pleosporaceae
      g__unidentified (f__Pleosporaceae)
      s__unidentified (f__Pleosporaceae)
      family
      k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      k__Fungi
      p__Ascomycota
      c__Eurotiomycetes
      o__Eurotiales
      f__Trichocomaceae
      g__Aspergillus
      s__Aspergillus flavus
      species
      k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      k__Fungi
      p__Ascomycota
      c__Dothideomycetes
      o__Pleosporales
      f__Pleosporaceae
      g__Lewia
      s__unidentified (g__Lewia)
      genus
      k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
    
    
      54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5
      No blast hit
      p__unidentified (Unassigned)
      c__unidentified (Unassigned)
      o__unidentified (Unassigned)
      f__unidentified (Unassigned)
      g__unidentified (Unassigned)
      s__unidentified (Unassigned)
      NaN
      No blast hit
      54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5



In [15]:

    
#1. the taxonomy groupby filter
#this filter is very important, as it is used to collapse otus by their taxonomies
#according to the taxonomic rank asked for
exp.dapply(Taxonomy.groupby('genus')) #any taxonomic rank can be passed









    Out[15]:






  
    
      
      sample0
      sample1
      sample2
      1234
      9876
    
    
      genus
      
      
      
      
      
    
  
  
    
      g__Aspergillus
      0
      91913
      100428
      2
      225873
    
    
      g__Lewia
      86870
      0
      0
      0
      0
    
    
      g__unidentified (f__Pleosporaceae)
      0
      21
      0
      133138
      0



In [16]:

    
'''
We noticed above that one of the assignments was identified at a highest
resolution only at the family level.
We can utilize 2. the taxonomy attribute filters to remove these OTUs that
were classified at a lower resolution than a genus before continuing with
downstream analyses. 
'''
genus_or_higher = exp.apply(Taxonomy.rank_resolution >= 'genus') #note efilter
genus_or_higher.apply(Taxonomy.groupby('genus')).data_df









    Out[16]:






  
    
      
      sample0
      sample1
      sample2
      1234
      9876
    
    
      genus
      
      
      
      
      
    
  
  
    
      g__Aspergillus
      0
      91913
      100428
      2
      225873
    
    
      g__Lewia
      86870
      0
      0
      0
      0



In [17]:

    
#Another example of the various Taxonomy attribute filters
exp.dapply(Taxonomy.genus == 'g__Aspergillus')
#only three otus had a genus assigned as 'g__Aspergillus"









    Out[17]:






  
    
      
      sample0
      sample1
      sample2
      1234
      9876
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      2
      0
      0
      225872
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0
      91911
      100428
      2
      1
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
      0

Missing Taxonomic operations functionality

In future versions, diversity



In [ ]:

	sample0	sample1	sample2	1234	9876
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	2	0	0	225872
ae0ddda08027454fdb5db77c96b94691b8274cdd	0	91911	100428	2	1
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	0	21	0	133138	0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0	0	0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	86870	0	0	0	0

	#SampleID	BarcodeSequence	LinkerPrimerSequence	Description	patient_id	group	asthma	vas	amplicon_conc
#SampleID
sample0	sample0	ACTGAGCG	AAAA	sample0	132	CRSsNP	0	49	4.3
sample1	sample1	AAGAGGCA	AAAA	sample1	315	CRSwNP	1	43	2.3
sample2	sample2	ATCTCAGG	AAAA	sample2	742	CRSsNP	0	23	3.2
1234	1234	ATGCGCAG	AAAA	1234	927	control	1	87	1.0
9876	9876	TAGGCATG	AAAA	9876	538	CRSwNP	1	12	1.3

group	CRSsNP	CRSwNP	control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0.000000	71.072695	0.000000
ae0ddda08027454fdb5db77c96b94691b8274cdd	53.619366	28.920697	0.001502
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	0.000000	0.006608	99.998498
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0.000000	0.000000	0.000000
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	46.380634	0.000000	0.000000

group	CRSsNP	CRSwNP	control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	225874	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	100428	91912	2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	0	21	133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	86870	0	0

	kingdom	phylum	class	order	family	genus	species	rank_resolution	tax	otu
otu
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__Aspergillus bombycis	species	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
ae0ddda08027454fdb5db77c96b94691b8274cdd	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__unidentified (g__Aspergillus)	genus	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	ae0ddda08027454fdb5db77c96b94691b8274cdd
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	k__Fungi	p__Ascomycota	c__Dothideomycetes	o__Pleosporales	f__Pleosporaceae	g__unidentified (f__Pleosporaceae)	s__unidentified (f__Pleosporaceae)	family	k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...	8f52abc02aed2ce6c63be04570a7e609f9cdac5f
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	k__Fungi	p__Ascomycota	c__Eurotiomycetes	o__Eurotiales	f__Trichocomaceae	g__Aspergillus	s__Aspergillus flavus	species	k__Fungi;p__Ascomycota;c__Eurotiomycetes;o__Eu...	3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	k__Fungi	p__Ascomycota	c__Dothideomycetes	o__Pleosporales	f__Pleosporaceae	g__Lewia	s__unidentified (g__Lewia)	genus	k__Fungi;p__Ascomycota;c__Dothideomycetes;o__P...	8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5	No blast hit	p__unidentified (Unassigned)	c__unidentified (Unassigned)	o__unidentified (Unassigned)	f__unidentified (Unassigned)	g__unidentified (Unassigned)	s__unidentified (Unassigned)	NaN	No blast hit	54c89100c4ebc5b2fceebd3bd9a857fe07dfedb5

	sample0	sample1	sample2	1234	9876
genus
g__Aspergillus	0	91913	100428	2	225873
g__Lewia	86870	0	0	0	0
g__unidentified (f__Pleosporaceae)	0	21	0	133138	0