Experiment transforms

Transforms are objects that aim to "transform" the experiment object or the data contained therein to a new experiment object by applying a set of data transformations.

You can thus view your workflow as a series of transformations on your Experiment object.

The transform API is based on two methods on the Experiment object. The first method is called 'apply'. The second method is called 'dapply' (short for "dataframe apply" or "data apply"). These methods apply your various Transform objects onto your Experiment.

These methods work by calling the following corresponding methods on the Transform objects: eapply and dapply.

The Transform eapply method (short for "experiment apply") should always return a new Experiment object. As explained before, this paradigm is need to allow method chaining and pipelining of transforms aka "the Fluent interface".

applied using the apply method.

The dapply method (short for "dataframe apply" or "data apply") is an additional method on the Transform object that could be provided as a shorthand for a particular dataframe or set of data to be returned when the Fluent interface is not needed. Typically, this method returns a new pandas DataFrame object (for example the new "transformed" or "modified" data_df). If not implemented, this method should return NotImplementedError.

Examples of transforms are listed in this notebook/chapter.



In [1]:

    
%load_ext autoreload
%autoreload 2

#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment

mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"

exp = MicrobiomeExperiment(biom, mapping,tax)



In [2]:

    
exp.data_df









    Out[2]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
      0
      2
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      1
      0
      91911
      100428
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      133138
      0
      0
      21
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      86870
      0
      0



In [3]:

    
exp.mapping_df









    Out[3]:






  
    
      
      #SampleID
      BarcodeSequence
      LinkerPrimerSequence
      Description
      patient_id
      group
      asthma
      vas
      amplicon_conc
    
    
      #SampleID
      
      
      
      
      
      
      
      
      
    
  
  
    
      sample0
      sample0
      ACTGAGCG
      AAAA
      sample0
      132
      CRSsNP
      0
      49
      4.3
    
    
      sample1
      sample1
      AAGAGGCA
      AAAA
      sample1
      315
      CRSwNP
      1
      43
      2.3
    
    
      sample2
      sample2
      ATCTCAGG
      AAAA
      sample2
      742
      CRSsNP
      0
      23
      3.2
    
    
      1234
      1234
      ATGCGCAG
      AAAA
      1234
      927
      control
      1
      87
      1.0
    
    
      9876
      9876
      TAGGCATG
      AAAA
      9876
      538
      CRSwNP
      1
      12
      1.3

The apply function

The apply function can process three different types of parameters:

The first type of parameter is to pass a Transform object (or class).
The second type of parameter is to pass a function (which accepts a pandas DataFrame, intended here to be the experiment's data_df). The reason for the decision to make this type of function accept a dataframe, rather than an OmicExperiment object is to make the apply functionality compatible with various pandas and numpy aggregate functions like sum, mean etc.
The third type of parameter is a list of Transform objects.

The Relative Abundance Transform



In [4]:

    
from omicexperiment.transforms.general import RelativeAbundance
exp.apply(RelativeAbundance) #note how the apply method it returns a new experiment









    Out[4]:





<omicexperiment.experiment.microbiome.MicrobiomeExperiment at 0x7f31ecd0b588>



In [5]:

    
exp.apply(RelativeAbundance).data_df #peak into the transformed data









    Out[5]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0.000000
      99.999557
      0
      0.002175
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0.001502
      0.000443
      0
      99.974982
      100
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      99.998498
      0.000000
      0
      0.022842
      0
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0.000000
      0.000000
      0
      0.000000
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0.000000
      0.000000
      100
      0.000000
      0

The Rarefaction Transform

Here is another example. A rarefaction transform rarefies our experiment to a desired cutoff. All samples below this cutoff are discarded from the data DataFrame.



In [6]:

    
from omicexperiment.transforms.general import Rarefaction
exp.apply(Rarefaction(90000)).data_df









    Out[6]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      89998
      0
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      89999
      0
      27
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      1
      2
      89973
      90000



In [7]:

    
#Here you can see how the Transform API contract works
Rarefaction(90000).__eapply__(exp)

#the above is more or less equivalent to
#exp.apply(Rarefaction(90000))









    Out[7]:





<omicexperiment.experiment.microbiome.MicrobiomeExperiment at 0x7f31ec5aaeb8>



In [8]:

    
#a shorthand on the MicrobiomeExperiment object -- the rarefy method
exp.rarefy(90000).data_df









    Out[8]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      90000
      1
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      89998
      0
      12
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      2
      0
      89987
      90000

The rarefaction functions can also take a num_reps argument (default=1), to set the number of randomizations run.



In [9]:

    
exp.apply(Rarefaction(n=90000, num_reps=10)).data_df #takes longer of course ;)









    Out[9]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      90000
      2
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      89999
      0
      16
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      1
      0
      89982
      90000

Examples of other Transforms

The SampleGroupBy Transform



In [10]:

    
# The SampleGroupBy Transform
#the aggregate function here is the mean
from omicexperiment.transforms.sample import SampleGroupBy
exp.dapply(SampleGroupBy("group"))









    Out[10]:






  
    
      group
      CRSsNP
      CRSwNP
      control
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      112937.0
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      50214
      45956.0
      2
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
      10.5
      133138
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0.0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      43435
      0.0
      0



In [11]:

    
# we can also change the aggfunc from mean to sum
import numpy as np
exp.dapply(SampleGroupBy("group", aggfunc=np.sum))









    Out[11]:






  
    
      group
      CRSsNP
      CRSwNP
      control
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225874
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      100428
      91912
      2
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      0
      21
      133138
    
    
      3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
      0
      0
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      86870
      0
      0

The ClusterObservations Transform

The main argument to supply to this transform is a clusters_df DataFrame with two columns (Series). The first column should be labeled 'observation' and the second column should be labeled 'cluster'.

It will then reindex your experiments dataframe using the provided clusters_df then perform a groupby on the 'cluster' column using the supplied aggregate function (default is np.sum).

The use case here of course is for OTU clustering.



In [12]:

    
from pandas import DataFrame, Series
otus = exp.data_df.index
print([o for o in otus])
clusters_df = DataFrame({'observation': Series(otus), 'cluster': ['otu1','otu2','otu1','otu1','otu2']})
print(clusters_df)

from omicexperiment.transforms.observation import ClusterObservations

otu_cluster_transform = ClusterObservations(clusters_df)

exp.dapply(otu_cluster_transform)









    



['2f328e48f4252bbade0dd7f66b0d5bf1b09617dd', 'ae0ddda08027454fdb5db77c96b94691b8274cdd', '8f52abc02aed2ce6c63be04570a7e609f9cdac5f', '3cb3c2347cdbe128b645e432f4dcbca702e0e8e3', '8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0']
  cluster                               observation
0    otu1  2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
1    otu2  ae0ddda08027454fdb5db77c96b94691b8274cdd
2    otu1  8f52abc02aed2ce6c63be04570a7e609f9cdac5f
3    otu1  3cb3c2347cdbe128b645e432f4dcbca702e0e8e3
4    otu2  8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0






    Out[12]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
    
      cluster
      
      
      
      
      
    
  
  
    
      otu1
      133138
      225872
      0
      23
      0
    
    
      otu2
      2
      1
      86870
      91911
      100428

The BinObservations Transform

You supply this transform with a list of observations and a corresponding group name. It will then 'bin' or 'collapse' (or group etc) these observations into a group using the supplied aggregate function (default is np.sum). It is a subclass of the ClusterObservations Transform.

The most common example use case is if you want to bin a particular set of OTUs below a certain percentage of reads or relative abundance or you want to bin all OTUs except the top 10 or 20 into a group called 'Other'.



In [13]:

    
from omicexperiment.transforms.observation import BinObservations

exp.dapply(
BinObservations(['ae0ddda08027454fdb5db77c96b94691b8274cdd', \
                 '3cb3c2347cdbe128b645e432f4dcbca702e0e8e3', \
                 '8f52abc02aed2ce6c63be04570a7e609f9cdac5f'], groupnames='Other')
)









    Out[13]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0
      225872
      0
      2
      0
    
    
      8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0
      0
      0
      86870
      0
      0
    
    
      Other
      133140
      1
      0
      91932
      100428

Taxonomy transforms

Taxonomy transforms allows common operations done on the taxonomy metadata of the Observations/OTUs.



In [14]:

    
from omicexperiment.transforms.taxonomy import TaxonomyGroupBy

#this is an important Transform, as it is used to collapse otus by their taxonomies
#according to the taxonomic rank asked for
exp.dapply(TaxonomyGroupBy('genus')) #any taxonomic rank can be passed









    Out[14]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
    
      genus
      
      
      
      
      
    
  
  
    
      g__Aspergillus
      2
      225873
      0
      91913
      100428
    
    
      g__Lewia
      0
      0
      86870
      0
      0
    
    
      g__unidentified (f__Pleosporaceae)
      133138
      0
      0
      21
      0

Applying a list of Transforms in sequence

Passing in a list which contains transforms objects/classes chains the transforms in sequence on our experiment.



In [15]:

    
exp.apply([Rarefaction(90000), RelativeAbundance]).data_df









    Out[15]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      2f328e48f4252bbade0dd7f66b0d5bf1b09617dd
      0.000000
      100
      0.001111
      0
    
    
      8f52abc02aed2ce6c63be04570a7e609f9cdac5f
      99.996667
      0
      0.021111
      0
    
    
      ae0ddda08027454fdb5db77c96b94691b8274cdd
      0.003333
      0
      99.977778
      100

Another very basic transform is the number of unique observations in each sample. This is essentially the "Number of Observed OTUs", or when used for species, can be described as "species richness".



In [16]:

    
#%matplotlib inline
from omicexperiment.transforms.observation import NumberUniqueObservations
exp.apply(NumberUniqueObservations).data_df









    Out[16]:






  
    
      
      1234
      9876
      sample0
      sample1
      sample2
    
  
  
    
      number_unique_obs
      2
      2
      1
      3
      1

Note how when we implement a Transform, we try to follow the convention of samples being in the columns rather than the rows (Take a look at the source code!)

Applying a numpy aggregate function

Passing in a function object (for example, numpy's sum or mean functions) to apply can be thought of as equivalent to the following apply operation on the data_df:

DataFrame( experiment.data_df.apply(func) )

Also available is the axis parameter to pass to the apply function of the data_df (Default axis=0 i.e. columns):

DataFrame( experiment.data_df.apply(func, axis=axis) )



In [17]:

    
import numpy as np
exp.apply(np.mean).data_df
#Note that a transpose() is done on the DataFrame in the inside of
#the apply method, such that sample names remain as columns.

Rarefaction Functions

The RarefactionFunction transform allows us to apply a function on a rarefied dataframe.

Let us use this functionality to apply an alpha diversity function to our data.

The scikit-bio diversity package is an excellent resource for alpha diversity functions (http://scikit-bio.org/docs/latest/diversity.html).



In [18]:

    
from omicexperiment.transforms.general import RarefactionFunction
from skbio.diversity import alpha

#we pass the function to the func argument
#the axis to apply to is 0 by default (i.e. on columns = samples)
shannon_rf = RarefactionFunction(n=90000, num_reps=10, func=alpha.shannon, axis=0)

shannon = exp.apply(shannon_rf).data_df

shannon









    Out[18]:






  
    
      
      
      1234
      9876
      sample1
      sample2
    
    
      rarefaction
      rep
      
      
      
      
    
  
  
    
      90000
      0
      0.000199
      0.000199
      0.003122
      -0
    
    
      1
      0.000199
      -0.000000
      0.002531
      -0
    
    
      2
      -0.000000
      -0.000000
      0.003122
      -0
    
    
      3
      -0.000000
      0.000199
      0.002427
      -0
    
    
      4
      -0.000000
      -0.000000
      0.003152
      -0
    
    
      5
      -0.000000
      -0.000000
      0.003992
      -0
    
    
      6
      -0.000000
      -0.000000
      0.003551
      -0
    
    
      7
      -0.000000
      -0.000000
      0.003696
      -0
    
    
      8
      0.000199
      -0.000000
      0.003589
      -0
    
    
      9
      0.000376
      -0.000000
      0.002192
      -0



In [19]:

    
#this time we pass the np.mean() function function to the agg_rep argument, to aggregate the results
shannon_rf = RarefactionFunction(n=90000, num_reps=10, func=alpha.shannon, axis=0, agg_rep=np.mean)
shannon = exp.apply(shannon_rf).data_df
shannon #now note how the result produces one result row at the intended rarefaction level.









    Out[19]:






  
    
      
      1234
      9876
      sample1
      sample2
    
    
      rarefaction
      
      
      
      
    
  
  
    
      90000
      0.000244
      0.000115
      0.003457
      0

Not a very elegant curve for this example dataset, but you get the point!

Note how the resulting dataframe is structured. The index now is a MultiIndex with rarefaction and rep levels. At each repetition, the function is applied to the rarefaction level chosen.

We can however, aggregate/collapse these repetitions by passing in the agg_rep argument.

Rarefaction curves

Rarefaction curves is a very common transformation for microbiome amplicon studies and is utilized widely in the software package qiime. In omicexperiment, Rarefaction curves are implemented as a generalization of RarefactionFunction at multiple levels of rarefaction.

It is also a transform, the RarefactionCurveFunction. RarefactionCurveFunction allows us to pass functions not only to one rarefaction level (like the RarefactionFunction), but to multiple rarefaction levels (i.e. a curve). The RarefactionCurveFunction constructor signature is the same as for RarefactionFunction, but it needs an additional step argument to specify the levels of rarefaction desired between 0 and n.

The RarefactionCurveFunction constructor takes the following four main parameters:

The rarefaction curve cutoff level n. (Any sample with lower count than this cutoff will be dropped.)
The 'num_reps': number of randomizations done at each level.
The 'step': from 0 to the cutoff. For example, a cutoff value of 1000 and a step of 100 will rarefy at levels 0,100,200,300...,900,1000.
The function to apply: the function to apply on the dataframe at each rarefaction level (as well as the axis)

The end result is demonstrated below. In the example below, we run the "Number of Unique Observations Transform" (i.e. observed_otus, or richness) on each level of the curve.



In [20]:

    
#the function is called number_unique_obs in the transforms subpackage
from omicexperiment.transforms.general import RarefactionCurveFunction
from omicexperiment.transforms.observation import number_unique_obs

num_obs_curve = RarefactionCurveFunction(n=90001, num_reps=1, step=30000, func=number_unique_obs, axis=0)
exp.apply(num_obs_curve).data_df









    Out[20]:






  
    
      
      1234
      9876
      sample1
      sample2
    
    
      rarefaction
      
      
      
      
    
  
  
    
      0
      0
      0
      0
      0
    
    
      30000
      1
      1
      2
      1
    
    
      60000
      2
      1
      2
      1
    
    
      90000
      1
      1
      2
      1

We can also use this method to calculate alpha diversity measures as a curve, analogous to the way we used them with RarefactionFunction.



In [21]:

    
from omicexperiment.transforms.general import RarefactionCurveFunction

#construct a curve with a step of 10000 (0 --> n 90000)
#note we add 1 here to include the last level 90000, as the function uses np.arange internally
shannon_curve = RarefactionCurveFunction(n=90001, num_reps=10, step=10000, func=alpha.shannon, axis=0, agg_rep=np.mean)
exp.apply(shannon_curve).data_df









    Out[21]:






  
    
      
      1234
      9876
      sample1
      sample2
    
    
      rarefaction
      
      
      
      
    
  
  
    
      0
      NaN
      NaN
      NaN
      NaN
    
    
      10000
      0.000000
      0.000000
      0.004648
      0
    
    
      20000
      0.000079
      0.000079
      0.003608
      0
    
    
      30000
      0.000109
      0.000109
      0.003568
      0
    
    
      40000
      0.000162
      0.000042
      0.003445
      0
    
    
      50000
      0.000231
      0.000102
      0.003444
      0
    
    
      60000
      0.000253
      0.000191
      0.003485
      0
    
    
      70000
      0.000235
      0.000075
      0.003312
      0
    
    
      80000
      0.000111
      0.000044
      0.003680
      0
    
    
      90000
      0.000212
      0.000057
      0.003333
      0



In [22]:

    
#again if you wish, you could not pass the agg_rep argument to keep the
#results of your repetitions in the resulting dataframe
shannon_curve = RarefactionCurveFunction(n=90001, num_reps=10, step=10000, func=alpha.shannon)
exp.apply(shannon_curve).data_df









    Out[22]:






  
    
      
      
      1234
      9876
      sample1
      sample2
    
    
      rarefaction
      rep
      
      
      
      
    
  
  
    
      0
      0
      NaN
      NaN
      NaN
      NaN
    
    
      1
      NaN
      NaN
      NaN
      NaN
    
    
      2
      NaN
      NaN
      NaN
      NaN
    
    
      3
      NaN
      NaN
      NaN
      NaN
    
    
      4
      NaN
      NaN
      NaN
      NaN
    
    
      5
      NaN
      NaN
      NaN
      NaN
    
    
      6
      NaN
      NaN
      NaN
      NaN
    
    
      7
      NaN
      NaN
      NaN
      NaN
    
    
      8
      NaN
      NaN
      NaN
      NaN
    
    
      9
      NaN
      NaN
      NaN
      NaN
    
    
      10000
      0
      -0.000000
      -0.000000
      0.004219
      -0
    
    
      1
      -0.000000
      -0.000000
      0.005092
      -0
    
    
      2
      -0.000000
      -0.000000
      0.003944
      -0
    
    
      3
      -0.000000
      -0.000000
      0.001473
      -0
    
    
      4
      -0.000000
      -0.000000
      0.002946
      -0
    
    
      5
      -0.000000
      -0.000000
      0.004219
      -0
    
    
      6
      -0.000000
      -0.000000
      -0.000000
      -0
    
    
      7
      -0.000000
      -0.000000
      0.001473
      -0
    
    
      8
      0.001473
      -0.000000
      0.005092
      -0
    
    
      9
      -0.000000
      -0.000000
      0.001473
      -0
    
    
      20000
      0
      -0.000000
      -0.000000
      0.001473
      -0
    
    
      1
      -0.000000
      -0.000000
      0.002122
      -0
    
    
      2
      0.000787
      -0.000000
      0.004523
      -0
    
    
      3
      -0.000000
      -0.000000
      0.003944
      -0
    
    
      4
      0.000787
      -0.000000
      0.003352
      -0
    
    
      5
      -0.000000
      0.000787
      0.003944
      -0
    
    
      6
      -0.000000
      -0.000000
      0.000787
      -0
    
    
      7
      -0.000000
      -0.000000
      0.003352
      -0
    
    
      8
      0.000787
      -0.000000
      0.004523
      -0
    
    
      9
      0.001473
      -0.000000
      0.003352
      -0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      70000
      0
      0.001087
      0.000473
      0.004583
      -0
    
    
      1
      0.000251
      -0.000000
      0.003605
      -0
    
    
      2
      -0.000000
      -0.000000
      0.001912
      -0
    
    
      3
      0.000888
      -0.000000
      0.003266
      -0
    
    
      4
      -0.000000
      -0.000000
      0.003219
      -0
    
    
      5
      0.000251
      -0.000000
      0.003437
      -0
    
    
      6
      0.000251
      -0.000000
      0.003778
      -0
    
    
      7
      0.000251
      -0.000000
      0.002642
      -0
    
    
      8
      0.000251
      0.000251
      0.003437
      -0
    
    
      9
      0.000684
      0.000473
      0.003171
      -0
    
    
      80000
      0
      0.000222
      -0.000000
      0.003807
      -0
    
    
      1
      0.000418
      -0.000000
      0.002592
      -0
    
    
      2
      0.000222
      -0.000000
      0.003797
      -0
    
    
      3
      0.000605
      -0.000000
      0.004107
      -0
    
    
      4
      0.000418
      0.000222
      0.004014
      -0
    
    
      5
      0.000222
      -0.000000
      0.004549
      -0
    
    
      6
      0.000418
      -0.000000
      0.003273
      -0
    
    
      7
      0.000222
      -0.000000
      0.002698
      -0
    
    
      8
      0.000222
      0.000222
      0.003273
      -0
    
    
      9
      0.000222
      0.000222
      0.004068
      -0
    
    
      90000
      0
      0.000199
      0.000376
      0.002427
      -0
    
    
      1
      0.000376
      0.000199
      0.002882
      -0
    
    
      2
      0.000544
      -0.000000
      0.002708
      -0
    
    
      3
      0.000376
      -0.000000
      0.003290
      -0
    
    
      4
      0.000544
      -0.000000
      0.003152
      -0
    
    
      5
      0.000199
      -0.000000
      0.003484
      -0
    
    
      6
      0.000376
      -0.000000
      0.002985
      -0
    
    
      7
      0.000376
      0.000199
      0.001995
      -0
    
    
      8
      0.000199
      -0.000000
      0.003178
      -0
    
    
      9
      0.000376
      -0.000000
      0.003426
      -0
    
  

100 rows × 4 columns

Plot Rarefaction curves?

For the sake of fun (and science!), you will find that we can very easily, using minimal code, plot a RarefactionCurveFunction dataframe as an actual rarefaction curves plot. The following simplistic example is a demo, awaiting a proper implementation of rarefaction curves in omicexperiment!!



In [23]:

    
%matplotlib inline
import matplotlib.pyplot as plt

#here we plot a richness (number of otus) plot.
num_obs_curve = RarefactionCurveFunction(n=90001, num_reps=1, step=30000, func=number_unique_obs, axis=0)
curve_df = exp.apply(num_obs_curve).data_df

print(curve_df.columns)

for sample in curve_df:
    plt.plot(curve_df.index, curve_df[sample])









    



Index(['1234', '9876', 'sample1', 'sample2'], dtype='object')

Distance Matrix transforms

Distance Matrix transforms internally utilize the cdist function from scipy.spatial

The cdist function accepts either a string for the builtin functions already provided in scipy (see http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.cdist.html) or it can accept a function (which accepts two arguments - calculating the distance between them).

The DistanceMatrix Transform constructor accepts the metric (string or function), as explained just above.



In [24]:

    
from omicexperiment.transforms.general import DistanceMatrix

rarefied_90000 = exp.rarefy(90000, num_reps=10)

rarefied_90000.apply(DistanceMatrix('braycurtis')).data_df

You can then apply PCoA on the resulting distance matrix dataframe. Here we utilize sci-kit bio's implementation.



In [25]:

    
from skbio.stats.ordination import pcoa
pcoa_results = pcoa(rarefied_90000.apply(DistanceMatrix('braycurtis')).data_df)
pcoa_results









    Out[25]:

PCoA is also implemented in omicexperiment as a Transform. It is based on the excellent sci-kit bio's implementation, with minor changes.



In [26]:

    
from omicexperiment.transforms.ordination import PCoA
bray_curtis_90000 = rarefied_90000.apply(DistanceMatrix('braycurtis'))
pcoa_bray_curtis_90000 = bray_curtis_90000.apply(PCoA)
pcoa_bray_curtis_90000.data_df









    Out[26]:






  
    
      
      1234
      9876
      sample1
      sample2
    
  
  
    
      PC1
      0.432782
      0.433167
      -4.328972e-01
      -0.433051
    
    
      PC2
      -0.500111
      0.499889
      -1.570092e-15
      0.000222
    
    
      PC3
      -0.000000
      -0.000000
      -0.000000e+00
      -0.000000
    
    
      PC4
      0.000000
      -0.000000
      -0.000000e+00
      0.000000



In [27]:

    
#the scikit-bio OrdinationResults object is stored in the experiment's metadata dict under the key 'pcoa'
print(pcoa_bray_curtis_90000.metadata['pcoa'].eigvals)
pcoa_bray_curtis_90000.metadata['pcoa']









    



PC1    0.749867
PC2    0.500000
PC3    0.000000
PC4    0.000000
dtype: float64






    Out[27]:



In [28]:

    
#Plot 2nd vs 3rd principal coordinate
from matplotlib import pyplot
pyplot.scatter(pcoa_bray_curtis_90000.data_df.loc['PC2'], pcoa_bray_curtis_90000.data_df.loc['PC3'])









    Out[28]:





<matplotlib.collections.PathCollection at 0x7f31e6174390>

	1234	9876	sample0	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	225872	0	2	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	2	1	0	91911	100428
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	133138	0	0	21	0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0	0	0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	0	0	86870	0	0

	#SampleID	BarcodeSequence	LinkerPrimerSequence	Description	patient_id	group	asthma	vas	amplicon_conc
#SampleID
sample0	sample0	ACTGAGCG	AAAA	sample0	132	CRSsNP	0	49	4.3
sample1	sample1	AAGAGGCA	AAAA	sample1	315	CRSwNP	1	43	2.3
sample2	sample2	ATCTCAGG	AAAA	sample2	742	CRSsNP	0	23	3.2
1234	1234	ATGCGCAG	AAAA	1234	927	control	1	87	1.0
9876	9876	TAGGCATG	AAAA	9876	538	CRSwNP	1	12	1.3

	1234	9876	sample0	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0.000000	99.999557	0	0.002175	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	0.001502	0.000443	0	99.974982	100
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	99.998498	0.000000	0	0.022842	0
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0.000000	0.000000	0	0.000000	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	0.000000	0.000000	100	0.000000	0

	1234	9876	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	89998	0	0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	89999	0	27	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	1	2	89973	90000

group	CRSsNP	CRSwNP	control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	112937.0	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	50214	45956.0	2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	0	10.5	133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0.0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	43435	0.0	0

group	CRSsNP	CRSwNP	control
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0	225874	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	100428	91912	2
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	0	21	133138
3cb3c2347cdbe128b645e432f4dcbca702e0e8e3	0	0	0
8e9a3b9a9d91e86f21da1bd57b8ae4486c78bbe0	86870	0	0

	1234	9876	sample0	sample1	sample2
genus
g__Aspergillus	2	225873	0	91913	100428
g__Lewia	0	0	86870	0	0
g__unidentified (f__Pleosporaceae)	133138	0	0	21	0

	1234	9876	sample1	sample2
2f328e48f4252bbade0dd7f66b0d5bf1b09617dd	0.000000	100	0.001111	0
8f52abc02aed2ce6c63be04570a7e609f9cdac5f	99.996667	0	0.021111	0
ae0ddda08027454fdb5db77c96b94691b8274cdd	0.003333	0	99.977778	100

		1234	9876	sample1	sample2
rarefaction	rep
90000	0	0.000199	0.000199	0.003122	-0
	1	0.000199	-0.000000	0.002531	-0
	2	-0.000000	-0.000000	0.003122	-0
	3	-0.000000	0.000199	0.002427	-0
	4	-0.000000	-0.000000	0.003152	-0
	5	-0.000000	-0.000000	0.003992	-0
	6	-0.000000	-0.000000	0.003551	-0
	7	-0.000000	-0.000000	0.003696	-0
	8	0.000199	-0.000000	0.003589	-0
	9	0.000376	-0.000000	0.002192	-0

	1234	9876	sample1	sample2
rarefaction
0	NaN	NaN	NaN	NaN
10000	0.000000	0.000000	0.004648	0
20000	0.000079	0.000079	0.003608	0
30000	0.000109	0.000109	0.003568	0
40000	0.000162	0.000042	0.003445	0
50000	0.000231	0.000102	0.003444	0
60000	0.000253	0.000191	0.003485	0
70000	0.000235	0.000075	0.003312	0
80000	0.000111	0.000044	0.003680	0
90000	0.000212	0.000057	0.003333	0

	1234	9876	sample1	sample2
1234	0.000000	1.000000	0.999756	1.000000
9876	1.000000	0.000000	0.999978	1.000000
sample1	0.999756	0.999978	0.000000	0.000267
sample2	1.000000	1.000000	0.000267	0.000000

	1234	9876	sample1	sample2
PC1	0.432782	0.433167	-4.328972e-01	-0.433051
PC2	-0.500111	0.499889	-1.570092e-15	0.000222
PC3	-0.000000	-0.000000	-0.000000e+00	-0.000000
PC4	0.000000	-0.000000	-0.000000e+00	0.000000