Transforms are objects that aim to "transform" the experiment object or the data contained therein to a new experiment object by applying a set of data transformations.
You can thus view your workflow as a series of transformations on your Experiment object.
The transform API is based on two methods on the Experiment object. The first method is called 'apply'. The second method is called 'dapply' (short for "dataframe apply" or "data apply"). These methods apply your various Transform objects onto your Experiment.
These methods work by calling the following corresponding methods on the Transform objects: eapply and dapply.
The Transform eapply method (short for "experiment apply") should always return a new Experiment object. As explained before, this paradigm is need to allow method chaining and pipelining of transforms aka "the Fluent interface".
applied using the apply method.
The dapply method (short for "dataframe apply" or "data apply") is an additional method on the Transform object that could be provided as a shorthand for a particular dataframe or set of data to be returned when the Fluent interface is not needed. Typically, this method returns a new pandas DataFrame object (for example the new "transformed" or "modified" data_df). If not implemented, this method should return NotImplementedError.
Examples of transforms are listed in this notebook/chapter.
In [1]:
%load_ext autoreload
%autoreload 2
#Load our data
from omicexperiment.experiment.microbiome import MicrobiomeExperiment
mapping = "example_map.tsv"
biom = "example_fungal.biom"
tax = "blast_tax_assignments.txt"
exp = MicrobiomeExperiment(biom, mapping,tax)
In [2]:
exp.data_df
Out[2]:
In [3]:
exp.mapping_df
Out[3]:
The apply function can process three different types of parameters:
The first type of parameter is to pass a Transform object (or class).
The second type of parameter is to pass a function (which accepts a pandas DataFrame, intended here to be the experiment's data_df). The reason for the decision to make this type of function accept a dataframe, rather than an OmicExperiment object is to make the apply functionality compatible with various pandas and numpy aggregate functions like sum, mean etc.
The third type of parameter is a list of Transform objects.
In [4]:
from omicexperiment.transforms.general import RelativeAbundance
exp.apply(RelativeAbundance) #note how the apply method it returns a new experiment
Out[4]:
In [5]:
exp.apply(RelativeAbundance).data_df #peak into the transformed data
Out[5]:
In [6]:
from omicexperiment.transforms.general import Rarefaction
exp.apply(Rarefaction(90000)).data_df
Out[6]:
In [7]:
#Here you can see how the Transform API contract works
Rarefaction(90000).__eapply__(exp)
#the above is more or less equivalent to
#exp.apply(Rarefaction(90000))
Out[7]:
In [8]:
#a shorthand on the MicrobiomeExperiment object -- the rarefy method
exp.rarefy(90000).data_df
Out[8]:
The rarefaction functions can also take a num_reps argument (default=1), to set the number of randomizations run.
In [9]:
exp.apply(Rarefaction(n=90000, num_reps=10)).data_df #takes longer of course ;)
Out[9]:
In [10]:
# The SampleGroupBy Transform
#the aggregate function here is the mean
from omicexperiment.transforms.sample import SampleGroupBy
exp.dapply(SampleGroupBy("group"))
Out[10]:
In [11]:
# we can also change the aggfunc from mean to sum
import numpy as np
exp.dapply(SampleGroupBy("group", aggfunc=np.sum))
Out[11]:
The main argument to supply to this transform is a clusters_df DataFrame with two columns (Series). The first column should be labeled 'observation' and the second column should be labeled 'cluster'.
It will then reindex your experiments dataframe using the provided clusters_df then perform a groupby on the 'cluster' column using the supplied aggregate function (default is np.sum).
The use case here of course is for OTU clustering.
In [12]:
from pandas import DataFrame, Series
otus = exp.data_df.index
print([o for o in otus])
clusters_df = DataFrame({'observation': Series(otus), 'cluster': ['otu1','otu2','otu1','otu1','otu2']})
print(clusters_df)
from omicexperiment.transforms.observation import ClusterObservations
otu_cluster_transform = ClusterObservations(clusters_df)
exp.dapply(otu_cluster_transform)
Out[12]:
You supply this transform with a list of observations and a corresponding group name. It will then 'bin' or 'collapse' (or group etc) these observations into a group using the supplied aggregate function (default is np.sum). It is a subclass of the ClusterObservations Transform.
The most common example use case is if you want to bin a particular set of OTUs below a certain percentage of reads or relative abundance or you want to bin all OTUs except the top 10 or 20 into a group called 'Other'.
In [13]:
from omicexperiment.transforms.observation import BinObservations
exp.dapply(
BinObservations(['ae0ddda08027454fdb5db77c96b94691b8274cdd', \
'3cb3c2347cdbe128b645e432f4dcbca702e0e8e3', \
'8f52abc02aed2ce6c63be04570a7e609f9cdac5f'], groupnames='Other')
)
Out[13]:
In [14]:
from omicexperiment.transforms.taxonomy import TaxonomyGroupBy
#this is an important Transform, as it is used to collapse otus by their taxonomies
#according to the taxonomic rank asked for
exp.dapply(TaxonomyGroupBy('genus')) #any taxonomic rank can be passed
Out[14]:
In [15]:
exp.apply([Rarefaction(90000), RelativeAbundance]).data_df
Out[15]:
Another very basic transform is the number of unique observations in each sample. This is essentially the "Number of Observed OTUs", or when used for species, can be described as "species richness".
In [16]:
#%matplotlib inline
from omicexperiment.transforms.observation import NumberUniqueObservations
exp.apply(NumberUniqueObservations).data_df
Out[16]:
Note how when we implement a Transform, we try to follow the convention of samples being in the columns rather than the rows (Take a look at the source code!)
Passing in a function object (for example, numpy's sum or mean functions) to apply can be thought of as equivalent to the following apply operation on the data_df:
DataFrame( experiment.data_df.apply(func) )
Also available is the axis parameter to pass to the apply function of the data_df (Default axis=0 i.e. columns):
DataFrame( experiment.data_df.apply(func, axis=axis) )
In [17]:
import numpy as np
exp.apply(np.mean).data_df
#Note that a transpose() is done on the DataFrame in the inside of
#the apply method, such that sample names remain as columns.
Out[17]:
The RarefactionFunction transform allows us to apply a function on a rarefied dataframe.
Let us use this functionality to apply an alpha diversity function to our data.
The scikit-bio diversity package is an excellent resource for alpha diversity functions (http://scikit-bio.org/docs/latest/diversity.html).
In [18]:
from omicexperiment.transforms.general import RarefactionFunction
from skbio.diversity import alpha
#we pass the function to the func argument
#the axis to apply to is 0 by default (i.e. on columns = samples)
shannon_rf = RarefactionFunction(n=90000, num_reps=10, func=alpha.shannon, axis=0)
shannon = exp.apply(shannon_rf).data_df
shannon
Out[18]:
In [19]:
#this time we pass the np.mean() function function to the agg_rep argument, to aggregate the results
shannon_rf = RarefactionFunction(n=90000, num_reps=10, func=alpha.shannon, axis=0, agg_rep=np.mean)
shannon = exp.apply(shannon_rf).data_df
shannon #now note how the result produces one result row at the intended rarefaction level.
Out[19]:
Not a very elegant curve for this example dataset, but you get the point!
Note how the resulting dataframe is structured. The index now is a MultiIndex with rarefaction and rep levels. At each repetition, the function is applied to the rarefaction level chosen.
We can however, aggregate/collapse these repetitions by passing in the agg_rep argument.
Rarefaction curves is a very common transformation for microbiome amplicon studies and is utilized widely in the software package qiime. In omicexperiment, Rarefaction curves are implemented as a generalization of RarefactionFunction at multiple levels of rarefaction.
It is also a transform, the RarefactionCurveFunction. RarefactionCurveFunction allows us to pass functions not only to one rarefaction level (like the RarefactionFunction), but to multiple rarefaction levels (i.e. a curve). The RarefactionCurveFunction constructor signature is the same as for RarefactionFunction, but it needs an additional step argument to specify the levels of rarefaction desired between 0 and n.
The RarefactionCurveFunction constructor takes the following four main parameters:
The end result is demonstrated below. In the example below, we run the "Number of Unique Observations Transform" (i.e. observed_otus, or richness) on each level of the curve.
In [20]:
#the function is called number_unique_obs in the transforms subpackage
from omicexperiment.transforms.general import RarefactionCurveFunction
from omicexperiment.transforms.observation import number_unique_obs
num_obs_curve = RarefactionCurveFunction(n=90001, num_reps=1, step=30000, func=number_unique_obs, axis=0)
exp.apply(num_obs_curve).data_df
Out[20]:
We can also use this method to calculate alpha diversity measures as a curve, analogous to the way we used them with RarefactionFunction.
In [21]:
from omicexperiment.transforms.general import RarefactionCurveFunction
#construct a curve with a step of 10000 (0 --> n 90000)
#note we add 1 here to include the last level 90000, as the function uses np.arange internally
shannon_curve = RarefactionCurveFunction(n=90001, num_reps=10, step=10000, func=alpha.shannon, axis=0, agg_rep=np.mean)
exp.apply(shannon_curve).data_df
Out[21]:
In [22]:
#again if you wish, you could not pass the agg_rep argument to keep the
#results of your repetitions in the resulting dataframe
shannon_curve = RarefactionCurveFunction(n=90001, num_reps=10, step=10000, func=alpha.shannon)
exp.apply(shannon_curve).data_df
Out[22]:
For the sake of fun (and science!), you will find that we can very easily, using minimal code, plot a RarefactionCurveFunction dataframe as an actual rarefaction curves plot. The following simplistic example is a demo, awaiting a proper implementation of rarefaction curves in omicexperiment!!
In [23]:
%matplotlib inline
import matplotlib.pyplot as plt
#here we plot a richness (number of otus) plot.
num_obs_curve = RarefactionCurveFunction(n=90001, num_reps=1, step=30000, func=number_unique_obs, axis=0)
curve_df = exp.apply(num_obs_curve).data_df
print(curve_df.columns)
for sample in curve_df:
plt.plot(curve_df.index, curve_df[sample])
Distance Matrix transforms internally utilize the cdist function from scipy.spatial
The cdist function accepts either a string for the builtin functions already provided in scipy (see http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.spatial.distance.cdist.html) or it can accept a function (which accepts two arguments - calculating the distance between them).
The DistanceMatrix Transform constructor accepts the metric (string or function), as explained just above.
In [24]:
from omicexperiment.transforms.general import DistanceMatrix
rarefied_90000 = exp.rarefy(90000, num_reps=10)
rarefied_90000.apply(DistanceMatrix('braycurtis')).data_df
Out[24]:
You can then apply PCoA on the resulting distance matrix dataframe. Here we utilize sci-kit bio's implementation.
In [25]:
from skbio.stats.ordination import pcoa
pcoa_results = pcoa(rarefied_90000.apply(DistanceMatrix('braycurtis')).data_df)
pcoa_results
Out[25]:
PCoA is also implemented in omicexperiment as a Transform. It is based on the excellent sci-kit bio's implementation, with minor changes.
In [26]:
from omicexperiment.transforms.ordination import PCoA
bray_curtis_90000 = rarefied_90000.apply(DistanceMatrix('braycurtis'))
pcoa_bray_curtis_90000 = bray_curtis_90000.apply(PCoA)
pcoa_bray_curtis_90000.data_df
Out[26]:
In [27]:
#the scikit-bio OrdinationResults object is stored in the experiment's metadata dict under the key 'pcoa'
print(pcoa_bray_curtis_90000.metadata['pcoa'].eigvals)
pcoa_bray_curtis_90000.metadata['pcoa']
Out[27]:
In [28]:
#Plot 2nd vs 3rd principal coordinate
from matplotlib import pyplot
pyplot.scatter(pcoa_bray_curtis_90000.data_df.loc['PC2'], pcoa_bray_curtis_90000.data_df.loc['PC3'])
Out[28]: