This notebook contains some experiments to collapse replicate samples for analysis with QIIME. Ideas for this are being discussed in QIIME's #1678.
This involves two steps:
1. collapsing sample metadata, and
2. collapsing samples in a BIOM table.
I'm approaching this by first using the sample metadata (in a pandas DataFrame) to determine which sample ids belong to a replicate group, and then using those to collapse the BIOM table.
First we'll collapse samples in the mapping file with pandas.
In [1]:
import pandas as pd
import numpy as np
In [2]:
sample_md = pd.read_csv('tiny-test/map', sep='\t')
In [3]:
sample_md
Out[3]:
First we'll group the sample ids.
In [4]:
grouped = sample_md.groupby(['subject', 'replicate'])
Next we'll aggregate. You can specify a function per metadata category that you want to aggregate. Here we want to retain the grouping of input sample ids, so I'm doing that by tuplizing the collection of sample ids in each group while aggregating.
In [8]:
collapsed_md = grouped.agg({'days_since_epoch':'mean', 'SampleType':'first', '#SampleID':lambda x: tuple(x)})
collapsed_md
Out[8]:
In [16]:
for e in collapsed_md.index:
print e, set(collapsed_md['#SampleID'][e])
In [66]:
collapsed_md = collapsed_md.reset_index()
collapsed_md
Out[66]:
In [7]:
sample_id_groups = collapsed_md[('#SampleID', 'replicate')]
print sample_id_groups
In [71]:
set(sample_id_groups)
Out[71]:
In [8]:
sid_to_group_id = {}
for group_id, sample_ids in sample_id_groups.iteritems():
for sample_id in sample_ids:
sid_to_group_id[sample_id] = group_id
print sid_to_group_id
Now we're ready to collapse replicate groups in our BIOM file.
In [9]:
from biom import load_table
In [10]:
t = load_table('tiny-test/biom')
In [33]:
print t
In [34]:
def partition_f(id_, md):
return sid_to_group_id[id_]
print t.collapse(partition_f, norm=False)
By default, observation counts are summed when collapsing samples. Some other common options are illustrated here.
In [49]:
from biom.exception import UnknownAxisError
def collapse_to_first(t, axis):
return np.asarray([e[0] for e in t.iter_data(axis=axis, dense=True)])
print t.collapse(partition_f, collapse_f=collapse_to_first, norm=False, axis='sample')
In [50]:
def collapse_to_median(t, axis):
return np.asarray([np.median(e) for e in t.iter_data(axis=axis, dense=True)])
print t.collapse(partition_f, collapse_f=collapse_to_median, norm=False, axis='sample')
In [60]:
from biom.exception import UnknownAxisError
def collapse_to_rand(t, axis):
# this is a little clunky - waiting for an answer here
# http://stackoverflow.com/q/26050412/3424666
if axis == 'sample':
length = t.shape[0]
elif axis == 'observation':
length = t.shape[1]
else:
raise UnknownAxisError(axis)
n = np.random.randint(length)
return np.asarray([e[n] for e in t.iter_data(axis=axis, dense=True)])
print t.collapse(partition_f, collapse_f=collapse_to_rand, norm=False, axis='sample')
In [55]:
print t.collapse(partition_f, collapse_f=collapse_to_rand, norm=False, axis='sample')
In [56]:
print t.collapse(partition_f, norm=False, axis='sample')
In [57]:
print t.collapse(partition_f, norm=True, axis='sample')
In [ ]: