This notebook contains some experiments to collapse replicate samples for analysis with QIIME. Ideas for this are being discussed in QIIME's #1678.

This involves two steps:

1. collapsing sample metadata, and
2. collapsing samples in a BIOM table. 

I'm approaching this by first using the sample metadata (in a pandas DataFrame) to determine which sample ids belong to a replicate group, and then using those to collapse the BIOM table.

First we'll collapse samples in the mapping file with pandas.


In [1]:
import pandas as pd
import numpy as np

In [2]:
sample_md = pd.read_csv('tiny-test/map', sep='\t')

In [3]:
sample_md


Out[3]:
#SampleID BarcodeSequence LinkerPrimerSequence SampleType year month day subject replicate days_since_epoch Description
0 f1 ACACTGTTCATG GTGCCAGCMGCCGCGGTAA feces 2008 10 22 1 1 14174 fecal1
1 f2 ACCAGACGATGC GTGCCAGCMGCCGCGGTAA feces 2008 10 23 1 1 14175 fecal2
2 f3 ACCAGACGATGC GTGCCAGCMGCCGCGGTAA feces 2008 10 23 2 1 14175 identical sequences to fecal2
3 f4 ACCAGACGATGC GTGCCAGCMGCCGCGGTAA feces 2008 10 23 2 1 14175 all sequences identical, map to GG 295053 at 9...
4 f5 ACCAGACGATGC GTGCCAGCMGCCGCGGTAA feces 2008 10 23 1 2 14175 derived from f3 with some changes to sequences...
5 f6 ACCAGACGATGC GTGCCAGCMGCCGCGGTAA feces 2008 10 23 1 2 14175 derived from f4 with some changes to sequences...
6 p1 AACGCACGCTAG GTGCCAGCMGCCGCGGTAA L_palm 2008 10 21 1 2 14173 palm1, contains one randomly generated sequence
7 p2 ACACTGTTCATG GTGCCAGCMGCCGCGGTAA L_palm 2008 10 22 2 2 14174 palm2
8 t1 AGTGAGAGAAGC GTGCCAGCMGCCGCGGTAA Tongue 2008 10 21 2 2 14173 tongue1, contains one randomly generated sequence
9 t2 ATACTATTGCGC GTGCCAGCMGCCGCGGTAA Tongue 2008 10 22 2 2 14174 tongue2
10 not16S.1 ATACTATTGCGC GTGCCAGCMGCCGCGGTAA Other 2008 10 22 1 3 14174 randomly generated sequence plus some variants...

First we'll group the sample ids.


In [4]:
grouped = sample_md.groupby(['subject', 'replicate'])

Next we'll aggregate. You can specify a function per metadata category that you want to aggregate. Here we want to retain the grouping of input sample ids, so I'm doing that by tuplizing the collection of sample ids in each group while aggregating.


In [8]:
collapsed_md = grouped.agg({'days_since_epoch':'mean', 'SampleType':'first', '#SampleID':lambda x: tuple(x)})
collapsed_md


Out[8]:
days_since_epoch #SampleID SampleType
subject replicate
1 1 14174.500000 (f1, f2) feces
2 14174.333333 (f5, f6, p1) feces
3 14174.000000 (not16S.1,) Other
2 1 14175.000000 (f3, f4) feces
2 14173.666667 (p2, t1, t2) L_palm

In [16]:
for e in collapsed_md.index:
    print e, set(collapsed_md['#SampleID'][e])


(1, 1) set(['f1', 'f2'])
(1, 2) set(['p1', 'f5', 'f6'])
(1, 3) set(['not16S.1'])
(2, 1) set(['f3', 'f4'])
(2, 2) set(['p2', 't2', 't1'])

In [66]:
collapsed_md = collapsed_md.reset_index()
collapsed_md


Out[66]:
subject replicate days_since_epoch #SampleID SampleType
0 1 1 14174.500000 (f1, f2) feces
1 1 2 14174.333333 (f5, f6, p1) feces
2 1 3 14174.000000 (not16S.1,) Other
3 2 1 14175.000000 (f3, f4) feces
4 2 2 14173.666667 (p2, t1, t2) L_palm

In [7]:
sample_id_groups = collapsed_md[('#SampleID', 'replicate')]
print sample_id_groups


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-7-77ebb6300ecd> in <module>()
----> 1 sample_id_groups = collapsed_md[('#SampleID', 'replicate')]
      2 print sample_id_groups

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1676             return self._getitem_multilevel(key)
   1677         else:
-> 1678             return self._getitem_column(key)
   1679 
   1680     def _getitem_column(self, key):

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1683         # get column
   1684         if self.columns.is_unique:
-> 1685             return self._get_item_cache(key)
   1686 
   1687         # duplicate columns & possible reduce dimensionaility

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   2563 
   2564             if not isnull(item):
-> 2565                 loc = self.items.get_loc(item)
   2566             else:
   2567                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key)
   1179         loc : int if unique index, possibly slice or mask if not
   1180         """
-> 1181         return self._engine.get_loc(_values_from_object(key))
   1182 
   1183     def get_value(self, series, key):

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3656)()

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3534)()

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11911)()

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11864)()

KeyError: ('#SampleID', 'replicate')

In [71]:
set(sample_id_groups)


Out[71]:
{('f1', 'f2'),
 ('f3', 'f4'),
 ('f5', 'f6', 'p1'),
 ('not16S.1',),
 ('p2', 't1', 't2')}

In [8]:
sid_to_group_id = {}
for group_id, sample_ids in sample_id_groups.iteritems():
    for sample_id in sample_ids:
        sid_to_group_id[sample_id] = group_id
print sid_to_group_id


{'p2': 4, 'f1': 0, 'f2': 0, 'p1': 1, 'f4': 3, 'f5': 1, 'f6': 1, 't2': 4, 'not16S.1': 2, 't1': 4, 'f3': 3}

Now we're ready to collapse replicate groups in our BIOM file.


In [9]:
from biom import load_table

In [10]:
t = load_table('tiny-test/biom')

In [33]:
print t


 # Constructed from biom file
#OTU ID	f2	f1	f3	f4	p2	p1	t1	not16S.1	t2
295053	20.0	18.0	18.0	22.0	4.0	0.0	0.0	0.0	0.0
42684	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
None11	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
None7	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	20.0	0.0
None5	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
None4	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	1.0	0.0	2.0	0.0	3.0
None2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0
None1	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
879972	0.0	0.0	0.0	0.0	9.0	20.0	1.0	0.0	4.0
None9	0.0	0.0	0.0	0.0	3.0	0.0	19.0	0.0	15.0
None8	1.0	4.0	4.0	0.0	0.0	0.0	0.0	0.0	0.0

In [34]:
def partition_f(id_, md):
    return sid_to_group_id[id_]
print t.collapse(partition_f, norm=False)


# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	38.0	0.0	0.0	40.0	4.0
42684	0.0	0.0	0.0	0.0	1.0
None11	1.0	1.0	0.0	0.0	1.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	1.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	1.0
None4	0.0	1.0	0.0	0.0	1.0
None3	0.0	0.0	0.0	0.0	6.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	14.0
None9	0.0	0.0	0.0	0.0	37.0
None8	5.0	0.0	0.0	4.0	0.0

By default, observation counts are summed when collapsing samples. Some other common options are illustrated here.

Collapse to the first sample in each group


In [49]:
from biom.exception import UnknownAxisError
def collapse_to_first(t, axis):
    return np.asarray([e[0] for e in t.iter_data(axis=axis, dense=True)])

print t.collapse(partition_f, collapse_f=collapse_to_first, norm=False, axis='sample')


# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	20.0	0.0	0.0	18.0	4.0
42684	0.0	0.0	0.0	0.0	1.0
None11	1.0	1.0	0.0	0.0	1.0
None10	0.0	0.0	0.0	0.0	0.0
None7	0.0	0.0	0.0	0.0	1.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	1.0
None4	0.0	1.0	0.0	0.0	1.0
None3	0.0	0.0	0.0	0.0	1.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	9.0
None9	0.0	0.0	0.0	0.0	3.0
None8	1.0	0.0	0.0	4.0	0.0

Collapse to the median value for each observation


In [50]:
def collapse_to_median(t, axis):
    return np.asarray([np.median(e) for e in t.iter_data(axis=axis, dense=True)])

print t.collapse(partition_f, collapse_f=collapse_to_median, norm=False, axis='sample')


# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	19.0	0.0	0.0	20.0	0.0
42684	0.0	0.0	0.0	0.0	0.0
None11	0.5	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	0.0
None7	0.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.0
None4	0.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	4.0
None9	0.0	0.0	0.0	0.0	15.0
None8	2.5	0.0	0.0	2.0	0.0

Collapse to a randomly selected sample


In [60]:
from biom.exception import UnknownAxisError

def collapse_to_rand(t, axis):
    # this is a little clunky - waiting for an answer here
    # http://stackoverflow.com/q/26050412/3424666
    if axis == 'sample':
        length = t.shape[0]
    elif axis == 'observation':
        length = t.shape[1]
    else:
        raise UnknownAxisError(axis)
    n = np.random.randint(length)
    return np.asarray([e[n] for e in t.iter_data(axis=axis, dense=True)])

print t.collapse(partition_f, collapse_f=collapse_to_rand, norm=False, axis='sample')


# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	18.0	0.0	0.0	22.0	0.0
42684	0.0	0.0	0.0	0.0	0.0
None11	0.0	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.0
None4	0.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	1.0
None9	0.0	0.0	0.0	0.0	19.0
None8	4.0	0.0	0.0	0.0	0.0

In [55]:
print t.collapse(partition_f, collapse_f=collapse_to_rand, norm=False, axis='sample')


# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	18.0	0.0	0.0	18.0	0.0
42684	0.0	0.0	0.0	0.0	0.0
None11	0.0	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.0
None4	0.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	1.0
None9	0.0	0.0	0.0	0.0	19.0
None8	4.0	0.0	0.0	4.0	0.0

Sum counts


In [56]:
print t.collapse(partition_f, norm=False, axis='sample')


# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	38.0	0.0	0.0	40.0	4.0
42684	0.0	0.0	0.0	0.0	1.0
None11	1.0	1.0	0.0	0.0	1.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	1.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	1.0
None4	0.0	1.0	0.0	0.0	1.0
None3	0.0	0.0	0.0	0.0	6.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	14.0
None9	0.0	0.0	0.0	0.0	37.0
None8	5.0	0.0	0.0	4.0	0.0

Average counts


In [57]:
print t.collapse(partition_f, norm=True, axis='sample')


# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	19.0	0.0	0.0	20.0	1.33333333333
42684	0.0	0.0	0.0	0.0	0.333333333333
None11	0.5	1.0	0.0	0.0	0.333333333333
None10	0.0	0.0	0.0	0.0	0.333333333333
None7	0.0	0.0	0.0	0.0	0.333333333333
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.333333333333
None4	0.0	1.0	0.0	0.0	0.333333333333
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	4.66666666667
None9	0.0	0.0	0.0	0.0	12.3333333333
None8	2.5	0.0	0.0	2.0	0.0

In [ ]: