This notebook contains some experiments to collapse replicate samples for analysis with QIIME. Ideas for this are being discussed in QIIME's #1678.

This involves two steps:

1. collapsing sample metadata, and
2. collapsing samples in a BIOM table.

I'm approaching this by first using the sample metadata (in a pandas DataFrame) to determine which sample ids belong to a replicate group, and then using those to collapse the BIOM table.

First we'll collapse samples in the mapping file with pandas.



In [1]:

    
import pandas as pd
import numpy as np



In [2]:

    
sample_md = pd.read_csv('tiny-test/map', sep='\t')



In [3]:

    
sample_md









    Out[3]:






  
    
      
      #SampleID
      BarcodeSequence
      LinkerPrimerSequence
      SampleType
      year
      month
      day
      subject
      replicate
      days_since_epoch
      Description
    
  
  
    
      0 
             f1
       ACACTGTTCATG
       GTGCCAGCMGCCGCGGTAA
        feces
       2008
       10
       22
       1
       1
       14174
                                                  fecal1
    
    
      1 
             f2
       ACCAGACGATGC
       GTGCCAGCMGCCGCGGTAA
        feces
       2008
       10
       23
       1
       1
       14175
                                                  fecal2
    
    
      2 
             f3
       ACCAGACGATGC
       GTGCCAGCMGCCGCGGTAA
        feces
       2008
       10
       23
       2
       1
       14175
                           identical sequences to fecal2
    
    
      3 
             f4
       ACCAGACGATGC
       GTGCCAGCMGCCGCGGTAA
        feces
       2008
       10
       23
       2
       1
       14175
       all sequences identical, map to GG 295053 at 9...
    
    
      4 
             f5
       ACCAGACGATGC
       GTGCCAGCMGCCGCGGTAA
        feces
       2008
       10
       23
       1
       2
       14175
       derived from f3 with some changes to sequences...
    
    
      5 
             f6
       ACCAGACGATGC
       GTGCCAGCMGCCGCGGTAA
        feces
       2008
       10
       23
       1
       2
       14175
       derived from f4 with some changes to sequences...
    
    
      6 
             p1
       AACGCACGCTAG
       GTGCCAGCMGCCGCGGTAA
       L_palm
       2008
       10
       21
       1
       2
       14173
         palm1, contains one randomly generated sequence
    
    
      7 
             p2
       ACACTGTTCATG
       GTGCCAGCMGCCGCGGTAA
       L_palm
       2008
       10
       22
       2
       2
       14174
                                                   palm2
    
    
      8 
             t1
       AGTGAGAGAAGC
       GTGCCAGCMGCCGCGGTAA
       Tongue
       2008
       10
       21
       2
       2
       14173
       tongue1, contains one randomly generated sequence
    
    
      9 
             t2
       ATACTATTGCGC
       GTGCCAGCMGCCGCGGTAA
       Tongue
       2008
       10
       22
       2
       2
       14174
                                                 tongue2
    
    
      10
       not16S.1
       ATACTATTGCGC
       GTGCCAGCMGCCGCGGTAA
        Other
       2008
       10
       22
       1
       3
       14174
       randomly generated sequence plus some variants...

First we'll group the sample ids.



In [4]:

    
grouped = sample_md.groupby(['subject', 'replicate'])

Next we'll aggregate. You can specify a function per metadata category that you want to aggregate. Here we want to retain the grouping of input sample ids, so I'm doing that by tuplizing the collection of sample ids in each group while aggregating.



In [8]:

    
collapsed_md = grouped.agg({'days_since_epoch':'mean', 'SampleType':'first', '#SampleID':lambda x: tuple(x)})
collapsed_md









    Out[8]:






  
    
      
      
      days_since_epoch
      #SampleID
      SampleType
    
    
      subject
      replicate
      
      
      
    
  
  
    
      1
      1
       14174.500000
           (f1, f2)
        feces
    
    
      2
       14174.333333
       (f5, f6, p1)
        feces
    
    
      3
       14174.000000
        (not16S.1,)
        Other
    
    
      2
      1
       14175.000000
           (f3, f4)
        feces
    
    
      2
       14173.666667
       (p2, t1, t2)
       L_palm



In [16]:

    
for e in collapsed_md.index:
    print e, set(collapsed_md['#SampleID'][e])









    



(1, 1) set(['f1', 'f2'])
(1, 2) set(['p1', 'f5', 'f6'])
(1, 3) set(['not16S.1'])
(2, 1) set(['f3', 'f4'])
(2, 2) set(['p2', 't2', 't1'])



In [66]:

    
collapsed_md = collapsed_md.reset_index()
collapsed_md









    Out[66]:






  
    
      
      subject
      replicate
      days_since_epoch
      #SampleID
      SampleType
    
  
  
    
      0
       1
       1
       14174.500000
           (f1, f2)
        feces
    
    
      1
       1
       2
       14174.333333
       (f5, f6, p1)
        feces
    
    
      2
       1
       3
       14174.000000
        (not16S.1,)
        Other
    
    
      3
       2
       1
       14175.000000
           (f3, f4)
        feces
    
    
      4
       2
       2
       14173.666667
       (p2, t1, t2)
       L_palm



In [7]:

    
sample_id_groups = collapsed_md[('#SampleID', 'replicate')]
print sample_id_groups









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-7-77ebb6300ecd> in <module>()
----> 1 sample_id_groups = collapsed_md[('#SampleID', 'replicate')]
      2 print sample_id_groups

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1676             return self._getitem_multilevel(key)
   1677         else:
-> 1678             return self._getitem_column(key)
   1679 
   1680     def _getitem_column(self, key):

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1683         # get column
   1684         if self.columns.is_unique:
-> 1685             return self._get_item_cache(key)
   1686 
   1687         # duplicate columns & possible reduce dimensionaility

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   2563 
   2564             if not isnull(item):
-> 2565                 loc = self.items.get_loc(item)
   2566             else:
   2567                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key)
   1179         loc : int if unique index, possibly slice or mask if not
   1180         """
-> 1181         return self._engine.get_loc(_values_from_object(key))
   1182 
   1183     def get_value(self, series, key):

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3656)()

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3534)()

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11911)()

/Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11864)()

KeyError: ('#SampleID', 'replicate')



In [71]:

    
set(sample_id_groups)









    Out[71]:





{('f1', 'f2'),
 ('f3', 'f4'),
 ('f5', 'f6', 'p1'),
 ('not16S.1',),
 ('p2', 't1', 't2')}



In [8]:

    
sid_to_group_id = {}
for group_id, sample_ids in sample_id_groups.iteritems():
    for sample_id in sample_ids:
        sid_to_group_id[sample_id] = group_id
print sid_to_group_id









    



{'p2': 4, 'f1': 0, 'f2': 0, 'p1': 1, 'f4': 3, 'f5': 1, 'f6': 1, 't2': 4, 'not16S.1': 2, 't1': 4, 'f3': 3}

Now we're ready to collapse replicate groups in our BIOM file.



In [9]:

    
from biom import load_table



In [10]:

    
t = load_table('tiny-test/biom')



In [33]:

    
print t









    



 # Constructed from biom file
#OTU ID	f2	f1	f3	f4	p2	p1	t1	not16S.1	t2
295053	20.0	18.0	18.0	22.0	4.0	0.0	0.0	0.0	0.0
42684	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
None11	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
None7	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	20.0	0.0
None5	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
None4	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	1.0	0.0	2.0	0.0	3.0
None2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2.0	0.0
None1	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
879972	0.0	0.0	0.0	0.0	9.0	20.0	1.0	0.0	4.0
None9	0.0	0.0	0.0	0.0	3.0	0.0	19.0	0.0	15.0
None8	1.0	4.0	4.0	0.0	0.0	0.0	0.0	0.0	0.0



In [34]:

    
def partition_f(id_, md):
    return sid_to_group_id[id_]
print t.collapse(partition_f, norm=False)









    



# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	38.0	0.0	0.0	40.0	4.0
42684	0.0	0.0	0.0	0.0	1.0
None11	1.0	1.0	0.0	0.0	1.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	1.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	1.0
None4	0.0	1.0	0.0	0.0	1.0
None3	0.0	0.0	0.0	0.0	6.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	14.0
None9	0.0	0.0	0.0	0.0	37.0
None8	5.0	0.0	0.0	4.0	0.0

By default, observation counts are summed when collapsing samples. Some other common options are illustrated here.

Collapse to the first sample in each group



In [49]:

    
from biom.exception import UnknownAxisError
def collapse_to_first(t, axis):
    return np.asarray([e[0] for e in t.iter_data(axis=axis, dense=True)])

print t.collapse(partition_f, collapse_f=collapse_to_first, norm=False, axis='sample')









    



# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	20.0	0.0	0.0	18.0	4.0
42684	0.0	0.0	0.0	0.0	1.0
None11	1.0	1.0	0.0	0.0	1.0
None10	0.0	0.0	0.0	0.0	0.0
None7	0.0	0.0	0.0	0.0	1.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	1.0
None4	0.0	1.0	0.0	0.0	1.0
None3	0.0	0.0	0.0	0.0	1.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	9.0
None9	0.0	0.0	0.0	0.0	3.0
None8	1.0	0.0	0.0	4.0	0.0

Collapse to the median value for each observation



In [50]:

    
def collapse_to_median(t, axis):
    return np.asarray([np.median(e) for e in t.iter_data(axis=axis, dense=True)])

print t.collapse(partition_f, collapse_f=collapse_to_median, norm=False, axis='sample')









    



# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	19.0	0.0	0.0	20.0	0.0
42684	0.0	0.0	0.0	0.0	0.0
None11	0.5	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	0.0
None7	0.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.0
None4	0.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	4.0
None9	0.0	0.0	0.0	0.0	15.0
None8	2.5	0.0	0.0	2.0	0.0

Collapse to a randomly selected sample



In [60]:

    
from biom.exception import UnknownAxisError

def collapse_to_rand(t, axis):
    # this is a little clunky - waiting for an answer here
    # http://stackoverflow.com/q/26050412/3424666
    if axis == 'sample':
        length = t.shape[0]
    elif axis == 'observation':
        length = t.shape[1]
    else:
        raise UnknownAxisError(axis)
    n = np.random.randint(length)
    return np.asarray([e[n] for e in t.iter_data(axis=axis, dense=True)])

print t.collapse(partition_f, collapse_f=collapse_to_rand, norm=False, axis='sample')









    



# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	18.0	0.0	0.0	22.0	0.0
42684	0.0	0.0	0.0	0.0	0.0
None11	0.0	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.0
None4	0.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	1.0
None9	0.0	0.0	0.0	0.0	19.0
None8	4.0	0.0	0.0	0.0	0.0



In [55]:

    
print t.collapse(partition_f, collapse_f=collapse_to_rand, norm=False, axis='sample')









    



# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	18.0	0.0	0.0	18.0	0.0
42684	0.0	0.0	0.0	0.0	0.0
None11	0.0	1.0	0.0	0.0	0.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	0.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.0
None4	0.0	1.0	0.0	0.0	0.0
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	1.0
None9	0.0	0.0	0.0	0.0	19.0
None8	4.0	0.0	0.0	4.0	0.0

Sum counts



In [56]:

    
print t.collapse(partition_f, norm=False, axis='sample')









    



# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	38.0	0.0	0.0	40.0	4.0
42684	0.0	0.0	0.0	0.0	1.0
None11	1.0	1.0	0.0	0.0	1.0
None10	0.0	0.0	0.0	0.0	1.0
None7	0.0	0.0	0.0	0.0	1.0
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	1.0
None4	0.0	1.0	0.0	0.0	1.0
None3	0.0	0.0	0.0	0.0	6.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	14.0
None9	0.0	0.0	0.0	0.0	37.0
None8	5.0	0.0	0.0	4.0	0.0

Average counts



In [57]:

    
print t.collapse(partition_f, norm=True, axis='sample')









    



# Constructed from biom file
#OTU ID	0	1	2	3	4
295053	19.0	0.0	0.0	20.0	1.33333333333
42684	0.0	0.0	0.0	0.0	0.333333333333
None11	0.5	1.0	0.0	0.0	0.333333333333
None10	0.0	0.0	0.0	0.0	0.333333333333
None7	0.0	0.0	0.0	0.0	0.333333333333
None6	0.0	0.0	20.0	0.0	0.0
None5	0.0	0.0	0.0	0.0	0.333333333333
None4	0.0	1.0	0.0	0.0	0.333333333333
None3	0.0	0.0	0.0	0.0	2.0
None2	0.0	0.0	2.0	0.0	0.0
None1	0.0	1.0	0.0	0.0	0.0
879972	0.0	20.0	0.0	0.0	4.66666666667
None9	0.0	0.0	0.0	0.0	12.3333333333
None8	2.5	0.0	0.0	2.0	0.0



In [ ]:

	#SampleID	BarcodeSequence	LinkerPrimerSequence	SampleType	year	month	day	subject	replicate	days_since_epoch	Description
0	f1	ACACTGTTCATG	GTGCCAGCMGCCGCGGTAA	feces	2008	10	22	1	1	14174	fecal1
1	f2	ACCAGACGATGC	GTGCCAGCMGCCGCGGTAA	feces	2008	10	23	1	1	14175	fecal2
2	f3	ACCAGACGATGC	GTGCCAGCMGCCGCGGTAA	feces	2008	10	23	2	1	14175	identical sequences to fecal2
3	f4	ACCAGACGATGC	GTGCCAGCMGCCGCGGTAA	feces	2008	10	23	2	1	14175	all sequences identical, map to GG 295053 at 9...
4	f5	ACCAGACGATGC	GTGCCAGCMGCCGCGGTAA	feces	2008	10	23	1	2	14175	derived from f3 with some changes to sequences...
5	f6	ACCAGACGATGC	GTGCCAGCMGCCGCGGTAA	feces	2008	10	23	1	2	14175	derived from f4 with some changes to sequences...
6	p1	AACGCACGCTAG	GTGCCAGCMGCCGCGGTAA	L_palm	2008	10	21	1	2	14173	palm1, contains one randomly generated sequence
7	p2	ACACTGTTCATG	GTGCCAGCMGCCGCGGTAA	L_palm	2008	10	22	2	2	14174	palm2
8	t1	AGTGAGAGAAGC	GTGCCAGCMGCCGCGGTAA	Tongue	2008	10	21	2	2	14173	tongue1, contains one randomly generated sequence
9	t2	ATACTATTGCGC	GTGCCAGCMGCCGCGGTAA	Tongue	2008	10	22	2	2	14174	tongue2
10	not16S.1	ATACTATTGCGC	GTGCCAGCMGCCGCGGTAA	Other	2008	10	22	1	3	14174	randomly generated sequence plus some variants...

		days_since_epoch	#SampleID	SampleType
subject	replicate
1	1	14174.500000	(f1, f2)	feces
	2	14174.333333	(f5, f6, p1)	feces
	3	14174.000000	(not16S.1,)	Other
2	1	14175.000000	(f3, f4)	feces
2	2	14173.666667	(p2, t1, t2)	L_palm